Hi Spark Community, Thank you for bringing up this issue. We've also encountered the same challenge and are actively working on finding a solution. It's reassuring to know that we're not alone in this.
If you have any insights or suggestions regarding how to address this problem, please feel free to share them. Looking forward to hearing from others who might have encountered similar issues. Thanks and regards, Gowtham S On Tue, 19 Sept 2023 at 17:23, Karthick <ibmkarthickma...@gmail.com> wrote: > Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem > > Dear Spark Community, > > I recently reached out to the Apache Flink community for assistance with a > critical issue we are facing in our IoT platform, which relies on Apache > Kafka and real-time data processing. We received some valuable insights and > suggestions from the Apache Flink community, and now, we would like to seek > your expertise and guidance on the same problem. > > In our IoT ecosystem, we are dealing with data streams from thousands of > devices, each uniquely identified. To maintain data integrity and ordering, > we have configured a Kafka topic with ten partitions, ensuring that each > device's data is directed to its respective partition based on its unique > identifier. While this architectural choice has been effective in > maintaining data order, it has unveiled a significant challenge: > > *Slow Consumer and Data Skew Problem:* When a single device experiences > processing delays, it acts as a bottleneck within the Kafka partition, > leading to delays in processing data from other devices sharing the same > partition. This issue severely affects the efficiency and scalability of > our entire data processing pipeline. > > Here are some key details: > > - Number of Devices: 1000 (with potential growth) > - Target Message Rate: 1000 messages per second (with expected growth) > - Kafka Partitions: 10 (some partitions are overloaded) > - We are planning to migrate from Apache Storm to Apache Flink/Spark. > > We are actively seeking guidance on the following aspects: > > *1. Independent Device Data Processing*: We require a strategy that > guarantees one device's processing speed does not affect other devices in > the same Kafka partition. In other words, we need a solution that ensures > the independent processing of each device's data. > > *2. Custom Partitioning Strategy:* We are looking for a custom > partitioning strategy to distribute the load evenly across Kafka > partitions. Currently, we are using Murmur hashing with the device's unique > identifier, but we are open to exploring alternative partitioning > strategies. > > *3. Determining Kafka Partition Count:* We seek guidance on how to > determine the optimal number of Kafka partitions to handle the target > message rate efficiently. > > *4. Handling Data Skew:* Strategies or techniques for handling data skew > within Apache Flink. > > We believe that many in your community may have faced similar challenges > or possess valuable insights into addressing them. Your expertise and > experiences can greatly benefit our team and the broader community dealing > with real-time data processing. > > If you have any knowledge, solutions, or references to open-source > projects, libraries, or community-contributed solutions that align with our > requirements, we would be immensely grateful for your input. > > We appreciate your prompt attention to this matter and eagerly await your > responses and insights. Your support will be invaluable in helping us > overcome this critical challenge. > > Thank you for your time and consideration. > > Thanks & regards, > Karthick. >