Re: Urgent: Mitigating Slow Consumer Impact and Seeking Open-SourceSolutions in Apache Kafka Consumers

Gowtham S Mon, 18 Sep 2023 21:04:49 -0700

Hi All

We are also seeking for a custom partitioning strategy, it will be helpful
for us too.



Thanks and regards,
Gowtham S


On Mon, 18 Sept 2023 at 12:13, Karthick <ibmkarthickma...@gmail.com> wrote:

> Thanks Liu Ron for the suggestion.
>
> Can you please give any pointers/Reference for the custom partitioning
> strategy, we are currently using murmur hashing with the device unique id.
> It would be helpful if we guide/refer any other strategies.
>
> Thanks and regards
> Karthick.
>
> On Mon, Sep 18, 2023 at 9:18 AM liu ron <ron9....@gmail.com> wrote:
>
>> Hi, Karthick
>>
>> It looks like a data skewing problem, and I think one of the easiest and
>> most efficient ways for this issue is to increase the number of Partitions
>> and see how it works first, like try expanding by 100 first.
>>
>> Best,
>> Ron
>>
>> Karthick <ibmkarthickma...@gmail.com> 于2023年9月17日周日 17:03写道：
>>
>>> Thanks Wei Chen, Giannis for the time,
>>>
>>>
>>> For starters, you need to better size and estimate the required number
>>>> of partitions you will need on the Kafka side in order to process 1000+
>>>> messages/second.
>>>> The number of partitions should also define the maximum parallelism for
>>>> the Flink job reading for Kafka.
>>>
>>> Thanks for the pointer, can you please guide on what are all the factors
>>> we need to consider regarding this.
>>>
>>> use a custom partitioner that spreads those devices to somewhat separate
>>>> partitions.
>>>
>>> Please suggest a working solution regarding the custom partitioner, to
>>> distribute the load. It will be helpful.
>>>
>>>
>>> What we were doing at that time was to define multiple topics and each
>>>> has a different # of partitions
>>>
>>> Thanks for the suggestion, is there any calculation for choosing topics
>>> count, is there are any formulae/factors to determine this topic number,
>>> please let me know if available it will be helpful for us to choose that.
>>>
>>> Thanks and Regards
>>> Karthick.
>>>
>>>
>>>
>>> On Sun, Sep 17, 2023 at 4:04 AM Wei Chen <sagitc...@qq.com> wrote:
>>>
>>>> Hi Karthick,
>>>> We’ve experienced the similar issue before. What we were doing at that
>>>> time was to define multiple topics and each has a different # of partitions
>>>> which means some of the topics with more partitions will have the high
>>>> parallelisms for processing.
>>>> And you can further divide the topics into several groups and each
>>>> group should have the similar # of partitions. For each group, you can
>>>> define as the source of flink data stream to run them in parallel with
>>>> different parallelism.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>>
>>>>
>>>> ------------------ Original ------------------
>>>> *From:* Giannis Polyzos <ipolyzos...@gmail.com>
>>>> *Date:* Sat,Sep 16,2023 11:52 PM
>>>> *To:* Karthick <ibmkarthickma...@gmail.com>
>>>> *Cc:* Gowtham S <gowtham.co....@gmail.com>, user <user@flink.apache.org
>>>> >
>>>> *Subject:* Re: Urgent: Mitigating Slow Consumer Impact and Seeking
>>>> Open-SourceSolutions in Apache Kafka Consumers
>>>>
>>>> Can you provide some more context on what your Flink job will be doing?
>>>> There might be some things you can do to fix the data skew on the link
>>>> side, but first, you want to start with Kafka.
>>>> For starters, you need to better size and estimate the required number
>>>> of partitions you will need on the Kafka side in order to process 1000+
>>>> messages/second.
>>>> The number of partitions should also define the maximum parallelism for
>>>> the Flink job reading for Kafka.
>>>> If you know your "hot devices" in advance you might wanna use a custom
>>>> partitioner that spreads those devices to somewhat separate partitions.
>>>> Overall this is somewhat of a trial-and-error process. You might also
>>>> want to check that these partitions are evenly balanced among your brokers
>>>> and don't cause too much stress on particular brokers.
>>>>
>>>> Best
>>>>
>>>> On Sat, Sep 16, 2023 at 6:03 PM Karthick <ibmkarthickma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Gowtham i agree with you,
>>>>>
>>>>> I'm eager to resolve the issue or gain a better understanding. Your
>>>>> assistance would be greatly appreciated.
>>>>>
>>>>> If there are any additional details or context needed to address my
>>>>> query effectively, please let me know, and I'll be happy to provide them.
>>>>>
>>>>> Thank you in advance for your time and consideration. I look forward
>>>>> to hearing from you and benefiting from your expertise.
>>>>>
>>>>> Thanks and Regards
>>>>> Karthick.
>>>>>
>>>>> On Sat, Sep 16, 2023 at 11:04 AM Gowtham S <gowtham.co....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Karthik
>>>>>>
>>>>>> This appears to be a common challenge related to a slow-consuming
>>>>>> situation. Those with relevant experience in addressing such matters 
>>>>>> should
>>>>>> be capable of providing assistance.
>>>>>>
>>>>>> Thanks and regards,
>>>>>> Gowtham S
>>>>>>
>>>>>>
>>>>>> On Fri, 15 Sept 2023 at 23:06, Giannis Polyzos <ipolyzos...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karthick,
>>>>>>>
>>>>>>> on a high level seems like a data skew issue and some partitions
>>>>>>> have way more data than others?
>>>>>>> What is the number of your devices? how many messages are you
>>>>>>> processing?
>>>>>>> Most of the things you share above sound like you are looking for
>>>>>>> suggestions around load distribution for Kafka.  i.e number of 
>>>>>>> partitions,
>>>>>>> how to distribute your device data etc.
>>>>>>> It would be good to also share what your flink job is doing as I
>>>>>>> don't see anything mentioned around that.. are you observing back 
>>>>>>> pressure
>>>>>>> in the Flink UI?
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> On Fri, Sep 15, 2023 at 3:46 PM Karthick <ibmkarthickma...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dear Apache Flink Community,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I am writing to urgently address a critical challenge we've
>>>>>>>> encountered in our IoT platform that relies on Apache Kafka and 
>>>>>>>> real-time
>>>>>>>> data processing. We believe this issue is of paramount importance and 
>>>>>>>> may
>>>>>>>> have broad implications for the community.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In our IoT ecosystem, we receive data streams from numerous
>>>>>>>> devices, each uniquely identified. To maintain data integrity and 
>>>>>>>> ordering,
>>>>>>>> we've meticulously configured a Kafka topic with ten partitions, 
>>>>>>>> ensuring
>>>>>>>> that each device's data is directed to its respective partition based 
>>>>>>>> on
>>>>>>>> its unique identifier. This architectural choice has proven effective 
>>>>>>>> in
>>>>>>>> maintaining data order, but it has also unveiled a significant problem:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *One device's data processing slowness is interfering with other
>>>>>>>> devices' data, causing a detrimental ripple effect throughout our 
>>>>>>>> system.*
>>>>>>>>
>>>>>>>>
>>>>>>>> To put it simply, when a single device experiences processing
>>>>>>>> delays, it acts as a bottleneck within the Kafka partition, leading to
>>>>>>>> delays in processing data from other devices sharing the same 
>>>>>>>> partition.
>>>>>>>> This issue undermines the efficiency and scalability of our entire data
>>>>>>>> processing pipeline.
>>>>>>>>
>>>>>>>> Additionally, I would like to highlight that we are currently using
>>>>>>>> the default partitioner for choosing the partition of each device's 
>>>>>>>> data.
>>>>>>>> If there are alternative partitioning strategies that can help 
>>>>>>>> alleviate
>>>>>>>> this problem, we are eager to explore them.
>>>>>>>>
>>>>>>>> We are in dire need of a high-scalability solution that not only
>>>>>>>> ensures each device's data processing is independent but also prevents 
>>>>>>>> any
>>>>>>>> interference or collisions between devices' data streams. Our primary
>>>>>>>> objectives are:
>>>>>>>>
>>>>>>>> 1. *Isolation and Independence:* We require a strategy that
>>>>>>>> guarantees one device's processing speed does not affect other devices 
>>>>>>>> in
>>>>>>>> the same Kafka partition. In other words, we need a solution that 
>>>>>>>> ensures
>>>>>>>> the independent processing of each device's data.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. *Open-Source Implementation:* We are actively seeking pointers
>>>>>>>> to open-source implementations or references to working solutions that
>>>>>>>> address this specific challenge within the Apache ecosystem or any 
>>>>>>>> existing
>>>>>>>> projects, libraries, or community-contributed solutions that align 
>>>>>>>> with our
>>>>>>>> requirements would be immensely valuable.
>>>>>>>>
>>>>>>>> We recognize that many Apache Flink users face similar issues and
>>>>>>>> may have already found innovative ways to tackle them. We implore you 
>>>>>>>> to
>>>>>>>> share your knowledge and experiences on this matter. Specifically, we 
>>>>>>>> are
>>>>>>>> interested in:
>>>>>>>>
>>>>>>>> *- Strategies or architectural patterns that ensure independent
>>>>>>>> processing of device data.*
>>>>>>>>
>>>>>>>> *- Insights into load balancing, scalability, and efficient data
>>>>>>>> processing across Kafka partitions.*
>>>>>>>>
>>>>>>>> *- Any existing open-source projects or implementations that
>>>>>>>> address similar challenges.*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We are confident that your contributions will not only help us
>>>>>>>> resolve this critical issue but also assist the broader Apache Flink
>>>>>>>> community facing similar obstacles.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please respond to this thread with your expertise, solutions, or
>>>>>>>> any relevant resources. Your support will be invaluable to our team 
>>>>>>>> and the
>>>>>>>> entire Apache Flink community.
>>>>>>>>
>>>>>>>> Thank you for your prompt attention to this matter.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>>
>>>>>>>> Karthick.
>>>>>>>>
>>>>>>>

Re: Urgent: Mitigating Slow Consumer Impact and Seeking Open-SourceSolutions in Apache Kafka Consumers

Reply via email to