[ 
https://issues.apache.org/jira/browse/KAFKA-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Wang reassigned KAFKA-18974:
----------------------------------

    Assignee: Jimmy Wang

> Uneven distribution of topic partitions across consumers while using 
> Cooperative Sticky Assignor
> ------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-18974
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18974
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 3.8.1
>            Reporter: Gangadharan
>            Assignee: Jimmy Wang
>            Priority: Major
>
> I came across a scenario where we see the spread of partitions with topic 
> across consumer threads is uneven. The topic with high TPS (for ex. 85% 
> traffic) had more partitions compared to the topics with low TPS (for ex. 15% 
> traffic).  The consumer threads had subscribed to both set of topics. 
> Subsequently, some of the consumer threads were assigned with the more 
> partitions of low TPS topics. As a result, the pods with the consumer threads 
> that had more partitions of high TPS topics had to slog more resulting in 
> higher lag. However, if we choose round robin, the distribution is even 
> between threads and across pods. But we are limited by the stop the world 
> condition.
> There was already an issue raised and fixed on this context. However, it 
> doesn't fix the whole problem. I suspect that it is because, during the 
> rebalance the partitions that only the that are supposed to be moved from 
> existing consumers are sorted and distributed. However, there was no logic to 
> also check if the retained partitions should be moved to ensure even spread 
> across consumers. 
> KAFKA-16277 CooperativeStickyAssignor does not spread topics evenly among 
> consumer group - ASF Jira
> If the behavior is intended, then is there a way to guarantee even 
> distribution using cooperative sticky. Because the stop the world scenario 
> during the rebalance would limit the users to leverage round robin 
> distribution. 
> Below is a sample test:
> 2 pods with 6 consumer threads in each. Two topics with 18 partitions each 
> (test_topic_1 with higher inflow compared to test_topicone_1). As we could 
> see, the test_topic_1 is concentrated in pod1 as a result, it starts to 
> create the lag for the cooperative sticky strategy. However, for round robin, 
> we see it is distributed between pods.
> Note: The sample test with same partition count was put for the sake of 
> understanding. Irrespective of the partition count of the topics, the 
> behavior seems to be same.
>  
> Cooperative Sticky:
> Pod1
> c--> consumer 1912486590767 [test_topic_1-1, test_topic_1-3, 
> \{*}test_topicone_1\{*}-1]
> c--> consumer 1922696734819 [test_topic_1-11, test_topic_1-6, 
> \{*}test_topicone_1\{*}-6]
> c--> consumer 1941340051228 [test_topic_1-12, test_topic_1-7, 
> \{*}test_topicone_1\{*}-7]
> c--> consumer 1940955938996 [test_topic_1-0, test_topic_1-8, 
> \{*}test_topicone_1\{*}-0]
> c--> consumer 1941837822481 [test_topic_1-2, test_topic_1-9, 
> \{*}test_topicone_1\{*}-2] 
> c--> consumer 1942719746188 [test_topic_1-10, test_topic_1-4, 
> \{*}test_topicone_1\{*}-4] 
>  
> Pod2
> c--> consumer 1941486742305 [test_topic_1-13, \{*}test_topicone_1\{*}-13, 
> \{*}test_topicone_1\{*}-5] 
> c--> consumer 1941837974018 [test_topic_1-14, \{*}test_topicone_1\{*}-14, 
> \{*}test_topicone_1\{*}-8] 
> c--> consumer 1942719897724 [test_topic_1-15, \{*}test_topicone_1\{*}-15, 
> \{*}test_topicone_1\{*}-9]
> c--> consumer 1942696886353 [test_topic_1-16, \{*}test_topicone_1\{*}-10, 
> \{*}test_topicone_1\{*}-16]
> c--> consumer 1941340202762 [test_topic_1-17, \{*}test_topicone_1\{*}-11, 
> \{*}test_topicone_1\{*}-17]
> c--> consumer 1940956090534 [test_topic_1-5, \{*}test_topicone_1\{*}-12, 
> \{*}test_topicone_1\{*}-3]
> -----------------------------------------------------------------------------------------
> Round Robin:
> Pod1
> c--> consumer 1941408797822 [test_topic_1-0, test_topic_1-12, 
> \{*}test_topicone_1\{*}-6]
> c--> consumer 1941456423553 [test_topic_1-9, \{*}test_topicone_1\{*}-15, 
> \{*}test_topicone_1\{*}-3]
> c--> consumer 1942070859325 [test_topic_1-14, test_topic_1-2, 
> \{*}test_topicone_1\{*}-8]
> c--> consumer 1941385036886 [test_topic_1-16, test_topic_1-4, 
> \{*}test_topicone_1\{*}-10]
> c--> consumer 1941105638483 [test_topic_1-6, \{*}test_topicone_1\{*}-0, 
> \{*}test_topicone_1\{*}-12] 
> c--> consumer 1941885698382 [test_topic_1-10, \{*}test_topicone_1\{*}-16, 
> \{*}test_topicone_1\{*}-4]
> Pod2
> c--> consumer 1941456538287 [test_topic_1-8, \{*}test_topicone_1\{*}-14, 
> \{*}test_topicone_1\{*}-2]
> c--> consumer 1942070974058 [test_topic_1-15, test_topic_1-3, 
> \{*}test_topicone_1\{*}-9]
> c--> consumer 1941885813119 [test_topic_1-11, \{*}test_topicone_1\{*}-19, 
> \{*}test_topicone_1\{*}-5]
> c--> consumer 1941408912555 [test_topic_1-1, test_topic_1-13, 
> \{*}test_topicone_1\{*}-7]
> c--> consumer 1941385151618 [test_topic_1-17, test_topic_1-5, 
> \{*}test_topicone_1\{*}-11]
> c--> consumer 1941105753216 [test_topic_1-7, \{*}test_topicone_1\{*}-1, 
> \{*}test_topicone_1\{*}-13]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to