[
https://issues.apache.org/jira/browse/KAFKA-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jimmy Wang reassigned KAFKA-18974:
----------------------------------
Assignee: Jimmy Wang
> Uneven distribution of topic partitions across consumers while using
> Cooperative Sticky Assignor
> ------------------------------------------------------------------------------------------------
>
> Key: KAFKA-18974
> URL: https://issues.apache.org/jira/browse/KAFKA-18974
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer
> Affects Versions: 3.8.1
> Reporter: Gangadharan
> Assignee: Jimmy Wang
> Priority: Major
>
> I came across a scenario where we see the spread of partitions with topic
> across consumer threads is uneven. The topic with high TPS (for ex. 85%
> traffic) had more partitions compared to the topics with low TPS (for ex. 15%
> traffic). The consumer threads had subscribed to both set of topics.
> Subsequently, some of the consumer threads were assigned with the more
> partitions of low TPS topics. As a result, the pods with the consumer threads
> that had more partitions of high TPS topics had to slog more resulting in
> higher lag. However, if we choose round robin, the distribution is even
> between threads and across pods. But we are limited by the stop the world
> condition.
> There was already an issue raised and fixed on this context. However, it
> doesn't fix the whole problem. I suspect that it is because, during the
> rebalance the partitions that only the that are supposed to be moved from
> existing consumers are sorted and distributed. However, there was no logic to
> also check if the retained partitions should be moved to ensure even spread
> across consumers.
> KAFKA-16277 CooperativeStickyAssignor does not spread topics evenly among
> consumer group - ASF Jira
> If the behavior is intended, then is there a way to guarantee even
> distribution using cooperative sticky. Because the stop the world scenario
> during the rebalance would limit the users to leverage round robin
> distribution.
> Below is a sample test:
> 2 pods with 6 consumer threads in each. Two topics with 18 partitions each
> (test_topic_1 with higher inflow compared to test_topicone_1). As we could
> see, the test_topic_1 is concentrated in pod1 as a result, it starts to
> create the lag for the cooperative sticky strategy. However, for round robin,
> we see it is distributed between pods.
> Note: The sample test with same partition count was put for the sake of
> understanding. Irrespective of the partition count of the topics, the
> behavior seems to be same.
>
> Cooperative Sticky:
> Pod1
> c--> consumer 1912486590767 [test_topic_1-1, test_topic_1-3,
> \{*}test_topicone_1\{*}-1]
> c--> consumer 1922696734819 [test_topic_1-11, test_topic_1-6,
> \{*}test_topicone_1\{*}-6]
> c--> consumer 1941340051228 [test_topic_1-12, test_topic_1-7,
> \{*}test_topicone_1\{*}-7]
> c--> consumer 1940955938996 [test_topic_1-0, test_topic_1-8,
> \{*}test_topicone_1\{*}-0]
> c--> consumer 1941837822481 [test_topic_1-2, test_topic_1-9,
> \{*}test_topicone_1\{*}-2]
> c--> consumer 1942719746188 [test_topic_1-10, test_topic_1-4,
> \{*}test_topicone_1\{*}-4]
>
> Pod2
> c--> consumer 1941486742305 [test_topic_1-13, \{*}test_topicone_1\{*}-13,
> \{*}test_topicone_1\{*}-5]
> c--> consumer 1941837974018 [test_topic_1-14, \{*}test_topicone_1\{*}-14,
> \{*}test_topicone_1\{*}-8]
> c--> consumer 1942719897724 [test_topic_1-15, \{*}test_topicone_1\{*}-15,
> \{*}test_topicone_1\{*}-9]
> c--> consumer 1942696886353 [test_topic_1-16, \{*}test_topicone_1\{*}-10,
> \{*}test_topicone_1\{*}-16]
> c--> consumer 1941340202762 [test_topic_1-17, \{*}test_topicone_1\{*}-11,
> \{*}test_topicone_1\{*}-17]
> c--> consumer 1940956090534 [test_topic_1-5, \{*}test_topicone_1\{*}-12,
> \{*}test_topicone_1\{*}-3]
> -----------------------------------------------------------------------------------------
> Round Robin:
> Pod1
> c--> consumer 1941408797822 [test_topic_1-0, test_topic_1-12,
> \{*}test_topicone_1\{*}-6]
> c--> consumer 1941456423553 [test_topic_1-9, \{*}test_topicone_1\{*}-15,
> \{*}test_topicone_1\{*}-3]
> c--> consumer 1942070859325 [test_topic_1-14, test_topic_1-2,
> \{*}test_topicone_1\{*}-8]
> c--> consumer 1941385036886 [test_topic_1-16, test_topic_1-4,
> \{*}test_topicone_1\{*}-10]
> c--> consumer 1941105638483 [test_topic_1-6, \{*}test_topicone_1\{*}-0,
> \{*}test_topicone_1\{*}-12]
> c--> consumer 1941885698382 [test_topic_1-10, \{*}test_topicone_1\{*}-16,
> \{*}test_topicone_1\{*}-4]
> Pod2
> c--> consumer 1941456538287 [test_topic_1-8, \{*}test_topicone_1\{*}-14,
> \{*}test_topicone_1\{*}-2]
> c--> consumer 1942070974058 [test_topic_1-15, test_topic_1-3,
> \{*}test_topicone_1\{*}-9]
> c--> consumer 1941885813119 [test_topic_1-11, \{*}test_topicone_1\{*}-19,
> \{*}test_topicone_1\{*}-5]
> c--> consumer 1941408912555 [test_topic_1-1, test_topic_1-13,
> \{*}test_topicone_1\{*}-7]
> c--> consumer 1941385151618 [test_topic_1-17, test_topic_1-5,
> \{*}test_topicone_1\{*}-11]
> c--> consumer 1941105753216 [test_topic_1-7, \{*}test_topicone_1\{*}-1,
> \{*}test_topicone_1\{*}-13]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)