[ 
https://issues.apache.org/jira/browse/KAFKA-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827134#comment-17827134
 ] 

BDeus edited comment on KAFKA-16361 at 3/14/24 3:46 PM:
--------------------------------------------------------

Is it a regression related to this feature 
[https://issues.apache.org/jira/browse/KAFKA-14450|http://example.com] ?


was (Author: baz33):
Is it a regression related to this ticket 
https://issues.apache.org/jira/browse/KAFKA-14867 ?

> Rack aware sticky assignor minQuota violations
> ----------------------------------------------
>
>                 Key: KAFKA-16361
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16361
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 3.5.1, 3.7.0, 3.6.1
>            Reporter: Luke D
>            Priority: Major
>         Attachments: illegalstateexception.log
>
>
> In some low topic replication scenarios the rack aware assignment in the 
> StickyAssignor fails to balance consumers to its own expectations and throws 
> an IllegalStateException, commonly crashing the application (depending on 
> application implementation). While uncommon the error is deterministic, and 
> so persists until the replication state changes. 
>  
> We have observed this in the wild in 3.5.1, and 3.6.1. We have reproduced it 
> locally in a test case in 3.6.1 and 3.7.0 (3.5.1 we did not try but likely 
> would also be reproducible there) 
>  
> Here is the error and stack from our test case against 3.7.0
> {code:java}
> We haven't reached the expected number of members with more than the minQuota 
> partitions, but no more partitions to be assigned
> java.lang.IllegalStateException: We haven't reached the expected number of 
> members with more than the minQuota partitions, but no more partitions to be 
> assigned
>     at 
> org.apache.kafka.clients.consumer.internals.AbstractStickyAssignor$ConstrainedAssignmentBuilder.verifyUnfilledMembers(AbstractStickyAssignor.java:820)
>     at 
> org.apache.kafka.clients.consumer.internals.AbstractStickyAssignor$ConstrainedAssignmentBuilder.build(AbstractStickyAssignor.java:652)
>     at 
> org.apache.kafka.clients.consumer.internals.AbstractStickyAssignor.assignPartitions(AbstractStickyAssignor.java:113)
>     at 
> org.apache.kafka.clients.consumer.internals.AbstractPartitionAssignor.assign(AbstractPartitionAssignor.java:91)
>  {code}
> Here is a specific test case from 3.7.0 that fails when passed to 
> StickyAssignor.assign:
> {code:java}
> Cluster(id = cluster-id, nodes = [host-3:1 (id: 4 rack: rack-3), host-3:1 
> (id: 3 rack: rack-3), host-2:1 (id: 2 rack: rack-2), host-1:1 (id: 1 rack: 
> rack-1)], partitions = [Partition(topic = topic_name, partition = 57, leader 
> = 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 90, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 28, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 53, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 86, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 24, leader = 4, replicas = [4,3,1], isr = [4,3,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 49, leader = 
> 1, replicas = [1,2], isr = [1,2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 82, leader = 4, replicas = [4,2], isr = [4,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 20, leader = 
> 2, replicas = [2,1], isr = [2,1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 45, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 78, leader = 
> 1, replicas = [1], isr = [1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 16, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 41, leader = 
> 1, replicas = [1,2], isr = [1,2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 74, leader = 4, replicas = [4,3,1], isr = [4,3,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 12, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 37, leader = 1, replicas = [1], isr = [1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 70, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 8, leader = 4, replicas = [4,3,1], isr = [4,3,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 33, leader = 
> 1, replicas = [1], isr = [1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 66, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 4, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 29, leader = 3, replicas = [3,1,2], isr = [3,1,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 62, leader = 
> 3, replicas = [3,2,1], isr = [3,2,1], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 95, leader = 4, replicas = [4,3,2], isr = [4,3,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 0, leader = 
> 4, replicas = [4,1,2], isr = [4,1,2], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 25, leader = 4, replicas = [4,3,2], isr = [4,3,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 58, leader = 
> 4, replicas = [4,3,1], isr = [4,3,1], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 91, leader = 4, replicas = [4,3,1], isr = [4,3,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 21, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 54, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 87, leader = 
> 4, replicas = [4,3], isr = [4,3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 17, leader = 2, replicas = [2,1], isr = [2,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 50, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 83, leader = 4, replicas = [4,3,2], isr = [4,3,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 13, leader = 
> 4, replicas = [4,1], isr = [4,1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 46, leader = 4, replicas = [4,2], isr = [4,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 79, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 9, leader = 4, replicas = [4,2], isr = [4,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 42, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 75, leader = 4, replicas = [4,2], isr = [4,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 5, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 38, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 71, leader = 
> 3, replicas = [3,2], isr = [3,2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 1, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 34, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 67, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 30, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 63, leader = 1, replicas = [1,2], isr = [1,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 26, leader = 
> 4, replicas = [4,3], isr = [4,3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 59, leader = 4, replicas = [4,2], isr = [4,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 92, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 22, leader = 4, replicas = [4,2], isr = [4,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 55, leader = 
> 4, replicas = [4,1,2], isr = [4,1,2], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 88, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 18, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 51, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 84, leader = 
> 4, replicas = [4,2,1], isr = [4,2,1], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 14, leader = 1, replicas = [1], isr = [1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 47, leader = 
> 4, replicas = [4,3,1], isr = [4,3,1], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 80, leader = 4, replicas = [4,1,2], isr = [4,1,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 10, leader = 
> 1, replicas = [1,2], isr = [1,2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 43, leader = 2, replicas = [2,1], isr = [2,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 76, leader = 
> 4, replicas = [4,1], isr = [4,1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 6, leader = 3, replicas = [3,2], isr = [3,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 39, leader = 
> 3, replicas = [3,1], isr = [3,1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 72, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 2, leader = 
> 4, replicas = [4,3,2], isr = [4,3,2], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 35, leader = 4, replicas = [4,3,1], isr = [4,3,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 68, leader = 
> 4, replicas = [4,3], isr = [4,3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 93, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 31, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 64, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 89, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 27, leader = 3, replicas = [3], isr = [3], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 60, leader = 
> 4, replicas = [4,1,2], isr = [4,1,2], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 85, leader = 2, replicas = [2,1], isr = [2,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 23, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 56, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 81, leader = 
> 4, replicas = [4,2], isr = [4,2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 19, leader = 2, replicas = [2], isr = [2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 52, leader = 
> 4, replicas = [4,3,2], isr = [4,3,2], offlineReplicas = []), Partition(topic 
> = topic_name, partition = 77, leader = 4, replicas = [4,1], isr = [4,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 15, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 48, leader = 3, replicas = [3,2], isr = [3,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 73, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 11, leader = 3, replicas = [3,1], isr = [3,1], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 44, leader = 
> 4, replicas = [4], isr = [4], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 69, leader = 4, replicas = [4,3], isr = [4,3], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 7, leader = 
> 2, replicas = [2], isr = [2], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 40, leader = 4, replicas = [4], isr = [4], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 65, leader = 
> 3, replicas = [3], isr = [3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 3, leader = 4, replicas = [4,3], isr = [4,3], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 36, leader = 
> 2, replicas = [2,1], isr = [2,1], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 61, leader = 4, replicas = [4,3,2], isr = [4,3,2], 
> offlineReplicas = []), Partition(topic = topic_name, partition = 94, leader = 
> 4, replicas = [4,3], isr = [4,3], offlineReplicas = []), Partition(topic = 
> topic_name, partition = 32, leader = 2, replicas = [2,1], isr = [2,1], 
> offlineReplicas = [])], controller = host-1:1 (id: 1 rack: rack-1))
>  
> GroupSubscription(subscriptions={Consumer-12=Subscription(topics=[topic_name],
>  userDataSize=4, ownedPartitions=[], groupInstanceId=null, generationId=-1, 
> rackId=rack-1), Consumer-8=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-3), 
> Consumer-10=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-2), 
> Consumer-7=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-2), 
> Consumer-11=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-3), 
> Consumer-9=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-1), 
> Consumer-0=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-1), 
> Consumer-2=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-3), 
> Consumer-1=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-2), 
> Consumer-4=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-2), 
> Consumer-3=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-1), 
> Consumer-6=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-1), 
> Consumer-5=Subscription(topics=[topic_name], userDataSize=4, 
> ownedPartitions=[], groupInstanceId=null, generationId=-1, rackId=rack-3)})
> {code}
> A more general reproduction we have performed showed ~1/500 failure rate 
> using a 96 partition topic in a cluster with 4 nodes across 3 hosts on 3 
> racks being consumed by 13 consumers spread evenly across the 3 racks . Then 
> randomly replicating each partition to 1-3 nodes (all fully in sync). The 
> expectation here would be that any number of replicas >= 1 should be able to 
> be correctly assigned out, even if not rack sympathetic in 100% of scenarios. 
>  
> An actual assignment from one of the above sampled scenarios from 3.6.1 ended 
> with these assignments:
> 2 Consumers with 7 partitions
> 3 Consumers with 6 partitions
> 8 Consumers with 8 partitions
>  
> So, at least on that version, the assignment seems to be over assigning some 
> (3) consumers leaving some (3) consumers under-assigned and failing the 
> minQuota check. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to