[ 
https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434579#comment-17434579
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12984:
------------------------------------------------

Mm, yeah, I should've remembered that the `ownedPartitions` field was used for 
this validation as well. I mean we did fix the [onJoinPrepare 
issue|https://issues.apache.org/jira/browse/KAFKA-12983] which was the only 
specific known edge case at the time which could lead to the `ownedPartitions` 
set being incorrect/out of date like this. Not surprised there might be others 
out there though, in fact I originally wanted to expand the fix in 
[#10986|https://github.com/apache/kafka/pull/10986] to reset the 
generation/ownedPartitions in some other cases to be safe but was talked out of 
it (not trying to shift the blame here, I still should've caught this!)

My point is, even if we do either 1/2, users won't be able to write a custom 
cooperative assignor without running into this. It's a huge bummer if we can't 
trust the `ownedPartitions` to be accurate – the whole point of adding it to 
the ConsumerProtocol was to use it for this verification and also to provide 
this info to users who may want to write a custom cooperative assignor :/ 

I suppose for the time being we could/should consider putting in a hotfix for 
the CooperativeStickyAssignor while we work out a better long-term solution – 
for example a public interface/abstract class like 
CooperativeConsumerPartitionAssignor that users can implement/extend to for a 
custom cooperative assignor, which at the very least stores the generation like 
we do in the CooperativeStickyAssignor and makes it available for the user as 
well as to the ConsumerCoordinator for it to know when to invalidate a member's 
ownedPartitions during this validation step – does that make sense? WDYT?

cc also [~guozhang] [~hachikuji]

FWIW I'd prefer either option (2), or possibly a modified version of (1) that 
only uses the generation rather than trying to get the partitions from the 
assignor's userdata. But again, all of these will only work for the 
CooperativeStickyAssignor, so we may need to start thinking about a KIP if we 
don't feel like we can trust the ownedPartitions to be accurate

> Cooperative sticky assignor can get stuck with invalid SubscriptionState 
> input metadata
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-12984
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12984
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: A. Sophie Blee-Goldman
>            Priority: Blocker
>             Fix For: 2.8.1, 3.0.0
>
>         Attachments: image-2021-10-25-11-53-40-221.png, 
> log-events-viewer-result-kafka.numbers, logs-insights-results-kafka.csv, 
> logs-insights-results-kafka.numbers
>
>
> Some users have reported seeing their consumer group become stuck in the 
> CompletingRebalance phase when using the cooperative-sticky assignor. Based 
> on the request metadata we were able to deduce that multiple consumers were 
> reporting the same partition(s) in their "ownedPartitions" field of the 
> consumer protocol. Since this is an invalid state, the input causes the 
> cooperative-sticky assignor to detect that something is wrong and throw an 
> IllegalStateException. If the consumer application is set up to simply retry, 
> this will cause the group to appear to hang in the rebalance state.
> The "ownedPartitions" field is encoded based on the ConsumerCoordinator's 
> SubscriptionState, which was assumed to always be up to date. However there 
> may be cases where the consumer has dropped out of the group but fails to 
> clear the SubscriptionState, allowing it to report some partitions as owned 
> that have since been reassigned to another member.
> We should (a) fix the sticky assignment algorithm to resolve cases of 
> improper input conditions by invalidating the "ownedPartitions" in cases of 
> double ownership, and (b) shore up the ConsumerCoordinator logic to better 
> handle rejoining the group and keeping its internal state consistent. See 
> KAFKA-12983 for more details on (b)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to