[ 
https://issues.apache.org/jira/browse/KAFKA-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-10793.
--------------------------------------------
    Fix Version/s: 2.7.1
       Resolution: Fixed

> Race condition in FindCoordinatorFuture permanently severs connection to 
> group coordinator
> ------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-10793
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10793
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, streams
>    Affects Versions: 2.5.0
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: A. Sophie Blee-Goldman
>            Priority: Critical
>             Fix For: 2.8.0, 2.7.1
>
>
> Pretty much as soon as we started actively monitoring the 
> _last-rebalance-seconds-ago_ metric in our Kafka Streams test environment, we 
> started seeing something weird. Every so often one of the StreamThreads (ie a 
> single Consumer instance) would appear to permanently fall out of the group, 
> as evidenced by a monotonically increasing _last-rebalance-seconds-ago._ We 
> inject artificial network failures every few hours at most, so the group 
> rebalances quite often. But the one consumer never rejoins, with no other 
> symptoms (besides a slight drop in throughput since the remaining threads had 
> to take over this member's work). We're confident that the problem exists in 
> the client layer, since the logs confirmed that the unhealthy consumer was 
> still calling poll. It was also calling Consumer#committed in its main poll 
> loop, which was consistently failing with a TimeoutException.
> When I attached a remote debugger to an instance experiencing this issue, the 
> network client's connection to the group coordinator (the one that uses 
> MAX_VALUE - node.id as the coordinator id) was in the DISCONNECTED state. But 
> for some reason it never tried to re-establish this connection, although it 
> did successfully connect to that same broker through the "normal" connection 
> (ie the one that juts uses node.id).
> The tl;dr is that the AbstractCoordinator's FindCoordinatorRequest has failed 
> (presumably due to a disconnect), but the _findCoordinatorFuture_ is non-null 
> so a new request is never sent. This shouldn't be possible since the 
> FindCoordinatorResponseHandler is supposed to clear the 
> _findCoordinatorFuture_ when the future is completed. But somehow that didn't 
> happen, so the consumer continues to assume there's still a FindCoordinator 
> request in flight and never even notices that it's dropped out of the group.
> These are the only confirmed findings so far, however we have some guesses 
> which I'll leave in the comments. Note that we only noticed this due to the 
> newly added _last-rebalance-seconds-ago_ __metric, and there's no reason to 
> believe this bug hasn't been flying under the radar since the Consumer's 
> inception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to