[
https://issues.apache.org/jira/browse/KAFKA-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
A. Sophie Blee-Goldman resolved KAFKA-10793.
--------------------------------------------
Fix Version/s: 2.7.1
Resolution: Fixed
> Race condition in FindCoordinatorFuture permanently severs connection to
> group coordinator
> ------------------------------------------------------------------------------------------
>
> Key: KAFKA-10793
> URL: https://issues.apache.org/jira/browse/KAFKA-10793
> Project: Kafka
> Issue Type: Bug
> Components: consumer, streams
> Affects Versions: 2.5.0
> Reporter: A. Sophie Blee-Goldman
> Assignee: A. Sophie Blee-Goldman
> Priority: Critical
> Fix For: 2.8.0, 2.7.1
>
>
> Pretty much as soon as we started actively monitoring the
> _last-rebalance-seconds-ago_ metric in our Kafka Streams test environment, we
> started seeing something weird. Every so often one of the StreamThreads (ie a
> single Consumer instance) would appear to permanently fall out of the group,
> as evidenced by a monotonically increasing _last-rebalance-seconds-ago._ We
> inject artificial network failures every few hours at most, so the group
> rebalances quite often. But the one consumer never rejoins, with no other
> symptoms (besides a slight drop in throughput since the remaining threads had
> to take over this member's work). We're confident that the problem exists in
> the client layer, since the logs confirmed that the unhealthy consumer was
> still calling poll. It was also calling Consumer#committed in its main poll
> loop, which was consistently failing with a TimeoutException.
> When I attached a remote debugger to an instance experiencing this issue, the
> network client's connection to the group coordinator (the one that uses
> MAX_VALUE - node.id as the coordinator id) was in the DISCONNECTED state. But
> for some reason it never tried to re-establish this connection, although it
> did successfully connect to that same broker through the "normal" connection
> (ie the one that juts uses node.id).
> The tl;dr is that the AbstractCoordinator's FindCoordinatorRequest has failed
> (presumably due to a disconnect), but the _findCoordinatorFuture_ is non-null
> so a new request is never sent. This shouldn't be possible since the
> FindCoordinatorResponseHandler is supposed to clear the
> _findCoordinatorFuture_ when the future is completed. But somehow that didn't
> happen, so the consumer continues to assume there's still a FindCoordinator
> request in flight and never even notices that it's dropped out of the group.
> These are the only confirmed findings so far, however we have some guesses
> which I'll leave in the comments. Note that we only noticed this due to the
> newly added _last-rebalance-seconds-ago_ __metric, and there's no reason to
> believe this bug hasn't been flying under the radar since the Consumer's
> inception
--
This message was sent by Atlassian Jira
(v8.3.4#803005)