Sophie Blee-Goldman created KAFKA-9140:
------------------------------------------

             Summary: Consumer gets stuck rejoining the group indefinitely
                 Key: KAFKA-9140
                 URL: https://issues.apache.org/jira/browse/KAFKA-9140
             Project: Kafka
          Issue Type: Bug
          Components: clients, consumer
    Affects Versions: 2.4.0
            Reporter: Sophie Blee-Goldman


There seems to be a race condition that is now causing a rejoining member to 
potentially get stuck infinitely initiating a rejoin. The relevant logs are 
attached, but basically it repeats this message (and nothing else) continuously 
until killed/shutdown:

 
{code:java}
[2019-11-05 01:53:54,699] INFO [Consumer 
clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer,
 groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. 
Initiating rejoin. 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
{code}
 

The message that appears was added as part of the bugfix (PR #7460) for this 
related race condition: KAFKA-8104.

This issue was uncovered by the Streams version probing upgrade test, which 
fails with a varying frequency. Here is the rate of failures for different 
system test runs so far:

trunk (cooperative): 1/1 and 2/10 failures

2.4 (cooperative) : 0/10 and 1/15 failures

trunk (eager): 0/10 failures

I've kicked off some high-repeat runs to complete overnight and hopefully shed 
more light.

Note that I have also kicked off runs of both 2.4 and trunk with the PR for 
KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug 
that was fixed by PR #7460. It is therefore unclear whether PR #7460 introduced 
another or a new race condition/bug, or merely uncovered an existing one that 
previously would have first failed due to KAFKA-8104.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to