Hey Peter, It does sound like you may have hit https://issues.apache.org/jira/browse/KAFKA-9752
You will need to upgrade your brokers in order to get the fix, since it's a broker-side issue On Tue, Feb 9, 2021 at 2:48 AM Péter Sinóros-Szabó <peter.sinoros-sz...@transferwise.com.invalid> wrote: > Hi, > > I have an application running with 6 instances of it on Kubernetes. All 6 > instances (pods) are the same, using the same consumer group id. > Recently we see that when the application is restarted (rolling restart on > K8s), the triggered rebalancing sometimes doesn't finish at all and the > Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60 > minutes, sometimes it doesn't. > > If it is stuck, then if we stop the application and wait until > kafka-consumer-groups.sh doesn't show the group, and then we restart the > application, then the initial rebalancing finishes just fine and all is > good... until some hours or days later a rolling restart restarts it all > again. > > I grabbed some logs from the time when it was continuously rebalancing. > Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers > seem like running on localhost, but that's not true, traffic is routed on a > service mesh...) > > 2021-02-05T17:00:18.261422532Z: fin-df8d589bd-95bsz: INFO: Camel (camel-1) > thread #2 - KafkaConsumer[topicX]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id: > 2147482641 rack: null) is unavailable or invalid > 2021-02-05T17:00:18.261454952Z: fin-df8d589bd-95bsz: INFO: Camel (camel-1) > thread #2 - KafkaConsumer[topicX]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-3, groupId=fin] Rebalance failed.: > org.apache.kafka.common.errors.DisconnectException: null > > 2021-02-05T17:00:18.499108799Z: fin-df8d589bd-85zf9: INFO: Camel (camel-1) > thread #42 - KafkaConsumer[topicY]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-43, groupId=fin] Discovered group coordinator > localhost:9204 (id: 2147482641 rack: null) > 2021-02-05T17:00:18.499300612Z: fin-df8d589bd-85zf9: INFO: Camel (camel-1) > thread #42 - KafkaConsumer[topicY]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-43, groupId=fin] (Re-)joining group > > No more logs from Kafka Consumer, it seems that the rebalancing doesn't > finish at all, I don't see logs in any of the pods about the partition > assignments being calculated, so my _guess_ is that the rebalancing stucks > in PreparingRebalance phase and never progress from there. > > --- About 2 minutes 10 seconds later (sometimes I see a difference here of > 1 minutes 10 seconds). > > 2021-02-05T17:02:29.615402388Z: fin-df8d589bd-95bsz: INFO: > kafka-coordinator-heartbeat-thread | fin: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id: > 2147482641 rack: null) is unavailable or invalid, will attempt rediscovery > 2021-02-05T17:02:29.615520075Z: fin-df8d589bd-95bsz: INFO: Camel (camel-1) > thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-29, groupId=fin] Rebalance failed.: > org.apache.kafka.common.errors.RebalanceInProgressException: The group is > rebalancing, so a rejoin is needed. > > --- This last line may has a difference reason for rebalance too: > "Rebalance failed.: org.apache.kafka.common.errors.DisconnectException: > null" > > 2021-02-05T17:02:29.74932507Z: fin-df8d589bd-j8mw6: INFO: Camel (camel-1) > thread #2 - KafkaConsumer[topicX]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-3, groupId=fin] Discovered group coordinator > localhost:9204 (id: 2147482641 rack: null) > 2021-02-05T17:02:29.749488204Z: fin-df8d589bd-j8mw6: INFO: Camel (camel-1) > thread #2 - KafkaConsumer[topicX]: > org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer > clientId=consumer-fin-3, groupId=fin] (Re-)joining group > > ... and the same repeats forever. > > Kafka Client version: 2.6.x > Kafka Broker version: 2.4.1 > > > What can be the reason for this failing rebalance? > > I found this bug on 2.4.1, is it possible that I hit this issue? > https://issues.apache.org/jira/browse/KAFKA-9752 > "Consumer rebalance can be stuck after new member timeout with old > JoinGroup version" > > > Thanks for the help, > Peter >