Hey Peter,

It does sound like you may have hit
https://issues.apache.org/jira/browse/KAFKA-9752

You will need to upgrade your brokers in order to get the fix, since it's a
broker-side issue

On Tue, Feb 9, 2021 at 2:48 AM Péter Sinóros-Szabó
<peter.sinoros-sz...@transferwise.com.invalid> wrote:

> Hi,
>
> I have an application running with 6 instances of it on Kubernetes. All 6
> instances (pods) are the same, using the same consumer group id.
> Recently we see that when the application is restarted (rolling restart on
> K8s), the triggered rebalancing sometimes doesn't finish at all and the
> Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60
> minutes, sometimes it doesn't.
>
> If it is stuck, then if we stop the application and wait until
> kafka-consumer-groups.sh doesn't show the group, and then we restart the
> application, then the initial rebalancing finishes just fine and all is
> good... until some hours or days later a rolling restart restarts it all
> again.
>
> I grabbed some logs from the time when it was continuously rebalancing.
> Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers
> seem like running on localhost, but that's not true, traffic is routed on a
> service mesh...)
>
> 2021-02-05T17:00:18.261422532Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id:
> 2147482641 rack: null) is unavailable or invalid
> 2021-02-05T17:00:18.261454952Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Rebalance failed.:
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2021-02-05T17:00:18.499108799Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
> thread #42 - KafkaConsumer[topicY]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-43, groupId=fin] Discovered group coordinator
> localhost:9204 (id: 2147482641 rack: null)
> 2021-02-05T17:00:18.499300612Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
> thread #42 - KafkaConsumer[topicY]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-43, groupId=fin] (Re-)joining group
>
> No more logs from Kafka Consumer, it seems that the rebalancing doesn't
> finish at all, I don't see logs in any of the pods about the partition
> assignments being calculated, so my _guess_ is that the rebalancing stucks
> in PreparingRebalance phase and never progress from there.
>
> --- About 2 minutes 10 seconds later (sometimes I see a difference here of
> 1 minutes 10 seconds).
>
> 2021-02-05T17:02:29.615402388Z:  fin-df8d589bd-95bsz: INFO:
> kafka-coordinator-heartbeat-thread | fin:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id:
> 2147482641 rack: null) is unavailable or invalid, will attempt rediscovery
> 2021-02-05T17:02:29.615520075Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-29, groupId=fin] Rebalance failed.:
> org.apache.kafka.common.errors.RebalanceInProgressException: The group is
> rebalancing, so a rejoin is needed.
>
> --- This last line may has a difference reason for rebalance too:
> "Rebalance failed.: org.apache.kafka.common.errors.DisconnectException:
> null"
>
> 2021-02-05T17:02:29.74932507Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Discovered group coordinator
> localhost:9204 (id: 2147482641 rack: null)
> 2021-02-05T17:02:29.749488204Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] (Re-)joining group
>
> ... and the same repeats forever.
>
> Kafka Client version: 2.6.x
> Kafka Broker version: 2.4.1
>
>
> What can be the reason for this failing rebalance?
>
> I found this bug on 2.4.1, is it possible that I hit this issue?
> https://issues.apache.org/jira/browse/KAFKA-9752
> "Consumer rebalance can be stuck after new member timeout with old
> JoinGroup version"
>
>
> Thanks for the help,
> Peter
>

Reply via email to