Hi James, I've filed a bug in JIRA: KAFKA-13563 <https://issues.apache.org/jira/browse/KAFKA-13563>. I'll investigate this issue.
Thank you. Luke On Wed, Dec 22, 2021 at 2:49 AM James Olsen <ja...@inaseq.com> wrote: > This failure occurred again during this month's rolling OS security > updates to the Brokers (no change to Broker version). I have also been > able to reproduce it locally with the following process: > > 1. Start a 3 Broker cluster with a Topic having Replicas=3. > 2. Start a Client with Producer and Consumer communicating over the Topic. > 3. Stop the Broker that is acting as the Group Coordinator. > 4. Observe successful Rediscovery of new Group Coordinator. > 5. Restart the stopped Broker. > 6. Stop the Broker that became the new Group Coordinator at step 4. > 7. Observe "Rediscovery will be attempted" message but no "Discovered > group coordinator" message. > > In short, Group Coordinator Rediscovery only works for the first Broker > failover not any subsequent failover. > > I conducted tests using 2.7.1 servers. The issue occurs with 2.7.1 and > 2.7.2 Clients. The issue does not occur with 2.5.1 and 2.7.0 Clients. > This make me suspect that > https://issues.apache.org/jira/browse/KAFKA-10793 introduced this issue. > > Regards, James. > > On 24/11/2021, at 14:35, James Olsen <ja...@inaseq.com> wrote: > > Luke, > > We did not upgrade to resolve the issue. We simply restarted the failing > clients. > > Regards, James. > > On 23/11/2021, at 16:10, Luke Chen <show...@gmail.com> wrote: > > Hi James, > > Bouncing the clients resolved the issue > Could you please describe which version you upgrade to, to resolve this > issue? That should also help other users encountering the same issue. > > And the code snippet you listed, existed since 2018, I don't think there > is any problem there. > Maybe there are bugs existed in other places, and got fixed indirectly. > > Thank you. > Luke > > On Tue, Nov 23, 2021 at 10:27 AM James Olsen <ja...@inaseq.com> wrote: > >> We had a 2.5.1 Broker/Client system running for some time with regular >> rolling OS upgrades to the Brokers without any problems. A while ago we >> upgraded both Broker and Clients to 2.7.1 and now on the first rolling OS >> upgrade to the 2.7.1 Brokers we encountered some Consumer issues. We have >> a 3 Broker setup with min-ISRs configured to avoid any outage. >> >> So maybe we just got lucky 6 times in a row with the 2.5.1 or maybe there >> is an issue with the 2.7.1. >> >> The observable symptom is a continuous stream of "The coordinator is not >> available" messages when trying to commit offsets. It starts with the >> usual messages you might expect during a rolling upgrade... >> >> 2021-11-22 04:41:25,269 WARN >> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] >> 'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58, >> groupId=MyService-group] Offset commit failed on partition MyTopic-0 at >> offset 866799313: The coordinator is loading and hence can't process >> requests. >> >> ... then 5 minutes of all OK, then ... >> >> 2021-11-22 04:46:33,258 WARN >> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] >> 'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58, >> groupId=MyService-group] Offset commit failed on partition MyTopic-0 at >> offset 866803953: This is not the correct coordinator. >> >> 2021-11-22 04:46:33,258 INFO >> [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] >> 'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58, >> groupId=MyService-group] Group coordinator b-2.xxx.com:9094< >> http://b-2.xxx.com:9094> (id: 2147483645 rack: null) is unavailable or >> invalid due to cause: error response NOT_COORDINATOR.isDisconnected: false. >> Rediscovery will be attempted. >> >> 2021-11-22 04:46:33,258 WARN [xxx.KafkaConsumerRunner] >> 'pool-7-thread-132' Offset commit with offsets >> {MyTopic-0=OffsetAndMetadata{offset=866803953, leaderEpoch=null, >> metadata=''}} failed: >> org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset >> commit failed with a retriable exception. You should retry committing the >> latest consumed offsets. >> Caused by: org.apache.kafka.common.errors.NotCoordinatorException: This >> is not the correct coordinator. >> >> ... then the following message for every subsequent attempt to commit >> offsets ... >> >> 2021-11-22 04:46:33,284 WARN [xxx.KafkaConsumerRunner] >> 'pool-7-thread-132' Offset commit with offsets >> {MyTopic-0=OffsetAndMetadata{offset=866803954, leaderEpoch=82, >> metadata=''}, MyOtherTopic-0=OffsetAndMetadata{offset=12654756, >> leaderEpoch=79, metadata=''}} failed: >> org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset >> commit failed with a retriable exception. You should retry committing the >> latest consumed offsets. >> Caused by: >> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The >> coordinator is not available. >> >> In the above example we are doing manual async-commits but we also had >> offset commit failure for a different consumer group (observed through lag >> monitoring) that uses auto-commit, it just didn't log the ongoing >> failures. In both cases messages were still being processed, it was just >> the commits not working. These are our two busiest consumer groups and >> both have static Topic assignments. Other consumer groups continued OK. >> >> I've spent some time examining the (Java) client code and started to >> wonder whether there is a bug or race condition that means the coordinator >> never gets reassigned after being invalidated and we simply keep hitting >> the following short-circuit: >> >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator >> >> RequestFuture<Void> sendOffsetCommitRequest(final Map<TopicPartition, >> OffsetAndMetadata> offsets) { >> if (offsets.isEmpty()) >> return RequestFuture.voidSuccess(); >> >> Node coordinator = checkAndGetCoordinator(); >> if (coordinator == null) >> return RequestFuture.coordinatorNotAvailable(); >> >> I'm not sure what the exact pathway is to getting the coordinator set but >> I note that >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(Timer) >> and other methods that look like they may be related tend to only log at >> debug when they encounter RetriableException so could explain why I don't >> have more detail to provide. >> >> I'm not familiar enough with the code to be able to trace this through >> any further, but if you've had the patience to keep reading this far then >> maybe you do! >> >> Bouncing the clients resolved the issue, but I'd be interested if any >> experts out there can identify if there is any weakness in the 2.7.1 >> version. >> >> Regards, James. >> >> > >