Re: Consumer failure after rolling Broker upgrade

Luke Chen Wed, 22 Dec 2021 05:33:32 -0800

Hi James,

I've filed a bug in JIRA: KAFKA-13563
<https://issues.apache.org/jira/browse/KAFKA-13563>.
I'll investigate this issue.


Thank you.
Luke

On Wed, Dec 22, 2021 at 2:49 AM James Olsen <ja...@inaseq.com> wrote:

> This failure occurred again during this month's rolling OS security
> updates to the Brokers (no change to Broker version).  I have also been
> able to reproduce it locally with the following process:
>
> 1. Start a 3 Broker cluster with a Topic having Replicas=3.
> 2. Start a Client with Producer and Consumer communicating over the Topic.
> 3. Stop the Broker that is acting as the Group Coordinator.
> 4. Observe successful Rediscovery of new Group Coordinator.
> 5. Restart the stopped Broker.
> 6. Stop the Broker that became the new Group Coordinator at step 4.
> 7. Observe "Rediscovery will be attempted" message but no "Discovered
> group coordinator" message.
>
> In short, Group Coordinator Rediscovery only works for the first Broker
> failover not any subsequent failover.
>
> I conducted tests using 2.7.1 servers.  The issue occurs with 2.7.1 and
> 2.7.2 Clients.  The issue does not occur with 2.5.1 and 2.7.0 Clients.
> This make me suspect that
> https://issues.apache.org/jira/browse/KAFKA-10793 introduced this issue.
>
> Regards, James.
>
> On 24/11/2021, at 14:35, James Olsen <ja...@inaseq.com> wrote:
>
> Luke,
>
> We did not upgrade to resolve the issue.  We simply restarted the failing
> clients.
>
> Regards, James.
>
> On 23/11/2021, at 16:10, Luke Chen <show...@gmail.com> wrote:
>
> Hi James,
> > Bouncing the clients resolved the issue
> Could you please describe which version you upgrade to, to resolve this
> issue? That should also help other users encountering the same issue.
>
> And the code snippet you listed, existed since 2018, I don't think there
> is any problem there.
> Maybe there are bugs existed in other places, and got fixed indirectly.
>
> Thank you.
> Luke
>
> On Tue, Nov 23, 2021 at 10:27 AM James Olsen <ja...@inaseq.com> wrote:
>
>> We had a 2.5.1 Broker/Client system running for some time with regular
>> rolling OS upgrades to the Brokers without any problems.  A while ago we
>> upgraded both Broker and Clients to 2.7.1 and now on the first rolling OS
>> upgrade to the 2.7.1 Brokers we encountered some Consumer issues.  We have
>> a 3 Broker setup with min-ISRs configured to avoid any outage.
>>
>> So maybe we just got lucky 6 times in a row with the 2.5.1 or maybe there
>> is an issue with the 2.7.1.
>>
>> The observable symptom is a continuous stream of "The coordinator is not
>> available" messages when trying to commit offsets.  It starts with the
>> usual messages you might expect during a rolling upgrade...
>>
>> 2021-11-22 04:41:25,269 WARN
>> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
>> 'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58,
>> groupId=MyService-group] Offset commit failed on partition MyTopic-0 at
>> offset 866799313: The coordinator is loading and hence can't process
>> requests.
>>
>> ... then 5 minutes of all OK, then ...
>>
>> 2021-11-22 04:46:33,258 WARN
>> [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
>> 'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58,
>> groupId=MyService-group] Offset commit failed on partition MyTopic-0 at
>> offset 866803953: This is not the correct coordinator.
>>
>> 2021-11-22 04:46:33,258 INFO
>> [org.apache.kafka.clients.consumer.internals.AbstractCoordinator]
>> 'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58,
>> groupId=MyService-group] Group coordinator b-2.xxx.com:9094<
>> http://b-2.xxx.com:9094> (id: 2147483645 rack: null) is unavailable or
>> invalid due to cause: error response NOT_COORDINATOR.isDisconnected: false.
>> Rediscovery will be attempted.
>>
>> 2021-11-22 04:46:33,258 WARN  [xxx.KafkaConsumerRunner]
>> 'pool-7-thread-132' Offset commit with offsets
>> {MyTopic-0=OffsetAndMetadata{offset=866803953, leaderEpoch=null,
>> metadata=''}} failed:
>> org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset
>> commit failed with a retriable exception. You should retry committing the
>> latest consumed offsets.
>> Caused by: org.apache.kafka.common.errors.NotCoordinatorException: This
>> is not the correct coordinator.
>>
>> ... then the following message for every subsequent attempt to commit
>> offsets ...
>>
>> 2021-11-22 04:46:33,284 WARN  [xxx.KafkaConsumerRunner]
>> 'pool-7-thread-132' Offset commit with offsets
>> {MyTopic-0=OffsetAndMetadata{offset=866803954, leaderEpoch=82,
>> metadata=''}, MyOtherTopic-0=OffsetAndMetadata{offset=12654756,
>> leaderEpoch=79, metadata=''}} failed:
>> org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset
>> commit failed with a retriable exception. You should retry committing the
>> latest consumed offsets.
>> Caused by:
>> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The
>> coordinator is not available.
>>
>> In the above example we are doing manual async-commits but we also had
>> offset commit failure for a different consumer group (observed through lag
>> monitoring) that uses auto-commit, it just didn't log the ongoing
>> failures.  In both cases messages were still being processed, it was just
>> the commits not working.  These are our two busiest consumer groups and
>> both have static Topic assignments.  Other consumer groups continued OK.
>>
>> I've spent some time examining the (Java) client code and started to
>> wonder whether there is a bug or race condition that means the coordinator
>> never gets reassigned after being invalidated and we simply keep hitting
>> the following short-circuit:
>>
>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
>>
>>     RequestFuture<Void> sendOffsetCommitRequest(final Map<TopicPartition,
>> OffsetAndMetadata> offsets) {
>>         if (offsets.isEmpty())
>>             return RequestFuture.voidSuccess();
>>
>>         Node coordinator = checkAndGetCoordinator();
>>         if (coordinator == null)
>>             return RequestFuture.coordinatorNotAvailable();
>>
>> I'm not sure what the exact pathway is to getting the coordinator set but
>> I note that
>> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(Timer)
>> and other methods that look like they may be related tend to only log at
>> debug when they encounter RetriableException so could explain why I don't
>> have more detail to provide.
>>
>> I'm not familiar enough with the code to be able to trace this through
>> any further, but if you've had the patience to keep reading this far then
>> maybe you do!
>>
>> Bouncing the clients resolved the issue, but I'd be interested if any
>> experts out there can identify if there is any weakness in the 2.7.1
>> version.
>>
>> Regards, James.
>>
>>
>
>

Re: Consumer failure after rolling Broker upgrade

Reply via email to