Re: Rebalancing stuck, never finishes

2021-02-26 Thread Sophie Blee-Goldman
Peter,

It does seem like KAFKA-9752 is the most likely suspect, although if your
clients
were upgraded to 2.6.1 then I don't believe they would be on an early
enough version
of the JoinGroup to run into this. I'm not 100% sure though, it may be a
good idea
to leave a comment on that ticket and ping Jason directly since he
implemented the fix

Murilo,

I agree that your problem is not likely to be KAFKA-9752, since that was
caused by
KAFKA-9232 and that code is not present  in 2.2.1. But maybe you're hitting
up on the
issue which  KAFKA-9232 was originally intended to fix? In any case, 2.2.1
is quite old now
so there may be other known bugs which have since been fixed.

I know it's not always possible/easy, but I would still recommend to
upgrade your brokers to a
more recent version if you can.


On Fri, Feb 26, 2021 at 7:19 AM Murilo Tavares  wrote:

> Just to provide a bit more detail, I noticed Peter's pattern:
> "Rebalance failed. org.apache.kafka.common.errors.DisconnectException:
> null"
> "(Re-)joining group"
>
> But I also get a different pattern, interchangeably:
> Group coordinator broker-1:9092 (id: 2147483646 rack: null) is unavailable
> or invalid due to cause: null.isDisconnected: true. Rediscovery will be
> attempted.
> Followed by
> Discovered group coordinator broker-1:9092 (id: 2147483646 rack: null)
>
>
>
> On Fri, 26 Feb 2021 at 09:59, Murilo Tavares  wrote:
>
> > Hi
> > I got the same behaviour yesterday while trying to upgrade my
> KafkaStreams
> > app from 2.4.1 to 2.7.0. Our brokers are on 2.2.1.
> >
> > Looking at KAFKA-9752 it mentions the cause being two other tickets:
> > https://issues.apache.org/jira/browse/KAFKA-7610
> > https://issues.apache.org/jira/browse/KAFKA-9232
> >
> > Although the first ticket seems fixed in 2.2.0, the latter was just fixed
> > in 2.2.3, so my brokers shouldn't have the code for KAFKA-9232.
> > But what I don't understand is that KAFKA-9752 says:
> > "Note that this is only possible if 1) we have a consumer using an old
> > JoinGroup version, 2) the consumer times out and disconnects from its
> > initial JoinGroup request."
> > In this case, I guess my consumer is not using an old JoinGroup, as my
> > consumers (KafkaStreams) are on 2.7.0...
> >
> > Thanks
> > Murilo
> >
> > On Fri, 26 Feb 2021 at 06:06, Péter Sinóros-Szabó
> >  wrote:
> >
> >> Hey Sophie,
> >>
> >> thanks for the link, I was checking that ticket, but I was not sure if
> it
> >> is relevant for our case.
> >> Eventually we "fixed" our problem with reducing the session.timeout.ms
> >> (it
> >> was set to a high value for other reasons).
> >>
> >> But today, in another service, we faced the same problem when upgrading
> >> the
> >> Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the
> brokers.
> >> Do you think the same problem (KAFKA-9752) might cause this problem too?
> >> It's hard to judge just based on the description of that ticket.
> >>
> >> Thanks,
> >> Peter
> >>
> >
>


Re: Rebalancing stuck, never finishes

2021-02-26 Thread Murilo Tavares
Just to provide a bit more detail, I noticed Peter's pattern:
"Rebalance failed. org.apache.kafka.common.errors.DisconnectException: null"
"(Re-)joining group"

But I also get a different pattern, interchangeably:
Group coordinator broker-1:9092 (id: 2147483646 rack: null) is unavailable
or invalid due to cause: null.isDisconnected: true. Rediscovery will be
attempted.
Followed by
Discovered group coordinator broker-1:9092 (id: 2147483646 rack: null)



On Fri, 26 Feb 2021 at 09:59, Murilo Tavares  wrote:

> Hi
> I got the same behaviour yesterday while trying to upgrade my KafkaStreams
> app from 2.4.1 to 2.7.0. Our brokers are on 2.2.1.
>
> Looking at KAFKA-9752 it mentions the cause being two other tickets:
> https://issues.apache.org/jira/browse/KAFKA-7610
> https://issues.apache.org/jira/browse/KAFKA-9232
>
> Although the first ticket seems fixed in 2.2.0, the latter was just fixed
> in 2.2.3, so my brokers shouldn't have the code for KAFKA-9232.
> But what I don't understand is that KAFKA-9752 says:
> "Note that this is only possible if 1) we have a consumer using an old
> JoinGroup version, 2) the consumer times out and disconnects from its
> initial JoinGroup request."
> In this case, I guess my consumer is not using an old JoinGroup, as my
> consumers (KafkaStreams) are on 2.7.0...
>
> Thanks
> Murilo
>
> On Fri, 26 Feb 2021 at 06:06, Péter Sinóros-Szabó
>  wrote:
>
>> Hey Sophie,
>>
>> thanks for the link, I was checking that ticket, but I was not sure if it
>> is relevant for our case.
>> Eventually we "fixed" our problem with reducing the session.timeout.ms
>> (it
>> was set to a high value for other reasons).
>>
>> But today, in another service, we faced the same problem when upgrading
>> the
>> Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the brokers.
>> Do you think the same problem (KAFKA-9752) might cause this problem too?
>> It's hard to judge just based on the description of that ticket.
>>
>> Thanks,
>> Peter
>>
>


Re: Rebalancing stuck, never finishes

2021-02-26 Thread Murilo Tavares
Hi
I got the same behaviour yesterday while trying to upgrade my KafkaStreams
app from 2.4.1 to 2.7.0. Our brokers are on 2.2.1.

Looking at KAFKA-9752 it mentions the cause being two other tickets:
https://issues.apache.org/jira/browse/KAFKA-7610
https://issues.apache.org/jira/browse/KAFKA-9232

Although the first ticket seems fixed in 2.2.0, the latter was just fixed
in 2.2.3, so my brokers shouldn't have the code for KAFKA-9232.
But what I don't understand is that KAFKA-9752 says:
"Note that this is only possible if 1) we have a consumer using an old
JoinGroup version, 2) the consumer times out and disconnects from its
initial JoinGroup request."
In this case, I guess my consumer is not using an old JoinGroup, as my
consumers (KafkaStreams) are on 2.7.0...

Thanks
Murilo

On Fri, 26 Feb 2021 at 06:06, Péter Sinóros-Szabó
 wrote:

> Hey Sophie,
>
> thanks for the link, I was checking that ticket, but I was not sure if it
> is relevant for our case.
> Eventually we "fixed" our problem with reducing the session.timeout.ms (it
> was set to a high value for other reasons).
>
> But today, in another service, we faced the same problem when upgrading the
> Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the brokers.
> Do you think the same problem (KAFKA-9752) might cause this problem too?
> It's hard to judge just based on the description of that ticket.
>
> Thanks,
> Peter
>


Re: Rebalancing stuck, never finishes

2021-02-26 Thread Péter Sinóros-Szabó
Hey Sophie,

thanks for the link, I was checking that ticket, but I was not sure if it
is relevant for our case.
Eventually we "fixed" our problem with reducing the session.timeout.ms (it
was set to a high value for other reasons).

But today, in another service, we faced the same problem when upgrading the
Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the brokers.
Do you think the same problem (KAFKA-9752) might cause this problem too?
It's hard to judge just based on the description of that ticket.

Thanks,
Peter


Re: Rebalancing stuck, never finishes

2021-02-25 Thread Sophie Blee-Goldman
Hey Peter,

It does sound like you may have hit
https://issues.apache.org/jira/browse/KAFKA-9752

You will need to upgrade your brokers in order to get the fix, since it's a
broker-side issue

On Tue, Feb 9, 2021 at 2:48 AM Péter Sinóros-Szabó
 wrote:

> Hi,
>
> I have an application running with 6 instances of it on Kubernetes. All 6
> instances (pods) are the same, using the same consumer group id.
> Recently we see that when the application is restarted (rolling restart on
> K8s), the triggered rebalancing sometimes doesn't finish at all and the
> Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60
> minutes, sometimes it doesn't.
>
> If it is stuck, then if we stop the application and wait until
> kafka-consumer-groups.sh doesn't show the group, and then we restart the
> application, then the initial rebalancing finishes just fine and all is
> good... until some hours or days later a rolling restart restarts it all
> again.
>
> I grabbed some logs from the time when it was continuously rebalancing.
> Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers
> seem like running on localhost, but that's not true, traffic is routed on a
> service mesh...)
>
> 2021-02-05T17:00:18.261422532Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id:
> 2147482641 rack: null) is unavailable or invalid
> 2021-02-05T17:00:18.261454952Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Rebalance failed.:
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2021-02-05T17:00:18.499108799Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
> thread #42 - KafkaConsumer[topicY]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-43, groupId=fin] Discovered group coordinator
> localhost:9204 (id: 2147482641 rack: null)
> 2021-02-05T17:00:18.499300612Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
> thread #42 - KafkaConsumer[topicY]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-43, groupId=fin] (Re-)joining group
>
> No more logs from Kafka Consumer, it seems that the rebalancing doesn't
> finish at all, I don't see logs in any of the pods about the partition
> assignments being calculated, so my _guess_ is that the rebalancing stucks
> in PreparingRebalance phase and never progress from there.
>
> --- About 2 minutes 10 seconds later (sometimes I see a difference here of
> 1 minutes 10 seconds).
>
> 2021-02-05T17:02:29.615402388Z:  fin-df8d589bd-95bsz: INFO:
> kafka-coordinator-heartbeat-thread | fin:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id:
> 2147482641 rack: null) is unavailable or invalid, will attempt rediscovery
> 2021-02-05T17:02:29.615520075Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-29, groupId=fin] Rebalance failed.:
> org.apache.kafka.common.errors.RebalanceInProgressException: The group is
> rebalancing, so a rejoin is needed.
>
> --- This last line may has a difference reason for rebalance too:
> "Rebalance failed.: org.apache.kafka.common.errors.DisconnectException:
> null"
>
> 2021-02-05T17:02:29.74932507Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Discovered group coordinator
> localhost:9204 (id: 2147482641 rack: null)
> 2021-02-05T17:02:29.749488204Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] (Re-)joining group
>
> ... and the same repeats forever.
>
> Kafka Client version: 2.6.x
> Kafka Broker version: 2.4.1
>
>
> What can be the reason for this failing rebalance?
>
> I found this bug on 2.4.1, is it possible that I hit this issue?
> https://issues.apache.org/jira/browse/KAFKA-9752
> "Consumer rebalance can be stuck after new member timeout with old
> JoinGroup version"
>
>
> Thanks for the help,
> Peter
>