Re: Kafka Streams endless rebalancing

Peter Levart Sat, 11 Apr 2020 11:40:27 -0700



On 4/11/20 2:45 PM, Alex Craig wrote:

Yep, max poll interval is 2147483647 and session timeout is 120000 (2
minutes).  I don't have anything set for heartbeat.interval.ms, so it must
be using the default. (3 seconds I think?)  Hmm, is it possible the
heartbeat might not happen if the client app was so swamped with restoring
data (and maybe starved for CPU) and then the broker ultimately kicked it
out of the group?  Seems unlikely, since it would have to have been missed
for a full 2 minutes..   Getting the broker logs can be a challenge, but
I'll see if I can get ahold of them.  Is there any text I should be
looking for, or are rebalances pretty clear in the logs?  Thanks again for
the help!

Alex


Hi Alex,

What GC algorithm are you using and how big is hour heap? GC pauses canbe measured in minutes sometimes, depending od GC algorithm chosen, heapsize and workload.


Peter


On Fri, Apr 10, 2020 at 4:15 PM John Roesler <vvcep...@apache.org> wrote:

Hey Alex,

Huh.

Unprefixed configs apply to all consumers, but in this case, it's
irrelevant because only the "main" consumer participates in group
management (so the config only applies to the main consumer).

So you actually have max.poll.interval.ms set to Integer.MAX_VALUE,
which amounts to 25 days? I agree, in that case it doesn't seem like
it could be a slow batch. In fact, it couldn't be anything related to
polling, since you see rebalances sooner than 25 days.

If you have the broker logs, they'll contain the reason for the rebalance.
The only other thing I can think of that causes rebalances is failing to
heartbeat. What do you have for session.timeout.ms and
heartbeat.interval.ms ?

If anyone else has any ideas, please jump in.

Thanks,
-John

On Fri, Apr 10, 2020, at 14:55, Alex Craig wrote:

Thanks John, I double-checked my configs and I've actually got the
max.poll.interval.ms set to the max (not prefixed with anything so
presumably that’s the “main” consumer).  So I think that means the

problem

isn’t due to a single batch of messages not getting processed/committed
within the polling cycle right?  I guess what I’m wondering is, could the
OVERALL length of time needed to fully restore the state stores (which
could be multiple topics with multiple partitions) be exceeding some
timeout or threshold?  Thanks again for any ideas,



Alex C


On Thu, Apr 9, 2020 at 9:36 AM John Roesler <vvcep...@apache.org> wrote:

Hi Alex,

It sounds like your theory is plausible. After a rebalance, Streams

needs

to restore its stores from the changelog topics. Currently, Streams
performs this restore operation in the same loop that does processing

and

polls the consumer for more records. If the restore batches (or the
processing) take too long, Streams won’t be able to call Consumer#poll

(on

the “main” consumer)within the max.poll.interval, which causes the
Consumer’s heartbeat thread to assume the instance is unhealthy and

stop

sending heartbeats, which in turn causes another rebalance.

You could try either adjusting the max poll interval for the _main_
consumer or decreasing the batch size for the _restore_ consumer to

make

sure Streams can call poll() frequently enough to stay in the group.

There

are prefixes you can add to the consumer configuration portions to

target

the main or restore consumer.

Also worth noting, we’re planning to change this up pretty soon, so

that

restoration happens in a separate thread and doesn’t block polling like
this.

I hope this helps!
-John

On Thu, Apr 9, 2020, at 08:33, Alex Craig wrote:

Hi all, I’ve got a Kafka Streams application running in a Kubernetes
environment.  The topology on this application has 2 aggregations

(and

therefore 2 Ktables), both of which can get fairly large – the first

is

around 200GB and the second around 500GB.  As with any K8s platform,

pods

can occasionally get rescheduled or go down, which of course will

cause

my

application to rebalance.  However, what I’m seeing is the

application

will

literally spend hours rebalancing, without any errors being thrown or

other

obvious causes for the frequent rebalances – all I can see in the

logs is

an instance will be restoring a state store from the changelog topic,

then

suddenly it will have its partitions revoked and begin the join-group
process all over again.  (I’m running 10 pods/instances of my app,

and I

see this same pattern in each instance)  In some cases it never

really

recovers from this rebalancing cycle – even after 12 hours or more -

and

I’ve had to scale down the application completely and start over by

purging

the application state and re-consuming from earliest on the source

topic.

Interestingly, after purging and starting from scratch the

application

seems to recover from rebalances pretty easily.

The storage I’m using is a NAS device, which admittedly is not

particularly

fast.  (it’s using spinning disks and is shared amongst other

tenants) As

an experiment, I’ve tried switching the k8s storage to an in-memory

option

(this is at the k8s layer - the application is still using the same

RocksDB

stores) to see if that helps.  As it turns out, I never have the

rebalance

problem when using an in-memory persistence layer.  If a pod goes

down,

the

application spends around 10 - 15 minutes rebalancing and then is

back to

processing data again.

At this point I guess my main question is: when I’m using the NAS

storage

and the state stores are fairly large, could I be hitting some

timeout

somewhere that isn’t allowing the restore process to complete, which

then

triggers another rebalance?  In other words, the restore process is

simply

taking too long given the amount of data needed to restore and the

slow

storage?   I’m currently using Kafka 2.4.1, but I saw this same

behavior

in

2.3.  I am using a custom RocksDB config setter to limit off-heap

memory,

but I’ve tried removing that and saw no difference in the rebalance
problem.  Again, no errors that I’m seeing or anything else in the

logs

that seems to indicate why it can never finish rebalancing.  I’ve

tried

turning on DEBUG logging but I’m having a tough time sifting through

the

amount of log messages, though I’m still looking.

If anyone has any ideas I would appreciate it, thanks!

Alex C

Re: Kafka Streams endless rebalancing

Reply via email to