Hey,

Just skimming the config list, there are two things that immediately jumped
out at me:

1. The default session timeout was bumped up to 45 seconds a little while
ago. Not sure if you're overriding this or just using an older version, but
I definitely recommend bumping this up to 45s. Especially in combination
with...
2. The internal.leave.group.on.close should always be set to "false" by
Kafka Streams. Are you overriding this? If so, that definitely explains a
lot of the rebalances. This config is basically like an internal backdoor
used by Kafka Streams to do exactly what it sounds like you want to do --
avoid triggering a rebalance when closing the consumer/KafkaStreams. It
also works in combination with the session timeout, and basically means
"don't kick off an extra rebalance if a bounced consumer rejoins within the
session timeout".

I'd start with that and see how it goes before fiddling with other things,
like the probing.rebalance.interval and max.warmup.replicas, since that'll
have implications/tradeoffs you may not want.

Lastly: I know this is somewhat contrary to common sense, but with consumer
groups/Kafka Streams it can often be much better to bounce as many nodes as
you can at once, rather than doing a true rolling bounce. If for any reason
you can't bounce multiple nodes at once, at the very least you should make
sure they are bounced as quickly as possible, ie minimize the time between
when one node comes back up and the next one is bounced. Often people will
wait for each node to come online, rejoin the consumer group, and fully
stabilize before bouncing the next node. But that means every single bounce
will not just necesitate a rebalance, but also guarantees that partitions
will be shuffled around the entire time. So my main piece of advice
(besides fixing the two configs above) is: do the rolling restart as fast
as you can!

On Mon, May 6, 2024 at 7:02 AM Nagendra Mahesh (namahesh)
<namah...@cisco.com.invalid> wrote:

> Hi,
>
>
> We have multiple replicas of an application running on a kubernetes
> cluster. Each application instance runs a stateful kafka stream application
> with an in-memory state-store (backed by a changelog topic). All instances
> of the stream apps are members of the same consumer group.
>
>
> Deployments happen using the “rolling restart” method i.e. new replica(s)
> come up successfully, and existing (old) replica(s) are killed. Due to
> members joining the consumer group (new app instances) and members leaving
> the consumer group (old app instances), there is rebalancing of topic
> partitions within the group.
>
>
> Ultimately, when all instances of the app have completed rolling restart,
> we see partitions have undergone rebalancing an excessive number of times.
> For example, the app has 48 instances and it is observed that each
> partition (say, partition #50) has undergone rebalancing a lot of times (50
> - 57 times) by moving across several app instances. Total count of
> partition movements during the entire rolling restart is greater than 3000.
>
>
> This excessive rebalancing incurs an overall lag on message processing
> SLAs, and is creating reliability issues.
>
>
> So, we are wondering:
>
>
> (1) is this expected, especially since cooperative rebalancing should
> ensure that not a lot of partitions get rebalanced
>
>
> (2) why would any partition undergo so many rebalances across several app
> instances?
>
>
> (3) is there some configuration (broker config or client config) that we
> can apply to reduce the total rebalances and partition movements during
> rolling restarts? We cannot consider static membership due to other
> technical constraints.
>
>
> The runtime and network is extremely stable — no heartbeat misses, session
> timeouts etc.
>
>
> DETAILS
>
> -----------
>
>   *   Kafka Broker Version = 2.6
>
>   *   Kafka Streams Client Version = 2.7.0
>
>   *   No. of app instances = 48
>
>   *   No. of stream threads per stream app = 3
>
>   *   Total partition count = 60
>
>   *   Warmup Replicas (max.warmup.replicas) = 5
>
>   *   Standby Replicas (num.standby.replicas) = 2
>
>   *   probing.rebalance.interval.ms) = 300000 (5 minutes)
>
>   *   session.timeout.ms = 10000 (10 seconds)
>
>   *   heartbeat.interval.ms = 3000 (3 seconds)
>
>   *   internal.leave.group.on.close = true
>
>   *   linger.ms = 5
>
>   *   processing.guarantee = at_least_once
>
>
> Any help or information would be greatly appreciated.
>
> Thanks,
> Nagendra U M
>

Reply via email to