One of our internal customers is working on a service that spans around 120
kubernetes pods.  Due to design constraints, every one of these pods has a
single kafka consumer, and they're all using the same consumer group id.
Since it's kubernetes, and the service is sized according to volume
throughout the day, pods are added/removed constantly, at least a few times
per hour.

What we are seeing with initial testing is that, whenever a single pod
joins or leaves the consumer group, it triggers a rebalance that sometimes
takes up to 60+ seconds to resolve.  Consumption resumes after the
rebalance event, but of course now there's 60+ second lag in consumption
for that topic.  Whenever there's a code deploy to these pods, and we need
to re-create all 120 pods, the problem seems to be exacerbated, and we run
into rebalances taking 200+ seconds.  This particular service is somewhat
sensitive to lag, so we'd like to keep the rebalance time to a minimum.

With that context, what kafka configs should we focus on on the consumer
side (and maybe the broker side?) that would enable us to minimize the time
spent on the rebalance?

Thanks,

Marcos Juarez

Reply via email to