One of our internal customers is working on a service that spans around 120 kubernetes pods. Due to design constraints, every one of these pods has a single kafka consumer, and they're all using the same consumer group id. Since it's kubernetes, and the service is sized according to volume throughout the day, pods are added/removed constantly, at least a few times per hour.
What we are seeing with initial testing is that, whenever a single pod joins or leaves the consumer group, it triggers a rebalance that sometimes takes up to 60+ seconds to resolve. Consumption resumes after the rebalance event, but of course now there's 60+ second lag in consumption for that topic. Whenever there's a code deploy to these pods, and we need to re-create all 120 pods, the problem seems to be exacerbated, and we run into rebalances taking 200+ seconds. This particular service is somewhat sensitive to lag, so we'd like to keep the rebalance time to a minimum. With that context, what kafka configs should we focus on on the consumer side (and maybe the broker side?) that would enable us to minimize the time spent on the rebalance? Thanks, Marcos Juarez
