What I mean by "flapping" in this context is unnecessary rebalancing happening. The example I would give is what a Hadoop Datanode would do in case of a shutdown. By default, it will wait 10 minutes before replicating the blocks owned by the Datanode so routine maintenance wouldn't cause unnecessary shuffling of blocks.
In this context, if I'm performing a rolling restart, as soon as worker 1 shuts down, it's work is picked up by other workers. But worker 1 comes back 3 seconds (or whatever) later and requests the work back. Then worker 2 goes down and it's work is assigned to other workers for 3 seconds before yet another rebalance. So, in theory, the order of operations will look something like this: STOP (1) -> REBALANCE -> START (1) -> REBALANCE -> STOP (2) -> REBALANCE -> START (2) -> REBALANCE -> .... >From what I understand, there's currently no way to prevent this type of shuffling of partitions from worker to worker while the consumers are under maintenance. I'm also not sure if this an issue I don't need to worry about. - Pradeep On Thu, Jan 5, 2017 at 8:29 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Not sure I understand your question about flapping. The LeaveGroupRequest > is only sent on a graceful shutdown. If a consumer knows it is going to > shutdown, it is good to proactively make sure the group knows it needs to > rebalance work because some of the partitions that were handled by the > consumer need to be handled by some other group members. > > There's no "flapping" in the sense that the leave group requests should > just inform the other members that they need to take over some of the work. > I would normally think of "flapping" as meaning that things start/stop > unnecessarily. In this case, *someone* needs to deal with the rebalance and > pick up the work being dropped by the worker. There's no flapping because > it's a one-time event -- one worker is shutting down, decides to drop the > work, and a rebalance sorts it out and reassigns it to another member of > the group. This happens once and then the "issue" is resolved without any > additional interruptions. > > -Ewen > > On Thu, Jan 5, 2017 at 3:01 PM, Pradeep Gollakota <pradeep...@gmail.com> > wrote: > > > I see... doesn't that cause flapping though? > > > > On Wed, Jan 4, 2017 at 8:22 PM, Ewen Cheslack-Postava <e...@confluent.io > > > > wrote: > > > > > The coordinator will immediately move the group into a rebalance if it > > > needs it. The reason LeaveGroupRequest was added was to avoid having to > > > wait for the session timeout before completing a rebalance. So aside > from > > > the latency of cleanup/committing offests/rejoining after a heartbeat, > > > rolling bounces should be fast for consumer groups. > > > > > > -Ewen > > > > > > On Wed, Jan 4, 2017 at 5:19 PM, Pradeep Gollakota < > pradeep...@gmail.com> > > > wrote: > > > > > > > Hi Kafka folks! > > > > > > > > When a consumer is closed, it will issue a LeaveGroupRequest. Does > > anyone > > > > know how long the coordinator waits before reassigning the partitions > > > that > > > > were assigned to the leaving consumer to a new consumer? I ask > because > > > I'm > > > > trying to understand the behavior of consumers if you're doing a > > rolling > > > > restart. > > > > > > > > Thanks! > > > > Pradeep > > > > > > > > > >