What I mean by "flapping" in this context is unnecessary rebalancing
happening. The example I would give is what a Hadoop Datanode would do in
case of a shutdown. By default, it will wait 10 minutes before replicating
the blocks owned by the Datanode so routine maintenance wouldn't cause
unnecessary shuffling of blocks.

In this context, if I'm performing a rolling restart, as soon as worker 1
shuts down, it's work is picked up by other workers. But worker 1 comes
back 3 seconds (or whatever) later and requests the work back. Then worker
2 goes down and it's work is assigned to other workers for 3 seconds before
yet another rebalance. So, in theory, the order of operations will look
something like this:

STOP (1) -> REBALANCE -> START (1) -> REBALANCE -> STOP (2) -> REBALANCE ->
START (2) -> REBALANCE -> ....

>From what I understand, there's currently no way to prevent this type of
shuffling of partitions from worker to worker while the consumers are under
maintenance. I'm also not sure if this an issue I don't need to worry about.

- Pradeep

On Thu, Jan 5, 2017 at 8:29 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> Not sure I understand your question about flapping. The LeaveGroupRequest
> is only sent on a graceful shutdown. If a consumer knows it is going to
> shutdown, it is good to proactively make sure the group knows it needs to
> rebalance work because some of the partitions that were handled by the
> consumer need to be handled by some other group members.
>
> There's no "flapping" in the sense that the leave group requests should
> just inform the other members that they need to take over some of the work.
> I would normally think of "flapping" as meaning that things start/stop
> unnecessarily. In this case, *someone* needs to deal with the rebalance and
> pick up the work being dropped by the worker. There's no flapping because
> it's a one-time event -- one worker is shutting down, decides to drop the
> work, and a rebalance sorts it out and reassigns it to another member of
> the group. This happens once and then the "issue" is resolved without any
> additional interruptions.
>
> -Ewen
>
> On Thu, Jan 5, 2017 at 3:01 PM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
> > I see... doesn't that cause flapping though?
> >
> > On Wed, Jan 4, 2017 at 8:22 PM, Ewen Cheslack-Postava <e...@confluent.io
> >
> > wrote:
> >
> > > The coordinator will immediately move the group into a rebalance if it
> > > needs it. The reason LeaveGroupRequest was added was to avoid having to
> > > wait for the session timeout before completing a rebalance. So aside
> from
> > > the latency of cleanup/committing offests/rejoining after a heartbeat,
> > > rolling bounces should be fast for consumer groups.
> > >
> > > -Ewen
> > >
> > > On Wed, Jan 4, 2017 at 5:19 PM, Pradeep Gollakota <
> pradeep...@gmail.com>
> > > wrote:
> > >
> > > > Hi Kafka folks!
> > > >
> > > > When a consumer is closed, it will issue a LeaveGroupRequest. Does
> > anyone
> > > > know how long the coordinator waits before reassigning the partitions
> > > that
> > > > were assigned to the leaving consumer to a new consumer? I ask
> because
> > > I'm
> > > > trying to understand the behavior of consumers if you're doing a
> > rolling
> > > > restart.
> > > >
> > > > Thanks!
> > > > Pradeep
> > > >
> > >
> >
>

Reply via email to