Hi Zaiming,

Yeah, you're right. Changing coordinator won't cause a rebalance (it hasn't
been that way since we added group metadata persistence). I went back and
checked the code and we actually do not reset the heartbeat timer when a
commit is received. I'm not sure whether there's a good reason for that,
but nothing is coming to mind. At least when the group is stable, the
commit could be treated as an implicit heartbeat. Feel free to create a
JIRA and we can see what others think. Out of curiosity, is this a
significant problem for the Erlang client you're writing?

-Jason

On Fri, Mar 25, 2016 at 1:38 PM, Zaiming Shi <zmst...@gmail.com> wrote:

> Hi Jason
>
> If I understand correctly, when coordinator is changed the consumer
> should get 'NotCoordinatorForGroup' exception not 'IllegalGenerationId'.
> Topic metadata change? like number of partitions changed ?
> I was testing it in a pretty stable cluster, and it was reproduced several
> times,
> I had no such issue if we change session timeout to 3 minutes.
> --- does this rule out the topic metadata change?
>
> The logs are lost because I was running debug mode in our Erlang client to
> help debugging this issue for my colleague who's using the new Java client.
> My colleague has observed very likely the same pattern as I described
> above.
> He is trying to get on hold a minimal setup for a reliable reproduction.
>
> I will also try to reproduce it in Erlang, and post here a (hopefully
> sensible)
> sequence of timestamped heartbeat and commit requests and responses.
>
> Will ask more questions if we have new findings.
>
> Regards
> -Zaiming
>
>
>
> On Fri, Mar 25, 2016 at 5:43 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Zaiming,
> >
> > It rules out the most likely cause of rebalance, but not the only one.
> > Rebalances can also be caused by a topic metadata change or a coordinator
> > change. Can you post some logs from the consumer around the time that the
> > unexpected rebalance occurred?
> >
> > -Jason
> >
> > On Fri, Mar 25, 2016 at 12:09 AM, Zaiming Shi <zmst...@gmail.com> wrote:
> >
> > > Hi Jason
> > >
> > > thanks for the reply!
> > >
> > > Forgot to mention that in we tried to test the simplest scenario in
> which
> > > there was only one member in the group. I think that should rule out
> > group
> > >  rebalancing right?
> > >
> > > On Thursday, March 24, 2016, Jason Gustafson <ja...@confluent.io>
> wrote:
> > >
> > > > HI Zaiming,
> > > >
> > > > I think the problem is not that commit requests aren't considered as
> > > > effective as heartbeats (they are), but that you can't rejoin the
> group
> > > > using only commits/heartbeats. Every time the group rebalances, all
> > > members
> > > > must rejoin the group by sending a JoinGroup request. Once a
> rebalance
> > > has
> > > > begun (e.g. because a new consumer has been started), then each
> member
> > > must
> > > > send the JoinGroup before expiration of the session timeout. If not,
> > then
> > > > they will be kicked out of the group even if they are still sending
> > > > heartbeats. Does that make sense?
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > > On Wed, Mar 23, 2016 at 10:03 AM, Zaiming Shi <zmst...@gmail.com
> > > > <javascript:;>> wrote:
> > > >
> > > > > Hi there!
> > > > >
> > > > > We have noticed that when committing requests are sent intensively,
> > we
> > > > > receive IllegalGenerationId.
> > > > > Here is the settings we had problem with: session-timeout: 30 sec,
> > > > > heartbeat-rate: 3 sec.
> > > > > Problem resolved by increasing the session timeout to 180 sec.
> > > > >
> > > > > So I suppose, due to whatever reason (either the client didn't send
> > > > > heartbeat, or the broker didn't process the heartbeats in time),
> the
> > > > > session was considered dead in group coordinator.
> > > > >
> > > > > My question is: why commit requests can't be taken as an indicator
> of
> > > > > member being alive? hence not to kill the session.
> > > > >
> > > > > Regards
> > > > > -Zaiming
> > > > >
> > > >
> > >
> >
>

Reply via email to