Hi Zaiming, Yeah, you're right. Changing coordinator won't cause a rebalance (it hasn't been that way since we added group metadata persistence). I went back and checked the code and we actually do not reset the heartbeat timer when a commit is received. I'm not sure whether there's a good reason for that, but nothing is coming to mind. At least when the group is stable, the commit could be treated as an implicit heartbeat. Feel free to create a JIRA and we can see what others think. Out of curiosity, is this a significant problem for the Erlang client you're writing?
-Jason On Fri, Mar 25, 2016 at 1:38 PM, Zaiming Shi <zmst...@gmail.com> wrote: > Hi Jason > > If I understand correctly, when coordinator is changed the consumer > should get 'NotCoordinatorForGroup' exception not 'IllegalGenerationId'. > Topic metadata change? like number of partitions changed ? > I was testing it in a pretty stable cluster, and it was reproduced several > times, > I had no such issue if we change session timeout to 3 minutes. > --- does this rule out the topic metadata change? > > The logs are lost because I was running debug mode in our Erlang client to > help debugging this issue for my colleague who's using the new Java client. > My colleague has observed very likely the same pattern as I described > above. > He is trying to get on hold a minimal setup for a reliable reproduction. > > I will also try to reproduce it in Erlang, and post here a (hopefully > sensible) > sequence of timestamped heartbeat and commit requests and responses. > > Will ask more questions if we have new findings. > > Regards > -Zaiming > > > > On Fri, Mar 25, 2016 at 5:43 PM, Jason Gustafson <ja...@confluent.io> > wrote: > > > Hi Zaiming, > > > > It rules out the most likely cause of rebalance, but not the only one. > > Rebalances can also be caused by a topic metadata change or a coordinator > > change. Can you post some logs from the consumer around the time that the > > unexpected rebalance occurred? > > > > -Jason > > > > On Fri, Mar 25, 2016 at 12:09 AM, Zaiming Shi <zmst...@gmail.com> wrote: > > > > > Hi Jason > > > > > > thanks for the reply! > > > > > > Forgot to mention that in we tried to test the simplest scenario in > which > > > there was only one member in the group. I think that should rule out > > group > > > rebalancing right? > > > > > > On Thursday, March 24, 2016, Jason Gustafson <ja...@confluent.io> > wrote: > > > > > > > HI Zaiming, > > > > > > > > I think the problem is not that commit requests aren't considered as > > > > effective as heartbeats (they are), but that you can't rejoin the > group > > > > using only commits/heartbeats. Every time the group rebalances, all > > > members > > > > must rejoin the group by sending a JoinGroup request. Once a > rebalance > > > has > > > > begun (e.g. because a new consumer has been started), then each > member > > > must > > > > send the JoinGroup before expiration of the session timeout. If not, > > then > > > > they will be kicked out of the group even if they are still sending > > > > heartbeats. Does that make sense? > > > > > > > > -Jason > > > > > > > > > > > > > > > > On Wed, Mar 23, 2016 at 10:03 AM, Zaiming Shi <zmst...@gmail.com > > > > <javascript:;>> wrote: > > > > > > > > > Hi there! > > > > > > > > > > We have noticed that when committing requests are sent intensively, > > we > > > > > receive IllegalGenerationId. > > > > > Here is the settings we had problem with: session-timeout: 30 sec, > > > > > heartbeat-rate: 3 sec. > > > > > Problem resolved by increasing the session timeout to 180 sec. > > > > > > > > > > So I suppose, due to whatever reason (either the client didn't send > > > > > heartbeat, or the broker didn't process the heartbeats in time), > the > > > > > session was considered dead in group coordinator. > > > > > > > > > > My question is: why commit requests can't be taken as an indicator > of > > > > > member being alive? hence not to kill the session. > > > > > > > > > > Regards > > > > > -Zaiming > > > > > > > > > > > > > > >