Guozhang Wang created KAFKA-12352:
-------------------------------------

             Summary: Improve debuggability with continuous consumer rebalances
                 Key: KAFKA-12352
                 URL: https://issues.apache.org/jira/browse/KAFKA-12352
             Project: Kafka
          Issue Type: Improvement
          Components: consumer, streams
            Reporter: Guozhang Wang
            Assignee: Guozhang Wang


There are several scenarios where a consumer/streams client can fall into 
continuous rebalances and hence does not make any progress. Today when this 
happens, developers usually need to do a lot digging in order to get insights 
on what happens. Here's short summary of different scenarios where we 
(re-)trigger rebalances: 

1. Group member kicked out of the group: when the coordinator kicked out the 
member, later on when the member issues a join / sync / heartbeat / 
offset-commit, it will fail and the member will try to re-join. When the member 
was constantly calling poll too late, it would continuously fall into this 
scenario and not make progress. 

2. Group is rebalancing: if the group is rebalancing at the moment, the 
member's heartbeat / offset commit / sync-group will fail. In this case the 
member rejoining the group is not the root cause of the rebalancing anyways. 

3. Caller enforce a rebalance via `enforceRebalance`. This is one-off and 
should not cause rebalance storms. 

4. After a rebalance is completed, the member found out that a) its 
subscription has changed or 2) its subscribed topics' number of partitions 
changed since the re-join request was sent. In this case it needs to re-trigger 
the rebalance in order to get the new assignment. Since the subscription change 
is one-off, it should not cause rebalance storms; topic metadata change should 
also be infrequent, but there are some rare cases where topic metadata keeps 
"vibrating" due to broker side issues. 

5. After a rebalance is completed, the member need to revoke some partitions as 
indicated by the assignment. After the revocation it needs to re-join the 
group. This may cause rebalance storms when the partition assignor was 
sub-optimal in determining the assignment and hence the partitions keep 
migrating around and rebalances triggered continuously. 

As we can see, 1/5 above could potentially cause rebalance storms, while 2/3/4 
should not in normal cases. In all of these scenarios, we should expose exactly 
the reason why the member is re-joining the group, and whether this re-joining 
the group would trigger the rebalance, or if it is already in a rebalance 
(hence join-group itself is not causing it, but the result of it). This could 
help operators to quickly nail down which of the above may be the root cause of 
continuous rebalances. 

I'd suggest we first go through the log4j hierarchy to make sure this is the 
right place, and maybe in the future we can expose a single state metric on top 
of the logging categorization for even convienent trouble shooting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to