Guozhang Wang created KAFKA-12352:
-------------------------------------
Summary: Improve debuggability with continuous consumer rebalances
Key: KAFKA-12352
URL: https://issues.apache.org/jira/browse/KAFKA-12352
Project: Kafka
Issue Type: Improvement
Components: consumer, streams
Reporter: Guozhang Wang
Assignee: Guozhang Wang
There are several scenarios where a consumer/streams client can fall into
continuous rebalances and hence does not make any progress. Today when this
happens, developers usually need to do a lot digging in order to get insights
on what happens. Here's short summary of different scenarios where we
(re-)trigger rebalances:
1. Group member kicked out of the group: when the coordinator kicked out the
member, later on when the member issues a join / sync / heartbeat /
offset-commit, it will fail and the member will try to re-join. When the member
was constantly calling poll too late, it would continuously fall into this
scenario and not make progress.
2. Group is rebalancing: if the group is rebalancing at the moment, the
member's heartbeat / offset commit / sync-group will fail. In this case the
member rejoining the group is not the root cause of the rebalancing anyways.
3. Caller enforce a rebalance via `enforceRebalance`. This is one-off and
should not cause rebalance storms.
4. After a rebalance is completed, the member found out that a) its
subscription has changed or 2) its subscribed topics' number of partitions
changed since the re-join request was sent. In this case it needs to re-trigger
the rebalance in order to get the new assignment. Since the subscription change
is one-off, it should not cause rebalance storms; topic metadata change should
also be infrequent, but there are some rare cases where topic metadata keeps
"vibrating" due to broker side issues.
5. After a rebalance is completed, the member need to revoke some partitions as
indicated by the assignment. After the revocation it needs to re-join the
group. This may cause rebalance storms when the partition assignor was
sub-optimal in determining the assignment and hence the partitions keep
migrating around and rebalances triggered continuously.
As we can see, 1/5 above could potentially cause rebalance storms, while 2/3/4
should not in normal cases. In all of these scenarios, we should expose exactly
the reason why the member is re-joining the group, and whether this re-joining
the group would trigger the rebalance, or if it is already in a rebalance
(hence join-group itself is not causing it, but the result of it). This could
help operators to quickly nail down which of the above may be the root cause of
continuous rebalances.
I'd suggest we first go through the log4j hierarchy to make sure this is the
right place, and maybe in the future we can expose a single state metric on top
of the logging categorization for even convienent trouble shooting.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)