Guozhang Wang created KAFKA-12472:
-------------------------------------

             Summary: Add a Consumer / Streams metric to indicate the current 
rebalance status
                 Key: KAFKA-12472
                 URL: https://issues.apache.org/jira/browse/KAFKA-12472
             Project: Kafka
          Issue Type: Improvement
          Components: consumer, streams
            Reporter: Guozhang Wang


Today to trouble shoot a rebalance issue operators need to do a lot of manual 
steps: locating the problematic members, search in the log entries, and look 
for related metrics. It would be great to add a single metric that covers all 
these manual steps and operators would only need to check this single signal to 
check what is the root cause. A concrete idea is to expose two enum gauge 
metrics on consumer and streams, respectively:

* Consumer level (the order below is by-design, see Streams level for details):
  0. None => there is no rebalance on going.
  1. CoordinatorRequested => any of the coordinator response contains a 
RebalanceInProgress error code.
  2. NewMember => when the join group response has a MemberIdRequired error 
code.
  3. UnknownMember => when any of the coordinator response contains an 
UnknownMember error code, indicating this member is already kicked out of the 
group.
  4. StaleMember => when any of the coordinator response contains an 
IllegalGeneration error code.
  5. DroppedGroup => when hb thread decides to call leaveGroup due to hb 
expired.
  6. UserRequested => when leaveGroup upon the shutdown / unsubscribeAll API, 
as well as upon calling the enforceRebalance API.
  7. MetadataChanged => requestRejoin triggered since metadata has changed.
  8. SubscriptionChanged => requestRejoin triggered since subscription has 
changed.
  9. RetryOnError => when join/syncGroup response contains a retriable error 
which would cause the consumer to backoff and retry.
 10. RevocationNeeded => requestRejoin triggered since revoked partitions is 
not empty.

The transition rule is that a non-zero status code can only transit to zero or 
to a higher code, but not to a lower code (same for streams, see rationales 
below).

* Streams level: today a streams client can have multiple consumers. We 
introduced some new enum states as well as aggregation rules across consumers: 
if there's no streams-layer events as below that transits its status (i.e. 
streams layer think it is 0), then we aggregate across all the embedded 
consumers and take the largest status code value as the streams metric; if 
there are streams-layer events that determines its status should be in 10+, 
then its overrides all embedded consumer layer status code. In addition, when 
create aggregated metric across streams instance within an app, we also follow 
the same aggregation rule, e.g. if there are two streams instance where one 
instance's status code is 1), and the other is 10), then the app's status is 
10).

 10. RevocationNeeded => the definition of this is changed to the original 10) 
defined in consumer above, OR leader decides to revoke either active/standby 
tasks and hence schedule follow-ups.
 11. AssignmentProbing => leader decides to schedule follow-ups since the 
current assignment is unstable.
 12. VersionProbing => leader decides to schedule follow-ups due to version 
probing.
 13. EndpointUpdate => anyone decides to schedule follow-ups due to endpoint 
updates.


The main motivations of the above proposed precedence order are the following:
1. When a rebalance is triggered by one member, all other members would only 
know it is due to CoordinatorRequested from coordinator error codes, and hence 
CoordinatorRequested should be overridden by any other status when aggregating 
across clients.
2. DroppedGroup could cause unknown/stale members that would fail and retry 
immediately, and hence should take higher precedence.
3. Revocation definition is extended in Streams, and hence it needs to take the 
highest precedence among all consumer-only status so that it would not be 
overridden by any of the consumer-only status.
4. In general, more rare events get higher precedence.

Any comments on the precedence rules / categorization are more than welcomed!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to