[ https://issues.apache.org/jira/browse/KAFKA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306664#comment-17306664 ]
A. Sophie Blee-Goldman commented on KAFKA-12472: ------------------------------------------------ [~guozhang] what do you think of these names & ordering to replace/expand on the current #6 status "*UserRequested* => when leaveGroup upon the shutdown / unsubscribeAll API, as well as upon calling the enforceRebalance API": 6. *ConsumerClosed* => when leaveGroup upon the shutdown of client. 7. *UnsubscribedAll* => when leaveGroup upon unsubscribing from all topics. 8. *UserRequested* => when user requests a rebalance via the enforceRebalance API. > Add a Consumer / Streams metric to indicate the current rebalance status > ------------------------------------------------------------------------ > > Key: KAFKA-12472 > URL: https://issues.apache.org/jira/browse/KAFKA-12472 > Project: Kafka > Issue Type: Improvement > Components: consumer, streams > Reporter: Guozhang Wang > Priority: Major > Labels: needs-kip > > Today to trouble shoot a rebalance issue operators need to do a lot of manual > steps: locating the problematic members, search in the log entries, and look > for related metrics. It would be great to add a single metric that covers all > these manual steps and operators would only need to check this single signal > to check what is the root cause. A concrete idea is to expose two enum gauge > metrics on consumer and streams, respectively: > * Consumer level (the order below is by-design, see Streams level for > details): > 0. *None* => there is no rebalance on going. > 1. *CoordinatorRequested* => any of the coordinator response contains a > RebalanceInProgress error code. > 2. *NewMember* => when the join group response has a MemberIdRequired error > code. > 3. *UnknownMember* => when any of the coordinator response contains an > UnknownMember error code, indicating this member is already kicked out of the > group. > 4. *StaleMember* => when any of the coordinator response contains an > IllegalGeneration error code. > 5. *DroppedGroup* => when hb thread decides to call leaveGroup due to hb > expired. > 6. *UserRequested* => when leaveGroup upon the shutdown / unsubscribeAll > API, as well as upon calling the enforceRebalance API. > 7. *MetadataChanged* => requestRejoin triggered since metadata has changed. > 8. *SubscriptionChanged* => requestRejoin triggered since subscription has > changed. > 9. *RetryOnError* => when join/syncGroup response contains a retriable > error which would cause the consumer to backoff and retry. > 10. *RevocationNeeded* => requestRejoin triggered since revoked partitions > is not empty. > The transition rule is that a non-zero status code can only transit to zero > or to a higher code, but not to a lower code (same for streams, see > rationales below). > * Streams level: today a streams client can have multiple consumers. We > introduced some new enum states as well as aggregation rules across > consumers: if there's no streams-layer events as below that transits its > status (i.e. streams layer think it is 0), then we aggregate across all the > embedded consumers and take the largest status code value as the streams > metric; if there are streams-layer events that determines its status should > be in 10+, then it ignores all embedded consumer layer status code since it > should has a higher precedence. In addition, when create aggregated metric > across streams instance (a.k.a at the app-level, which is usually what we > would care and alert on), we also follow the same aggregation rule, e.g. if > there are two streams instance where one instance's status code is 1), and > the other is 10), then the app's status is 10). > 10. *RevocationNeeded* => the definition of this is changed to the original > 10) defined in consumer above, OR leader decides to revoke either > active/standby tasks and hence schedule follow-ups. > 11. *AssignmentProbing* => leader decides to schedule follow-ups since the > current assignment is unstable. > 12. *VersionProbing* => leader decides to schedule follow-ups due to version > probing. > 13. *EndpointUpdate* => anyone decides to schedule follow-ups due to > endpoint updates. > 14. *EOSViolated* => when OutOfOrderSequenceException is thrown, causing > TaskMigratedException > 15. *EOSProducerFenced* => when ProducerFencedException / > InvalidProducerEpochException / UnknownProducerIdException are thrown, > causing TaskMigratedException > 16. *ConsumerDropped* => when CommitFailedException are thrown, causing > TaskMigratedException > The main motivations of the above proposed precedence order are the following: > 1. When a rebalance is triggered by one member, all other members would only > know it is due to CoordinatorRequested from coordinator error codes, and > hence CoordinatorRequested should be overridden by any other status when > aggregating across clients. > 2. DroppedGroup could cause unknown/stale members that would fail and retry > immediately, and hence should take higher precedence. > 3. Revocation definition is extended in Streams, and hence it needs to take > the highest precedence among all consumer-only status so that it would not be > overridden by any of the consumer-only status. > 4. In general, more rare events get higher precedence. > This is proposed on top of KAFKA-12352. Any comments on the precedence rules > / categorization are more than welcomed! -- This message was sent by Atlassian Jira (v8.3.4#803005)