[ 
https://issues.apache.org/jira/browse/KAFKA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305217#comment-17305217
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12472:
------------------------------------------------

That all makes sense to me, but I guessed I'm still confused that we would 
enforce that it can only transition to a higher state or to 0. I agree that the 
higher the code, the more important it is from an observability standpoint, and 
I agree with your proposed ordering, but if we enforce this then it seems like 
we won't be getting the most up-to-date view on the consumer status, and worse 
we may have some conflicting reports as to the cause of a rebalance where some 
of the consumers are reporting a condition that's no longer afflicting them.

For example member A has dropped out of the group, and transitions to the 
*DroppedGroup* state. After A rejoins, but before this rebalance is finalized, 
some other member B triggers a probing rebalance and transitions to 
*AssignmentProbing*, which causes A to get the RebalanceInProgressException. If 
A can only transition to a higher status, then it will remain in *DroppedGroup* 
until the probing rebalance has completed, which is misleading since during the 
probing rebalance A is no longer dropped out. If we get into a cycle of 
rebalancing the problem will just get worse: for example yet another member C 
triggers a rebalance, this time due to *RevocationNeeded*, then B will now be 
stuck in *AssignmentProbing* which is also out of date since the true cause of 
the latest rebalance is *RevocationNeeded*.

If such a thing occurs, don't we want only the member which has caused the 
current rebalance to report a status while all others report that its current 
rebalance is just due to another member triggering it? In the example above the 
status of these members during each rebalance should be:

Rebalance 1: member A is *DroppedGroup*, B and C are *CoordinatorRequested*
Rebalance 2: member B is *AssignmentProbing*, A and C are *CoordinatorRequested*
Rebalance 3: member C is *RevocationNeeded*, A and B are *CoordinatorRequested*

Or are you suggesting that member A would have to transition back to *None* in 
between the *DroppedGroup* and the *CoordinatorRequested* state, even though it 
was actually never stable and was rebalancing this entire time? I just feel 
this will be more confusing, and don't see why we would need to enforce a rule 
on the consumer's status transitions -- shouldn't it just reflect what is 
currently going on?

> Add a Consumer / Streams metric to indicate the current rebalance status
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-12472
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12472
>             Project: Kafka
>          Issue Type: Improvement
>          Components: consumer, streams
>            Reporter: Guozhang Wang
>            Priority: Major
>              Labels: needs-kip
>
> Today to trouble shoot a rebalance issue operators need to do a lot of manual 
> steps: locating the problematic members, search in the log entries, and look 
> for related metrics. It would be great to add a single metric that covers all 
> these manual steps and operators would only need to check this single signal 
> to check what is the root cause. A concrete idea is to expose two enum gauge 
> metrics on consumer and streams, respectively:
> * Consumer level (the order below is by-design, see Streams level for 
> details):
>   0. *None* => there is no rebalance on going.
>   1. *CoordinatorRequested* => any of the coordinator response contains a 
> RebalanceInProgress error code.
>   2. *NewMember* => when the join group response has a MemberIdRequired error 
> code.
>   3. *UnknownMember* => when any of the coordinator response contains an 
> UnknownMember error code, indicating this member is already kicked out of the 
> group.
>   4. *StaleMember* => when any of the coordinator response contains an 
> IllegalGeneration error code.
>   5. *DroppedGroup* => when hb thread decides to call leaveGroup due to hb 
> expired.
>   6. *UserRequested* => when leaveGroup upon the shutdown / unsubscribeAll 
> API, as well as upon calling the enforceRebalance API.
>   7. *MetadataChanged* => requestRejoin triggered since metadata has changed.
>   8. *SubscriptionChanged* => requestRejoin triggered since subscription has 
> changed.
>   9. *RetryOnError* => when join/syncGroup response contains a retriable 
> error which would cause the consumer to backoff and retry.
>  10. *RevocationNeeded* => requestRejoin triggered since revoked partitions 
> is not empty.
> The transition rule is that a non-zero status code can only transit to zero 
> or to a higher code, but not to a lower code (same for streams, see 
> rationales below).
> * Streams level: today a streams client can have multiple consumers. We 
> introduced some new enum states as well as aggregation rules across 
> consumers: if there's no streams-layer events as below that transits its 
> status (i.e. streams layer think it is 0), then we aggregate across all the 
> embedded consumers and take the largest status code value as the streams 
> metric; if there are streams-layer events that determines its status should 
> be in 10+, then it ignores all embedded consumer layer status code since it 
> should has a higher precedence. In addition, when create aggregated metric 
> across streams instance (a.k.a at the app-level, which is usually what we 
> would care and alert on), we also follow the same aggregation rule, e.g. if 
> there are two streams instance where one instance's status code is 1), and 
> the other is 10), then the app's status is 10).
>  10. *RevocationNeeded* => the definition of this is changed to the original 
> 10) defined in consumer above, OR leader decides to revoke either 
> active/standby tasks and hence schedule follow-ups.
>  11. *AssignmentProbing* => leader decides to schedule follow-ups since the 
> current assignment is unstable.
>  12. *VersionProbing* => leader decides to schedule follow-ups due to version 
> probing.
>  13. *EndpointUpdate* => anyone decides to schedule follow-ups due to 
> endpoint updates.
>  14. *EOSViolated* => when OutOfOrderSequenceException is thrown, causing 
> TaskMigratedException
>  15. *EOSProducerFenced* => when ProducerFencedException / 
> InvalidProducerEpochException / UnknownProducerIdException are thrown, 
> causing TaskMigratedException
>  16. *ConsumerDropped* => when CommitFailedException are thrown, causing 
> TaskMigratedException
> The main motivations of the above proposed precedence order are the following:
> 1. When a rebalance is triggered by one member, all other members would only 
> know it is due to CoordinatorRequested from coordinator error codes, and 
> hence CoordinatorRequested should be overridden by any other status when 
> aggregating across clients.
> 2. DroppedGroup could cause unknown/stale members that would fail and retry 
> immediately, and hence should take higher precedence.
> 3. Revocation definition is extended in Streams, and hence it needs to take 
> the highest precedence among all consumer-only status so that it would not be 
> overridden by any of the consumer-only status.
> 4. In general, more rare events get higher precedence.
> This is proposed on top of KAFKA-12352. Any comments on the precedence rules 
> / categorization are more than welcomed!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to