[ 
https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401408#comment-17401408
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12550:
------------------------------------------------

Yeah, the RESTORING would be an additional KafkaStreams state that exists in 
parallel to the existing REBALANCING state. imo REBALANCING should still take 
precedence over RESTORING, ie as long as at least one thread is going through a 
rebalance then the overall state should be REBALANCING. And if no threads are 
rebalancing but at least one is still restoring, then the overall state is 
RESTORING. And so on.

I also think we should consider breaking away from the rebalance callbacks 
since the thread states don't really make sense to couple with these callbacks 
anymore since cooperative rebalancing was introduced. Before that, 
PARTITIONS_REVOKED always indicated the beginning of a rebalance, and 
PARTITIONS_ASSIGNED the end. But now with cooperative rebalancing, you may 
never invoke #onPartitionsRevoked to begin with, so it's actually possible for 
Streams to stay in RUNNING for the duration of the actual rebalance and then 
only when the rebalance ends do the threads transition to PARTITIONS_ASSIGNED 
and the overall state to REBALANCING. It's also a bit confusing since 
PARTITIONS_ASSIGNED is supposed to indicate the end of a rebalance, but if a 
followup rebalance is immediately triggered and the consumer must rejoin, then 
it may actually still be rebalancing even after entering PARTITIONS_ASSIGNED. 
The whole thing makes less and less sense.

So, I'd propose to also clean up the StreamThread FSM at the same time by 
removing the PARTITIONS_ASSIGNED/PARTITIONS_REVOKED states and replacing them 
with the equivalent REBALANCING/RESTORING. As the names suggest, when the 
thread first rejoins the group (ie sends the Subscription for the rebalance) 
then it will transition to REBALANCING. At the end of the rebalance, when it 
receives its Assignment, it then transitions to RESTORING. That way it's always 
clear what the thread is doing, and if a followup rebalance is ever triggered 
then it will automatically transition back to the appropriate state, ie 
REBALANCING.

Does that make sense? Unfortunately we'll now need to wait for another major 
release, since changing the FSM is a breaking change. But it would probably be 
a good idea to at least start the KIP now and get it accepted so that we can be 
ready when 4.0 comes along

> Introduce RESTORING state to the KafkaStreams FSM
> -------------------------------------------------
>
>                 Key: KAFKA-12550
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12550
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: Sagar Rao
>            Priority: Major
>              Labels: needs-kip
>             Fix For: 4.0.0
>
>
> We should consider adding a new state to the KafkaStreams FSM: RESTORING
> This would cover the time between the completion of a stable rebalance and 
> the completion of restoration across the client. Currently, Streams will 
> report the state during this time as REBALANCING even though it is generally 
> spending much more time restoring than rebalancing in most cases.
> There are a few motivations/benefits behind this idea:
> # Observability is a big one: using the umbrella REBALANCING state to cover 
> all aspects of rebalancing -> task initialization -> restoring has been a 
> common source of confusion in the past. It’s also proved to be a time sink 
> for us, during escalations, incidents, mailing list questions, and bug 
> reports. It often adds latency to escalations in particular as we have to go 
> through GTS and wait for the customer to clarify whether their “Kafka Streams 
> is stuck rebalancing” ticket means that it’s literally rebalancing, or just 
> in the REBALANCING state and actually stuck elsewhere in Streams
> # Prereq for global thread improvements: for example [KIP-406: 
> GlobalStreamThread should honor custom reset policy 
> |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy]
>  was ultimately blocked on this as we needed to pause the Streams app while 
> the global thread restored from the appropriate offset. Since there’s 
> absolutely no rebalancing involved in this case, piggybacking on the 
> REBALANCING state would just be shooting ourselves in the foot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to