[ 
https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17399394#comment-17399394
 ] 

Matthias J. Sax commented on KAFKA-12550:
-----------------------------------------

Yes, it's about state store restoration. Also note that there is a difference 
between the state of individual threads and the Kafka Streams client. 
REBALANCING is a client state, not a thread state.

The consumer client doesn't really have a state similar to Kafka Streams. Of 
course, consumers will follow the rebalance protocol etc, but nothing is 
exposed to users. Thus, there is no need to worry about consumer clients.

We basically want to add a new state RESTORING to the threads and the client, 
and need to decide how we do the translation to the client state. Atm, the 
client is in state REBALANCING, if at least one thread is in state 
Partition_Revoked or Partitions_Assigned. I guess, we might want to follow a 
similar pattern and transit the client into state RESTORING if at least one 
client is in state RESTORING. I guess, one corner case is if one thread is in 
state Partition_Revoked or Partition_Assigned and a second thread is in state 
RESTORING: we might want to keep the client state in REBALANCING for this case?

> Introduce RESTORING state to the KafkaStreams FSM
> -------------------------------------------------
>
>                 Key: KAFKA-12550
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12550
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: Sagar Rao
>            Priority: Major
>              Labels: needs-kip
>             Fix For: 4.0.0
>
>
> We should consider adding a new state to the KafkaStreams FSM: RESTORING
> This would cover the time between the completion of a stable rebalance and 
> the completion of restoration across the client. Currently, Streams will 
> report the state during this time as REBALANCING even though it is generally 
> spending much more time restoring than rebalancing in most cases.
> There are a few motivations/benefits behind this idea:
> # Observability is a big one: using the umbrella REBALANCING state to cover 
> all aspects of rebalancing -> task initialization -> restoring has been a 
> common source of confusion in the past. It’s also proved to be a time sink 
> for us, during escalations, incidents, mailing list questions, and bug 
> reports. It often adds latency to escalations in particular as we have to go 
> through GTS and wait for the customer to clarify whether their “Kafka Streams 
> is stuck rebalancing” ticket means that it’s literally rebalancing, or just 
> in the REBALANCING state and actually stuck elsewhere in Streams
> # Prereq for global thread improvements: for example [KIP-406: 
> GlobalStreamThread should honor custom reset policy 
> |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy]
>  was ultimately blocked on this as we needed to pause the Streams app while 
> the global thread restored from the appropriate offset. Since there’s 
> absolutely no rebalancing involved in this case, piggybacking on the 
> REBALANCING state would just be shooting ourselves in the foot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to