[ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401408#comment-17401408 ]
A. Sophie Blee-Goldman commented on KAFKA-12550: ------------------------------------------------ Yeah, the RESTORING would be an additional KafkaStreams state that exists in parallel to the existing REBALANCING state. imo REBALANCING should still take precedence over RESTORING, ie as long as at least one thread is going through a rebalance then the overall state should be REBALANCING. And if no threads are rebalancing but at least one is still restoring, then the overall state is RESTORING. And so on. I also think we should consider breaking away from the rebalance callbacks since the thread states don't really make sense to couple with these callbacks anymore since cooperative rebalancing was introduced. Before that, PARTITIONS_REVOKED always indicated the beginning of a rebalance, and PARTITIONS_ASSIGNED the end. But now with cooperative rebalancing, you may never invoke #onPartitionsRevoked to begin with, so it's actually possible for Streams to stay in RUNNING for the duration of the actual rebalance and then only when the rebalance ends do the threads transition to PARTITIONS_ASSIGNED and the overall state to REBALANCING. It's also a bit confusing since PARTITIONS_ASSIGNED is supposed to indicate the end of a rebalance, but if a followup rebalance is immediately triggered and the consumer must rejoin, then it may actually still be rebalancing even after entering PARTITIONS_ASSIGNED. The whole thing makes less and less sense. So, I'd propose to also clean up the StreamThread FSM at the same time by removing the PARTITIONS_ASSIGNED/PARTITIONS_REVOKED states and replacing them with the equivalent REBALANCING/RESTORING. As the names suggest, when the thread first rejoins the group (ie sends the Subscription for the rebalance) then it will transition to REBALANCING. At the end of the rebalance, when it receives its Assignment, it then transitions to RESTORING. That way it's always clear what the thread is doing, and if a followup rebalance is ever triggered then it will automatically transition back to the appropriate state, ie REBALANCING. Does that make sense? Unfortunately we'll now need to wait for another major release, since changing the FSM is a breaking change. But it would probably be a good idea to at least start the KIP now and get it accepted so that we can be ready when 4.0 comes along > Introduce RESTORING state to the KafkaStreams FSM > ------------------------------------------------- > > Key: KAFKA-12550 > URL: https://issues.apache.org/jira/browse/KAFKA-12550 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: A. Sophie Blee-Goldman > Assignee: Sagar Rao > Priority: Major > Labels: needs-kip > Fix For: 4.0.0 > > > We should consider adding a new state to the KafkaStreams FSM: RESTORING > This would cover the time between the completion of a stable rebalance and > the completion of restoration across the client. Currently, Streams will > report the state during this time as REBALANCING even though it is generally > spending much more time restoring than rebalancing in most cases. > There are a few motivations/benefits behind this idea: > # Observability is a big one: using the umbrella REBALANCING state to cover > all aspects of rebalancing -> task initialization -> restoring has been a > common source of confusion in the past. It’s also proved to be a time sink > for us, during escalations, incidents, mailing list questions, and bug > reports. It often adds latency to escalations in particular as we have to go > through GTS and wait for the customer to clarify whether their “Kafka Streams > is stuck rebalancing” ticket means that it’s literally rebalancing, or just > in the REBALANCING state and actually stuck elsewhere in Streams > # Prereq for global thread improvements: for example [KIP-406: > GlobalStreamThread should honor custom reset policy > |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] > was ultimately blocked on this as we needed to pause the Streams app while > the global thread restored from the appropriate offset. Since there’s > absolutely no rebalancing involved in this case, piggybacking on the > REBALANCING state would just be shooting ourselves in the foot. -- This message was sent by Atlassian Jira (v8.3.4#803005)