[jira] [Updated] (KAFKA-12550) Introduce RESTORING state to the KafkaStreams FSM
[ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12550: --- Fix Version/s: (was: 4.0.0) > Introduce RESTORING state to the KafkaStreams FSM > - > > Key: KAFKA-12550 > URL: https://issues.apache.org/jira/browse/KAFKA-12550 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > Labels: needs-kip > > We should consider adding a new state to the KafkaStreams FSM: RESTORING > This would cover the time between the completion of a stable rebalance and > the completion of restoration across the client. Currently, Streams will > report the state during this time as REBALANCING even though it is generally > spending much more time restoring than rebalancing in most cases. > There are a few motivations/benefits behind this idea: > # Observability is a big one: using the umbrella REBALANCING state to cover > all aspects of rebalancing -> task initialization -> restoring has been a > common source of confusion in the past. It’s also proved to be a time sink > for us, during escalations, incidents, mailing list questions, and bug > reports. It often adds latency to escalations in particular as we have to go > through GTS and wait for the customer to clarify whether their “Kafka Streams > is stuck rebalancing” ticket means that it’s literally rebalancing, or just > in the REBALANCING state and actually stuck elsewhere in Streams > # Prereq for global thread improvements: for example [KIP-406: > GlobalStreamThread should honor custom reset policy > |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] > was ultimately blocked on this as we needed to pause the Streams app while > the global thread restored from the appropriate offset. Since there’s > absolutely no rebalancing involved in this case, piggybacking on the > REBALANCING state would just be shooting ourselves in the foot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-12550) Introduce RESTORING state to the KafkaStreams FSM
[ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12550: --- Fix Version/s: (was: 3.0.0) 4.0.0 > Introduce RESTORING state to the KafkaStreams FSM > - > > Key: KAFKA-12550 > URL: https://issues.apache.org/jira/browse/KAFKA-12550 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Labels: needs-kip > Fix For: 4.0.0 > > > We should consider adding a new state to the KafkaStreams FSM: RESTORING > This would cover the time between the completion of a stable rebalance and > the completion of restoration across the client. Currently, Streams will > report the state during this time as REBALANCING even though it is generally > spending much more time restoring than rebalancing in most cases. > There are a few motivations/benefits behind this idea: > # Observability is a big one: using the umbrella REBALANCING state to cover > all aspects of rebalancing -> task initialization -> restoring has been a > common source of confusion in the past. It’s also proved to be a time sink > for us, during escalations, incidents, mailing list questions, and bug > reports. It often adds latency to escalations in particular as we have to go > through GTS and wait for the customer to clarify whether their “Kafka Streams > is stuck rebalancing” ticket means that it’s literally rebalancing, or just > in the REBALANCING state and actually stuck elsewhere in Streams > # Prereq for global thread improvements: for example [KIP-406: > GlobalStreamThread should honor custom reset policy > |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] > was ultimately blocked on this as we needed to pause the Streams app while > the global thread restored from the appropriate offset. Since there’s > absolutely no rebalancing involved in this case, piggybacking on the > REBALANCING state would just be shooting ourselves in the foot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12550) Introduce RESTORING state to the KafkaStreams FSM
[ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12550: --- Priority: Major (was: Blocker) > Introduce RESTORING state to the KafkaStreams FSM > - > > Key: KAFKA-12550 > URL: https://issues.apache.org/jira/browse/KAFKA-12550 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Labels: needs-kip > Fix For: 3.0.0 > > > We should consider adding a new state to the KafkaStreams FSM: RESTORING > This would cover the time between the completion of a stable rebalance and > the completion of restoration across the client. Currently, Streams will > report the state during this time as REBALANCING even though it is generally > spending much more time restoring than rebalancing in most cases. > There are a few motivations/benefits behind this idea: > # Observability is a big one: using the umbrella REBALANCING state to cover > all aspects of rebalancing -> task initialization -> restoring has been a > common source of confusion in the past. It’s also proved to be a time sink > for us, during escalations, incidents, mailing list questions, and bug > reports. It often adds latency to escalations in particular as we have to go > through GTS and wait for the customer to clarify whether their “Kafka Streams > is stuck rebalancing” ticket means that it’s literally rebalancing, or just > in the REBALANCING state and actually stuck elsewhere in Streams > # Prereq for global thread improvements: for example [KIP-406: > GlobalStreamThread should honor custom reset policy > |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] > was ultimately blocked on this as we needed to pause the Streams app while > the global thread restored from the appropriate offset. Since there’s > absolutely no rebalancing involved in this case, piggybacking on the > REBALANCING state would just be shooting ourselves in the foot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12550) Introduce RESTORING state to the KafkaStreams FSM
[ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12550: --- Labels: needs-kip (was: ) > Introduce RESTORING state to the KafkaStreams FSM > - > > Key: KAFKA-12550 > URL: https://issues.apache.org/jira/browse/KAFKA-12550 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Labels: needs-kip > Fix For: 3.0.0 > > > We should consider adding a new state to the KafkaStreams FSM: RESTORING > This would cover the time between the completion of a stable rebalance and > the completion of restoration across the client. Currently, Streams will > report the state during this time as REBALANCING even though it is generally > spending much more time restoring than rebalancing in most cases. > There are a few motivations/benefits behind this idea: > # Observability is a big one: using the umbrella REBALANCING state to cover > all aspects of rebalancing -> task initialization -> restoring has been a > common source of confusion in the past. It’s also proved to be a time sink > for us, during escalations, incidents, mailing list questions, and bug > reports. It often adds latency to escalations in particular as we have to go > through GTS and wait for the customer to clarify whether their “Kafka Streams > is stuck rebalancing” ticket means that it’s literally rebalancing, or just > in the REBALANCING state and actually stuck elsewhere in Streams > # Prereq for global thread improvements: for example [KIP-406: > GlobalStreamThread should honor custom reset policy > |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] > was ultimately blocked on this as we needed to pause the Streams app while > the global thread restored from the appropriate offset. Since there’s > absolutely no rebalancing involved in this case, piggybacking on the > REBALANCING state would just be shooting ourselves in the foot. -- This message was sent by Atlassian Jira (v8.3.4#803005)