[jira] [Commented] (KAFKA-12462) Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state exception

2022-03-01 Thread Walker Carlson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499736#comment-17499736
 ] 

Walker Carlson commented on KAFKA-12462:


This also affects 2.6

> Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state 
> exception 
> --
>
> Key: KAFKA-12462
> URL: https://issues.apache.org/jira/browse/KAFKA-12462
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.8.0
>Reporter: Walker Carlson
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: streams
> Fix For: 2.8.0, 2.7.1
>
>
> A thread was removed, sending it to the PENDING_SHUTDOWN state, but went 
> through a rebalance before completing the shutdown.
> {code:java}
> // [2021-03-07 04:33:39,385] DEBUG [i-07430efc31ad166b7-StreamThread-6] 
> stream-thread [i-07430efc31ad166b7-StreamThread-6] Ignoring request to 
> transit from PENDING_SHUTDOWN to PARTITIONS_REVOKED: only DEAD state is a 
> valid next state (org.apache.kafka.streams.processor.internals.StreamThread)
> {code}
> Inside StreamsRebalanceListener#onPartitionsRevoked, we have
> {code:java}
> // 
> if (streamThread.setState(State.PARTITIONS_REVOKED) != null && 
> !partitions.isEmpty())
> taskManager.handleRevocation(partitions);
> {code}
> Since PENDING_SHUTDOWN → PARTITIONS_REVOKED is a disallowed transition, we 
> never invoke TaskManager#handleRevocation. Currently handleRevocation is 
> responsible for preparing any active tasks for close, including committing 
> offsets and writing the checkpoint as well as suspending the task. We can’t 
> close the task in handleRevocation since we still support EAGER rebalancing, 
> which invokes handleRevocation at the beginning of a rebalance on all tasks.
> The tasks that are actually revoked will be closed during 
> TaskManager#handleAssignment . The IllegalStateException is specifically 
> because we don’t suspend the task before attempting to close it, and the 
> direct transition from RUNNING → CLOSED is forbidden.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KAFKA-12462) Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state exception

2021-03-12 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300654#comment-17300654
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12462:


The only real downside in 2.7 is that we won't properly clean up the task, ie 
we'll skip committing the offsets and writing the checkpoint. So we'd lose any 
work we did since the last commit – for EOS this would be a perf hit since we'd 
probably need to restore the state stores from scratch after starting back up, 
whereas for ALOS we could get some overcounting. Not the end of the world, but 
worth fixing if we can

> Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state 
> exception 
> --
>
> Key: KAFKA-12462
> URL: https://issues.apache.org/jira/browse/KAFKA-12462
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Walker Carlson
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: streams
> Fix For: 2.8.0, 2.7.1
>
>
> A thread was removed, sending it to the PENDING_SHUTDOWN state, but went 
> through a rebalance before completing the shutdown.
> {code:java}
> // [2021-03-07 04:33:39,385] DEBUG [i-07430efc31ad166b7-StreamThread-6] 
> stream-thread [i-07430efc31ad166b7-StreamThread-6] Ignoring request to 
> transit from PENDING_SHUTDOWN to PARTITIONS_REVOKED: only DEAD state is a 
> valid next state (org.apache.kafka.streams.processor.internals.StreamThread)
> {code}
> Inside StreamsRebalanceListener#onPartitionsRevoked, we have
> {code:java}
> // 
> if (streamThread.setState(State.PARTITIONS_REVOKED) != null && 
> !partitions.isEmpty())
> taskManager.handleRevocation(partitions);
> {code}
> Since PENDING_SHUTDOWN → PARTITIONS_REVOKED is a disallowed transition, we 
> never invoke TaskManager#handleRevocation. Currently handleRevocation is 
> responsible for preparing any active tasks for close, including committing 
> offsets and writing the checkpoint as well as suspending the task. We can’t 
> close the task in handleRevocation since we still support EAGER rebalancing, 
> which invokes handleRevocation at the beginning of a rebalance on all tasks.
> The tasks that are actually revoked will be closed during 
> TaskManager#handleAssignment . The IllegalStateException is specifically 
> because we don’t suspend the task before attempting to close it, and the 
> direct transition from RUNNING → CLOSED is forbidden.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12462) Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state exception

2021-03-12 Thread Walker Carlson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300652#comment-17300652
 ] 

Walker Carlson commented on KAFKA-12462:


If we were shutting down the whole client the thread would become dead either 
way. In 2.7 I think the only impact it would have is that the handler would get 
called after the close call when it shouldn’t. But otherwise it might not have 
an effect. I suppose there is no harm to back-porting though.

I defiantly don't think it is worth cutting a new RC for anything that does not 
have removeThread in it

 

> Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state 
> exception 
> --
>
> Key: KAFKA-12462
> URL: https://issues.apache.org/jira/browse/KAFKA-12462
> Project: Kafka
>  Issue Type: Bug
>Reporter: Walker Carlson
>Priority: Blocker
> Fix For: 2.8.0
>
>
> A thread was removed, sending it to the PENDING_SHUTDOWN state, but went 
> through a rebalance before completing the shutdown.
> {code:java}
> // [2021-03-07 04:33:39,385] DEBUG [i-07430efc31ad166b7-StreamThread-6] 
> stream-thread [i-07430efc31ad166b7-StreamThread-6] Ignoring request to 
> transit from PENDING_SHUTDOWN to PARTITIONS_REVOKED: only DEAD state is a 
> valid next state (org.apache.kafka.streams.processor.internals.StreamThread)
> {code}
> Inside StreamsRebalanceListener#onPartitionsRevoked, we have
> {code:java}
> // 
> if (streamThread.setState(State.PARTITIONS_REVOKED) != null && 
> !partitions.isEmpty())
> taskManager.handleRevocation(partitions);
> {code}
> Since PENDING_SHUTDOWN → PARTITIONS_REVOKED is a disallowed transition, we 
> never invoke TaskManager#handleRevocation. Currently handleRevocation is 
> responsible for preparing any active tasks for close, including committing 
> offsets and writing the checkpoint as well as suspending the task. We can’t 
> close the task in handleRevocation since we still support EAGER rebalancing, 
> which invokes handleRevocation at the beginning of a rebalance on all tasks.
> The tasks that are actually revoked will be closed during 
> TaskManager#handleAssignment . The IllegalStateException is specifically 
> because we don’t suspend the task before attempting to close it, and the 
> direct transition from RUNNING → CLOSED is forbidden.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12462) Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state exception

2021-03-12 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300650#comment-17300650
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12462:


Thanks Walker! This actually seems like a long-lurking bug that was just 
surfaced by the removeStreamThread() feature, not caused by it. Before we could 
remove threads this was only possible when shutting down the client, which we 
don’t test as frequently as we now do removeStreamThread(). It’s also hard to 
notice that a bug has caused thread(s) to die when the threads were supposed to 
shut down anyways. But now we might only be removing one thread, and thanks to 
the new exception handler we’ll shut down the whole application upon hitting 
this so the thread won’t just quietly die.

We should consider backporting the fix to 2.7, even though the bug isn't going 
to be as frequent or as bad in earlier versions. I wouldn't cut a new RC for 
2.6.2 over this, but we might as well backport to get the fix in 2.7.1 whenever 
that comes out

> Threads in PENDING_SHUTDOWN entering a rebalance can cause an illegal state 
> exception 
> --
>
> Key: KAFKA-12462
> URL: https://issues.apache.org/jira/browse/KAFKA-12462
> Project: Kafka
>  Issue Type: Bug
>Reporter: Walker Carlson
>Priority: Blocker
> Fix For: 2.8.0
>
>
> A thread was removed, sending it to the PENDING_SHUTDOWN state, but went 
> through a rebalance before completing the shutdown.
> {code:java}
> // [2021-03-07 04:33:39,385] DEBUG [i-07430efc31ad166b7-StreamThread-6] 
> stream-thread [i-07430efc31ad166b7-StreamThread-6] Ignoring request to 
> transit from PENDING_SHUTDOWN to PARTITIONS_REVOKED: only DEAD state is a 
> valid next state (org.apache.kafka.streams.processor.internals.StreamThread)
> {code}
> Inside StreamsRebalanceListener#onPartitionsRevoked, we have
> {code:java}
> // 
> if (streamThread.setState(State.PARTITIONS_REVOKED) != null && 
> !partitions.isEmpty())
> taskManager.handleRevocation(partitions);
> {code}
> Since PENDING_SHUTDOWN → PARTITIONS_REVOKED is a disallowed transition, we 
> never invoke TaskManager#handleRevocation. Currently handleRevocation is 
> responsible for preparing any active tasks for close, including committing 
> offsets and writing the checkpoint as well as suspending the task. We can’t 
> close the task in handleRevocation since we still support EAGER rebalancing, 
> which invokes handleRevocation at the beginning of a rebalance on all tasks.
> The tasks that are actually revoked will be closed during 
> TaskManager#handleAssignment . The IllegalStateException is specifically 
> because we don’t suspend the task before attempting to close it, and the 
> direct transition from RUNNING → CLOSED is forbidden.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)