[jira] [Updated] (KAFKA-10563) Make sure task directories don't remain locked by dead threads

2022-03-01 Thread Guozhang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-10563:
--
Labels: new-streams-runtime-should-fix  (was: )

> Make sure task directories don't remain locked by dead threads
> --
>
> Key: KAFKA-10563
> URL: https://issues.apache.org/jira/browse/KAFKA-10563
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: new-streams-runtime-should-fix
>
> Most common/expected exceptions within Streams are handled gracefully, and 
> the thread will make sure to clean up all resources such as task locks during 
> shutdown. However, there are some instances where an unexpected exception 
> such as an IllegalStateException can leave some resources orphaned.
> We have seen this happen to task directories after an IllegalStateException 
> is hit during the TaskManager's rebalance handling logic – the Thread shuts 
> down, but loses track of some tasks before unlocking them. This blocks any 
> further work on that task by any other thread in the same instance.
> Previously we decided that this was "ok" because an IllegalStateException 
> means all bets are off. But with the upcoming work of KIP-663 and KIP-671, 
> users will be able to react smartly on dying threads and replace them with 
> new ones, making it more important than ever to ensure that the application 
> can continue on with no lasting repercussions of a thread death. If we allow 
> users to revive/replace a thread that dies due to IllegalStateException, that 
> thread should not be blocked from doing any work by the ghost of its 
> predecessor. 
> It might be easiest to just add some logic to the cleanup thread to verify 
> all the existing locks against the list of live threads, and remove any 
> zombie locks. But we probably want to do this purging more frequently than 
> the cleanup thread runs (10min by default) – so maybe we can leverage the 
> work in KIP-671 and have each thread purge any locks still owned by it after 
> the uncaught exception handler runs, but before the thread dies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KAFKA-10563) Make sure task directories don't remain locked by dead threads

2020-10-20 Thread Bill Bejeck (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck updated KAFKA-10563:

Fix Version/s: (was: 2.7.0)
   2.8.0

> Make sure task directories don't remain locked by dead threads
> --
>
> Key: KAFKA-10563
> URL: https://issues.apache.org/jira/browse/KAFKA-10563
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 2.8.0
>
>
> Most common/expected exceptions within Streams are handled gracefully, and 
> the thread will make sure to clean up all resources such as task locks during 
> shutdown. However, there are some instances where an unexpected exception 
> such as an IllegalStateException can leave some resources orphaned.
> We have seen this happen to task directories after an IllegalStateException 
> is hit during the TaskManager's rebalance handling logic – the Thread shuts 
> down, but loses track of some tasks before unlocking them. This blocks any 
> further work on that task by any other thread in the same instance.
> Previously we decided that this was "ok" because an IllegalStateException 
> means all bets are off. But with the upcoming work of KIP-663 and KIP-671, 
> users will be able to react smartly on dying threads and replace them with 
> new ones, making it more important than ever to ensure that the application 
> can continue on with no lasting repercussions of a thread death. If we allow 
> users to revive/replace a thread that dies due to IllegalStateException, that 
> thread should not be blocked from doing any work by the ghost of its 
> predecessor. 
> It might be easiest to just add some logic to the cleanup thread to verify 
> all the existing locks against the list of live threads, and remove any 
> zombie locks. But we probably want to do this purging more frequently than 
> the cleanup thread runs (10min by default) – so maybe we can leverage the 
> work in KIP-671 and have each thread purge any locks still owned by it after 
> the uncaught exception handler runs, but before the thread dies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10563) Make sure task directories don't remain locked by dead threads

2021-03-18 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-10563:
---
Fix Version/s: (was: 2.8.0)

> Make sure task directories don't remain locked by dead threads
> --
>
> Key: KAFKA-10563
> URL: https://issues.apache.org/jira/browse/KAFKA-10563
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> Most common/expected exceptions within Streams are handled gracefully, and 
> the thread will make sure to clean up all resources such as task locks during 
> shutdown. However, there are some instances where an unexpected exception 
> such as an IllegalStateException can leave some resources orphaned.
> We have seen this happen to task directories after an IllegalStateException 
> is hit during the TaskManager's rebalance handling logic – the Thread shuts 
> down, but loses track of some tasks before unlocking them. This blocks any 
> further work on that task by any other thread in the same instance.
> Previously we decided that this was "ok" because an IllegalStateException 
> means all bets are off. But with the upcoming work of KIP-663 and KIP-671, 
> users will be able to react smartly on dying threads and replace them with 
> new ones, making it more important than ever to ensure that the application 
> can continue on with no lasting repercussions of a thread death. If we allow 
> users to revive/replace a thread that dies due to IllegalStateException, that 
> thread should not be blocked from doing any work by the ghost of its 
> predecessor. 
> It might be easiest to just add some logic to the cleanup thread to verify 
> all the existing locks against the list of live threads, and remove any 
> zombie locks. But we probably want to do this purging more frequently than 
> the cleanup thread runs (10min by default) – so maybe we can leverage the 
> work in KIP-671 and have each thread purge any locks still owned by it after 
> the uncaught exception handler runs, but before the thread dies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)