[jira] [Commented] (HDFS-9409) DataNode shutdown does not guarantee full shutdown of all threads due to race condition.

2017-02-06 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854634#comment-15854634
 ] 

Chris Nauroth commented on HDFS-9409:
-

Using a hidden configuration flag for this sounds appropriate to me.  I agree 
that there is no need for a strict long wait on all threads in production 
operations if correctness doesn't depend on it.

> DataNode shutdown does not guarantee full shutdown of all threads due to race 
> condition.
> 
>
> Key: HDFS-9409
> URL: https://issues.apache.org/jira/browse/HDFS-9409
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Chris Nauroth
>
> {{DataNode#shutdown}} is documented to return "only after shutdown is 
> complete".  Even after completion of this method, it's possible that threads 
> started by the DataNode are still running.  Race conditions in the shutdown 
> sequence may cause it to skip stopping and joining the {{BPServiceActor}} 
> threads.
> This is likely not a big problem in normal operations, because these are 
> daemon threads that won't block overall process exit.  It is more of a 
> problem for tests, because it makes it impossible to write reliable 
> assertions that these threads exited cleanly.  For large test suites, it can 
> also cause an accumulation of unneeded threads, which might harm test 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9409) DataNode shutdown does not guarantee full shutdown of all threads due to race condition.

2017-01-11 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819152#comment-15819152
 ] 

Kihwal Lee commented on HDFS-9409:
--

{{BlockScanner}} is the same way. It removes a thread from its tracking data 
structure after a timed join. This is probably less problematic because the 
timeout in each {{join()}} is 5 minutes.  Although this wait may be good for 
unit tests, waiting up to 5 minutes per each volume scanner thread is 
unreasonable when the datanode needs to be restarted quickly.  We actually 
internally reduced it down to 100ms and the rolling upgrades works a lot better.

Perhaps we need a flag in datanode that tells whether it should wait until 
everything is terminated or quickly shutdown without waiting a long time for 
daemon threads to terminate.  The shutdown code of each subsystem would then 
check this flag and behave differently. We would turn it on for unit testing 
and off for production.  How does it sound?

> DataNode shutdown does not guarantee full shutdown of all threads due to race 
> condition.
> 
>
> Key: HDFS-9409
> URL: https://issues.apache.org/jira/browse/HDFS-9409
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Chris Nauroth
>
> {{DataNode#shutdown}} is documented to return "only after shutdown is 
> complete".  Even after completion of this method, it's possible that threads 
> started by the DataNode are still running.  Race conditions in the shutdown 
> sequence may cause it to skip stopping and joining the {{BPServiceActor}} 
> threads.
> This is likely not a big problem in normal operations, because these are 
> daemon threads that won't block overall process exit.  It is more of a 
> problem for tests, because it makes it impossible to write reliable 
> assertions that these threads exited cleanly.  For large test suites, it can 
> also cause an accumulation of unneeded threads, which might harm test 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9409) DataNode shutdown does not guarantee full shutdown of all threads due to race condition.

2015-11-10 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999257#comment-14999257
 ] 

Chris Nauroth commented on HDFS-9409:
-

{{DataNode#shutdown}} calls {{BlockPoolManager#getAllNamenodeThreads}} to get 
every {{BPOfferService}}.  Then, later in {{shutdown}}, these are passed to 
{{BlockPoolManager#shutDownAll}}, which eventually stops and joins each 
{{BPServiceActor}} thread.  There are a few problems:

# {{BlockPoolManager#getAllNamenodeThreads}} returns an unmodifiable wrapper 
over its underlying list, so callers can't mutate the list, but it's still the 
same shared backing list.  Later during shutdown, the {{BPServiceActor}} is 
told that it can exit its main loop.  Part of that is a call on the 
{{BPServiceActor}} thread to {{BlockPoolManager#remove}}.  This effectively 
removes it from the backing list returned by 
{{BlockPoolManager#getAllNamenodeThreads}}, so it will appear to vanish from 
the list before the call to {{BlockPoolManager#shutDownAll}}.
# Even if point 1 is fixed by changing 
{{BlockPoolManager#getAllNamenodeThreads}} to return a deep copy, there is a 
similar problem in that {{BPOfferService#shutdownActor}} will remove the actor 
from its internal tracking list.

Because of these 2 problems, {{DataNode#shutdown}} might no longer have a 
reference to the {{BPServiceActor}} threads when it tries to stop and join on 
them.  Therefore, those threads might still be alive even after completion of 
{{DataNode#shutdown}}.  I noticed this while trying to write a test that 
asserts a particular thread has exited after DataNode shutdown.

> DataNode shutdown does not guarantee full shutdown of all threads due to race 
> condition.
> 
>
> Key: HDFS-9409
> URL: https://issues.apache.org/jira/browse/HDFS-9409
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Chris Nauroth
>
> {{DataNode#shutdown}} is documented to return "only after shutdown is 
> complete".  Even after completion of this method, it's possible that threads 
> started by the DataNode are still running.  Race conditions in the shutdown 
> sequence may cause it to skip stopping and joining the {{BPServiceActor}} 
> threads.
> This is likely not a big problem in normal operations, because these are 
> daemon threads that won't block overall process exit.  It is more of a 
> problem for tests, because it makes it impossible to write reliable 
> assertions that these threads exited cleanly.  For large test suites, it can 
> also cause an accumulation of unneeded threads, which might harm test 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)