[jira] [Created] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Catherinot Remi (Jira) Fri, 22 Apr 2022 08:35:05 -0700

Catherinot Remi created HADOOP-18217:
----------------------------------------


             Summary: shutdownhookmanager should not be multithreaded (deadlock 
possible)
                 Key: HADOOP-18217
                 URL: https://issues.apache.org/jira/browse/HADOOP-18217
             Project: Hadoop Common
          Issue Type: Bug
          Components: util
    Affects Versions: 2.10.1
         Environment: linux, windows, any version
            Reporter: Catherinot Remi


the ShutdownHookManager class uses an executor to run hooks to have a "timeout" 
notion around them. It does this using a single threaded executor. It can leads 
to deadlock leaving a never-shutting-down JVM with this execution flow:
 * JVM need to exit (only daemon threads remaining or someone called 
System.exit)
 * ShutdowHookManager kicks in
 * SHMngr executor start running some hooks
 * SHMngr executor thread kicks in and, as a side effect, run some code from 
one of the hook that calls System.exit (as a side effect from an external lib 
for example)
 * the executor thread is waiting for a lock because another thread already 
entered System.exit and has its internal lock, so the executor never returns.
 * SHMngr never returns
 * 1st call to System.exit never returns
 * JVM stuck

 

using an executor with a single thread does "fake" timeouts (the task keeps 
running, you can interrupt it but until it stumble upon some piece of code that 
is interruptible (like an IO) it will keep running) especially since the 
executor is a single threaded one. So it has this bug for example :
 * caller submit 1st hook (bad one that would need 1 hour of runtime and that 
cannot be interrupted)
 * executor start 1st hook
 * caller of the future 1st hook result timeout
 * caller submit 2nd hook
 * bug : 1 hook still running, 2nd hook triggers a timeout but never got the 
chance to run anyway, so 1st faulty hook makes it impossible for any other hook 
to have a chance to run, so running hooks in a single separate thread does not 
allow to run other hooks in parallel to long ones.

 

If we really really want to timeout the JVM shutdown, even accepting maybe 
dirty shutdown, it should rather handle the hooks inside the initial thread 
(not spawning new one(s) so not triggering the deadlock described on the 1st 
place) and if a timeout was configured, only spawn a single parallel daemon 
thread that sleeps the timeout delay, and then use Runtime.halt (which bypass 
the hook system so should not trigger the deadlock). If the normal System.exit 
ends before the timeout delay everything is fine. If the System.exit took to 
much time, the JVM is killed and so the reason why this multithreaded shutdown 
hook implementation was created is satisfied (avoding having hanging JVMs)

 

Had the bug with both oracle and open jdk builds, all in 1.8 major version. 
hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in 
another thread

 

Another solution is of course to configure the timeout AND to have as many 
threads as needed to run the hooks so to have at least some gain to offset the 
pain of the dealock scenario



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Reply via email to