[jira] [Work logged] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

ASF GitHub Bot (Jira) Mon, 11 Jul 2022 05:55:19 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18217?focusedWorklogId=789513&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789513
 ]


ASF GitHub Bot logged work on HADOOP-18217:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 11/Jul/22 12:54
            Start Date: 11/Jul/22 12:54
    Worklog Time Spent: 10m 
      Work Description: HerCath commented on code in PR #4255:
URL: https://github.com/apache/hadoop/pull/4255#discussion_r917895399


##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ExitUtil.java:
##########
@@ -159,138 +163,155 @@ public static void disableSystemHalt() {
    */
   public static boolean terminateCalled() {
     // Either we set this member or we actually called System#exit
-    return firstExitException != null;
+    return FIRST_EXIT_EXCEPTION.get()!=null;
   }
 
   /**
    * @return true if halt has been called.
    */
   public static boolean haltCalled() {
-    return firstHaltException != null;
+    // Either we set this member or we actually called Runtime#halt
+    return FIRST_HALT_EXCEPTION.get()!=null;
   }
 
   /**
-   * @return the first ExitException thrown, null if none thrown yet.
+   * @return the first {@code ExitException} thrown, null if none thrown yet.
    */
   public static ExitException getFirstExitException() {
-    return firstExitException;
+    return FIRST_EXIT_EXCEPTION.get();
   }
 
   /**
    * @return the first {@code HaltException} thrown, null if none thrown yet.
    */
   public static HaltException getFirstHaltException() {
-    return firstHaltException;
+    return FIRST_HALT_EXCEPTION.get();
   }
 
   /**
    * Reset the tracking of process termination. This is for use in unit tests
    * where one test in the suite expects an exit but others do not.
    */
   public static void resetFirstExitException() {
-    firstExitException = null;
+    FIRST_EXIT_EXCEPTION.set(null);
   }
 
+  /**
+   * Reset the tracking of process termination. This is for use in unit tests
+   * where one test in the suite expects a halt but others do not.
+   */
   public static void resetFirstHaltException() {
-    firstHaltException = null;
+    FIRST_HALT_EXCEPTION.set(null);
   }
 
   /**
+   * Exits the JVM if exit is enabled, rethrow provided exception or any 
raised error otherwise.
    * Inner termination: either exit with the exception's exit code,
    * or, if system exits are disabled, rethrow the exception.
    * @param ee exit exception
+   * @throws ExitException if {@link System#exit(int)} is disabled and not 
suppressed by an Error
+   * @throws Error if {@link System#exit(int)} is disabled and one Error 
arise, suppressing
+   * anything else, even <code>ee</code>
    */
   public static void terminate(ExitException ee)
       throws ExitException {
     final int status = ee.getExitCode();
-    Error catched = null;
+    Error caught = null;
     if (status != 0) {
       try {
-        //exit indicates a problem, log it
+        // exit indicates a problem, log it
         String msg = ee.getMessage();
         LOG.debug("Exiting with status {}: {}",  status, msg, ee);
         LOG.info("Exiting with status {}: {}", status, msg);
       } catch (Error e) {
-        catched = e; // errors have higher priority than HaltException, it may 
be re-thrown. OOM and ThreadDeath are 2 examples of Errors to re-throw
+        // errors have higher priority than HaltException, it may be re-thrown.
+        // OOM and ThreadDeath are 2 examples of Errors to re-throw
+        caught = e;
       } catch (Throwable t) {
-        // all other kind of throwables are supressed
+        // all other kind of throwables are suppressed
         ee.addSuppressed(t);
       }
     }
     if (systemExitDisabled) {
       try {
         LOG.error("Terminate called", ee);
       } catch (Error e) {
-        if (catched == null) {
-          catched = e; // errors will be re-thrown
+        // errors have higher priority again, if it's a 2nd error, the 1st one 
suprpesses it
+        if (caught == null) {
+          caught = e;
         } else {
-          catched.addSuppressed(e); // 1st raised error has priority and will 
be re-thrown, so the 1st error supresses the 2nd
+          caught.addSuppressed(e);
         }
       } catch (Throwable t) {
-        ee.addSuppressed(t); // all other kind of throwables are supressed
-      }
-      synchronized (ExitUtil.class) {
-        if (!terminateCalled()) {
-          firstExitException = ee;
-        }
+        // all other kind of throwables are suppressed
+        ee.addSuppressed(t);
       }
-      if (catched != null) {
-        catched.addSuppressed(ee);
-        throw catched;
+      FIRST_EXIT_EXCEPTION.compareAndSet(null, ee);
+      if (caught != null) {

Review Comment:
   Hi, sorry to have taken so long before an update.
   
   I don't think we need such a check because the only things able to throw 
something would be either the JVM (which never reuse already thrown things) or 
the LOG object (which should never do that too). In my understanding of 
exception and any throwable in general is to never throw 2 times the same 
instance. If such a rule is respected, a suppressed can never a suppressor of 
itself through any suppressor/suppressed/rootCause chain of throwables. If for 
some reason the addSuppressed fails because of this, this more means there is 
something broken implemented elsewhere and this "elsewhere" should be fixed. If 
some piece of code reuse some already thrown throwables, even without failing, 
the suppress stuff could lead to memory leak (adding again and again new 
throwables to a singleton one kept and reuse somewhere).
   
   In fact, if we really want to be safe, then we can forget the suppression 
stuff : this would keep code simple, no broken suppression error, no leak, at 
the cost of a slight loss of debug info when dumping throwable.
   
   maybe we should rather go for this safe solution.
   
   what do you think ?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 789513)
    Time Spent: 2h  (was: 1h 50m)

> shutdownhookmanager should not be multithreaded (deadlock possible)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-18217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 2.10.1
>         Environment: linux, windows, any version
>            Reporter: Catherinot Remi
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: wtf.java
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> the ShutdownHookManager class uses an executor to run hooks to have a 
> "timeout" notion around them. It does this using a single threaded executor. 
> It can leads to deadlock leaving a never-shutting-down JVM with this 
> execution flow:
>  * JVM need to exit (only daemon threads remaining or someone called 
> System.exit)
>  * ShutdowHookManager kicks in
>  * SHMngr executor start running some hooks
>  * SHMngr executor thread kicks in and, as a side effect, run some code from 
> one of the hook that calls System.exit (as a side effect from an external lib 
> for example)
>  * the executor thread is waiting for a lock because another thread already 
> entered System.exit and has its internal lock, so the executor never returns.
>  * SHMngr never returns
>  * 1st call to System.exit never returns
>  * JVM stuck
>  
> using an executor with a single thread does "fake" timeouts (the task keeps 
> running, you can interrupt it but until it stumble upon some piece of code 
> that is interruptible (like an IO) it will keep running) especially since the 
> executor is a single threaded one. So it has this bug for example :
>  * caller submit 1st hook (bad one that would need 1 hour of runtime and that 
> cannot be interrupted)
>  * executor start 1st hook
>  * caller of the future 1st hook result timeout
>  * caller submit 2nd hook
>  * bug : 1 hook still running, 2nd hook triggers a timeout but never got the 
> chance to run anyway, so 1st faulty hook makes it impossible for any other 
> hook to have a chance to run, so running hooks in a single separate thread 
> does not allow to run other hooks in parallel to long ones.
>  
> If we really really want to timeout the JVM shutdown, even accepting maybe 
> dirty shutdown, it should rather handle the hooks inside the initial thread 
> (not spawning new one(s) so not triggering the deadlock described on the 1st 
> place) and if a timeout was configured, only spawn a single parallel daemon 
> thread that sleeps the timeout delay, and then use Runtime.halt (which bypass 
> the hook system so should not trigger the deadlock). If the normal 
> System.exit ends before the timeout delay everything is fine. If the 
> System.exit took to much time, the JVM is killed and so the reason why this 
> multithreaded shutdown hook implementation was created is satisfied (avoding 
> having hanging JVMs)
>  
> Had the bug with both oracle and open jdk builds, all in 1.8 major version. 
> hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in 
> another thread
>  
> Another solution is of course to configure the timeout AND to have as many 
> threads as needed to run the hooks so to have at least some gain to offset 
> the pain of the dealock scenario
>  
> EDIT: added some logs and reproduced the problem. in fact it is located after 
> triggering all the hook entries and before shutting down the executor. 
> Current code, after running the hooks, creates a new Configuration object and 
> reads the configured timeout from it, applies this timeout to shutdown the 
> executor. I sometimes run with a classloader doing remote classloading, 
> Configuration loads its content using this classloader, so when shutting down 
> the JVM and some network error occurs the classloader fails to load the 
> ressources needed by Configuration. So the code crash before shutting down 
> the executor and ends up inside the thread's default uncaught throwable 
> handler, which was calling System.exit, so got stuck, so shutting down the 
> executor never returned, so does the JVM.
> So, forget about the halt stuff (even if it is a last ressort very robust 
> safety net). Still I'll do a small adjustement to the final executor shutdown 
> code to be slightly more robust to even the strangest exceptions/errors it 
> encounters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Reply via email to