[ https://issues.apache.org/jira/browse/HADOOP-18217?focusedWorklogId=790089&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-790089 ]
ASF GitHub Bot logged work on HADOOP-18217: ------------------------------------------- Author: ASF GitHub Bot Created on: 12/Jul/22 16:31 Start Date: 12/Jul/22 16:31 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4255: URL: https://github.com/apache/hadoop/pull/4255#issuecomment-1181989449 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 15s | | trunk passed | | +1 :green_heart: | compile | 24m 55s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 21m 36s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 29s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 57s | | trunk passed | | +1 :green_heart: | javadoc | 1m 32s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 4s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 3s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 57s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 5s | | the patch passed | | +1 :green_heart: | compile | 24m 2s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 24m 2s | | the patch passed | | +1 :green_heart: | compile | 21m 37s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 21m 37s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 23s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 56s | | the patch passed | | +1 :green_heart: | javadoc | 1m 23s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 3s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 3s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 51s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 18m 24s | | hadoop-common in the patch passed. | | +1 :green_heart: | asflicense | 1m 17s | | The patch does not generate ASF License warnings. | | | | 224m 32s | | | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4255/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4255 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d87a77336653 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 518fefc06263d557206b21afe17ceb5338025371 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4255/5/testReport/ | | Max. process+thread count | 3135 (vs. ulimit of 5500) | | modules | C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4255/5/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. Issue Time Tracking ------------------- Worklog Id: (was: 790089) Time Spent: 4h 10m (was: 4h) > shutdownhookmanager should not be multithreaded (deadlock possible) > ------------------------------------------------------------------- > > Key: HADOOP-18217 > URL: https://issues.apache.org/jira/browse/HADOOP-18217 > Project: Hadoop Common > Issue Type: Bug > Components: util > Affects Versions: 2.10.1 > Environment: linux, windows, any version > Reporter: Catherinot Remi > Priority: Minor > Labels: pull-request-available > Attachments: wtf.java > > Time Spent: 4h 10m > Remaining Estimate: 0h > > the ShutdownHookManager class uses an executor to run hooks to have a > "timeout" notion around them. It does this using a single threaded executor. > It can leads to deadlock leaving a never-shutting-down JVM with this > execution flow: > * JVM need to exit (only daemon threads remaining or someone called > System.exit) > * ShutdowHookManager kicks in > * SHMngr executor start running some hooks > * SHMngr executor thread kicks in and, as a side effect, run some code from > one of the hook that calls System.exit (as a side effect from an external lib > for example) > * the executor thread is waiting for a lock because another thread already > entered System.exit and has its internal lock, so the executor never returns. > * SHMngr never returns > * 1st call to System.exit never returns > * JVM stuck > > using an executor with a single thread does "fake" timeouts (the task keeps > running, you can interrupt it but until it stumble upon some piece of code > that is interruptible (like an IO) it will keep running) especially since the > executor is a single threaded one. So it has this bug for example : > * caller submit 1st hook (bad one that would need 1 hour of runtime and that > cannot be interrupted) > * executor start 1st hook > * caller of the future 1st hook result timeout > * caller submit 2nd hook > * bug : 1 hook still running, 2nd hook triggers a timeout but never got the > chance to run anyway, so 1st faulty hook makes it impossible for any other > hook to have a chance to run, so running hooks in a single separate thread > does not allow to run other hooks in parallel to long ones. > > If we really really want to timeout the JVM shutdown, even accepting maybe > dirty shutdown, it should rather handle the hooks inside the initial thread > (not spawning new one(s) so not triggering the deadlock described on the 1st > place) and if a timeout was configured, only spawn a single parallel daemon > thread that sleeps the timeout delay, and then use Runtime.halt (which bypass > the hook system so should not trigger the deadlock). If the normal > System.exit ends before the timeout delay everything is fine. If the > System.exit took to much time, the JVM is killed and so the reason why this > multithreaded shutdown hook implementation was created is satisfied (avoding > having hanging JVMs) > > Had the bug with both oracle and open jdk builds, all in 1.8 major version. > hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in > another thread > > Another solution is of course to configure the timeout AND to have as many > threads as needed to run the hooks so to have at least some gain to offset > the pain of the dealock scenario > > EDIT: added some logs and reproduced the problem. in fact it is located after > triggering all the hook entries and before shutting down the executor. > Current code, after running the hooks, creates a new Configuration object and > reads the configured timeout from it, applies this timeout to shutdown the > executor. I sometimes run with a classloader doing remote classloading, > Configuration loads its content using this classloader, so when shutting down > the JVM and some network error occurs the classloader fails to load the > ressources needed by Configuration. So the code crash before shutting down > the executor and ends up inside the thread's default uncaught throwable > handler, which was calling System.exit, so got stuck, so shutting down the > executor never returned, so does the JVM. > So, forget about the halt stuff (even if it is a last ressort very robust > safety net). Still I'll do a small adjustement to the final executor shutdown > code to be slightly more robust to even the strangest exceptions/errors it > encounters. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org