Sergey Shelukhin created SLIDER-1052: ----------------------------------------
Summary: Deadlock in slider AM Key: SLIDER-1052 URL: https://issues.apache.org/jira/browse/SLIDER-1052 Project: Slider Issue Type: Bug Reporter: Sergey Shelukhin Priority: Critical I have a hung slider AM in the following state. It is a weird situation, the first app attempt failed to start, so this is the 2nd one. However, the 1st app attempt process is still running on the same machine, and it is in a state where I cannot jstack it even with -F. I will kill it shortly and see what happens. The 2nd attempt received the container death notification for the first one: {noformat} 2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - Container Completion for containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, exitStatus=-105, diagnostics=Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 {noformat} Note that is is from the 2nd container (container_e02_1450721565699_0007_02_000001) logs. {noformat} Found one Java-level deadlock: ============================= "AMRM Callback Handler Thread": waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a org/apache/slider/server/appmaster/state/AppState), which is held by "main" "main": waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a org/apache/slider/server/appmaster/SliderAppMaster), which is held by "AMRM Callback Handler Thread" {noformat} The jstack is with -F, so I cannot actually see thread names in the dump, but these look like it (not sure about the first one): {noformat} Thread 11054: (state = BLOCKED) - org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus) @bci=0, line=1534 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List) @bci=119, line=1606 (Interpreted frame) - org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run() @bci=141, line=300 (Interpreted frame) ... Thread 10254: (state = BLOCKED) - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, line=1369 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String) @bci=1291, line=822 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, line=576 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration, java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration, java.lang.String[]) @bci=4, line=475 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List) @bci=21, line=403 (Interpreted frame) - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) @bci=143, line=630 (Interpreted frame) - org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) @bci=24, line=2327 (Interpreted frame) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)