Sergey Shelukhin created SLIDER-1052:
----------------------------------------

             Summary: Deadlock in slider AM
                 Key: SLIDER-1052
                 URL: https://issues.apache.org/jira/browse/SLIDER-1052
             Project: Slider
          Issue Type: Bug
            Reporter: Sergey Shelukhin
            Priority: Critical


I have a hung slider AM in the following state.
It is a weird situation, the first app attempt failed to start, so this is the 
2nd one. However, the 1st app attempt process is still running on the same 
machine, and it is in a state where I cannot jstack it even with -F. I will 
kill it shortly and see what happens.
The 2nd attempt received the container death notification for the first one:
{noformat}
2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO  
appmaster.SliderAppMaster - Container Completion for 
containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, 
exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
{noformat}
Note that is is from the 2nd container 
(container_e02_1450721565699_0007_02_000001)  logs.

{noformat}
Found one Java-level deadlock:
=============================

"AMRM Callback Handler Thread":
  waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a 
org/apache/slider/server/appmaster/state/AppState),
  which is held by "main"
"main":
  waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a 
org/apache/slider/server/appmaster/SliderAppMaster),
  which is held by "AMRM Callback Handler Thread"

{noformat}

The jstack is with -F, so I cannot actually see thread names in the dump, but 
these look like it (not sure about the first one):
{noformat}
Thread 11054: (state = BLOCKED)
 - 
org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus)
 @bci=0, line=1534 (Interpreted frame)
 - 
org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List)
 @bci=119, line=1606 (Interpreted frame)
 - 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run()
 @bci=141, line=300 (Interpreted frame)
...
Thread 10254: (state = BLOCKED)
 - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 
(Interpreted frame)
 - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, 
line=1369 (Interpreted frame)
 - 
org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String)
 @bci=1291, line=822 (Interpreted frame)
 - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, 
line=576 (Interpreted frame)
 - 
org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration,
 java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame)
 - 
org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration,
 java.lang.String[]) @bci=4, line=475 (Interpreted frame)
 - 
org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List)
 @bci=21, line=403 (Interpreted frame)
 - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) 
@bci=143, line=630 (Interpreted frame)
 - org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) 
@bci=24, line=2327 (Interpreted frame)


{noformat}







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to