[ 
https://issues.apache.org/jira/browse/SLIDER-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned SLIDER-1052:
--------------------------------------

    Assignee: Steve Loughran

> Deadlock in slider AM
> ---------------------
>
>                 Key: SLIDER-1052
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1052
>             Project: Slider
>          Issue Type: Bug
>    Affects Versions: Slider 0.80
>            Reporter: Sergey Shelukhin
>            Assignee: Steve Loughran
>            Priority: Critical
>
> I have a hung slider AM in the following state.
> The first app attempt failed to start, so this is the 2nd one. -However, the 
> 1st app attempt process is still running on the same machine, and it is in a 
> state where I cannot jstack it even with -F. I will kill it shortly and see 
> what happens.  YARN thinks it's killed.-.nm, it was some other process. The 
> first container was on a different machine and did die.
> The 2nd attempt received the container death notification for the first one:
> {noformat}
> 2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO  
> appmaster.SliderAppMaster - Container Completion for 
> containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, 
> exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {noformat}
> Note that is is from the 2nd container 
> (container_e02_1450721565699_0007_02_000001)  logs. Jstack for the 2nd 
> attempt has the deadlock:
> {noformat}
> Found one Java-level deadlock:
> =============================
> "AMRM Callback Handler Thread":
>   waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a 
> org/apache/slider/server/appmaster/state/AppState),
>   which is held by "main"
> "main":
>   waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a 
> org/apache/slider/server/appmaster/SliderAppMaster),
>   which is held by "AMRM Callback Handler Thread"
> {noformat}
> The jstack is with -F, so I cannot actually see thread names in the dump, but 
> these look like it (not sure about the first one):
> {noformat}
> Thread 11054: (state = BLOCKED)
>  - 
> org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus)
>  @bci=0, line=1534 (Interpreted frame)
>  - 
> org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List)
>  @bci=119, line=1606 (Interpreted frame)
>  - 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run()
>  @bci=141, line=300 (Interpreted frame)
> ...
> Thread 10254: (state = BLOCKED)
>  - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 
> (Interpreted frame)
>  - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, 
> line=1369 (Interpreted frame)
>  - 
> org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String)
>  @bci=1291, line=822 (Interpreted frame)
>  - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, 
> line=576 (Interpreted frame)
>  - 
> org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration,
>  java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame)
>  - 
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration,
>  java.lang.String[]) @bci=4, line=475 (Interpreted frame)
>  - 
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List)
>  @bci=21, line=403 (Interpreted frame)
>  - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) 
> @bci=143, line=630 (Interpreted frame)
>  - 
> org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) 
> @bci=24, line=2327 (Interpreted frame)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to