[ 
https://issues.apache.org/jira/browse/SLIDER-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098755#comment-15098755
 ] 

Steve Loughran commented on SLIDER-1052:
----------------------------------------

w.r.t the deadlock, that's deliberate.

The container death event is one that is queued up by YARN awaiting the AM 
restart. The moment the AM restarts, it gets the list of failures, which have 
to be delayed until the AM has actually rebuilt its state: otherwise, the AM 
sees the container events and says "I don't know it"

The real issue is that the getClusterFS() call isn't completing. Don't 
understand why it is in getConfig(). I see that's tagged as synchronized: I 
think it's a deep-lurking problem —one I can fix in slider, but address 
elsewhere.

> Deadlock in slider AM
> ---------------------
>
>                 Key: SLIDER-1052
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1052
>             Project: Slider
>          Issue Type: Bug
>    Affects Versions: Slider 0.80
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> I have a hung slider AM in the following state.
> The first app attempt failed to start, so this is the 2nd one. -However, the 
> 1st app attempt process is still running on the same machine, and it is in a 
> state where I cannot jstack it even with -F. I will kill it shortly and see 
> what happens.  YARN thinks it's killed.-.nm, it was some other process. The 
> first container was on a different machine and did die.
> The 2nd attempt received the container death notification for the first one:
> {noformat}
> 2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO  
> appmaster.SliderAppMaster - Container Completion for 
> containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, 
> exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {noformat}
> Note that is is from the 2nd container 
> (container_e02_1450721565699_0007_02_000001)  logs. Jstack for the 2nd 
> attempt has the deadlock:
> {noformat}
> Found one Java-level deadlock:
> =============================
> "AMRM Callback Handler Thread":
>   waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a 
> org/apache/slider/server/appmaster/state/AppState),
>   which is held by "main"
> "main":
>   waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a 
> org/apache/slider/server/appmaster/SliderAppMaster),
>   which is held by "AMRM Callback Handler Thread"
> {noformat}
> The jstack is with -F, so I cannot actually see thread names in the dump, but 
> these look like it (not sure about the first one):
> {noformat}
> Thread 11054: (state = BLOCKED)
>  - 
> org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus)
>  @bci=0, line=1534 (Interpreted frame)
>  - 
> org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List)
>  @bci=119, line=1606 (Interpreted frame)
>  - 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run()
>  @bci=141, line=300 (Interpreted frame)
> ...
> Thread 10254: (state = BLOCKED)
>  - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 
> (Interpreted frame)
>  - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, 
> line=1369 (Interpreted frame)
>  - 
> org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String)
>  @bci=1291, line=822 (Interpreted frame)
>  - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, 
> line=576 (Interpreted frame)
>  - 
> org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration,
>  java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame)
>  - 
> org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration,
>  java.lang.String[]) @bci=4, line=475 (Interpreted frame)
>  - 
> org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List)
>  @bci=21, line=403 (Interpreted frame)
>  - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) 
> @bci=143, line=630 (Interpreted frame)
>  - 
> org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) 
> @bci=24, line=2327 (Interpreted frame)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to