[ https://issues.apache.org/jira/browse/SLIDER-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran reassigned SLIDER-1052: -------------------------------------- Assignee: Steve Loughran > Deadlock in slider AM > --------------------- > > Key: SLIDER-1052 > URL: https://issues.apache.org/jira/browse/SLIDER-1052 > Project: Slider > Issue Type: Bug > Affects Versions: Slider 0.80 > Reporter: Sergey Shelukhin > Assignee: Steve Loughran > Priority: Critical > > I have a hung slider AM in the following state. > The first app attempt failed to start, so this is the 2nd one. -However, the > 1st app attempt process is still running on the same machine, and it is in a > state where I cannot jstack it even with -F. I will kill it shortly and see > what happens. YARN thinks it's killed.-.nm, it was some other process. The > first container was on a different machine and did die. > The 2nd attempt received the container death notification for the first one: > {noformat} > 2016-01-07 03:59:41,828 [AMRM Callback Handler Thread] INFO > appmaster.SliderAppMaster - Container Completion for > containerID=container_e02_1450721565699_0007_01_000001, state=COMPLETE, > exitStatus=-105, diagnostics=Container killed by the ApplicationMaster. > Container killed on request. Exit code is 143 > Container exited with a non-zero exit code 143 > {noformat} > Note that is is from the 2nd container > (container_e02_1450721565699_0007_02_000001) logs. Jstack for the 2nd > attempt has the deadlock: > {noformat} > Found one Java-level deadlock: > ============================= > "AMRM Callback Handler Thread": > waiting to lock Monitor@0x00007f1b953b18b8 (Object@0x00000000c022c6f0, a > org/apache/slider/server/appmaster/state/AppState), > which is held by "main" > "main": > waiting to lock Monitor@0x00007f1b953b1128 (Object@0x00000000c00db378, a > org/apache/slider/server/appmaster/SliderAppMaster), > which is held by "AMRM Callback Handler Thread" > {noformat} > The jstack is with -F, so I cannot actually see thread names in the dump, but > these look like it (not sure about the first one): > {noformat} > Thread 11054: (state = BLOCKED) > - > org.apache.slider.server.appmaster.state.AppState.onCompletedNode(org.apache.hadoop.yarn.api.records.ContainerStatus) > @bci=0, line=1534 (Interpreted frame) > - > org.apache.slider.server.appmaster.SliderAppMaster.onContainersCompleted(java.util.List) > @bci=119, line=1606 (Interpreted frame) > - > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run() > @bci=141, line=300 (Interpreted frame) > ... > Thread 10254: (state = BLOCKED) > - org.apache.hadoop.service.AbstractService.getConfig() @bci=0, line=403 > (Interpreted frame) > - org.apache.slider.server.appmaster.SliderAppMaster.getClusterFS() @bci=5, > line=1369 (Interpreted frame) > - > org.apache.slider.server.appmaster.SliderAppMaster.createAndRunCluster(java.lang.String) > @bci=1291, line=822 (Interpreted frame) > - org.apache.slider.server.appmaster.SliderAppMaster.runService() @bci=162, > line=576 (Interpreted frame) > - > org.apache.slider.core.main.ServiceLauncher.launchService(org.apache.hadoop.conf.Configuration, > java.lang.String[], boolean) @bci=128, line=188 (Interpreted frame) > - > org.apache.slider.core.main.ServiceLauncher.launchServiceRobustly(org.apache.hadoop.conf.Configuration, > java.lang.String[]) @bci=4, line=475 (Interpreted frame) > - > org.apache.slider.core.main.ServiceLauncher.launchServiceAndExit(java.util.List) > @bci=21, line=403 (Interpreted frame) > - org.apache.slider.core.main.ServiceLauncher.serviceMain(java.util.List) > @bci=143, line=630 (Interpreted frame) > - > org.apache.slider.server.appmaster.SliderAppMaster.main(java.lang.String[]) > @bci=24, line=2327 (Interpreted frame) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)