[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.

Jason Lowe (JIRA) Wed, 12 Nov 2014 06:48:13 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208087#comment-14208087
 ]


Jason Lowe commented on YARN-2846:
----------------------------------

Thanks, Junping, patch looks better.  I'm +1 pending investigation of the 
ContainerLaunch path and why we don't have to deal with thread interruption 
there.

bq. But if regular ContainerLaunch get interrupted, we may not care if running 
container exit code as these running container should be killed soon

If we're going to kill normal containers on shutdown then why wouldn't we also 
kill containers we are recovering as well?  For the NM restart scenario we're 
not supposed to be killing any containers, so it's essentially a question of 
why doesn't interrupting the ContainerLaunch thread manifest as a container 
completing as it did for a recovered container.  If we know why that's not 
possible then we can put in the patch as-is, otherwise I'm wondering if there's 
another hole we need to plug.

> Incorrect persist exit code for running containers in reacquireContainer() 
> that interrupted by NodeManager restart.
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2846
>                 URL: https://issues.apache.org/jira/browse/YARN-2846
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Blocker
>         Attachments: YARN-2846-demo.patch, YARN-2846.patch
>
>
> The NM restart work preserving feature could make running AM container get 
> LOST and killed during stop NM daemon. The exception is like below:
> {code}
> 2014-11-11 00:48:35,214 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for 
> container-id container_1415666714233_0001_01_000084: 53.8 MB of 512 MB 
> physical memory used; 931.3 MB of 1.0 GB virtual memory used
> 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager 
> (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM
> 2014-11-11 00:48:35,299 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060
> 2014-11-11 00:48:35,337 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - 
> Applications still running : [application_1415666714233_0001]
> 2014-11-11 00:48:35,338 INFO  ipc.Server (Server.java:stop(2437)) - Stopping 
> server on 45454
> 2014-11-11 00:48:35,344 INFO  ipc.Server (Server.java:run(706)) - Stopping 
> IPC Server listener on 45454
> 2014-11-11 00:48:35,346 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:serviceStop(141)) - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
>  waiting for pending aggregation during exit
> 2014-11-11 00:48:35,347 INFO  ipc.Server (Server.java:run(832)) - Stopping 
> IPC Server Responder
> 2014-11-11 00:48:35,347 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log 
> aggregation for application_1415666714233_0001
> 2014-11-11 00:48:35,348 WARN  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for 
> application application_1415666714233_0001
> 2014-11-11 00:48:35,358 WARN  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(476)) - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch 
> (RecoveredContainerLaunch.java:call(87)) - Unable to recover container 
> container_1415666714233_0001_01_000001
> java.io.IOException: Interrupted while waiting for process 20001 to exit
>         at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177)
>         ... 6 more
> {code}
> In reacquireContainer() of ContainerExecutor.java, the while loop of checking 
> container process (AM container) will be interrupted by NM stop. The 
> IOException get thrown and failed to generate an ExitCodeFile for the running 
> container. Later, the IOException will be caught in upper call 
> (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST 
> without any setting) get persistent in NMStateStore. 
> After NM restart again, this container is recovered as COMPLETE state but 
> exit code is LOST (154) - cause this (AM) container get killed later.
> We should get rid of recording the exit code of running containers if 
> detecting process is interrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.

Reply via email to