[ https://issues.apache.org/jira/browse/YARN-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294140#comment-15294140 ]
Jason Lowe commented on YARN-5103: ---------------------------------- Thanks for the patch! I'm OK skipping the unit test for this case. Rather than catching IOException and explicitly checking the instance we should let the normal catch processing do it for us, e.g.: {code} } catch (InterruptedException | InterruptedIOException e) { LOG.warn("Interrupted while waiting for exit code from " + containerId); notInterrupted = false; } catch (IOException e) { LOG.error("Unable to recover container " + containerIdStr, e); } {code} I noticed this is targeted to 2.9, but I would think this should go into at least 2.8 as well? > With NM recovery enabled, restarting NM multiple times results in AM restart > ---------------------------------------------------------------------------- > > Key: YARN-5103 > URL: https://issues.apache.org/jira/browse/YARN-5103 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Reporter: Sumana Sathish > Assignee: Junping Du > Priority: Critical > Attachments: YARN-5103-demo.patch, YARN-5103.patch > > > AM is restarted when NM is restarted multiple times even though NM recovery > is enabled. > {Code:title=NM log on which AM attempt 1 was running } > ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(88)) - Unable to recover container > container_e12_1463043063682_0002_01_000001 > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:579) > at org.apache.hadoop.util.Shell.run(Shell.java:487) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:478) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerProcessAlive(LinuxContainerExecutor.java:542) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:185) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:445) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {Code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org