[ https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14687395#comment-14687395 ]
Anubhav Dhoot commented on YARN-4046: ------------------------------------- [~cnauroth] appreciate your review > Applications fail on NM restart on some linux distro because NM container > recovery declares AM container as LOST > ---------------------------------------------------------------------------------------------------------------- > > Key: YARN-4046 > URL: https://issues.apache.org/jira/browse/YARN-4046 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Anubhav Dhoot > Assignee: Anubhav Dhoot > Priority: Critical > Attachments: YARN-4096.001.patch > > > On a debian machine we have seen node manager recovery of containers fail > because the signal syntax for process group may not work. We see errors in > checking if process is alive during container recovery which causes the > container to be declared as LOST (154) on a NodeManager restart. > The application will fail with error > {noformat} > Application application_1439244348718_0001 failed 1 times due to Attempt > recovered after RM restartAM Container for > appattempt_1439244348718_0001_000001 exited with exitCode: 154 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)