[ https://issues.apache.org/jira/browse/YARN-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anubhav Dhoot updated YARN-4046: -------------------------------- Summary: Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST (was: NM container recovery is broken on some linux distro because of syntax of signal) > Applications fail on NM restart on some linux distro because NM container > recovery declares AM container as LOST > ---------------------------------------------------------------------------------------------------------------- > > Key: YARN-4046 > URL: https://issues.apache.org/jira/browse/YARN-4046 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Anubhav Dhoot > Assignee: Anubhav Dhoot > Priority: Critical > > On a debian machine we have seen node manager recovery of containers fail > because the signal syntax for process group may not work. We see errors in > checking if process is alive during container recovery which causes the > container to be declared as LOST (154) on a NodeManager restart. > The application will fail with error > {noformat} > Application application_1439244348718_0001 failed 1 times due to Attempt > recovered after RM restartAM Container for > appattempt_1439244348718_0001_000001 exited with exitCode: 154 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)