Peter Simon created YARN-7686: --------------------------------- Summary: Yarn containers failover if datanode/nodemanager fails Key: YARN-7686 URL: https://issues.apache.org/jira/browse/YARN-7686 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.6.0 Reporter: Peter Simon
While running an application on Yarn, one of the datanodes/nodemanagers went offline due to power issues. The first application attempt was failed due to lost containers. When the second attempt started, there were no heartbeat interval happened to the Namenode, and the second attempt still got the datanode/nodemanager as possible worker node for the containers. While the host was unreachable, therefore the container attempts were failed, led to the second application attempt also failed, caused the application failure. There could be a failover process for container attempts, so if on one node new container can't be brought up, the ResourceManager should try to allocate the new container on a different node. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org