Peter Simon created YARN-7686:
-
Summary: Yarn containers failover if datanode/nodemanager fails
Key: YARN-7686
URL: https://issues.apache.org/jira/browse/YARN-7686
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Peter Simon
While running an application on Yarn, one of the datanodes/nodemanagers went
offline due to power issues. The first application attempt was failed due to
lost containers. When the second attempt started, there were no heartbeat
interval happened to the Namenode, and the second attempt still got the
datanode/nodemanager as possible worker node for the containers. While the host
was unreachable, therefore the container attempts were failed, led to the
second application attempt also failed, caused the application failure.
There could be a failover process for container attempts, so if on one node new
container can't be brought up, the ResourceManager should try to allocate the
new container on a different node.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org