Zhankun Tang created YARN-9167: ---------------------------------- Summary: [Submarine] Support fault tolerance when Tensorflow worker container fails Key: YARN-9167 URL: https://issues.apache.org/jira/browse/YARN-9167 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang
A long-running Tensorflow job needs to restart failed worker containers when something unexpected happens. Luckily that TF can restore checkpoints and continue training in a worker, a restart of the worker container seems enough. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org