[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652271#comment-16652271
 ] 

Wangda Tan commented on YARN-8489:
----------------------------------

[~eyang],

Basically there're four models in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to