[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652163#comment-16652163
 ] 

Wangda Tan commented on YARN-8489:
----------------------------------

[~eyang], 

I have thought about this, but it seems to me both existing readiness check are 
insufficient. 

In YARN service, dependency is for launch order as well as readiness. It has to 
be a DAG.

However In TF for example, master and ps are not depends on each other for 
launch time, but once master succeeded or failed, we should give the same state 
to job. And once ps failed, we should mark job is failed as well.

Maybe "dominant" is not the best field to add, for TF training use cases, it 
seems sufficient. But if we want better extensibility, we can add a 
ServiceControlPlugin into service master, which app master can specify their 
own implementation. Which should be good for people who wants to integrate to 
service framework. 

Suggestions? [~billie.rinaldi], [~gsaha].

 

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to