[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652606#comment-16652606
 ] 

Wangda Tan commented on YARN-8489:
----------------------------------

[~eyang],

let me try to answer your questions: 
{quote}Data scientist specify the cluster spec in notebook, parameter server 
partitions the models and tasks to increase workers effectiveness.
{quote}
Actually people want to avoid using PS as much as possible in TF given the poor 
performance of grpc and overhead of network communication. However because it 
is the only solution for Distributed TF now, people will use it when needed. 
Comparing to standalone TF, distributed TF has much fewer user bases. 

What I'm thinking now is, only final status of dominant component (not 
component instance) will impact service's state. Regarding to your questions. 

bq. For example, what happen if during upgrade the dominant component is 
offline. Should the service terminate and clean up?
No if dominant component is not in final state yet. Upgrading is not considered 
as final state.

bq. How about flex dominant component to lesser nodes?
Flexing is not final state, so will not be impacted by the patch. 

bq. What is the order to evaluate dominant component and component dependencies?
No addition evaluations needed, once dominant component succeeded / failed, 
service master will finalize service.  

bq.  How to handle restart policy in place of dominant component?
If it is never, dominant field will be ignored. Otherwise dominant field is 
allowed. 

Hope this explanations makes you clear about the scope. Heres logics for 
dominant component affect state of service: 

{code} 
Component.state: 
- Transition to SUCCEEDED && component.dominant == true: Set service state to 
SUCCEEDED. 
- Transition to FAILED && component.dominant == true. Set service state to 
FAILED. 
{code}

 

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to