[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652453#comment-16652453 ]
Wangda Tan commented on YARN-8489: ---------------------------------- [~eyang], This is bit different from Spark executors. For Spark, from external view, it is a fully managed service, which can run tasks inside the Spark executors. Livy is just responsible to send code to Spark service and wait the result. For TF, PS can be deployed outside of workers like what you shown, but computation is still executed inside worker. In your example, it is inside the notebook. The separated PS deployment is not a widely used feature, AFAIK, only Google internally deploys in that way, part of the reason is they have super large models require distributed PS. The separate PS deployment approach is not easy to manage, need user to modify their source code, etc. And for most of the use cases, people avoid the distributed model given it is very hard to manage, serving, etc. After talked to many companies, for Submarine, in a short to mid term, I prefer to only support PS within each job. To your concern : {quote}Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? {quote} The easiest way is not to handle PS at all from the notebook, user can choose Keras, etc. to build their model inside notebook. Handling separate logics inside notebook for PS is just an overhead to users. > Need to support "dominant" component concept inside YARN service > ---------------------------------------------------------------- > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services > Reporter: Wangda Tan > Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org