[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652420#comment-16652420 ]
Eric Yang edited comment on YARN-8489 at 10/16/18 8:49 PM: ----------------------------------------------------------- [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and the architecture is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name": "worker", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "launch_command": "python worker.py", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" } ] } {code} ps.py {code} server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) server.join() {code} In jupyter notebook: User can write code on the fly: {code} with tf.Session("grpc://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) {code} Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? was (Author: eyang): [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and this is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name": "worker", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "launch_command": "python worker.py", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" } ] } {code} ps.py {code} server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) server.join() {code} In jupyter notebook: User can write code on the fly: {code} with tf.Session("grpc://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) {code} Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? > Need to support "dominant" component concept inside YARN service > ---------------------------------------------------------------- > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services > Reporter: Wangda Tan > Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org