[ https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wangda Tan updated YARN-8135: ----------------------------- Description: Description: *Goals:* - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on YARN. - Allow jobs easy access data/models in HDFS and other storages. - Can launch services to serve Tensorflow/MXNet models. - Support run distributed Tensorflow jobs with simple configs. - Support run user-specified Docker images. - Support specify GPU and other resources. - Support launch tensorboard if user specified. - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) *Why this name?* - Because Submarine is the only vehicle can take human to deep places. B-) Compare to other projects: !image-2018-04-09-14-44-41-101.png! *Notes:* *GPU Isolation of XLearning project is achieved by patched YARN, which is different from community’s GPU isolation solution. **XLearning needs few modification to read ClusterSpec from env. *References:* - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark] - TensorFlowOnYARN (Intel): [https://github.com/Intel-bigdata/TensorFlowOnYARN] - Spark Deep Learning (Databricks): [https://github.com/databricks/spark-deep-learning] - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning] - Kubeflow (Google): [https://github.com/kubeflow/kubeflow] was: Description: *Goals:* - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on YARN. - Allow jobs easy access data/models in HDFS and other storages. - Can launch services to serve Tensorflow/MXNet models. - Support run distributed Tensorflow jobs with simple configs. - Support run user-specified Docker images. - Support specify GPU and other resources. - Support launch tensorboard if user specified. - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) *Why this name?* - Because Submarine is the only vehicle can take human to deep places. B-) Compare to other projects: !image-2018-04-09-14-35-16-778.png! *Notes:* * GPU Isolation of XLearning project is achieved by patched YARN, which is different from community’s GPU isolation solution. ** XLearning needs few modification to read ClusterSpec from env. *References:* - TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark - TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN - Spark Deep Learning (Databricks): https://github.com/databricks/spark-deep-learning - XLearning (Qihoo360): https://github.com/Qihoo360/XLearning - Kubeflow (Google): https://github.com/kubeflow/kubeflow > Hadoop {Submarine} Project: Simple and scalable deployment of deep learning > training / serving jobs on Hadoop > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-8135 > URL: https://issues.apache.org/jira/browse/YARN-8135 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Major > Attachments: image-2018-04-09-14-35-16-778.png, > image-2018-04-09-14-44-41-101.png > > > Description: > *Goals:* > - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs > on YARN. > - Allow jobs easy access data/models in HDFS and other storages. > - Can launch services to serve Tensorflow/MXNet models. > - Support run distributed Tensorflow jobs with simple configs. > - Support run user-specified Docker images. > - Support specify GPU and other resources. > - Support launch tensorboard if user specified. > - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) > *Why this name?* > - Because Submarine is the only vehicle can take human to deep places. B-) > Compare to other projects: > !image-2018-04-09-14-44-41-101.png! > *Notes:* > *GPU Isolation of XLearning project is achieved by patched YARN, which is > different from community’s GPU isolation solution. > **XLearning needs few modification to read ClusterSpec from env. > *References:* > - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark] > - TensorFlowOnYARN (Intel): > [https://github.com/Intel-bigdata/TensorFlowOnYARN] > - Spark Deep Learning (Databricks): > [https://github.com/databricks/spark-deep-learning] > - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning] > - Kubeflow (Google): [https://github.com/kubeflow/kubeflow] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org