[ https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431369#comment-16431369 ]
Wangda Tan commented on YARN-8135: ---------------------------------- [~oliverhuh...@gmail.com], Thanks for the responses, {quote}what does w/o modification mean ? {quote} Without modification of vanilla TF program in order to run on the framework. {quote}As far as Kubeflow is deployed in the same cluster as Hadoop, Kubeflow should be able to access HDFS, through libhdfs or webhdfs interface? {quote} Since tensorflow supports to read HDFS, ideally all platform can support this :). What I meant here is, TF read HDFS needs lots of configurations, and needs some specific optimization / considerations to make HDFS access from Docker container easier. Our on-going prototype covers some of this problem. {quote}ToS kind of supports GPU scheduling (not isolation) base on memory: if you ask for 1 GPU and a machine has 4 GPU, it asks for total memory * the portion of GPU you asked. {quote} This is not easy for user and cannot guarantee proper isolation, so I didn't put a (√) for ToS. > Hadoop {Submarine} Project: Simple and scalable deployment of deep learning > training / serving jobs on Hadoop > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-8135 > URL: https://issues.apache.org/jira/browse/YARN-8135 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Major > Attachments: image-2018-04-09-14-35-16-778.png, > image-2018-04-09-14-44-41-101.png > > > Description: > *Goals:* > - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs > on YARN. > - Allow jobs easy access data/models in HDFS and other storages. > - Can launch services to serve Tensorflow/MXNet models. > - Support run distributed Tensorflow jobs with simple configs. > - Support run user-specified Docker images. > - Support specify GPU and other resources. > - Support launch tensorboard if user specified. > - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) > *Why this name?* > - Because Submarine is the only vehicle can let human to explore deep > places. B-) > Compare to other projects: > !image-2018-04-09-14-44-41-101.png! > *Notes:* > *GPU Isolation of XLearning project is achieved by patched YARN, which is > different from community’s GPU isolation solution. > **XLearning needs few modification to read ClusterSpec from env. > *References:* > - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark] > - TensorFlowOnYARN (Intel): > [https://github.com/Intel-bigdata/TensorFlowOnYARN] > - Spark Deep Learning (Databricks): > [https://github.com/databricks/spark-deep-learning] > - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning] > - Kubeflow (Google): [https://github.com/kubeflow/kubeflow] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org