[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728020#comment-16728020 ]
Debasish Das commented on SPARK-24374: -------------------------------------- Hi [~mengxr] with barrier mode available is it not possible to use native TF parameter server in place of using MPI ? Although we are offloading compute from spark to tf workers/ps, still if there is an exception that comes out, tracking it with native TF API might be easier than MPI exception...great work by the way...I was looking for a cloud-ml alternative using spark over aws/azure/gcp and looks like barrier should help a lot although I am still not clear on the limitations of TensorflowOnSpark project from Yahoo [https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like syntax but not sure if few partitions fails on some tfrecord read / communication exceptions whether it can re-run full job or it will only re-run the failed partition...I guess the exception from few partitions can be thrown back to spark driver and driver can take the action for re-run..when multiple tf training jobs get scheduled on the same spark cluster I suspect TFoS might have issues as well... > SPIP: Support Barrier Execution Mode in Apache Spark > ---------------------------------------------------- > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core > Affects Versions: 2.4.0 > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Priority: Major > Labels: Hydrogen, SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org