[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728020#comment-16728020
 ] 

Debasish Das commented on SPARK-24374:
--------------------------------------

Hi [~mengxr] with barrier mode available is it not possible to use native TF 
parameter server in place of using MPI ? Although we are offloading compute 
from spark to tf workers/ps, still if there is an exception that comes out, 
tracking it with native TF API might be easier than MPI exception...great work 
by the way...I was looking for a cloud-ml alternative using spark over 
aws/azure/gcp and looks like barrier should help a lot although I am still not 
clear on the limitations of TensorflowOnSpark project from Yahoo 
[https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like 
syntax but not sure if few partitions fails on some tfrecord read / 
communication exceptions whether it can re-run full job or it will only re-run 
the failed partition...I guess the exception from few partitions can be thrown 
back to spark driver and driver can take the action for re-run..when multiple 
tf training jobs get scheduled on the same spark cluster I suspect TFoS might 
have issues as well... 

> SPIP: Support Barrier Execution Mode in Apache Spark
> ----------------------------------------------------
>
>                 Key: SPARK-24374
>                 URL: https://issues.apache.org/jira/browse/SPARK-24374
>             Project: Spark
>          Issue Type: Epic
>          Components: ML, Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>              Labels: Hydrogen, SPIP
>         Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to