[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489333#comment-16489333
 ] 

Xiangrui Meng commented on SPARK-24374:
---------------------------------------

[~galv] Thanks for your feedback!
 * SPARK-20327 allows Spark to request GPU from YARN. Within Spark, we also 
need to address the issue with allocating custom resources to tasks. For 
example, a user can request a task should take 1 CPU core and 2 GPUs and Spark 
needs to assign the executor as well as GPU devices to that task. This is 
relevant to the proposal here. But I want to consider it an orthogonal topic 
because barrier scheduling works with or without GPUs.
 * I picked version 3.0.0 because I'm not sure if there would be breaking 
changes. The proposed API doesn't have breaking changes. But this adds a new 
model to job scheduler and hence fault tolerance model. I want to see more 
design discussions first.
 * I think most MPI implementations use SSH. Spark standalone is usually 
configured with keyless SSH. But this might be an issue on YARN / Kube / Mesos. 
I thought about this but I want to treat it as a separate issue. Spark 
executors can talk to each other in all those cluster modes. So user should 
have a way to set up a hybrid cluster within the barrier stage. The question is 
what info we need to provide at the very minimum.
 * "barrier" comes from "MPI_Barrier". Please check the example code in the 
SPIP. We need to set barrier in two scenarios: 1) configure a Spark stage to be 
in the barrier mode, where Spark job scheduler knows to wait until all task 
slots are ready to launch them, 2) within the mapPartitions closure, we still 
need to provide users contxt.barrier() to wait for data exchange or user 
program to finish on all tasks.

> SPIP: Support Barrier Scheduling in Apache Spark
> ------------------------------------------------
>
>                 Key: SPARK-24374
>                 URL: https://issues.apache.org/jira/browse/SPARK-24374
>             Project: Spark
>          Issue Type: Epic
>          Components: ML, Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>              Labels: SPIP
>         Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to