[jira] [Resolved] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

Xiangrui Meng (Jira) Fri, 19 Mar 2021 10:38:13 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiangrui Meng resolved SPARK-24374.
-----------------------------------
       Fix Version/s: 2.4.0
    Target Version/s: 2.4.0
          Resolution: Fixed

I'm marking this epic jira as done given the major feature was implemented in 
2.4.

> SPIP: Support Barrier Execution Mode in Apache Spark
> ----------------------------------------------------
>
>                 Key: SPARK-24374
>                 URL: https://issues.apache.org/jira/browse/SPARK-24374
>             Project: Spark
>          Issue Type: Epic
>          Components: ML, Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>              Labels: Hydrogen, SPIP
>             Fix For: 2.4.0
>
>         Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

Reply via email to