Shivaram: Yes, we can call it "gang scheduling" or "barrier synchronization". Spark doesn't support it now. The proposal is to have a proper support in Spark's job scheduler, so we can integrate well with MPI-like frameworks.
On Tue, May 8, 2018 at 11:17 AM Nan Zhu <zhunanmcg...@gmail.com> wrote: > .....how I skipped the last part........ > > On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <r...@databricks.com> wrote: > >> Yes, Nan, totally agree. To be on the same page, that's exactly what I >> wrote wasn't it? >> >> On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote: >> >>> besides that, one of the things which is needed by multiple frameworks >>> is to schedule tasks in a single wave >>> >>> i.e. >>> >>> if some frameworks like xgboost/mxnet requires 50 parallel workers, >>> Spark is desired to provide a capability to ensure that either we run 50 >>> tasks at once, or we should quit the complete application/job after some >>> timeout period >>> >>> Best, >>> >>> Nan >>> >>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> I think that's what Xiangrui was referring to. Instead of retrying a >>>> single task, retry the entire stage, and the entire stage of tasks need to >>>> be scheduled all at once. >>>> >>>> >>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman < >>>> shiva...@eecs.berkeley.edu> wrote: >>>> >>>>> >>>>>> >>>>>>> - Fault tolerance and execution model: Spark assumes >>>>>>> fine-grained task recovery, i.e. if something fails, only that task >>>>>>> is >>>>>>> rerun. This doesn’t match the execution model of distributed ML/DL >>>>>>> frameworks that are typically MPI-based, and rerunning a single task >>>>>>> would >>>>>>> lead to the entire system hanging. A whole stage needs to be re-run. >>>>>>> >>>>>>> This is not only useful for integrating with 3rd-party frameworks, >>>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts >>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485 >>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up >>>>>> with some compromised solutions. With the new execution model, we can set >>>>>> up a hybrid cluster and do all-reduce properly. >>>>>> >>>>>> >>>>> Is there a particular new execution model you are referring to or do >>>>> we plan to investigate a new execution model ? For the MPI-like model, we >>>>> also need gang scheduling (i.e. schedule all tasks at once or none of >>>>> them) >>>>> and I dont think we have support for that in the scheduler right now. >>>>> >>>>>> >>>>>>> -- >>>>>> >>>>>> Xiangrui Meng >>>>>> >>>>>> Software Engineer >>>>>> >>>>>> Databricks Inc. [image: http://databricks.com] >>>>>> <http://databricks.com/> >>>>>> >>>>> >>>>> >>> > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>