.....how I skipped the last part........ On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <r...@databricks.com> wrote:
> Yes, Nan, totally agree. To be on the same page, that's exactly what I > wrote wasn't it? > > On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote: > >> besides that, one of the things which is needed by multiple frameworks is >> to schedule tasks in a single wave >> >> i.e. >> >> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark >> is desired to provide a capability to ensure that either we run 50 tasks at >> once, or we should quit the complete application/job after some timeout >> period >> >> Best, >> >> Nan >> >> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> wrote: >> >>> I think that's what Xiangrui was referring to. Instead of retrying a >>> single task, retry the entire stage, and the entire stage of tasks need to >>> be scheduled all at once. >>> >>> >>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman < >>> shiva...@eecs.berkeley.edu> wrote: >>> >>>> >>>>> >>>>>> - Fault tolerance and execution model: Spark assumes fine-grained >>>>>> task recovery, i.e. if something fails, only that task is rerun. This >>>>>> doesn’t match the execution model of distributed ML/DL frameworks >>>>>> that are >>>>>> typically MPI-based, and rerunning a single task would lead to the >>>>>> entire >>>>>> system hanging. A whole stage needs to be re-run. >>>>>> >>>>>> This is not only useful for integrating with 3rd-party frameworks, >>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts >>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485 >>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up >>>>> with some compromised solutions. With the new execution model, we can set >>>>> up a hybrid cluster and do all-reduce properly. >>>>> >>>>> >>>> Is there a particular new execution model you are referring to or do we >>>> plan to investigate a new execution model ? For the MPI-like model, we >>>> also need gang scheduling (i.e. schedule all tasks at once or none of them) >>>> and I dont think we have support for that in the scheduler right now. >>>> >>>>> >>>>>> -- >>>>> >>>>> Xiangrui Meng >>>>> >>>>> Software Engineer >>>>> >>>>> Databricks Inc. [image: http://databricks.com] >>>>> <http://databricks.com/> >>>>> >>>> >>>> >>