Hi All,

I would like to invite you to review the design document for Barrier
Execution Mode:
https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#

TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a major
part of the project involves significant changes to execution mode of
Spark. This design doc proposes new APIs as well as new execution mode
(known as barrier execution mode) to provide high-performance support for
DL workloads.

Major changes include:

   - Add RDDBarrier to support gang scheduling.
   - Add BarrierTaskContext to support global sync of all tasks in a stage;
   - Better fault tolerance approach for barrier stage, that in case some
   tasks fail in the middle, retry all tasks in the same stage.
   - Integrate barrier execution mode with Standalone cluster manager.

Please feel free to review and discuss on the design proposal.

Thanks,
Xingbo

Reply via email to