Hi All, I would like to invite you to review the design document for Barrier Execution Mode: https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#
TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a major part of the project involves significant changes to execution mode of Spark. This design doc proposes new APIs as well as new execution mode (known as barrier execution mode) to provide high-performance support for DL workloads. Major changes include: - Add RDDBarrier to support gang scheduling. - Add BarrierTaskContext to support global sync of all tasks in a stage; - Better fault tolerance approach for barrier stage, that in case some tasks fail in the middle, retry all tasks in the same stage. - Integrate barrier execution mode with Standalone cluster manager. Please feel free to review and discuss on the design proposal. Thanks, Xingbo