Hi Wenchen, Glad to know that you like this idea. We also looked into making this pluggable in our early design phase. While the ShuffleManager API for pluggable shuffle systems does provide quite some room for customized behaviors for Spark shuffle, we feel that it is still not enough for this case.
Right now, the shuffle block location information is tracked inside MapOutputTracker and updated by DAGScheduler. Since we are relocating the shuffle blocks to improve overall shuffle throughput and efficiency, being able to update the information tracked inside MapOutputTracker so reducers can access their shuffle input more efficiently is thus necessary. Letting DAGScheduler orchestrate this process also provides the benefit of better coping with stragglers. If DAGScheduler has no control or is agnostic of the block push progress, it does leave a few gaps. On the shuffle Netty protocol side, there are a lot that can be leveraged from the existing code. With improvements in SPARK-24355 and SPARK-30512, the shuffle service Netty server is becoming much more reliable. The work in SPARK-6237 also provided quite some leverage for streaming push of shuffle blocks. Instead of building all of these from scratch, we took the alternative route of building on top of the existing Netty protocol to implement the shuffle block push operation. We feel that this design has the potential of further improving Spark shuffle system's scalability and efficiency, making Spark an even better compute engine. Would like to explore how we can leverage the shuffle plugin API to make this design more acceptable. ----- Min Shen Staff Software Engineer LinkedIn -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org