The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of the plugin API.
On Wed, Jan 22, 2020 at 3:03 PM mshen <ms...@apache.org> wrote: > Hi Reynold, > > Thanks for the comments. > Although in the SPIP doc, a big portion of the problem motivation is around > optimizing small random reads for shuffle, I believe the benefit of this > design is beyond that. > > In terms of the approach we take, it is true that the map phase would still > need to materialize the intermediate shuffle data and the reduce phase does > not start until the map phase is done. > However, pushing the shuffle data to the location where the reducers will > run while the map stage is running does help to provide additional latency > reduction beyond the disk read improvements. > In fact, for the benchmark job we used to measure improvements, the > reduction of shuffle fetch wait time is only about 30% of the total task > runtime reduction. > > Another benefit we can expect is the improved shuffle reliability. By > reducing the # of blocks need to be fetched remotely and by providing a > 2-replica of the intermediate shuffle data, we can also reduce the > likelihood of encountering shuffle fetch failures leading to expensive > retries. > > For the alternative approach to merge map outputs for each node locally, > which is similar to Facebook's SOS shuffle service, there are a few > downsides: > 1. The merge ratio might not be high enough, depending on the avg # of > mapper tasks per node. > 2. It does not deliver the shuffle partition data to the reducers. Most of > the reducer task input still needs to be fetched remotely. > 3. At least in Facebook's Rifle paper, the local merge is performed by the > shuffle service since it needs to read multiple mappers' output. This means > the memory buffering is happening on the shuffle service side, instead of > on > the executor side. While our approach also does memory buffering right now, > we are doing this on the executor side, which makes it much less constraint > compared with doing this inside shuffle service. > > > > ----- > Min Shen > Staff Software Engineer > LinkedIn > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >