The name "push-based shuffle" is a little misleading. This seems like a
better shuffle service that co-locates shuffle blocks of one reducer at the
map phase. I think this is a good idea. Is it possible to make it
completely external via the shuffle plugin API? This looks like a good use
case of the plugin API.

On Wed, Jan 22, 2020 at 3:03 PM mshen <ms...@apache.org> wrote:

> Hi Reynold,
>
> Thanks for the comments.
> Although in the SPIP doc, a big portion of the problem motivation is around
> optimizing small random reads for shuffle, I believe the benefit of this
> design is beyond that.
>
> In terms of the approach we take, it is true that the map phase would still
> need to materialize the intermediate shuffle data and the reduce phase does
> not start until the map phase is done.
> However, pushing the shuffle data to the location where the reducers will
> run while the map stage is running does help to provide additional latency
> reduction beyond the disk read improvements.
> In fact, for the benchmark job we used to measure improvements, the
> reduction of shuffle fetch wait time is only about 30% of the total task
> runtime reduction.
>
> Another benefit we can expect is the improved shuffle reliability. By
> reducing the # of blocks need to be fetched remotely and by providing a
> 2-replica of the intermediate shuffle data, we can also reduce the
> likelihood of encountering shuffle fetch failures leading to expensive
> retries.
>
> For the alternative approach to merge map outputs for each node locally,
> which is similar to Facebook's SOS shuffle service, there are a few
> downsides:
> 1. The merge ratio might not be high enough, depending on the avg # of
> mapper tasks per node.
> 2. It does not deliver the shuffle partition data to the reducers. Most of
> the reducer task input still needs to be fetched remotely.
> 3. At least in Facebook's Rifle paper, the local merge is performed by the
> shuffle service since it needs to read multiple mappers' output. This means
> the memory buffering is happening on the shuffle service side, instead of
> on
> the executor side. While our approach also does memory buffering right now,
> we are doing this on the executor side, which makes it much less constraint
> compared with doing this inside shuffle service.
>
>
>
> -----
> Min Shen
> Staff Software Engineer
> LinkedIn
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to