Hi Reynold,

Thanks for the comments.
Although in the SPIP doc, a big portion of the problem motivation is around
optimizing small random reads for shuffle, I believe the benefit of this
design is beyond that.

In terms of the approach we take, it is true that the map phase would still
need to materialize the intermediate shuffle data and the reduce phase does
not start until the map phase is done.
However, pushing the shuffle data to the location where the reducers will
run while the map stage is running does help to provide additional latency
reduction beyond the disk read improvements.
In fact, for the benchmark job we used to measure improvements, the
reduction of shuffle fetch wait time is only about 30% of the total task
runtime reduction.

Another benefit we can expect is the improved shuffle reliability. By
reducing the # of blocks need to be fetched remotely and by providing a
2-replica of the intermediate shuffle data, we can also reduce the
likelihood of encountering shuffle fetch failures leading to expensive
retries.

For the alternative approach to merge map outputs for each node locally,
which is similar to Facebook's SOS shuffle service, there are a few
downsides:
1. The merge ratio might not be high enough, depending on the avg # of
mapper tasks per node.
2. It does not deliver the shuffle partition data to the reducers. Most of
the reducer task input still needs to be fetched remotely.
3. At least in Facebook's Rifle paper, the local merge is performed by the
shuffle service since it needs to read multiple mappers' output. This means
the memory buffering is happening on the shuffle service side, instead of on
the executor side. While our approach also does memory buffering right now,
we are doing this on the executor side, which makes it much less constraint
compared with doing this inside shuffle service.



-----
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to