Hi Reynold, Thanks for the comments. Although in the SPIP doc, a big portion of the problem motivation is around optimizing small random reads for shuffle, I believe the benefit of this design is beyond that.
In terms of the approach we take, it is true that the map phase would still need to materialize the intermediate shuffle data and the reduce phase does not start until the map phase is done. However, pushing the shuffle data to the location where the reducers will run while the map stage is running does help to provide additional latency reduction beyond the disk read improvements. In fact, for the benchmark job we used to measure improvements, the reduction of shuffle fetch wait time is only about 30% of the total task runtime reduction. Another benefit we can expect is the improved shuffle reliability. By reducing the # of blocks need to be fetched remotely and by providing a 2-replica of the intermediate shuffle data, we can also reduce the likelihood of encountering shuffle fetch failures leading to expensive retries. For the alternative approach to merge map outputs for each node locally, which is similar to Facebook's SOS shuffle service, there are a few downsides: 1. The merge ratio might not be high enough, depending on the avg # of mapper tasks per node. 2. It does not deliver the shuffle partition data to the reducers. Most of the reducer task input still needs to be fetched remotely. 3. At least in Facebook's Rifle paper, the local merge is performed by the shuffle service since it needs to read multiple mappers' output. This means the memory buffering is happening on the shuffle service side, instead of on the executor side. While our approach also does memory buffering right now, we are doing this on the executor side, which makes it much less constraint compared with doing this inside shuffle service. ----- Min Shen Staff Software Engineer LinkedIn -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org