Re: Enabling push-based shuffle in Spark

2020-06-24 Thread mshen
Our paper summarizing this work of push-based shuffle was recently accepted by VLDB 2020. We have uploaded a preprint version of the paper to the JIRA ticket , along with the production results we have so far. - Min Shen Staff Software

Re: Enabling push-based shuffle in Spark

2020-01-27 Thread Long, Andrew
The easiest would be to create a fork of the code in github. I can also accept diffs. Cheers Andrew From: Min Shen Date: Monday, January 27, 2020 at 12:48 PM To: "Long, Andrew" , "dev@spark.apache.org" Subject: Re: Enabling push-based shuffle in Spark Hi Andrew, We

Re: Enabling push-based shuffle in Spark

2020-01-27 Thread Min Shen
Hi Andrew, We are leveraging SPARK-6237 to control the off-heap memory consumption due to Netty. With that change, the data is processed in a streaming fashion so Netty does not buffer an entire RPC in memory before handing it over to RPCHandler. We tested with our internal stress testing

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread mshen
Hi Wenchen, Glad to know that you like this idea. We also looked into making this pluggable in our early design phase. While the ShuffleManager API for pluggable shuffle systems does provide quite some room for customized behaviors for Spark shuffle, we feel that it is still not enough for this

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread Wenchen Fan
The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread mshen
Hi Reynold, Thanks for the comments. Although in the SPIP doc, a big portion of the problem motivation is around optimizing small random reads for shuffle, I believe the benefit of this design is beyond that. In terms of the approach we take, it is true that the map phase would still need to

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread Reynold Xin
limiting the number of concurrent streams you can write to). On Tue, Jan 21, 2020 at 6:13 PM, mshen < ms...@apache.org > wrote: > > > > I'd like to start a discussion on enabling push-based shuffle in Spark. > This is meant to address issues with existing shuffle inefficien

Enabling push-based shuffle in Spark

2020-01-21 Thread mshen
I'd like to start a discussion on enabling push-based shuffle in Spark. This is meant to address issues with existing shuffle inefficiency in a large-scale Spark compute infra deployment. Facebook's previous talks on SOS shuffle <https://databricks.com/session/sos-optimizing-shuffle-