[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026109#comment-14026109 ]
Weihua Jiang commented on SPARK-2044: ------------------------------------- Hi Matei, Thanks for the reply. I am glad that you thinking pushing sorting into the interface is useful. Yes, you are right. I misunderstand the partition id and map id. For partition id range, I am totally OK with it. > Pluggable interface for shuffles > -------------------------------- > > Key: SPARK-2044 > URL: https://issues.apache.org/jira/browse/SPARK-2044 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core > Reporter: Matei Zaharia > Assignee: Matei Zaharia > Attachments: Pluggableshuffleproposal.pdf > > > Given that a lot of the current activity in Spark Core is in shuffles, I > wanted to propose factoring out shuffle implementations in a way that will > make experimentation easier. Ideally we will converge on one implementation, > but for a while, this could also be used to have several implementations > coexist. I'm suggesting this because I aware of at least three efforts to > look at shuffle (from Yahoo!, Intel and Databricks). Some of the things > people are investigating are: > * Push-based shuffle where data moves directly from mappers to reducers > * Sorting-based instead of hash-based shuffle, to create fewer files (helps a > lot with file handles and memory usage on large shuffles) > * External spilling within a key > * Changing the level of parallelism or even algorithm for downstream stages > at runtime based on statistics of the map output (this is a thing we had > prototyped in the Shark research project but never merged in core) > I've attached a design doc with a proposed interface. It's not too crazy > because the interface between shuffles and the rest of the code is already > pretty narrow (just some iterators for reading data and a writer interface > for writing it). Bigger changes will be needed in the interaction with > DAGScheduler and BlockManager for some of the ideas above, but we can handle > those separately, and this interface will allow us to experiment with some > short-term stuff sooner. > If things go well I'd also like to send a sort-based shuffle implementation > for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252)