[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

Weihua Jiang (JIRA) Mon, 09 Jun 2014 21:42:22 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026109#comment-14026109
 ]


Weihua Jiang commented on SPARK-2044:
-------------------------------------

Hi Matei,

Thanks for the reply. 

I am glad that you thinking pushing sorting into the interface is useful.

Yes, you are right. I misunderstand the partition id and map id. For partition 
id range, I am totally OK with it.

> Pluggable interface for shuffles
> --------------------------------
>
>                 Key: SPARK-2044
>                 URL: https://issues.apache.org/jira/browse/SPARK-2044
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>         Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

Reply via email to