[ 
https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100043#comment-14100043
 ] 

Mridul Muralidharan commented on SPARK-3019:
--------------------------------------------

Unfortunately, I never went into how MR does shuffle - though I was supposed to 
dig into this with Tom in Q1 - Q2 timeframe; so hopefully I am not way off base 
here !

bq. It's true that we can't start on a function which requires a full view of 
the coming in for a particular key, but we can start merging and combining.

In case of spark, unlike MR, we cannot start merging/combining until all blocks 
are fetched.
Well, technically we can - but we will end up repeating merge/combine multiple 
times for each new map output fetched, and it would be very suboptimal since we 
will be reading way more times from disk (hope I did not get this wrong /CC 
[~matei]).


bq. MapReduce makes this assessment. Each reducer has a pool of memory for 
fetching data into, and avoids fetching more data than can fit into this pool. 
I was under the impression that Spark does something similar.

In case of hash based shuffle, obviously this is not possible.
In case of sort based shuffle, I can see this being possible : but it is not 
supported (iirc /CC [~matei]).


> Pluggable block transfer (data plane communication) interface
> -------------------------------------------------------------
>
>                 Key: SPARK-3019
>                 URL: https://issues.apache.org/jira/browse/SPARK-3019
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>         Attachments: PluggableBlockTransferServiceProposalforSpark - draft 
> 1.pdf
>
>
> The attached design doc proposes a standard interface for block transferring, 
> which will make future engineering of this functionality easier, allowing the 
> Spark community to provide alternative implementations.
> Block transferring is a critical function in Spark. All of the following 
> depend on it:
> * shuffle
> * torrent broadcast
> * block replication in BlockManager
> * remote block reads for tasks scheduled without locality



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to