[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

Sandy Ryza (JIRA) Tue, 02 Sep 2014 08:55:10 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118286#comment-14118286
 ]


Sandy Ryza commented on SPARK-2978:
-----------------------------------

IIUC, that would require using ShuffledRDD directly.  Would we be comfortable 
taking off the DeveloperAPI tag?

Another option that would allow us to avoid making the groupBy decision would 
be exposing a repartitionAndSortWithinPartition transform.  Then Hive would 
handle the grouping on the sorted stream.

> Provide an MR-style shuffle transformation
> ------------------------------------------
>
>                 Key: SPARK-2978
>                 URL: https://issues.apache.org/jira/browse/SPARK-2978
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in 
> general, I think it would be useful to provide a transformation with the 
> semantics of the Hadoop MR shuffle, i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", 
> "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

Reply via email to