[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105126#comment-14105126
 ] 

Sandy Ryza commented on SPARK-2978:
-----------------------------------

So I started looking into this a little more and wanted to bring up a semantics 
issue I came across.

The proposed implementation would be to use a similar path to that used by 
sortByKey in each reduce task, and then wrap the Iterator over sorted records 
with an Iterator that groups them.  I.e. wrap an the Iterator[(K, V)] in an 
Iterator[(K, Iterator[V])].  The question is how to handle the validity of an 
inner V iterator with respect to the outer Iterator.  The options as I see it 
are:
1. Calling next() or hasNext() on the outer iterator invalidates the current 
inner V iterator.
2. The inner V iterator must be exhausted before calling next() or hasNext() on 
the outer iterator.
3. On each next() call on the outer iterator, scan over all the values for that 
key and put them in a separate buffer. 

The MapReduce approach, where the outer iterator is replaced by a sequence of 
calls to the reduce function, is similar to (1).

When the Iterators returned by groupByKey are eventually disk-backed, we'll 
face the same issue, so we probably want to make the semantics there consistent 
with whatever we decide here.


> Provide an MR-style shuffle transformation
> ------------------------------------------
>
>                 Key: SPARK-2978
>                 URL: https://issues.apache.org/jira/browse/SPARK-2978
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in 
> general, I think it would be useful to provide a transformation with the 
> semantics of the Hadoop MR shuffle, i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", 
> "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to