[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105126#comment-14105126 ]
Sandy Ryza commented on SPARK-2978: ----------------------------------- So I started looking into this a little more and wanted to bring up a semantics issue I came across. The proposed implementation would be to use a similar path to that used by sortByKey in each reduce task, and then wrap the Iterator over sorted records with an Iterator that groups them. I.e. wrap an the Iterator[(K, V)] in an Iterator[(K, Iterator[V])]. The question is how to handle the validity of an inner V iterator with respect to the outer Iterator. The options as I see it are: 1. Calling next() or hasNext() on the outer iterator invalidates the current inner V iterator. 2. The inner V iterator must be exhausted before calling next() or hasNext() on the outer iterator. 3. On each next() call on the outer iterator, scan over all the values for that key and put them in a separate buffer. The MapReduce approach, where the outer iterator is replaced by a sequence of calls to the reduce function, is similar to (1). When the Iterators returned by groupByKey are eventually disk-backed, we'll face the same issue, so we probably want to make the semantics there consistent with whatever we decide here. > Provide an MR-style shuffle transformation > ------------------------------------------ > > Key: SPARK-2978 > URL: https://issues.apache.org/jira/browse/SPARK-2978 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Sandy Ryza > > For Hive on Spark joins in particular, and for running legacy MR code in > general, I think it would be useful to provide a transformation with the > semantics of the Hadoop MR shuffle, i.e. one that > * groups by key: provides (Key, Iterator[Value]) > * within each partition, provides keys in sorted order > A couple ways that could make sense to expose this: > * Add a new operator. "groupAndSortByKey", > "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe? > * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org