[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sandy Ryza updated SPARK-2978: ------------------------------ Description: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe? * Allow groupByKey to take an ordering param for keys within a partition was: For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle" * Allow groupByKey to take an ordering param for keys within a partition > Provide an MR-style shuffle transformation > ------------------------------------------ > > Key: SPARK-2978 > URL: https://issues.apache.org/jira/browse/SPARK-2978 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Sandy Ryza > > For Hive on Spark joins in particular, and for running legacy MR code in > general, I think it would be useful to provide a transformation with the > semantics of the Hadoop MR shuffle, i.e. one that > * groups by key: provides (Key, Iterator[Value]) > * within each partition, provides keys in sorted order > A couple ways that could make sense to expose this: > * Add a new operator. "groupAndSortByKey", > "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe? > * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org