[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126454#comment-14126454 ]
Sandy Ryza commented on SPARK-3441: ----------------------------------- Right. It's not much work, but there are some questions (posted on SPARK-2978) about exactly what the semantics of such a wrapper should be. The concern was that we would want to make groupByKey consistent with it when it supports disk-backed keys, and didn't feel comfortable locking that behavior down right now. Happy to add a wrapper if we can come to a decision there. > Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style > shuffle > ----------------------------------------------------------------------------------- > > Key: SPARK-3441 > URL: https://issues.apache.org/jira/browse/SPARK-3441 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core > Reporter: Patrick Wendell > Assignee: Sandy Ryza > > I think it would be good to say something like this in the doc for > repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: > {code} > This can be used to enact a "Hadoop Style" shuffle along with a call to > mapPartitions, e.g.: > rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) > {code} > It might also be nice to add a version that doesn't take a partitioner and/or > to mention this in the groupBy javadoc. I guess it depends a bit whether we > consider this to be an API we want people to use more widely or whether we > just consider it a narrow stable API mostly for Hive-on-Spark. If we want > people to consider this API when porting workloads from Hadoop, then it might > be worth documenting better. > What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org