[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386541#comment-14386541 ]
Sean Owen commented on SPARK-3441: ---------------------------------- This is mentioned in the change for https://github.com/apache/spark/pull/5074 but I think the work here is to explain more deeply the rationale and partitioner details in the scaladoc > Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style > shuffle > ----------------------------------------------------------------------------------- > > Key: SPARK-3441 > URL: https://issues.apache.org/jira/browse/SPARK-3441 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core > Reporter: Patrick Wendell > Assignee: Sandy Ryza > > I think it would be good to say something like this in the doc for > repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: > {code} > This can be used to enact a "Hadoop Style" shuffle along with a call to > mapPartitions, e.g.: > rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) > {code} > It might also be nice to add a version that doesn't take a partitioner and/or > to mention this in the groupBy javadoc. I guess it depends a bit whether we > consider this to be an API we want people to use more widely or whether we > just consider it a narrow stable API mostly for Hive-on-Spark. If we want > people to consider this API when porting workloads from Hadoop, then it might > be worth documenting better. > What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org