[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386541#comment-14386541 ] Sean Owen commented on SPARK-3441: -- This is mentioned in the change for https://github.com/apache/spark/pull/5074 but I think the work here is to explain more deeply the rationale and partitioner details in the scaladoc Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365896#comment-14365896 ] Apache Spark commented on SPARK-3441: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5074 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336948#comment-14336948 ] Sean Owen commented on SPARK-3441: -- Since another shuffle-related doc ticket came up for action (http://issues.apache.org/jira/browse/SPARK-5750) I wonder is this still live? is it just a matter of documenting something or needs a code change? Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125936#comment-14125936 ] Sandy Ryza commented on SPARK-3441: --- I'll add mention that this can be used to get Hadoop-style shuffle. Because we opted not to provide the grouping, the API isn't ideal for porting workloads from Hadoop - the amount of code required to replicate the Hadoop shuffle semantics is probably more than we can fit in the Javadoc. Not opposed to adding a note that it can be used as a building block. I considered a version without a partitioner, but I couldn't think of a situation where one would care that records are sorted within a partition, but not need to be specific about what keys end up in what partitions. Anything you can think of that I'm missing? Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126175#comment-14126175 ] Matei Zaharia commented on SPARK-3441: -- I agree that we should have more of a doc here to explain the rationale, I missed it earlier. BTW for a partitioner we should note that people can pass `new HashPartitioner`. No need to write your own. One case where you may not care about giving a Partitioner is if you just want to do some kind of groupBy / join that spills externally. So that may also be useful but since this is a pretty advanced API, I don't think it matters that much. Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126192#comment-14126192 ] Sandy Ryza commented on SPARK-3441: --- bq. One case where you may not care about giving a Partitioner is if you just want to do some kind of groupBy / join that spills externally. You mean so that the values for a single key can be disk-backed? Eventually we want join and groupByKey to handle this themselves, right? Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126308#comment-14126308 ] Patrick Wendell commented on SPARK-3441: Hey [~sandyr] - what do you mean by grouping? Just that the user has to write their own code to detect the boundaries between keys? I wonder if we could write a simple wrapper that does that (though @rxin pointed out to me this would be bad if a user cached it). Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126454#comment-14126454 ] Sandy Ryza commented on SPARK-3441: --- Right. It's not much work, but there are some questions (posted on SPARK-2978) about exactly what the semantics of such a wrapper should be. The concern was that we would want to make groupByKey consistent with it when it supports disk-backed keys, and didn't feel comfortable locking that behavior down right now. Happy to add a wrapper if we can come to a decision there. Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org