[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072729#comment-14072729 ] Apache Spark commented on SPARK-2568: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/1562 > RangePartitioner should go through the data only once > - > > Key: SPARK-2568 > URL: https://issues.apache.org/jira/browse/SPARK-2568 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Xiangrui Meng > > As of Spark 1.0, RangePartitioner goes through data twice: once to compute > the count and once to do sampling. As a result, to do sortByKey, Spark goes > through data 3 times (once to count, once to sample, and once to sort). > RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066689#comment-14066689 ] Reynold Xin commented on SPARK-2568: Our PhD in Math is working on that :) > RangePartitioner should go through the data only once > - > > Key: SPARK-2568 > URL: https://issues.apache.org/jira/browse/SPARK-2568 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Xiangrui Meng > > As of Spark 1.0, RangePartitioner goes through data twice: once to compute > the count and once to do sampling. As a result, to do sortByKey, Spark goes > through data 3 times (once to count, once to sample, and once to sort). > RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066553#comment-14066553 ] Mark Hamstra commented on SPARK-2568: - Sure, if they can be cleanly separated -- but there's also interaction with the ShuffleManager refactoring. Do you have some strategy in mind for addressing just SPARK-2568 in isolation? > RangePartitioner should go through the data only once > - > > Key: SPARK-2568 > URL: https://issues.apache.org/jira/browse/SPARK-2568 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Xiangrui Meng > > As of Spark 1.0, RangePartitioner goes through data twice: once to compute > the count and once to do sampling. As a result, to do sortByKey, Spark goes > through data 3 times (once to count, once to sample, and once to sort). > RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066541#comment-14066541 ] Reynold Xin commented on SPARK-2568: Yes. Let's solve the problem one by one though. > RangePartitioner should go through the data only once > - > > Key: SPARK-2568 > URL: https://issues.apache.org/jira/browse/SPARK-2568 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Xiangrui Meng > > As of Spark 1.0, RangePartitioner goes through data twice: once to compute > the count and once to do sampling. As a result, to do sortByKey, Spark goes > through data 3 times (once to count, once to sample, and once to sort). > RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066539#comment-14066539 ] Mark Hamstra commented on SPARK-2568: - What is at least as much a problem as the making of three passes through the data is that the count and sample are separate hidden/special jobs within the RangePartitioner that aren't launched by RDD actions under the user's control. This ends up not only breaking Spark's "transformations are lazy; jobs are only launched by actions" model, but it also messes up the construction of FutureActions on sorted RDDs, accounting of resource usage of jobs that include a sort, etc. > RangePartitioner should go through the data only once > - > > Key: SPARK-2568 > URL: https://issues.apache.org/jira/browse/SPARK-2568 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Xiangrui Meng > > As of Spark 1.0, RangePartitioner goes through data twice: once to compute > the count and once to do sampling. As a result, to do sortByKey, Spark goes > through data 3 times (once to count, once to sample, and once to sort). > RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)