[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072729#comment-14072729
 ] 

Apache Spark commented on SPARK-2568:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1562

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066689#comment-14066689
 ] 

Reynold Xin commented on SPARK-2568:


Our PhD in Math is working on that :)

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066553#comment-14066553
 ] 

Mark Hamstra commented on SPARK-2568:
-

Sure, if they can be cleanly separated -- but there's also interaction with the 
ShuffleManager refactoring.

Do you have some strategy in mind for addressing just SPARK-2568 in isolation?

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066541#comment-14066541
 ] 

Reynold Xin commented on SPARK-2568:


Yes. Let's solve the problem one by one though.

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066539#comment-14066539
 ] 

Mark Hamstra commented on SPARK-2568:
-

What is at least as much a problem as the making of three passes through the 
data is that the count and sample are separate hidden/special jobs within the 
RangePartitioner that aren't launched by RDD actions under the user's control.  
This ends up not only breaking Spark's "transformations are lazy; jobs are only 
launched by actions" model, but it also messes up the construction of 
FutureActions on sorted RDDs, accounting of resource usage of jobs that include 
a sort, etc.

> RangePartitioner should go through the data only once
> -
>
> Key: SPARK-2568
> URL: https://issues.apache.org/jira/browse/SPARK-2568
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)