[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

Apache Spark (JIRA) Mon, 03 Nov 2014 13:51:58 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195168#comment-14195168
 ]


Apache Spark commented on SPARK-1021:
-------------------------------------

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/3079

> sortByKey() launches a cluster job when it shouldn't
> ----------------------------------------------------
>
>                 Key: SPARK-1021
>                 URL: https://issues.apache.org/jira/browse/SPARK-1021
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
>            Reporter: Andrew Ash
>            Assignee: Erik Erlandson
>              Labels: starter
>             Fix For: 1.2.0
>
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

Reply via email to