[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

Matei Zaharia (JIRA) Mon, 07 Apr 2014 10:19:29 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962026#comment-13962026
 ]


Matei Zaharia commented on SPARK-1021:
--------------------------------------

Note that if we do this, we'll need a similar fix in Python, which may be 
trickier.

> sortByKey() launches a cluster job when it shouldn't
> ----------------------------------------------------
>
>                 Key: SPARK-1021
>                 URL: https://issues.apache.org/jira/browse/SPARK-1021
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Andrew Ash
>              Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

Reply via email to