[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195168#comment-14195168 ]
Apache Spark commented on SPARK-1021: ------------------------------------- User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/3079 > sortByKey() launches a cluster job when it shouldn't > ---------------------------------------------------- > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 > Reporter: Andrew Ash > Assignee: Erik Erlandson > Labels: starter > Fix For: 1.2.0 > > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org