[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-1021: -- Labels: (was: starter) > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Erik Erlandson > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1021: --- Target Version/s: (was: 1.2.0) > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Erik Erlandson > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1021: --- Fix Version/s: (was: 1.3.0) > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Erik Erlandson > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1021: --- Fix Version/s: (was: 1.2.0) 1.3.0 > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Erik Erlandson > Labels: starter > Fix For: 1.3.0 > > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1021: Assignee: Erik Erlandson (was: Mark Hamstra) > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Erik Erlandson > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson updated SPARK-1021: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-2992 > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Mark Hamstra > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1021: --- Target Version/s: 1.2.0 Affects Version/s: 1.1.0 1.0.0 > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 >Reporter: Andrew Ash >Assignee: Mark Hamstra > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org