[jira] [Resolved] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik van Oosten resolved SPARK-27025. - Resolution: Incomplete > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784854#comment-16784854 ] Erik van Oosten commented on SPARK-27025: - If there is no obvious way to improve Spark, then its probably better to close this issue until someone finds a better angle. BTW, the cache/count/iterate/unpersist cycle did not make it faster for my use case. I will try the 2-partition implementation of toLocalIterator. [~srowen], [~hyukjin.kwon], thanks for your input! > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783220#comment-16783220 ] Erik van Oosten edited comment on SPARK-27025 at 3/4/19 10:36 AM: -- [~hyukjin.kwon] maybe I misunderstood Sean's comment. I understood that every invocation of toLocalIterator will either benefit, or not have any negative side effect. Under this assumption, it would be better to put the cache/count/iterate/unpersist logic directly in toLocalIterator. I can not make any assumptions on the number of use cases. was (Author: erikvanoosten): [~hyukjin.kwon] maybe I misunderstood Sean's comment. I understood that every invocation of toLocalIterator will either benefit, or not have any negative side effect. Under this assumption, it would be better to put the cache/count/iterate/unpersist logic directly in toLocalIterator. > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783220#comment-16783220 ] Erik van Oosten commented on SPARK-27025: - [~hyukjin.kwon] maybe I misunderstood Sean's comment. I understood that every invocation of toLocalIterator will either benefit, or not have any negative side effect. Under this assumption, it would be better to put the cache/count/iterate/unpersist logic directly in toLocalIterator. > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783110#comment-16783110 ] Erik van Oosten commented on SPARK-27025: - Thanks Sean, that is very useful. In my use case the entire data set is too big for the driver, but I can easily fit 1/10th of it. So even with as little as 20 partitions, 2 partitions on the driver would be fine. In the use case there are 2 joins, and a groupby/count so this is probably a wide transformation. So it seems that the cache/count/toLocalIterator/unpersist approach is applicable. The ergonomics of this approach are way worse, so I don't agree that it is 'better' to do this in application code. > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322 ] Erik van Oosten edited comment on SPARK-27025 at 3/2/19 8:43 AM: - The point is to _not_ fetch pro-actively. I have a program in which several steps need to be executed before anything can be transferred to the driver. So why can't the executors start executing immediately, and only transfer the results to the driver when its ready? was (Author: erikvanoosten): I have a program in which several steps need to be executed before anything can be transferred to the driver. So why can't the executors start executing immediately, and only transfer the results to the driver when its ready? > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322 ] Erik van Oosten commented on SPARK-27025: - I have a program in which several steps need to be executed before anything can be transferred to the driver. So why can't the executors start executing immediately, and only transfer the results to the driver when its ready? > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27025) Speed up toLocalIterator
Erik van Oosten created SPARK-27025: --- Summary: Speed up toLocalIterator Key: SPARK-27025 URL: https://issues.apache.org/jira/browse/SPARK-27025 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.3.3 Reporter: Erik van Oosten Method {{toLocalIterator}} fetches the partitions to the driver one by one. However, as far as I can see, any required computation for the yet-to-be-fetched-partitions is not kicked off until it is fetched. Effectively only one partition is being computed at the same time. Desired behavior: immediately start calculation of all partitions while retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6878) Sum on empty RDD fails with exception
Erik van Oosten created SPARK-6878: -- Summary: Sum on empty RDD fails with exception Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492302#comment-14492302 ] Erik van Oosten commented on SPARK-6878: Ah, yes. I now see that fold also first reduces per partition. Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492336#comment-14492336 ] Erik van Oosten commented on SPARK-6878: Pull request: https://github.com/apache/spark/pull/5489 Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492282#comment-14492282 ] Erik van Oosten commented on SPARK-6878: The answer is only defined because the RDD is an {{RDD[Double]}} :) Sure, I'll make a PR. Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6878) Sum on empty RDD fails with exception
[ https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik van Oosten updated SPARK-6878: --- Flags: Patch Sum on empty RDD fails with exception - Key: SPARK-6878 URL: https://issues.apache.org/jira/browse/SPARK-6878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Erik van Oosten Priority: Minor {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}. A simple fix is the replace {noformat} class DoubleRDDFunctions { def sum(): Double = self.reduce(_ + _) {noformat} with: {noformat} class DoubleRDDFunctions { def sum(): Double = self.aggregate(0.0)(_ + _, _ + _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org