[jira] [Resolved] (SPARK-27025) Speed up toLocalIterator

2019-03-05 Thread Erik van Oosten (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik van Oosten resolved SPARK-27025.
-
Resolution: Incomplete

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-05 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784854#comment-16784854
 ] 

Erik van Oosten commented on SPARK-27025:
-

If there is no obvious way to improve Spark, then its probably better to close 
this issue until someone finds a better angle.

BTW, the cache/count/iterate/unpersist cycle did not make it faster for my use 
case. I will try the 2-partition implementation of toLocalIterator.

[~srowen], [~hyukjin.kwon], thanks for your input!

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27025) Speed up toLocalIterator

2019-03-04 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783220#comment-16783220
 ] 

Erik van Oosten edited comment on SPARK-27025 at 3/4/19 10:36 AM:
--

[~hyukjin.kwon] maybe I misunderstood Sean's comment. I understood that every 
invocation of toLocalIterator will either benefit, or not have any negative 
side effect.

Under this assumption, it would be better to put the 
cache/count/iterate/unpersist logic directly in toLocalIterator.

I can not make any assumptions on the number of use cases.


was (Author: erikvanoosten):
[~hyukjin.kwon] maybe I misunderstood Sean's comment. I understood that every 
invocation of toLocalIterator will either benefit, or not have any negative 
side effect.

Under this assumption, it would be better to put the 
cache/count/iterate/unpersist logic directly in toLocalIterator.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-04 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783220#comment-16783220
 ] 

Erik van Oosten commented on SPARK-27025:
-

[~hyukjin.kwon] maybe I misunderstood Sean's comment. I understood that every 
invocation of toLocalIterator will either benefit, or not have any negative 
side effect.

Under this assumption, it would be better to put the 
cache/count/iterate/unpersist logic directly in toLocalIterator.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-04 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783110#comment-16783110
 ] 

Erik van Oosten commented on SPARK-27025:
-

Thanks Sean, that is very useful.

In my use case the entire data set is too big for the driver, but I can easily 
fit 1/10th of it. So even with as little as 20 partitions, 2 partitions on the 
driver would be fine.
In the use case there are 2 joins, and a groupby/count so this is probably a 
wide transformation. So it seems that the cache/count/toLocalIterator/unpersist 
approach is applicable.

The ergonomics of this approach are way worse, so I don't agree that it is 
'better' to do this in application code.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27025) Speed up toLocalIterator

2019-03-02 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322
 ] 

Erik van Oosten edited comment on SPARK-27025 at 3/2/19 8:43 AM:
-

The point is to _not_ fetch pro-actively.

I have a program in which several steps need to be executed before anything can 
be transferred to the driver. So why can't the executors start executing 
immediately, and only transfer the results to the driver when its ready?


was (Author: erikvanoosten):
I have a program in which several steps need to be executed before anything can 
be transferred to the driver. So why can't the executors start executing 
immediately, and only transfer the results to the driver when its ready?

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-02 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322
 ] 

Erik van Oosten commented on SPARK-27025:
-

I have a program in which several steps need to be executed before anything can 
be transferred to the driver. So why can't the executors start executing 
immediately, and only transfer the results to the driver when its ready?

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27025) Speed up toLocalIterator

2019-03-01 Thread Erik van Oosten (JIRA)
Erik van Oosten created SPARK-27025:
---

 Summary: Speed up toLocalIterator
 Key: SPARK-27025
 URL: https://issues.apache.org/jira/browse/SPARK-27025
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.3.3
Reporter: Erik van Oosten


Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
However, as far as I can see, any required computation for the 
yet-to-be-fetched-partitions is not kicked off until it is fetched. Effectively 
only one partition is being computed at the same time. 



Desired behavior: immediately start calculation of all partitions while 
retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)
Erik van Oosten created SPARK-6878:
--

 Summary: Sum on empty RDD fails with exception
 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor


{{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.

A simple fix is the replace

{noformat}
class DoubleRDDFunctions {
  def sum(): Double = self.reduce(_ + _)
{noformat} 

with:

{noformat}
class DoubleRDDFunctions {
  def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492302#comment-14492302
 ] 

Erik van Oosten commented on SPARK-6878:


Ah, yes. I now see that fold also first reduces per partition.

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492336#comment-14492336
 ] 

Erik van Oosten commented on SPARK-6878:


Pull request: https://github.com/apache/spark/pull/5489

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492282#comment-14492282
 ] 

Erik van Oosten commented on SPARK-6878:


The answer is only defined because the RDD is an {{RDD[Double]}} :)

Sure, I'll make a PR.

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik van Oosten updated SPARK-6878:
---
Flags: Patch

 Sum on empty RDD fails with exception
 -

 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor

 {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
 A simple fix is the replace
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.reduce(_ + _)
 {noformat} 
 with:
 {noformat}
 class DoubleRDDFunctions {
   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org