RE: foreachRDD vs. forearchPartition ?

Evo Eftimov Wed, 08 Jul 2015 07:06:12 -0700

That was a) fuzzy b) insufficient – one can certainly use forach (only) on 
DStream RDDs – it works as empirical observation


 

As another empirical observation:

 

For each partition results in having one instance of the lambda/closure per 
partition when e.g. publishing to output systems like message brokers, 
databases and file systems - that increases the level of parallelism of your 
output processing 

 

As an architect I deal with gazillions of products and don’t have time to read 
the source code of all of them to make up for documentation deficiencies. On 
the other hand I believe you have been involved in writing some of the code so 
be a good boy and either answer this question properly or enhance the product 
documentation of that area of the system 

 

From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Wednesday, July 8, 2015 2:52 PM
To: dgoldenberg; user@spark.apache.org
Subject: Re: foreachRDD vs. forearchPartition ?

 

These are quite different operations. One operates on RDDs in  DStream and one 
operates on partitions of an RDD. They are not alternatives. 

 

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> wrote:

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: foreachRDD vs. forearchPartition ?

Reply via email to