You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how
it optimizes execution of iterative jobs.
Simple answer is
1. Spark doesn't materialize RDD when you do an iteration but lazily
captures the transformation functions in RDD.(only function and closure ,
no data operation actually happens)
2. When you finally execute and want to cause effects (save to disk ,
collect on master etc) it views the DAG of execution and optimizes what it
can reason (eliminating intermediate states , performing multiple
Transformations in one tasks, leveraging partitioning where available among
others)
Bottom line it doesn't matter how many RDD you have in your DAG chain as
long as Spark can optimize the functions in that DAG to create minimal
materialization on its way to final output.

Regards
Mayur
 On 06-Dec-2014 6:12 pm, "Ron Ayoub" <ronalday...@live.com> wrote:

> This is from a separate thread with a differently named title.
>
> Why can't you modify the actual contents of an RDD using forEach? It
> appears to be working for me. What I'm doing is changing cluster
> assignments and distances per data item for each iteration of the
> clustering algorithm. The clustering algorithm is massive and iterates
> thousands of times. As I understand it now, you are supposed to create new
> RDDs on each pass. This is a hierachical k-means that I'm doing and hence
> it is consist of many iterations rather than large iterations.
>
> So I understand the restriction of why operation when aggregating and
> reducing etc, need to be associative. However, forEach operates on a single
> item. So being that Spark is advertised as being great for iterative
> algorithms since it operates in-memory, how can it be good to create
> thousands upon thousands of RDDs during the course of an iterative
> algorithm?  Does Spark have some kind of trick like reuse behind the
> scenes - fully persistent data objects or whatever? How can it possibly be
> efficient for 'iterative' algorithms when it is creating so many RDDs as
> opposed to one?
>
> Or is the answer that I should keep doing what I'm doing because it is
> working even though it is not theoretically sound and aligned with
> functional ideas. I personally just want it to be fast and be able to
> operate on up to 500 million data items.
>

Reply via email to