Ron,
“appears to be working” might be true when there are no failures. on large 
datasets being processed on a large number of machines, failures of several 
types(server, network, disk etc) can happen. At that time, Spark will not 
“know” that you changed the RDD in-place and will use any version of any 
partition of the RDD to be retried. Retries require idempotency and that is 
difficult without immutability. I believe, this is one of the primary reasons 
for making RDDs immutable in Spark (mutable isn't even an option worth 
considering). In general mutating something in a distributed system is a hard 
problem. It can be solved (e.g. in NoSQL or newSQL databases) but Spark is not 
a transactional data store.

If you are building an iterative machine learning algorithm which usually have 
a “reduce” step at the end of every iteration, then the lazy evaluation is 
unlikely to be useful. On the other hand, if these intermediate RDDs stay in 
the young generation of the JVM heap [I am not sure if RDD cache management 
somehow changes this, so I could be wrong] they are garbage collected quickly 
and with very little overhead.

This is the price of scaling out :-)
        
Hope this helps,
Mohit.

> On Dec 6, 2014, at 5:02 AM, Mayur Rustagi <mayur.rust...@gmail.com> wrote:
> 
> You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how it 
> optimizes execution of iterative jobs.
> Simple answer is 
> 1. Spark doesn't materialize RDD when you do an iteration but lazily captures 
> the transformation functions in RDD.(only function and closure , no data 
> operation actually happens)
> 2. When you finally execute and want to cause effects (save to disk , collect 
> on master etc) it views the DAG of execution and optimizes what it can reason 
> (eliminating intermediate states , performing multiple Transformations in one 
> tasks, leveraging partitioning where available among others)
> Bottom line it doesn't matter how many RDD you have in your DAG chain as long 
> as Spark can optimize the functions in that DAG to create minimal 
> materialization on its way to final output. 
> 
> Regards
> Mayur
> On 06-Dec-2014 6:12 pm, "Ron Ayoub" <ronalday...@live.com 
> <mailto:ronalday...@live.com>> wrote:
> This is from a separate thread with a differently named title. 
> 
> Why can't you modify the actual contents of an RDD using forEach? It appears 
> to be working for me. What I'm doing is changing cluster assignments and 
> distances per data item for each iteration of the clustering algorithm. The 
> clustering algorithm is massive and iterates thousands of times. As I 
> understand it now, you are supposed to create new RDDs on each pass. This is 
> a hierachical k-means that I'm doing and hence it is consist of many 
> iterations rather than large iterations.
> 
> So I understand the restriction of why operation when aggregating and 
> reducing etc, need to be associative. However, forEach operates on a single 
> item. So being that Spark is advertised as being great for iterative 
> algorithms since it operates in-memory, how can it be good to create 
> thousands upon thousands of RDDs during the course of an iterative algorithm? 
>  Does Spark have some kind of trick like reuse behind the scenes - fully 
> persistent data objects or whatever? How can it possibly be efficient for 
> 'iterative' algorithms when it is creating so many RDDs as opposed to one? 
> 
> Or is the answer that I should keep doing what I'm doing because it is 
> working even though it is not theoretically sound and aligned with functional 
> ideas. I personally just want it to be fast and be able to operate on up to 
> 500 million data items. 

Reply via email to