This is from a separate thread with a differently named title. 
Why can't you modify the actual contents of an RDD using forEach? It appears to 
be working for me. What I'm doing is changing cluster assignments and distances 
per data item for each iteration of the clustering algorithm. The clustering 
algorithm is massive and iterates thousands of times. As I understand it now, 
you are supposed to create new RDDs on each pass. This is a hierachical k-means 
that I'm doing and hence it is consist of many iterations rather than large 
iterations.
So I understand the restriction of why operation when aggregating and reducing 
etc, need to be associative. However, forEach operates on a single item. So 
being that Spark is advertised as being great for iterative algorithms since it 
operates in-memory, how can it be good to create thousands upon thousands of 
RDDs during the course of an iterative algorithm?  Does Spark have some kind of 
trick like reuse behind the scenes - fully persistent data objects or whatever? 
How can it possibly be efficient for 'iterative' algorithms when it is creating 
so many RDDs as opposed to one? 
Or is the answer that I should keep doing what I'm doing because it is working 
even though it is not theoretically sound and aligned with functional ideas. I 
personally just want it to be fast and be able to operate on up to 500 million 
data items.                                          

Reply via email to