I'm still wrapping my head around that fact that the data backing an RDD is 
immutable since an RDD may need to be reconstructed from its lineage at any 
point. In the context of clustering there are many iterations where an RDD may 
need to change (for instance cluster assignments, etc) based on a broadcast 
variable of a list of centroids which are objects that in turn contain a list 
of features. So immutability is all well and good for the purposes of being 
able to replay a lineage. But now I'm wondering, during each iterations in 
which this RDD goes through many transformations it will be transforming based 
on that broadcast variable of centroids that are mutable. How would it replay 
the lineage in this instance? Does a dependency on mutable variables mess up 
the whole lineage thing?
Any help appreciated. Just trying to wrap my head around using Spark correctly. 
I will say it does seem like there is a common miss conception that Spark RDDs 
are in-memory arrays - but perhaps this is for a reason. Perhaps in some cases 
an option for mutability and failure exception is exactly what is needed for a 
one off algorithm that doesn't necessarily need resiliency. Just a thought.     
                                     

Reply via email to