I guess a major problem with this is that you lose fault tolerance.
You have no way of recreating the local state of the mutable RDD if a
partition is lost.

Why would you need thousands of RDDs for kmeans? it's a few per iteration.

An RDD is more bookkeeping that data structure, itself. They don't
inherently take up resource, unless you mark them to be persisted.
You're paying the cost of copying objects to create one RDD from next,
but that's mostly it.

On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub <ronalday...@live.com> wrote:
> With that said, and the nature of iterative algorithms that Spark is
> advertised for, isn't this a bit of an unnecessary restriction since I don't
> see where the problem is. For instance, it is clear that when aggregating
> you need operations to be associative because of the way they are divided
> and combined. But since forEach works on an individual item the same problem
> doesn't exist.
>
> As an example, during a k-means algorithm you have to continually update
> cluster assignments per data item along with perhaps distance from centroid.
> So if you can't update items in place you have to literally create thousands
> upon thousands of RDDs. Does Spark have some kind of trick like reuse behind
> the scenes - fully persistent data objects or whatever. How can it possibly
> be efficient for 'iterative' algorithms when it is creating so many RDDs as
> opposed to one?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to