Hiearchical K-means require a massive amount of iterations whereas flat K-means 
does not but I've found flat to be generally useless since in most UIs it is 
nice to be able to drill down into more and more specific clusters. If you have 
100 million documents and your branching factor is 8 (8-secting k-means) then 
you will be picking a cluster to split and iterating thousands of times. So per 
split you iterate maybe 6 or 7 times to get new cluster assignments and there 
are ultimately going to be 5,000 to 50,000 splits depending on split criterion 
and cluster variances etc... 
In this case fault tolerance doesn't matter. I've found that the distributed 
aspect of RDD is what I'm looking for and don't care or need the resilience 
part as much. It is a one off algorithm and that can just be run again if 
something goes wrong. Once the data is created it is done with Spark. 
But anyway, that is the very thing Spark is advertised for. 

> From: so...@cloudera.com
> Date: Sat, 6 Dec 2014 06:39:10 -0600
> Subject: Re: Java RDD Union
> To: ronalday...@live.com
> CC: user@spark.apache.org
> 
> I guess a major problem with this is that you lose fault tolerance.
> You have no way of recreating the local state of the mutable RDD if a
> partition is lost.
> 
> Why would you need thousands of RDDs for kmeans? it's a few per iteration.
> 
> An RDD is more bookkeeping that data structure, itself. They don't
> inherently take up resource, unless you mark them to be persisted.
> You're paying the cost of copying objects to create one RDD from next,
> but that's mostly it.
> 
> On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub <ronalday...@live.com> wrote:
> > With that said, and the nature of iterative algorithms that Spark is
> > advertised for, isn't this a bit of an unnecessary restriction since I don't
> > see where the problem is. For instance, it is clear that when aggregating
> > you need operations to be associative because of the way they are divided
> > and combined. But since forEach works on an individual item the same problem
> > doesn't exist.
> >
> > As an example, during a k-means algorithm you have to continually update
> > cluster assignments per data item along with perhaps distance from centroid.
> > So if you can't update items in place you have to literally create thousands
> > upon thousands of RDDs. Does Spark have some kind of trick like reuse behind
> > the scenes - fully persistent data objects or whatever. How can it possibly
> > be efficient for 'iterative' algorithms when it is creating so many RDDs as
> > opposed to one?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
                                          

Reply via email to