Very interesting Barrie. Also, interesting is the possibility of using fuzzy k-means clustering on the sketch produced by the streaming k-means. This gives you single pass fuzzy k-means.
On Thu, Aug 22, 2013 at 7:55 AM, B Kersbergen <kersberg...@gmail.com> wrote: > Hi Ted, > > The streaming k-means in Mahout is very sweet, but I need fuzzy k-means. > Converting Mahouts seed into a distributed algorithm allowed me to start > fuzzy clustering gigabytes of data in a few seconds instead of hours. > Maybe this is something other Mahout users also find interesting. > > You can find my changes here: > https://github.com/bkersbergen/mahout > > Kind regards, > Barrie Kersbergen > > > > 2013/8/15 Ted Dunning <ted.dunn...@gmail.com> > > > Look at the streaming k means implementation. This heinous seeding > > algorithm goes away entirely. > > > > Sent from my iPhone > > > > On Aug 14, 2013, at 13:35, B Kersbergen <kersberg...@gmail.com> wrote: > > > > > Hi, > > > > > > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' > specified, > > > depending on the characteristics of my dataset it takes about 0.5 to 12 > > > hours before my Mahout job is being submitted to my Hadoop cluster. > > > The Mahout source code shows that the big dataset is downloaded to my > > local > > > machine (over wifi, running in vagrant) and centroids are sampled in a > > > single thread and pushed to hdfs. > > > To benefit from MapReduce and data locality, I've created a > > > RandomSeedGeneratorDriver and integrated this in the map reduce version > > of > > > (f)kmeans clustering. > > > This version does the sampling in a few minutes on a small Hadoop > > cluster. > > > > > > If you like, I would be happy to share my code. > > > > > > There are several ways to implement this and perhaps you don't favor > it’s > > > current implementation. I'd be happy to discuss this and of course make > > > changes. > > > > > > Kind regards, > > > Barrie Kersbergen > > >