Very interesting Barrie.

Also, interesting is the possibility of using fuzzy k-means clustering on
the sketch produced by the streaming k-means.  This gives you single pass
fuzzy k-means.


On Thu, Aug 22, 2013 at 7:55 AM, B Kersbergen <kersberg...@gmail.com> wrote:

> Hi Ted,
>
> The streaming k-means in Mahout is very sweet, but I need fuzzy k-means.
> Converting Mahouts seed into a distributed algorithm allowed me to start
> fuzzy clustering gigabytes of data in a few seconds instead of hours.
> Maybe this is something other Mahout users also find interesting.
>
> You can find my changes here:
> https://github.com/bkersbergen/mahout
>
> Kind regards,
> Barrie Kersbergen
>
>
>
> 2013/8/15 Ted Dunning <ted.dunn...@gmail.com>
>
> > Look at the streaming k means implementation.  This heinous seeding
> > algorithm goes away entirely.
> >
> > Sent from my iPhone
> >
> > On Aug 14, 2013, at 13:35, B Kersbergen <kersberg...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > When (f)kmeans clustering 'large' or 'big' data-sets with 'k'
> specified,
> > > depending on the characteristics of my dataset it takes about 0.5 to 12
> > > hours before my Mahout job is being submitted to my Hadoop cluster.
> > > The Mahout source code shows that the big dataset is downloaded to my
> > local
> > > machine (over wifi, running in vagrant) and centroids are sampled in a
> > > single thread and pushed to hdfs.
> > > To benefit from MapReduce and data locality, I've created a
> > > RandomSeedGeneratorDriver and integrated this in the map reduce version
> > of
> > > (f)kmeans clustering.
> > > This version does the sampling in a few minutes on a small Hadoop
> > cluster.
> > >
> > > If you like, I would be happy to share my code.
> > >
> > > There are several ways to implement this and perhaps you don't favor
> it’s
> > > current implementation. I'd be happy to discuss this and of course make
> > > changes.
> > >
> > > Kind regards,
> > > Barrie Kersbergen
> >
>

Reply via email to