Re: Contributing to MLlib: Proposal for Clustering Algorithms

Hector Yee Tue, 08 Jul 2014 14:07:46 -0700

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.


I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.


On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <[email protected]> wrote:

> The scikit-learn implementation may be of interest:
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>
> It's a bottom up approach.  The pair of clusters for merging are
> chosen to minimize variance.
>
> Their code is under a BSD license so it can be used as a template.
>
> Is something like that you were thinking Hector?
>
> On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> > sure. more interesting problem here is choosing k at each level. Kernel
> > methods seem to be most promising.
> >
> >
> > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <[email protected]> wrote:
> >
> >> No idea, never looked it up. Always just implemented it as doing k-means
> >> again on each cluster.
> >>
> >> FWIW standard k-means with euclidean distance has problems too with some
> >> dimensionality reduction methods. Swapping out the distance metric with
> >> negative dot or cosine may help.
> >>
> >> Other more useful clustering would be hierarchical SVD. The reason why I
> >> like hierarchical clustering is it makes for faster inference especially
> >> over billions of users.
> >>
> >>
> >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <[email protected]>
> >> wrote:
> >>
> >> > Hector, could you share the references for hierarchical K-means?
> thanks.
> >> >
> >> >
> >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <[email protected]>
> wrote:
> >> >
> >> > > I would say for bigdata applications the most useful would be
> >> > hierarchical
> >> > > k-means with back tracking and the ability to support k nearest
> >> > centroids.
> >> > >
> >> > >
> >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > MLlib currently has one clustering algorithm implementation,
> KMeans.
> >> > > > It would benefit from having implementations of other clustering
> >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> >> > > > Clustering, and Affinity Propagation.
> >> > > >
> >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> implementation,
> >> > > > and I saw an email on this list about interest in implementing
> Fuzzy
> >> > > > C-Means.
> >> > > >
> >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> >> > > > apparent that before I implement more clustering algorithms, it
> would
> >> > > > be useful to hammer out a framework to reduce code duplication and
> >> > > > implement a consistent API.
> >> > > >
> >> > > > I'd like to gauge the interest and goals of the MLlib community:
> >> > > >
> >> > > > 1. Are you interested in having more clustering algorithms
> available?
> >> > > >
> >> > > > 2. Is the community interested in specifying a common framework?
> >> > > >
> >> > > > Thanks!
> >> > > > RJ
> >> > > >
> >> > > > [1] - https://github.com/apache/spark/pull/1248
> >> > > >
> >> > > >
> >> > > > --
> >> > > > em [email protected]
> >> > > > c 954.496.2314
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Yee Yang Li Hector <http://google.com/+HectorYee>
> >> *google.com/+HectorYee <http://google.com/+HectorYee>*
> >>
>
>
>
> --
> em [email protected]
> c 954.496.2314
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Reply via email to