Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer.
On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote: > If you're thinking along these lines, have a look at the DecisionTree > implementation in MLlib. It uses the same idea and is optimized to prevent > multiple passes over the data by computing several splits at each level of > tree building. The tradeoff is increased model state and computation per > pass over the data, but fewer total passes and hopefully lower > communication overheads than, say, shuffling data around that belongs to > one cluster or another. Something like that could work here as well. > > I'm not super-familiar with hierarchical K-Means so perhaps there's a more > efficient way to implement it, though. > > > On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector....@gmail.com> wrote: > > > No was thinking more top-down: > > > > assuming a distributed kmeans system already existing, recursively apply > > the kmeans algorithm on data already partitioned by the previous level of > > kmeans. > > > > I haven't been much of a fan of bottom up approaches like HAC mainly > > because they assume there is already a distance metric for items to > items. > > This makes it hard to cluster new content. The distances between sibling > > clusters is also hard to compute (if you have thrown away the similarity > > matrix), do you count paths to same parent node if you are computing > > distances between items in two adjacent nodes for example. It is also a > bit > > harder to distribute the computation for bottom up approaches as you have > > to already find the nearest neighbor to an item to begin the process. > > > > > > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowl...@gmail.com> wrote: > > > > > The scikit-learn implementation may be of interest: > > > > > > > > > > > > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward > > > > > > It's a bottom up approach. The pair of clusters for merging are > > > chosen to minimize variance. > > > > > > Their code is under a BSD license so it can be used as a template. > > > > > > Is something like that you were thinking Hector? > > > > > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > > wrote: > > > > sure. more interesting problem here is choosing k at each level. > Kernel > > > > methods seem to be most promising. > > > > > > > > > > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com> > > wrote: > > > > > > > >> No idea, never looked it up. Always just implemented it as doing > > k-means > > > >> again on each cluster. > > > >> > > > >> FWIW standard k-means with euclidean distance has problems too with > > some > > > >> dimensionality reduction methods. Swapping out the distance metric > > with > > > >> negative dot or cosine may help. > > > >> > > > >> Other more useful clustering would be hierarchical SVD. The reason > > why I > > > >> like hierarchical clustering is it makes for faster inference > > especially > > > >> over billions of users. > > > >> > > > >> > > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com > > > > > >> wrote: > > > >> > > > >> > Hector, could you share the references for hierarchical K-means? > > > thanks. > > > >> > > > > >> > > > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com> > > > wrote: > > > >> > > > > >> > > I would say for bigdata applications the most useful would be > > > >> > hierarchical > > > >> > > k-means with back tracking and the ability to support k nearest > > > >> > centroids. > > > >> > > > > > >> > > > > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com > > > > > >> wrote: > > > >> > > > > > >> > > > Hi all, > > > >> > > > > > > >> > > > MLlib currently has one clustering algorithm implementation, > > > KMeans. > > > >> > > > It would benefit from having implementations of other > clustering > > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, > Hierarchical > > > >> > > > Clustering, and Affinity Propagation. > > > >> > > > > > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans > > > implementation, > > > >> > > > and I saw an email on this list about interest in implementing > > > Fuzzy > > > >> > > > C-Means. > > > >> > > > > > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it > > became > > > >> > > > apparent that before I implement more clustering algorithms, > it > > > would > > > >> > > > be useful to hammer out a framework to reduce code duplication > > and > > > >> > > > implement a consistent API. > > > >> > > > > > > >> > > > I'd like to gauge the interest and goals of the MLlib > community: > > > >> > > > > > > >> > > > 1. Are you interested in having more clustering algorithms > > > available? > > > >> > > > > > > >> > > > 2. Is the community interested in specifying a common > framework? > > > >> > > > > > > >> > > > Thanks! > > > >> > > > RJ > > > >> > > > > > > >> > > > [1] - https://github.com/apache/spark/pull/1248 > > > >> > > > > > > >> > > > > > > >> > > > -- > > > >> > > > em rnowl...@gmail.com > > > >> > > > c 954.496.2314 > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > -- > > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee> > > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>* > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> -- > > > >> Yee Yang Li Hector <http://google.com/+HectorYee> > > > >> *google.com/+HectorYee <http://google.com/+HectorYee>* > > > >> > > > > > > > > > > > > -- > > > em rnowl...@gmail.com > > > c 954.496.2314 > > > > > > > > > > > -- > > Yee Yang Li Hector <http://google.com/+HectorYee> > > *google.com/+HectorYee <http://google.com/+HectorYee>* > > > -- Yee Yang Li Hector <http://google.com/+HectorYee> *google.com/+HectorYee <http://google.com/+HectorYee>*