Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great.
I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rnowl...@gmail.com> wrote: > Thanks everyone for the input. > So it seems what people want is: > * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and > conquer approach, look at DecisionTree implementation as a reference) > * Restructure 3 Kmeans clustering algorithm implementations to prevent > code duplication and conform to a consistent API where possible > If this is correct, I'll start work on that. How would it be best to > structure it? Should I submit separate JIRAs / PRs for refactoring of > current code, MiniBatch KMeans, and Hierarchical or keep my current > JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR > for Hierarchical KMeans that builds on top of that? > Thanks! > On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <hector....@gmail.com> wrote: >> Yeah if one were to replace the objective function in decision tree with >> minimizing the variance of the leaf nodes it would be a hierarchical >> clusterer. >> >> >> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.spa...@gmail.com> >> wrote: >> >>> If you're thinking along these lines, have a look at the DecisionTree >>> implementation in MLlib. It uses the same idea and is optimized to prevent >>> multiple passes over the data by computing several splits at each level of >>> tree building. The tradeoff is increased model state and computation per >>> pass over the data, but fewer total passes and hopefully lower >>> communication overheads than, say, shuffling data around that belongs to >>> one cluster or another. Something like that could work here as well. >>> >>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more >>> efficient way to implement it, though. >>> >>> >>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector....@gmail.com> wrote: >>> >>> > No was thinking more top-down: >>> > >>> > assuming a distributed kmeans system already existing, recursively apply >>> > the kmeans algorithm on data already partitioned by the previous level of >>> > kmeans. >>> > >>> > I haven't been much of a fan of bottom up approaches like HAC mainly >>> > because they assume there is already a distance metric for items to >>> items. >>> > This makes it hard to cluster new content. The distances between sibling >>> > clusters is also hard to compute (if you have thrown away the similarity >>> > matrix), do you count paths to same parent node if you are computing >>> > distances between items in two adjacent nodes for example. It is also a >>> bit >>> > harder to distribute the computation for bottom up approaches as you have >>> > to already find the nearest neighbor to an item to begin the process. >>> > >>> > >>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowl...@gmail.com> wrote: >>> > >>> > > The scikit-learn implementation may be of interest: >>> > > >>> > > >>> > > >>> > >>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward >>> > > >>> > > It's a bottom up approach. The pair of clusters for merging are >>> > > chosen to minimize variance. >>> > > >>> > > Their code is under a BSD license so it can be used as a template. >>> > > >>> > > Is something like that you were thinking Hector? >>> > > >>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlie...@gmail.com> >>> > > wrote: >>> > > > sure. more interesting problem here is choosing k at each level. >>> Kernel >>> > > > methods seem to be most promising. >>> > > > >>> > > > >>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com> >>> > wrote: >>> > > > >>> > > >> No idea, never looked it up. Always just implemented it as doing >>> > k-means >>> > > >> again on each cluster. >>> > > >> >>> > > >> FWIW standard k-means with euclidean distance has problems too with >>> > some >>> > > >> dimensionality reduction methods. Swapping out the distance metric >>> > with >>> > > >> negative dot or cosine may help. >>> > > >> >>> > > >> Other more useful clustering would be hierarchical SVD. The reason >>> > why I >>> > > >> like hierarchical clustering is it makes for faster inference >>> > especially >>> > > >> over billions of users. >>> > > >> >>> > > >> >>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com >>> > >>> > > >> wrote: >>> > > >> >>> > > >> > Hector, could you share the references for hierarchical K-means? >>> > > thanks. >>> > > >> > >>> > > >> > >>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com> >>> > > wrote: >>> > > >> > >>> > > >> > > I would say for bigdata applications the most useful would be >>> > > >> > hierarchical >>> > > >> > > k-means with back tracking and the ability to support k nearest >>> > > >> > centroids. >>> > > >> > > >>> > > >> > > >>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com >>> > >>> > > >> wrote: >>> > > >> > > >>> > > >> > > > Hi all, >>> > > >> > > > >>> > > >> > > > MLlib currently has one clustering algorithm implementation, >>> > > KMeans. >>> > > >> > > > It would benefit from having implementations of other >>> clustering >>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, >>> Hierarchical >>> > > >> > > > Clustering, and Affinity Propagation. >>> > > >> > > > >>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans >>> > > implementation, >>> > > >> > > > and I saw an email on this list about interest in implementing >>> > > Fuzzy >>> > > >> > > > C-Means. >>> > > >> > > > >>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it >>> > became >>> > > >> > > > apparent that before I implement more clustering algorithms, >>> it >>> > > would >>> > > >> > > > be useful to hammer out a framework to reduce code duplication >>> > and >>> > > >> > > > implement a consistent API. >>> > > >> > > > >>> > > >> > > > I'd like to gauge the interest and goals of the MLlib >>> community: >>> > > >> > > > >>> > > >> > > > 1. Are you interested in having more clustering algorithms >>> > > available? >>> > > >> > > > >>> > > >> > > > 2. Is the community interested in specifying a common >>> framework? >>> > > >> > > > >>> > > >> > > > Thanks! >>> > > >> > > > RJ >>> > > >> > > > >>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248 >>> > > >> > > > >>> > > >> > > > >>> > > >> > > > -- >>> > > >> > > > em rnowl...@gmail.com >>> > > >> > > > c 954.496.2314 >>> > > >> > > > >>> > > >> > > >>> > > >> > > >>> > > >> > > >>> > > >> > > -- >>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee> >>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>* >>> > > >> > > >>> > > >> > >>> > > >> >>> > > >> >>> > > >> >>> > > >> -- >>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee> >>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>* >>> > > >> >>> > > >>> > > >>> > > >>> > > -- >>> > > em rnowl...@gmail.com >>> > > c 954.496.2314 >>> > > >>> > >>> > >>> > >>> > -- >>> > Yee Yang Li Hector <http://google.com/+HectorYee> >>> > *google.com/+HectorYee <http://google.com/+HectorYee>* >>> > >>> >> >> >> >> -- >> Yee Yang Li Hector <http://google.com/+HectorYee> >> *google.com/+HectorYee <http://google.com/+HectorYee>* > -- > em rnowl...@gmail.com > c 954.496.2314