Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox
On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling <rnowl...@gmail.com> wrote: > I went ahead and created JIRAs. > JIRA for Hierarchical Clustering: > https://issues.apache.org/jira/browse/SPARK-2429 > JIRA for Standarized Clustering APIs: > https://issues.apache.org/jira/browse/SPARK-2430 > Before submitting a PR for the standardized API, I want to implement a > few clustering algorithms for myself to get a good "feel" for how to > structure such an API. If others with more experience want to dive > into designing the API, though, that would allow us to get moving more > quickly. > On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: >> Cool seems like a god initiative. Adding a couple extra high quality >> clustering implantations will be great. >> >> >> I'd say it would make most sense to submit a PR for the Standardised API >> first, agree that with everyone and then build on it for the specific >> implementations. >> — >> Sent from Mailbox >> >> On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rnowl...@gmail.com> wrote: >> >>> Thanks everyone for the input. >>> So it seems what people want is: >>> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and >>> conquer approach, look at DecisionTree implementation as a reference) >>> * Restructure 3 Kmeans clustering algorithm implementations to prevent >>> code duplication and conform to a consistent API where possible >>> If this is correct, I'll start work on that. How would it be best to >>> structure it? Should I submit separate JIRAs / PRs for refactoring of >>> current code, MiniBatch KMeans, and Hierarchical or keep my current >>> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR >>> for Hierarchical KMeans that builds on top of that? >>> Thanks! >>> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <hector....@gmail.com> wrote: >>>> Yeah if one were to replace the objective function in decision tree with >>>> minimizing the variance of the leaf nodes it would be a hierarchical >>>> clusterer. >>>> >>>> >>>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.spa...@gmail.com> >>>> wrote: >>>> >>>>> If you're thinking along these lines, have a look at the DecisionTree >>>>> implementation in MLlib. It uses the same idea and is optimized to prevent >>>>> multiple passes over the data by computing several splits at each level of >>>>> tree building. The tradeoff is increased model state and computation per >>>>> pass over the data, but fewer total passes and hopefully lower >>>>> communication overheads than, say, shuffling data around that belongs to >>>>> one cluster or another. Something like that could work here as well. >>>>> >>>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more >>>>> efficient way to implement it, though. >>>>> >>>>> >>>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector....@gmail.com> wrote: >>>>> >>>>> > No was thinking more top-down: >>>>> > >>>>> > assuming a distributed kmeans system already existing, recursively apply >>>>> > the kmeans algorithm on data already partitioned by the previous level >>>>> > of >>>>> > kmeans. >>>>> > >>>>> > I haven't been much of a fan of bottom up approaches like HAC mainly >>>>> > because they assume there is already a distance metric for items to >>>>> items. >>>>> > This makes it hard to cluster new content. The distances between sibling >>>>> > clusters is also hard to compute (if you have thrown away the similarity >>>>> > matrix), do you count paths to same parent node if you are computing >>>>> > distances between items in two adjacent nodes for example. It is also a >>>>> bit >>>>> > harder to distribute the computation for bottom up approaches as you >>>>> > have >>>>> > to already find the nearest neighbor to an item to begin the process. >>>>> > >>>>> > >>>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowl...@gmail.com> wrote: >>>>> > >>>>> > > The scikit-learn implementation may be of interest: >>>>> > > >>>>> > > >>>>> > > >>>>> > >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward >>>>> > > >>>>> > > It's a bottom up approach. The pair of clusters for merging are >>>>> > > chosen to minimize variance. >>>>> > > >>>>> > > Their code is under a BSD license so it can be used as a template. >>>>> > > >>>>> > > Is something like that you were thinking Hector? >>>>> > > >>>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlie...@gmail.com> >>>>> > > wrote: >>>>> > > > sure. more interesting problem here is choosing k at each level. >>>>> Kernel >>>>> > > > methods seem to be most promising. >>>>> > > > >>>>> > > > >>>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com> >>>>> > wrote: >>>>> > > > >>>>> > > >> No idea, never looked it up. Always just implemented it as doing >>>>> > k-means >>>>> > > >> again on each cluster. >>>>> > > >> >>>>> > > >> FWIW standard k-means with euclidean distance has problems too with >>>>> > some >>>>> > > >> dimensionality reduction methods. Swapping out the distance metric >>>>> > with >>>>> > > >> negative dot or cosine may help. >>>>> > > >> >>>>> > > >> Other more useful clustering would be hierarchical SVD. The reason >>>>> > why I >>>>> > > >> like hierarchical clustering is it makes for faster inference >>>>> > especially >>>>> > > >> over billions of users. >>>>> > > >> >>>>> > > >> >>>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com >>>>> > >>>>> > > >> wrote: >>>>> > > >> >>>>> > > >> > Hector, could you share the references for hierarchical K-means? >>>>> > > thanks. >>>>> > > >> > >>>>> > > >> > >>>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com> >>>>> > > wrote: >>>>> > > >> > >>>>> > > >> > > I would say for bigdata applications the most useful would be >>>>> > > >> > hierarchical >>>>> > > >> > > k-means with back tracking and the ability to support k nearest >>>>> > > >> > centroids. >>>>> > > >> > > >>>>> > > >> > > >>>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com >>>>> > >>>>> > > >> wrote: >>>>> > > >> > > >>>>> > > >> > > > Hi all, >>>>> > > >> > > > >>>>> > > >> > > > MLlib currently has one clustering algorithm implementation, >>>>> > > KMeans. >>>>> > > >> > > > It would benefit from having implementations of other >>>>> clustering >>>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, >>>>> Hierarchical >>>>> > > >> > > > Clustering, and Affinity Propagation. >>>>> > > >> > > > >>>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans >>>>> > > implementation, >>>>> > > >> > > > and I saw an email on this list about interest in >>>>> > > >> > > > implementing >>>>> > > Fuzzy >>>>> > > >> > > > C-Means. >>>>> > > >> > > > >>>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it >>>>> > became >>>>> > > >> > > > apparent that before I implement more clustering algorithms, >>>>> it >>>>> > > would >>>>> > > >> > > > be useful to hammer out a framework to reduce code >>>>> > > >> > > > duplication >>>>> > and >>>>> > > >> > > > implement a consistent API. >>>>> > > >> > > > >>>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib >>>>> community: >>>>> > > >> > > > >>>>> > > >> > > > 1. Are you interested in having more clustering algorithms >>>>> > > available? >>>>> > > >> > > > >>>>> > > >> > > > 2. Is the community interested in specifying a common >>>>> framework? >>>>> > > >> > > > >>>>> > > >> > > > Thanks! >>>>> > > >> > > > RJ >>>>> > > >> > > > >>>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248 >>>>> > > >> > > > >>>>> > > >> > > > >>>>> > > >> > > > -- >>>>> > > >> > > > em rnowl...@gmail.com >>>>> > > >> > > > c 954.496.2314 >>>>> > > >> > > > >>>>> > > >> > > >>>>> > > >> > > >>>>> > > >> > > >>>>> > > >> > > -- >>>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee> >>>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>* >>>>> > > >> > > >>>>> > > >> > >>>>> > > >> >>>>> > > >> >>>>> > > >> >>>>> > > >> -- >>>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee> >>>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>* >>>>> > > >> >>>>> > > >>>>> > > >>>>> > > >>>>> > > -- >>>>> > > em rnowl...@gmail.com >>>>> > > c 954.496.2314 >>>>> > > >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Yee Yang Li Hector <http://google.com/+HectorYee> >>>>> > *google.com/+HectorYee <http://google.com/+HectorYee>* >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Yee Yang Li Hector <http://google.com/+HectorYee> >>>> *google.com/+HectorYee <http://google.com/+HectorYee>* >>> -- >>> em rnowl...@gmail.com >>> c 954.496.2314 > -- > em rnowl...@gmail.com > c 954.496.2314