Re: Contributing to MLlib: Proposal for Clustering Algorithms

Nick Pentreath Wed, 09 Jul 2014 05:40:18 -0700

Cool seems like a god initiative. Adding a couple extra high quality clustering 
implantations will be great.



I'd say it would make most sense to submit a PR for the Standardised API first, 
agree that with everyone and then build on it for the specific implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rnowl...@gmail.com> wrote:

> Thanks everyone for the input.
> So it seems what people want is:
> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
> conquer approach, look at DecisionTree implementation as a reference)
> * Restructure 3 Kmeans clustering algorithm implementations to prevent
> code duplication and conform to a consistent API where possible
> If this is correct, I'll start work on that.  How would it be best to
> structure it? Should I submit separate JIRAs / PRs for refactoring of
> current code, MiniBatch KMeans, and Hierarchical or keep my current
> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
> for Hierarchical KMeans that builds on top of that?
> Thanks!
> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <hector....@gmail.com> wrote:
>> Yeah if one were to replace the objective function in decision tree with
>> minimizing the variance of the leaf nodes it would be a hierarchical
>> clusterer.
>>
>>
>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.spa...@gmail.com>
>> wrote:
>>
>>> If you're thinking along these lines, have a look at the DecisionTree
>>> implementation in MLlib. It uses the same idea and is optimized to prevent
>>> multiple passes over the data by computing several splits at each level of
>>> tree building. The tradeoff is increased model state and computation per
>>> pass over the data, but fewer total passes and hopefully lower
>>> communication overheads than, say, shuffling data around that belongs to
>>> one cluster or another. Something like that could work here as well.
>>>
>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>>> efficient way to implement it, though.
>>>
>>>
>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector....@gmail.com> wrote:
>>>
>>> > No was thinking more top-down:
>>> >
>>> > assuming a distributed kmeans system already existing, recursively apply
>>> > the kmeans algorithm on data already partitioned by the previous level of
>>> > kmeans.
>>> >
>>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>>> > because they assume there is already a distance metric for items to
>>> items.
>>> > This makes it hard to cluster new content. The distances between sibling
>>> > clusters is also hard to compute (if you have thrown away the similarity
>>> > matrix), do you count paths to same parent node if you are computing
>>> > distances between items in two adjacent nodes for example. It is also a
>>> bit
>>> > harder to distribute the computation for bottom up approaches as you have
>>> > to already find the nearest neighbor to an item to begin the process.
>>> >
>>> >
>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowl...@gmail.com> wrote:
>>> >
>>> > > The scikit-learn implementation may be of interest:
>>> > >
>>> > >
>>> > >
>>> >
>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>>> > >
>>> > > It's a bottom up approach.  The pair of clusters for merging are
>>> > > chosen to minimize variance.
>>> > >
>>> > > Their code is under a BSD license so it can be used as a template.
>>> > >
>>> > > Is something like that you were thinking Hector?
>>> > >
>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> > > wrote:
>>> > > > sure. more interesting problem here is choosing k at each level.
>>> Kernel
>>> > > > methods seem to be most promising.
>>> > > >
>>> > > >
>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com>
>>> > wrote:
>>> > > >
>>> > > >> No idea, never looked it up. Always just implemented it as doing
>>> > k-means
>>> > > >> again on each cluster.
>>> > > >>
>>> > > >> FWIW standard k-means with euclidean distance has problems too with
>>> > some
>>> > > >> dimensionality reduction methods. Swapping out the distance metric
>>> > with
>>> > > >> negative dot or cosine may help.
>>> > > >>
>>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>>> > why I
>>> > > >> like hierarchical clustering is it makes for faster inference
>>> > especially
>>> > > >> over billions of users.
>>> > > >>
>>> > > >>
>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com
>>> >
>>> > > >> wrote:
>>> > > >>
>>> > > >> > Hector, could you share the references for hierarchical K-means?
>>> > > thanks.
>>> > > >> >
>>> > > >> >
>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com>
>>> > > wrote:
>>> > > >> >
>>> > > >> > > I would say for bigdata applications the most useful would be
>>> > > >> > hierarchical
>>> > > >> > > k-means with back tracking and the ability to support k nearest
>>> > > >> > centroids.
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com
>>> >
>>> > > >> wrote:
>>> > > >> > >
>>> > > >> > > > Hi all,
>>> > > >> > > >
>>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>>> > > KMeans.
>>> > > >> > > > It would benefit from having implementations of other
>>> clustering
>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>>> Hierarchical
>>> > > >> > > > Clustering, and Affinity Propagation.
>>> > > >> > > >
>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>>> > > implementation,
>>> > > >> > > > and I saw an email on this list about interest in implementing
>>> > > Fuzzy
>>> > > >> > > > C-Means.
>>> > > >> > > >
>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>>> > became
>>> > > >> > > > apparent that before I implement more clustering algorithms,
>>> it
>>> > > would
>>> > > >> > > > be useful to hammer out a framework to reduce code duplication
>>> > and
>>> > > >> > > > implement a consistent API.
>>> > > >> > > >
>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>>> community:
>>> > > >> > > >
>>> > > >> > > > 1. Are you interested in having more clustering algorithms
>>> > > available?
>>> > > >> > > >
>>> > > >> > > > 2. Is the community interested in specifying a common
>>> framework?
>>> > > >> > > >
>>> > > >> > > > Thanks!
>>> > > >> > > > RJ
>>> > > >> > > >
>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>>> > > >> > > >
>>> > > >> > > >
>>> > > >> > > > --
>>> > > >> > > > em rnowl...@gmail.com
>>> > > >> > > > c 954.496.2314
>>> > > >> > > >
>>> > > >> > >
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > --
>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> > > >> > >
>>> > > >> >
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >> --
>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> > > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > em rnowl...@gmail.com
>>> > > c 954.496.2314
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> >
>>>
>>
>>
>>
>> --
>> Yee Yang Li Hector <http://google.com/+HectorYee>
>> *google.com/+HectorYee <http://google.com/+HectorYee>*
> -- 
> em rnowl...@gmail.com
> c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Reply via email to