Re: Contributing to MLlib: Proposal for Clustering Algorithms

Nick Pentreath Thu, 10 Jul 2014 07:48:51 -0700

Might be worth checking out scikit-learn and mahout to get some broad ideas—
Sent from Mailbox


On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling <rnowl...@gmail.com> wrote:

> I went ahead and created JIRAs.
> JIRA for Hierarchical Clustering:
> https://issues.apache.org/jira/browse/SPARK-2429
> JIRA for Standarized Clustering APIs:
> https://issues.apache.org/jira/browse/SPARK-2430
> Before submitting a PR for the standardized API, I want to implement a
> few clustering algorithms for myself to get a good "feel" for how to
> structure such an API.  If others with more experience want to dive
> into designing the API, though, that would allow us to get moving more
> quickly.
> On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath <nick.pentre...@gmail.com> 
> wrote:
>> Cool seems like a god initiative. Adding a couple extra high quality 
>> clustering implantations will be great.
>>
>>
>> I'd say it would make most sense to submit a PR for the Standardised API 
>> first, agree that with everyone and then build on it for the specific 
>> implementations.
>> —
>> Sent from Mailbox
>>
>> On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rnowl...@gmail.com> wrote:
>>
>>> Thanks everyone for the input.
>>> So it seems what people want is:
>>> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
>>> conquer approach, look at DecisionTree implementation as a reference)
>>> * Restructure 3 Kmeans clustering algorithm implementations to prevent
>>> code duplication and conform to a consistent API where possible
>>> If this is correct, I'll start work on that.  How would it be best to
>>> structure it? Should I submit separate JIRAs / PRs for refactoring of
>>> current code, MiniBatch KMeans, and Hierarchical or keep my current
>>> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
>>> for Hierarchical KMeans that builds on top of that?
>>> Thanks!
>>> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <hector....@gmail.com> wrote:
>>>> Yeah if one were to replace the objective function in decision tree with
>>>> minimizing the variance of the leaf nodes it would be a hierarchical
>>>> clusterer.
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.spa...@gmail.com>
>>>> wrote:
>>>>
>>>>> If you're thinking along these lines, have a look at the DecisionTree
>>>>> implementation in MLlib. It uses the same idea and is optimized to prevent
>>>>> multiple passes over the data by computing several splits at each level of
>>>>> tree building. The tradeoff is increased model state and computation per
>>>>> pass over the data, but fewer total passes and hopefully lower
>>>>> communication overheads than, say, shuffling data around that belongs to
>>>>> one cluster or another. Something like that could work here as well.
>>>>>
>>>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>>>>> efficient way to implement it, though.
>>>>>
>>>>>
>>>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector....@gmail.com> wrote:
>>>>>
>>>>> > No was thinking more top-down:
>>>>> >
>>>>> > assuming a distributed kmeans system already existing, recursively apply
>>>>> > the kmeans algorithm on data already partitioned by the previous level 
>>>>> > of
>>>>> > kmeans.
>>>>> >
>>>>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>>>>> > because they assume there is already a distance metric for items to
>>>>> items.
>>>>> > This makes it hard to cluster new content. The distances between sibling
>>>>> > clusters is also hard to compute (if you have thrown away the similarity
>>>>> > matrix), do you count paths to same parent node if you are computing
>>>>> > distances between items in two adjacent nodes for example. It is also a
>>>>> bit
>>>>> > harder to distribute the computation for bottom up approaches as you 
>>>>> > have
>>>>> > to already find the nearest neighbor to an item to begin the process.
>>>>> >
>>>>> >
>>>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowl...@gmail.com> wrote:
>>>>> >
>>>>> > > The scikit-learn implementation may be of interest:
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> >
>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>>>>> > >
>>>>> > > It's a bottom up approach.  The pair of clusters for merging are
>>>>> > > chosen to minimize variance.
>>>>> > >
>>>>> > > Their code is under a BSD license so it can be used as a template.
>>>>> > >
>>>>> > > Is something like that you were thinking Hector?
>>>>> > >
>>>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>>> > > wrote:
>>>>> > > > sure. more interesting problem here is choosing k at each level.
>>>>> Kernel
>>>>> > > > methods seem to be most promising.
>>>>> > > >
>>>>> > > >
>>>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com>
>>>>> > wrote:
>>>>> > > >
>>>>> > > >> No idea, never looked it up. Always just implemented it as doing
>>>>> > k-means
>>>>> > > >> again on each cluster.
>>>>> > > >>
>>>>> > > >> FWIW standard k-means with euclidean distance has problems too with
>>>>> > some
>>>>> > > >> dimensionality reduction methods. Swapping out the distance metric
>>>>> > with
>>>>> > > >> negative dot or cosine may help.
>>>>> > > >>
>>>>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>>>>> > why I
>>>>> > > >> like hierarchical clustering is it makes for faster inference
>>>>> > especially
>>>>> > > >> over billions of users.
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com
>>>>> >
>>>>> > > >> wrote:
>>>>> > > >>
>>>>> > > >> > Hector, could you share the references for hierarchical K-means?
>>>>> > > thanks.
>>>>> > > >> >
>>>>> > > >> >
>>>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com>
>>>>> > > wrote:
>>>>> > > >> >
>>>>> > > >> > > I would say for bigdata applications the most useful would be
>>>>> > > >> > hierarchical
>>>>> > > >> > > k-means with back tracking and the ability to support k nearest
>>>>> > > >> > centroids.
>>>>> > > >> > >
>>>>> > > >> > >
>>>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com
>>>>> >
>>>>> > > >> wrote:
>>>>> > > >> > >
>>>>> > > >> > > > Hi all,
>>>>> > > >> > > >
>>>>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>>>>> > > KMeans.
>>>>> > > >> > > > It would benefit from having implementations of other
>>>>> clustering
>>>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>>>>> Hierarchical
>>>>> > > >> > > > Clustering, and Affinity Propagation.
>>>>> > > >> > > >
>>>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>>>>> > > implementation,
>>>>> > > >> > > > and I saw an email on this list about interest in 
>>>>> > > >> > > > implementing
>>>>> > > Fuzzy
>>>>> > > >> > > > C-Means.
>>>>> > > >> > > >
>>>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>>>>> > became
>>>>> > > >> > > > apparent that before I implement more clustering algorithms,
>>>>> it
>>>>> > > would
>>>>> > > >> > > > be useful to hammer out a framework to reduce code 
>>>>> > > >> > > > duplication
>>>>> > and
>>>>> > > >> > > > implement a consistent API.
>>>>> > > >> > > >
>>>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>>>>> community:
>>>>> > > >> > > >
>>>>> > > >> > > > 1. Are you interested in having more clustering algorithms
>>>>> > > available?
>>>>> > > >> > > >
>>>>> > > >> > > > 2. Is the community interested in specifying a common
>>>>> framework?
>>>>> > > >> > > >
>>>>> > > >> > > > Thanks!
>>>>> > > >> > > > RJ
>>>>> > > >> > > >
>>>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>>>>> > > >> > > >
>>>>> > > >> > > >
>>>>> > > >> > > > --
>>>>> > > >> > > > em rnowl...@gmail.com
>>>>> > > >> > > > c 954.496.2314
>>>>> > > >> > > >
>>>>> > > >> > >
>>>>> > > >> > >
>>>>> > > >> > >
>>>>> > > >> > > --
>>>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>>> > > >> > >
>>>>> > > >> >
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> --
>>>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>>>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>>> > > >>
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > --
>>>>> > > em rnowl...@gmail.com
>>>>> > > c 954.496.2314
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> --
>>> em rnowl...@gmail.com
>>> c 954.496.2314
> -- 
> em rnowl...@gmail.com
> c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Reply via email to