Re: Any plans for new clustering algorithms?

Xiangrui Meng Mon, 21 Apr 2014 10:55:01 -0700

+1 on Sean's comment. MLlib covers the basic algorithms but we
definitely need to spend more time on how to make the design scalable.
For example, think about current "ProblemWithAlgorithm" naming scheme.
That being said, new algorithms are welcomed. I wish they are
well-established and well-understood by users. They shouldn't be
research algorithms tuned to work well with a particular dataset but
not tested widely. You see the change log from Mahout:


===
The following algorithms that were marked deprecated in 0.8 have been
removed in 0.9:

>From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
Variational Bayes (CVB)
Meanshift
MinHash - removed due to poor performance, lack of support and lack of usage

>From Classification (both are sequential implementations)
Winnow - lack of actual usage and support
Perceptron - lack of actual usage and support

Collaborative Filtering
    SlopeOne implementations in
org.apache.mahout.cf.taste.hadoop.slopeone and
org.apache.mahout.cf.taste.impl.recommender.slopeone
    Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
    TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

Mahout Math
    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
===

In MLlib, we should include the algorithms users know how to use and
we can provide support rather than letting algorithms come and go.

My $0.02,
Xiangrui

On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us> wrote:
>> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>> working, correctly implemented, and documented requires a surprising amount
>> of work.
>
> As someone with first-hand knowledge, this is correct. To Sang's
> question, I can't see value in 'porting' Mahout since it is based on a
> quite different paradigm. About the only part that translates is the
> algorithm concept itself.
>
> This is also the cautionary tale. The contents of the project have
> ended up being a number of "drive-by" contributions of implementations
> that, while individually perhaps brilliant (perhaps), didn't
> necessarily match any other implementation in structure, input/output,
> libraries used. The implementations were often a touch academic. The
> result was hard to document, maintain, evolve or use.
>
> Far more of the structure of the MLlib implementations are consistent
> by virtue of being built around Spark core already. That's great.
>
> One can't wait to completely build the foundation before building any
> implementations. To me, the existing implementations are almost
> exactly the basics I would choose. They cover the bases and will
> exercise the abstractions and structure. So that's also great IMHO.

Re: Any plans for new clustering algorithms?

Reply via email to