Re: Any plans for new clustering algorithms?

Sandy Ryza Mon, 21 Apr 2014 11:00:03 -0700

If it's not done already, would it make sense to codify this philosophy
somewhere?  I imagine this won't be the first time this discussion comes
up, and it would be nice to have a doc to point to.  I'd be happy to take a
stab at this.



On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <[email protected]> wrote:

> +1 on Sean's comment. MLlib covers the basic algorithms but we
> definitely need to spend more time on how to make the design scalable.
> For example, think about current "ProblemWithAlgorithm" naming scheme.
> That being said, new algorithms are welcomed. I wish they are
> well-established and well-understood by users. They shouldn't be
> research algorithms tuned to work well with a particular dataset but
> not tested widely. You see the change log from Mahout:
>
> ===
> The following algorithms that were marked deprecated in 0.8 have been
> removed in 0.9:
>
> From Clustering:
>   Switched LDA implementation from using Gibbs Sampling to Collapsed
> Variational Bayes (CVB)
> Meanshift
> MinHash - removed due to poor performance, lack of support and lack of
> usage
>
> From Classification (both are sequential implementations)
> Winnow - lack of actual usage and support
> Perceptron - lack of actual usage and support
>
> Collaborative Filtering
>     SlopeOne implementations in
> org.apache.mahout.cf.taste.hadoop.slopeone and
> org.apache.mahout.cf.taste.impl.recommender.slopeone
>     Distributed pseudo recommender in
> org.apache.mahout.cf.taste.hadoop.pseudo
>     TreeClusteringRecommender in
> org.apache.mahout.cf.taste.impl.recommender
>
> Mahout Math
>     Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> ===
>
> In MLlib, we should include the algorithms users know how to use and
> we can provide support rather than letting algorithms come and go.
>
> My $0.02,
> Xiangrui
>
> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <[email protected]> wrote:
> > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <[email protected]> wrote:
> >> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
> >> Mahout, but there are also lots of rocks.  Setting a minimal bar of
> >> working, correctly implemented, and documented requires a surprising
> amount
> >> of work.
> >
> > As someone with first-hand knowledge, this is correct. To Sang's
> > question, I can't see value in 'porting' Mahout since it is based on a
> > quite different paradigm. About the only part that translates is the
> > algorithm concept itself.
> >
> > This is also the cautionary tale. The contents of the project have
> > ended up being a number of "drive-by" contributions of implementations
> > that, while individually perhaps brilliant (perhaps), didn't
> > necessarily match any other implementation in structure, input/output,
> > libraries used. The implementations were often a touch academic. The
> > result was hard to document, maintain, evolve or use.
> >
> > Far more of the structure of the MLlib implementations are consistent
> > by virtue of being built around Spark core already. That's great.
> >
> > One can't wait to completely build the foundation before building any
> > implementations. To me, the existing implementations are almost
> > exactly the basics I would choose. They cover the bases and will
> > exercise the abstractions and structure. So that's also great IMHO.
>

Re: Any plans for new clustering algorithms?

Reply via email to