Re: Any plans for new clustering algorithms?

Nick Pentreath Mon, 21 Apr 2014 13:42:31 -0700

I'd say a section in the "how to contribute" page would be a good place to put 
this.


In general I'd say that the criteria for inclusion of an algorithm is it should 
be high quality, widely known, used and accepted (citations and concrete use 
cases as examples of this), scalable and parallelizable, well documented and 
with reasonable expectation of dev support

Sent from my iPhone

> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com> wrote:
> 
> If it's not done already, would it make sense to codify this philosophy
> somewhere?  I imagine this won't be the first time this discussion comes
> up, and it would be nice to have a doc to point to.  I'd be happy to take a
> stab at this.
> 
> 
>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com> wrote:
>> 
>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>> definitely need to spend more time on how to make the design scalable.
>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>> That being said, new algorithms are welcomed. I wish they are
>> well-established and well-understood by users. They shouldn't be
>> research algorithms tuned to work well with a particular dataset but
>> not tested widely. You see the change log from Mahout:
>> 
>> ===
>> The following algorithms that were marked deprecated in 0.8 have been
>> removed in 0.9:
>> 
>> From Clustering:
>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
>> Variational Bayes (CVB)
>> Meanshift
>> MinHash - removed due to poor performance, lack of support and lack of
>> usage
>> 
>> From Classification (both are sequential implementations)
>> Winnow - lack of actual usage and support
>> Perceptron - lack of actual usage and support
>> 
>> Collaborative Filtering
>>    SlopeOne implementations in
>> org.apache.mahout.cf.taste.hadoop.slopeone and
>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>>    Distributed pseudo recommender in
>> org.apache.mahout.cf.taste.hadoop.pseudo
>>    TreeClusteringRecommender in
>> org.apache.mahout.cf.taste.impl.recommender
>> 
>> Mahout Math
>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>> ===
>> 
>> In MLlib, we should include the algorithms users know how to use and
>> we can provide support rather than letting algorithms come and go.
>> 
>> My $0.02,
>> Xiangrui
>> 
>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us> wrote:
>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>>>> working, correctly implemented, and documented requires a surprising
>> amount
>>>> of work.
>>> 
>>> As someone with first-hand knowledge, this is correct. To Sang's
>>> question, I can't see value in 'porting' Mahout since it is based on a
>>> quite different paradigm. About the only part that translates is the
>>> algorithm concept itself.
>>> 
>>> This is also the cautionary tale. The contents of the project have
>>> ended up being a number of "drive-by" contributions of implementations
>>> that, while individually perhaps brilliant (perhaps), didn't
>>> necessarily match any other implementation in structure, input/output,
>>> libraries used. The implementations were often a touch academic. The
>>> result was hard to document, maintain, evolve or use.
>>> 
>>> Far more of the structure of the MLlib implementations are consistent
>>> by virtue of being built around Spark core already. That's great.
>>> 
>>> One can't wait to completely build the foundation before building any
>>> implementations. To me, the existing implementations are almost
>>> exactly the basics I would choose. They cover the bases and will
>>> exercise the abstractions and structure. So that's also great IMHO.
>>

Re: Any plans for new clustering algorithms?

Reply via email to