Re: Any plans for new clustering algorithms?

Xiangrui Meng Mon, 21 Apr 2014 15:20:07 -0700

Cannot agree more with your words. Could you add one section about
"how and what to contribute" to MLlib's guide? -Xiangrui


On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
<nick.pentre...@gmail.com> wrote:
> I'd say a section in the "how to contribute" page would be a good place to 
> put this.
>
> In general I'd say that the criteria for inclusion of an algorithm is it 
> should be high quality, widely known, used and accepted (citations and 
> concrete use cases as examples of this), scalable and parallelizable, well 
> documented and with reasonable expectation of dev support
>
> Sent from my iPhone
>
>> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>>
>> If it's not done already, would it make sense to codify this philosophy
>> somewhere?  I imagine this won't be the first time this discussion comes
>> up, and it would be nice to have a doc to point to.  I'd be happy to take a
>> stab at this.
>>
>>
>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>>> definitely need to spend more time on how to make the design scalable.
>>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>>> That being said, new algorithms are welcomed. I wish they are
>>> well-established and well-understood by users. They shouldn't be
>>> research algorithms tuned to work well with a particular dataset but
>>> not tested widely. You see the change log from Mahout:
>>>
>>> ===
>>> The following algorithms that were marked deprecated in 0.8 have been
>>> removed in 0.9:
>>>
>>> From Clustering:
>>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
>>> Variational Bayes (CVB)
>>> Meanshift
>>> MinHash - removed due to poor performance, lack of support and lack of
>>> usage
>>>
>>> From Classification (both are sequential implementations)
>>> Winnow - lack of actual usage and support
>>> Perceptron - lack of actual usage and support
>>>
>>> Collaborative Filtering
>>>    SlopeOne implementations in
>>> org.apache.mahout.cf.taste.hadoop.slopeone and
>>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>>>    Distributed pseudo recommender in
>>> org.apache.mahout.cf.taste.hadoop.pseudo
>>>    TreeClusteringRecommender in
>>> org.apache.mahout.cf.taste.impl.recommender
>>>
>>> Mahout Math
>>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>>> ===
>>>
>>> In MLlib, we should include the algorithms users know how to use and
>>> we can provide support rather than letting algorithms come and go.
>>>
>>> My $0.02,
>>> Xiangrui
>>>
>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us> wrote:
>>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
>>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>>>>> working, correctly implemented, and documented requires a surprising
>>> amount
>>>>> of work.
>>>>
>>>> As someone with first-hand knowledge, this is correct. To Sang's
>>>> question, I can't see value in 'porting' Mahout since it is based on a
>>>> quite different paradigm. About the only part that translates is the
>>>> algorithm concept itself.
>>>>
>>>> This is also the cautionary tale. The contents of the project have
>>>> ended up being a number of "drive-by" contributions of implementations
>>>> that, while individually perhaps brilliant (perhaps), didn't
>>>> necessarily match any other implementation in structure, input/output,
>>>> libraries used. The implementations were often a touch academic. The
>>>> result was hard to document, maintain, evolve or use.
>>>>
>>>> Far more of the structure of the MLlib implementations are consistent
>>>> by virtue of being built around Spark core already. That's great.
>>>>
>>>> One can't wait to completely build the foundation before building any
>>>> implementations. To me, the existing implementations are almost
>>>> exactly the basics I would choose. They cover the bases and will
>>>> exercise the abstractions and structure. So that's also great IMHO.
>>>

Re: Any plans for new clustering algorithms?

Reply via email to