Re: Any plans for new clustering algorithms?

Matei Zaharia Mon, 21 Apr 2014 19:25:53 -0700

The wiki is actually maintained separately in 
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted 
editing of the wiki because bots would automatically add stuff. I’ve given you 
permissions now.


Matei

On Apr 21, 2014, at 6:22 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote:

> I thought those are files of spark.apache.org? 
> 
> -- 
> Nan Zhu
> 
> 
> On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
> 
>> The markdown files are under spark/docs. You can submit a PR for
>> changes. -Xiangrui
>> 
>> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.r...@cloudera.com 
>> (mailto:sandy.r...@cloudera.com)> wrote:
>>> How do I get permissions to edit the wiki?
>>> 
>>> 
>>> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com 
>>> (mailto:men...@gmail.com)> wrote:
>>> 
>>>> Cannot agree more with your words. Could you add one section about
>>>> "how and what to contribute" to MLlib's guide? -Xiangrui
>>>> 
>>>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
>>>> <nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com)> wrote:
>>>>> I'd say a section in the "how to contribute" page would be a good place
>>>> 
>>>> to put this.
>>>>> 
>>>>> In general I'd say that the criteria for inclusion of an algorithm is it
>>>> should be high quality, widely known, used and accepted (citations and
>>>> concrete use cases as examples of this), scalable and parallelizable, well
>>>> documented and with reasonable expectation of dev support
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com 
>>>>>> (mailto:sandy.r...@cloudera.com)> wrote:
>>>>>> 
>>>>>> If it's not done already, would it make sense to codify this philosophy
>>>>>> somewhere? I imagine this won't be the first time this discussion comes
>>>>>> up, and it would be nice to have a doc to point to. I'd be happy to
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> take a
>>>>>> stab at this.
>>>>>> 
>>>>>> 
>>>>>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com 
>>>>>>> (mailto:men...@gmail.com)>
>>>> wrote:
>>>>>>> 
>>>>>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>>>>>>> definitely need to spend more time on how to make the design scalable.
>>>>>>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>>>>>>> That being said, new algorithms are welcomed. I wish they are
>>>>>>> well-established and well-understood by users. They shouldn't be
>>>>>>> research algorithms tuned to work well with a particular dataset but
>>>>>>> not tested widely. You see the change log from Mahout:
>>>>>>> 
>>>>>>> ===
>>>>>>> The following algorithms that were marked deprecated in 0.8 have been
>>>>>>> removed in 0.9:
>>>>>>> 
>>>>>>> From Clustering:
>>>>>>> Switched LDA implementation from using Gibbs Sampling to Collapsed
>>>>>>> Variational Bayes (CVB)
>>>>>>> Meanshift
>>>>>>> MinHash - removed due to poor performance, lack of support and lack of
>>>>>>> usage
>>>>>>> 
>>>>>>> From Classification (both are sequential implementations)
>>>>>>> Winnow - lack of actual usage and support
>>>>>>> Perceptron - lack of actual usage and support
>>>>>>> 
>>>>>>> Collaborative Filtering
>>>>>>> SlopeOne implementations in
>>>>>>> org.apache.mahout.cf.taste.hadoop.slopeone and
>>>>>>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>>>>>>> Distributed pseudo recommender in
>>>>>>> org.apache.mahout.cf.taste.hadoop.pseudo
>>>>>>> TreeClusteringRecommender in
>>>>>>> org.apache.mahout.cf.taste.impl.recommender
>>>>>>> 
>>>>>>> Mahout Math
>>>>>>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>>>>>>> ===
>>>>>>> 
>>>>>>> In MLlib, we should include the algorithms users know how to use and
>>>>>>> we can provide support rather than letting algorithms come and go.
>>>>>>> 
>>>>>>> My $0.02,
>>>>>>> Xiangrui
>>>>>>> 
>>>>>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com 
>>>>>>>> (mailto:so...@cloudera.com)>
>>>> wrote:
>>>>>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us 
>>>>>>>>> (mailto:p...@mult.ifario.us)>
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> wrote:
>>>>>>>>> - MLlib as Mahout.next would be a unfortunate. There are some gems
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> in
>>>>>>>>> Mahout, but there are also lots of rocks. Setting a minimal bar of
>>>>>>>>> working, correctly implemented, and documented requires a surprising
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> amount
>>>>>>>>> of work.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As someone with first-hand knowledge, this is correct. To Sang's
>>>>>>>> question, I can't see value in 'porting' Mahout since it is based on a
>>>>>>>> quite different paradigm. About the only part that translates is the
>>>>>>>> algorithm concept itself.
>>>>>>>> 
>>>>>>>> This is also the cautionary tale. The contents of the project have
>>>>>>>> ended up being a number of "drive-by" contributions of implementations
>>>>>>>> that, while individually perhaps brilliant (perhaps), didn't
>>>>>>>> necessarily match any other implementation in structure, input/output,
>>>>>>>> libraries used. The implementations were often a touch academic. The
>>>>>>>> result was hard to document, maintain, evolve or use.
>>>>>>>> 
>>>>>>>> Far more of the structure of the MLlib implementations are consistent
>>>>>>>> by virtue of being built around Spark core already. That's great.
>>>>>>>> 
>>>>>>>> One can't wait to completely build the foundation before building any
>>>>>>>> implementations. To me, the existing implementations are almost
>>>>>>>> exactly the basics I would choose. They cover the bases and will
>>>>>>>> exercise the abstractions and structure. So that's also great IMHO.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: Any plans for new clustering algorithms?

Reply via email to