Re: Any plans for new clustering algorithms?

Nan Zhu Mon, 21 Apr 2014 18:17:30 -0700

I thought those are files of spark.apache.org? 

-- 
Nan Zhu



On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:

> The markdown files are under spark/docs. You can submit a PR for
> changes. -Xiangrui
> 
> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <[email protected] 
> (mailto:[email protected])> wrote:
> > How do I get permissions to edit the wiki?
> > 
> > 
> > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <[email protected] 
> > (mailto:[email protected])> wrote:
> > 
> > > Cannot agree more with your words. Could you add one section about
> > > "how and what to contribute" to MLlib's guide? -Xiangrui
> > > 
> > > On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> > > <[email protected] (mailto:[email protected])> wrote:
> > > > I'd say a section in the "how to contribute" page would be a good place
> > > 
> > > to put this.
> > > > 
> > > > In general I'd say that the criteria for inclusion of an algorithm is it
> > > should be high quality, widely known, used and accepted (citations and
> > > concrete use cases as examples of this), scalable and parallelizable, well
> > > documented and with reasonable expectation of dev support
> > > > 
> > > > Sent from my iPhone
> > > > 
> > > > > On 21 Apr 2014, at 19:59, Sandy Ryza <[email protected] 
> > > > > (mailto:[email protected])> wrote:
> > > > > 
> > > > > If it's not done already, would it make sense to codify this 
> > > > > philosophy
> > > > > somewhere? I imagine this won't be the first time this discussion 
> > > > > comes
> > > > > up, and it would be nice to have a doc to point to. I'd be happy to
> > > > > 
> > > > 
> > > > 
> > > 
> > > take a
> > > > > stab at this.
> > > > > 
> > > > > 
> > > > > > On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <[email protected] 
> > > > > > (mailto:[email protected])>
> > > wrote:
> > > > > > 
> > > > > > +1 on Sean's comment. MLlib covers the basic algorithms but we
> > > > > > definitely need to spend more time on how to make the design 
> > > > > > scalable.
> > > > > > For example, think about current "ProblemWithAlgorithm" naming 
> > > > > > scheme.
> > > > > > That being said, new algorithms are welcomed. I wish they are
> > > > > > well-established and well-understood by users. They shouldn't be
> > > > > > research algorithms tuned to work well with a particular dataset but
> > > > > > not tested widely. You see the change log from Mahout:
> > > > > > 
> > > > > > ===
> > > > > > The following algorithms that were marked deprecated in 0.8 have 
> > > > > > been
> > > > > > removed in 0.9:
> > > > > > 
> > > > > > From Clustering:
> > > > > > Switched LDA implementation from using Gibbs Sampling to Collapsed
> > > > > > Variational Bayes (CVB)
> > > > > > Meanshift
> > > > > > MinHash - removed due to poor performance, lack of support and lack 
> > > > > > of
> > > > > > usage
> > > > > > 
> > > > > > From Classification (both are sequential implementations)
> > > > > > Winnow - lack of actual usage and support
> > > > > > Perceptron - lack of actual usage and support
> > > > > > 
> > > > > > Collaborative Filtering
> > > > > > SlopeOne implementations in
> > > > > > org.apache.mahout.cf.taste.hadoop.slopeone and
> > > > > > org.apache.mahout.cf.taste.impl.recommender.slopeone
> > > > > > Distributed pseudo recommender in
> > > > > > org.apache.mahout.cf.taste.hadoop.pseudo
> > > > > > TreeClusteringRecommender in
> > > > > > org.apache.mahout.cf.taste.impl.recommender
> > > > > > 
> > > > > > Mahout Math
> > > > > > Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> > > > > > ===
> > > > > > 
> > > > > > In MLlib, we should include the algorithms users know how to use and
> > > > > > we can provide support rather than letting algorithms come and go.
> > > > > > 
> > > > > > My $0.02,
> > > > > > Xiangrui
> > > > > > 
> > > > > > > On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <[email protected] 
> > > > > > > (mailto:[email protected])>
> > > wrote:
> > > > > > > > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
> > > > > > > > <[email protected] (mailto:[email protected])>
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > wrote:
> > > > > > > > - MLlib as Mahout.next would be a unfortunate. There are some 
> > > > > > > > gems
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > in
> > > > > > > > Mahout, but there are also lots of rocks. Setting a minimal bar 
> > > > > > > > of
> > > > > > > > working, correctly implemented, and documented requires a 
> > > > > > > > surprising
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > amount
> > > > > > > > of work.
> > > > > > > 
> > > > > > > 
> > > > > > > As someone with first-hand knowledge, this is correct. To Sang's
> > > > > > > question, I can't see value in 'porting' Mahout since it is based 
> > > > > > > on a
> > > > > > > quite different paradigm. About the only part that translates is 
> > > > > > > the
> > > > > > > algorithm concept itself.
> > > > > > > 
> > > > > > > This is also the cautionary tale. The contents of the project have
> > > > > > > ended up being a number of "drive-by" contributions of 
> > > > > > > implementations
> > > > > > > that, while individually perhaps brilliant (perhaps), didn't
> > > > > > > necessarily match any other implementation in structure, 
> > > > > > > input/output,
> > > > > > > libraries used. The implementations were often a touch academic. 
> > > > > > > The
> > > > > > > result was hard to document, maintain, evolve or use.
> > > > > > > 
> > > > > > > Far more of the structure of the MLlib implementations are 
> > > > > > > consistent
> > > > > > > by virtue of being built around Spark core already. That's great.
> > > > > > > 
> > > > > > > One can't wait to completely build the foundation before building 
> > > > > > > any
> > > > > > > implementations. To me, the existing implementations are almost
> > > > > > > exactly the basics I would choose. They cover the bases and will
> > > > > > > exercise the abstractions and structure. So that's also great 
> > > > > > > IMHO.
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 
>

Re: Any plans for new clustering algorithms?

Reply via email to