I thought those are files of spark.apache.org? -- Nan Zhu
On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: > The markdown files are under spark/docs. You can submit a PR for > changes. -Xiangrui > > On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.r...@cloudera.com > (mailto:sandy.r...@cloudera.com)> wrote: > > How do I get permissions to edit the wiki? > > > > > > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com > > (mailto:men...@gmail.com)> wrote: > > > > > Cannot agree more with your words. Could you add one section about > > > "how and what to contribute" to MLlib's guide? -Xiangrui > > > > > > On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath > > > <nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com)> wrote: > > > > I'd say a section in the "how to contribute" page would be a good place > > > > > > to put this. > > > > > > > > In general I'd say that the criteria for inclusion of an algorithm is it > > > should be high quality, widely known, used and accepted (citations and > > > concrete use cases as examples of this), scalable and parallelizable, well > > > documented and with reasonable expectation of dev support > > > > > > > > Sent from my iPhone > > > > > > > > > On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com > > > > > (mailto:sandy.r...@cloudera.com)> wrote: > > > > > > > > > > If it's not done already, would it make sense to codify this > > > > > philosophy > > > > > somewhere? I imagine this won't be the first time this discussion > > > > > comes > > > > > up, and it would be nice to have a doc to point to. I'd be happy to > > > > > > > > > > > > > > > > > > > take a > > > > > stab at this. > > > > > > > > > > > > > > > > On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com > > > > > > (mailto:men...@gmail.com)> > > > wrote: > > > > > > > > > > > > +1 on Sean's comment. MLlib covers the basic algorithms but we > > > > > > definitely need to spend more time on how to make the design > > > > > > scalable. > > > > > > For example, think about current "ProblemWithAlgorithm" naming > > > > > > scheme. > > > > > > That being said, new algorithms are welcomed. I wish they are > > > > > > well-established and well-understood by users. They shouldn't be > > > > > > research algorithms tuned to work well with a particular dataset but > > > > > > not tested widely. You see the change log from Mahout: > > > > > > > > > > > > === > > > > > > The following algorithms that were marked deprecated in 0.8 have > > > > > > been > > > > > > removed in 0.9: > > > > > > > > > > > > From Clustering: > > > > > > Switched LDA implementation from using Gibbs Sampling to Collapsed > > > > > > Variational Bayes (CVB) > > > > > > Meanshift > > > > > > MinHash - removed due to poor performance, lack of support and lack > > > > > > of > > > > > > usage > > > > > > > > > > > > From Classification (both are sequential implementations) > > > > > > Winnow - lack of actual usage and support > > > > > > Perceptron - lack of actual usage and support > > > > > > > > > > > > Collaborative Filtering > > > > > > SlopeOne implementations in > > > > > > org.apache.mahout.cf.taste.hadoop.slopeone and > > > > > > org.apache.mahout.cf.taste.impl.recommender.slopeone > > > > > > Distributed pseudo recommender in > > > > > > org.apache.mahout.cf.taste.hadoop.pseudo > > > > > > TreeClusteringRecommender in > > > > > > org.apache.mahout.cf.taste.impl.recommender > > > > > > > > > > > > Mahout Math > > > > > > Hadoop entropy stuff in org.apache.mahout.math.stats.entropy > > > > > > === > > > > > > > > > > > > In MLlib, we should include the algorithms users know how to use and > > > > > > we can provide support rather than letting algorithms come and go. > > > > > > > > > > > > My $0.02, > > > > > > Xiangrui > > > > > > > > > > > > > On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com > > > > > > > (mailto:so...@cloudera.com)> > > > wrote: > > > > > > > > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown > > > > > > > > <p...@mult.ifario.us (mailto:p...@mult.ifario.us)> > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > - MLlib as Mahout.next would be a unfortunate. There are some > > > > > > > > gems > > > > > > > > > > > > > > > > > > > > > > > > > > > > in > > > > > > > > Mahout, but there are also lots of rocks. Setting a minimal bar > > > > > > > > of > > > > > > > > working, correctly implemented, and documented requires a > > > > > > > > surprising > > > > > > > > > > > > > > > > > > > > > > > > > > > amount > > > > > > > > of work. > > > > > > > > > > > > > > > > > > > > > As someone with first-hand knowledge, this is correct. To Sang's > > > > > > > question, I can't see value in 'porting' Mahout since it is based > > > > > > > on a > > > > > > > quite different paradigm. About the only part that translates is > > > > > > > the > > > > > > > algorithm concept itself. > > > > > > > > > > > > > > This is also the cautionary tale. The contents of the project have > > > > > > > ended up being a number of "drive-by" contributions of > > > > > > > implementations > > > > > > > that, while individually perhaps brilliant (perhaps), didn't > > > > > > > necessarily match any other implementation in structure, > > > > > > > input/output, > > > > > > > libraries used. The implementations were often a touch academic. > > > > > > > The > > > > > > > result was hard to document, maintain, evolve or use. > > > > > > > > > > > > > > Far more of the structure of the MLlib implementations are > > > > > > > consistent > > > > > > > by virtue of being built around Spark core already. That's great. > > > > > > > > > > > > > > One can't wait to completely build the foundation before building > > > > > > > any > > > > > > > implementations. To me, the existing implementations are almost > > > > > > > exactly the basics I would choose. They cover the bases and will > > > > > > > exercise the abstractions and structure. So that's also great > > > > > > > IMHO. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >