How do I get permissions to edit the wiki?
On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com> wrote: > Cannot agree more with your words. Could you add one section about > "how and what to contribute" to MLlib's guide? -Xiangrui > > On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath > <nick.pentre...@gmail.com> wrote: > > I'd say a section in the "how to contribute" page would be a good place > to put this. > > > > In general I'd say that the criteria for inclusion of an algorithm is it > should be high quality, widely known, used and accepted (citations and > concrete use cases as examples of this), scalable and parallelizable, well > documented and with reasonable expectation of dev support > > > > Sent from my iPhone > > > >> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com> wrote: > >> > >> If it's not done already, would it make sense to codify this philosophy > >> somewhere? I imagine this won't be the first time this discussion comes > >> up, and it would be nice to have a doc to point to. I'd be happy to > take a > >> stab at this. > >> > >> > >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com> > wrote: > >>> > >>> +1 on Sean's comment. MLlib covers the basic algorithms but we > >>> definitely need to spend more time on how to make the design scalable. > >>> For example, think about current "ProblemWithAlgorithm" naming scheme. > >>> That being said, new algorithms are welcomed. I wish they are > >>> well-established and well-understood by users. They shouldn't be > >>> research algorithms tuned to work well with a particular dataset but > >>> not tested widely. You see the change log from Mahout: > >>> > >>> === > >>> The following algorithms that were marked deprecated in 0.8 have been > >>> removed in 0.9: > >>> > >>> From Clustering: > >>> Switched LDA implementation from using Gibbs Sampling to Collapsed > >>> Variational Bayes (CVB) > >>> Meanshift > >>> MinHash - removed due to poor performance, lack of support and lack of > >>> usage > >>> > >>> From Classification (both are sequential implementations) > >>> Winnow - lack of actual usage and support > >>> Perceptron - lack of actual usage and support > >>> > >>> Collaborative Filtering > >>> SlopeOne implementations in > >>> org.apache.mahout.cf.taste.hadoop.slopeone and > >>> org.apache.mahout.cf.taste.impl.recommender.slopeone > >>> Distributed pseudo recommender in > >>> org.apache.mahout.cf.taste.hadoop.pseudo > >>> TreeClusteringRecommender in > >>> org.apache.mahout.cf.taste.impl.recommender > >>> > >>> Mahout Math > >>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy > >>> === > >>> > >>> In MLlib, we should include the algorithms users know how to use and > >>> we can provide support rather than letting algorithms come and go. > >>> > >>> My $0.02, > >>> Xiangrui > >>> > >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> > wrote: > >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us> > wrote: > >>>>> - MLlib as Mahout.next would be a unfortunate. There are some gems > in > >>>>> Mahout, but there are also lots of rocks. Setting a minimal bar of > >>>>> working, correctly implemented, and documented requires a surprising > >>> amount > >>>>> of work. > >>>> > >>>> As someone with first-hand knowledge, this is correct. To Sang's > >>>> question, I can't see value in 'porting' Mahout since it is based on a > >>>> quite different paradigm. About the only part that translates is the > >>>> algorithm concept itself. > >>>> > >>>> This is also the cautionary tale. The contents of the project have > >>>> ended up being a number of "drive-by" contributions of implementations > >>>> that, while individually perhaps brilliant (perhaps), didn't > >>>> necessarily match any other implementation in structure, input/output, > >>>> libraries used. The implementations were often a touch academic. The > >>>> result was hard to document, maintain, evolve or use. > >>>> > >>>> Far more of the structure of the MLlib implementations are consistent > >>>> by virtue of being built around Spark core already. That's great. > >>>> > >>>> One can't wait to completely build the foundation before building any > >>>> implementations. To me, the existing implementations are almost > >>>> exactly the basics I would choose. They cover the bases and will > >>>> exercise the abstractions and structure. So that's also great IMHO. > >>> >