If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this.
On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com> wrote: > +1 on Sean's comment. MLlib covers the basic algorithms but we > definitely need to spend more time on how to make the design scalable. > For example, think about current "ProblemWithAlgorithm" naming scheme. > That being said, new algorithms are welcomed. I wish they are > well-established and well-understood by users. They shouldn't be > research algorithms tuned to work well with a particular dataset but > not tested widely. You see the change log from Mahout: > > === > The following algorithms that were marked deprecated in 0.8 have been > removed in 0.9: > > From Clustering: > Switched LDA implementation from using Gibbs Sampling to Collapsed > Variational Bayes (CVB) > Meanshift > MinHash - removed due to poor performance, lack of support and lack of > usage > > From Classification (both are sequential implementations) > Winnow - lack of actual usage and support > Perceptron - lack of actual usage and support > > Collaborative Filtering > SlopeOne implementations in > org.apache.mahout.cf.taste.hadoop.slopeone and > org.apache.mahout.cf.taste.impl.recommender.slopeone > Distributed pseudo recommender in > org.apache.mahout.cf.taste.hadoop.pseudo > TreeClusteringRecommender in > org.apache.mahout.cf.taste.impl.recommender > > Mahout Math > Hadoop entropy stuff in org.apache.mahout.math.stats.entropy > === > > In MLlib, we should include the algorithms users know how to use and > we can provide support rather than letting algorithms come and go. > > My $0.02, > Xiangrui > > On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote: > > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us> wrote: > >> - MLlib as Mahout.next would be a unfortunate. There are some gems in > >> Mahout, but there are also lots of rocks. Setting a minimal bar of > >> working, correctly implemented, and documented requires a surprising > amount > >> of work. > > > > As someone with first-hand knowledge, this is correct. To Sang's > > question, I can't see value in 'porting' Mahout since it is based on a > > quite different paradigm. About the only part that translates is the > > algorithm concept itself. > > > > This is also the cautionary tale. The contents of the project have > > ended up being a number of "drive-by" contributions of implementations > > that, while individually perhaps brilliant (perhaps), didn't > > necessarily match any other implementation in structure, input/output, > > libraries used. The implementations were often a touch academic. The > > result was hard to document, maintain, evolve or use. > > > > Far more of the structure of the MLlib implementations are consistent > > by virtue of being built around Spark core already. That's great. > > > > One can't wait to completely build the foundation before building any > > implementations. To me, the existing implementations are almost > > exactly the basics I would choose. They cover the bases and will > > exercise the abstractions and structure. So that's also great IMHO. >