The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I’ve given you permissions now.
Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote: > I thought those are files of spark.apache.org? > > -- > Nan Zhu > > > On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: > >> The markdown files are under spark/docs. You can submit a PR for >> changes. -Xiangrui >> >> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.r...@cloudera.com >> (mailto:sandy.r...@cloudera.com)> wrote: >>> How do I get permissions to edit the wiki? >>> >>> >>> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com >>> (mailto:men...@gmail.com)> wrote: >>> >>>> Cannot agree more with your words. Could you add one section about >>>> "how and what to contribute" to MLlib's guide? -Xiangrui >>>> >>>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath >>>> <nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com)> wrote: >>>>> I'd say a section in the "how to contribute" page would be a good place >>>> >>>> to put this. >>>>> >>>>> In general I'd say that the criteria for inclusion of an algorithm is it >>>> should be high quality, widely known, used and accepted (citations and >>>> concrete use cases as examples of this), scalable and parallelizable, well >>>> documented and with reasonable expectation of dev support >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com >>>>>> (mailto:sandy.r...@cloudera.com)> wrote: >>>>>> >>>>>> If it's not done already, would it make sense to codify this philosophy >>>>>> somewhere? I imagine this won't be the first time this discussion comes >>>>>> up, and it would be nice to have a doc to point to. I'd be happy to >>>>>> >>>>> >>>>> >>>> >>>> take a >>>>>> stab at this. >>>>>> >>>>>> >>>>>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com >>>>>>> (mailto:men...@gmail.com)> >>>> wrote: >>>>>>> >>>>>>> +1 on Sean's comment. MLlib covers the basic algorithms but we >>>>>>> definitely need to spend more time on how to make the design scalable. >>>>>>> For example, think about current "ProblemWithAlgorithm" naming scheme. >>>>>>> That being said, new algorithms are welcomed. I wish they are >>>>>>> well-established and well-understood by users. They shouldn't be >>>>>>> research algorithms tuned to work well with a particular dataset but >>>>>>> not tested widely. You see the change log from Mahout: >>>>>>> >>>>>>> === >>>>>>> The following algorithms that were marked deprecated in 0.8 have been >>>>>>> removed in 0.9: >>>>>>> >>>>>>> From Clustering: >>>>>>> Switched LDA implementation from using Gibbs Sampling to Collapsed >>>>>>> Variational Bayes (CVB) >>>>>>> Meanshift >>>>>>> MinHash - removed due to poor performance, lack of support and lack of >>>>>>> usage >>>>>>> >>>>>>> From Classification (both are sequential implementations) >>>>>>> Winnow - lack of actual usage and support >>>>>>> Perceptron - lack of actual usage and support >>>>>>> >>>>>>> Collaborative Filtering >>>>>>> SlopeOne implementations in >>>>>>> org.apache.mahout.cf.taste.hadoop.slopeone and >>>>>>> org.apache.mahout.cf.taste.impl.recommender.slopeone >>>>>>> Distributed pseudo recommender in >>>>>>> org.apache.mahout.cf.taste.hadoop.pseudo >>>>>>> TreeClusteringRecommender in >>>>>>> org.apache.mahout.cf.taste.impl.recommender >>>>>>> >>>>>>> Mahout Math >>>>>>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy >>>>>>> === >>>>>>> >>>>>>> In MLlib, we should include the algorithms users know how to use and >>>>>>> we can provide support rather than letting algorithms come and go. >>>>>>> >>>>>>> My $0.02, >>>>>>> Xiangrui >>>>>>> >>>>>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com >>>>>>>> (mailto:so...@cloudera.com)> >>>> wrote: >>>>>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us >>>>>>>>> (mailto:p...@mult.ifario.us)> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> wrote: >>>>>>>>> - MLlib as Mahout.next would be a unfortunate. There are some gems >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> in >>>>>>>>> Mahout, but there are also lots of rocks. Setting a minimal bar of >>>>>>>>> working, correctly implemented, and documented requires a surprising >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> amount >>>>>>>>> of work. >>>>>>>> >>>>>>>> >>>>>>>> As someone with first-hand knowledge, this is correct. To Sang's >>>>>>>> question, I can't see value in 'porting' Mahout since it is based on a >>>>>>>> quite different paradigm. About the only part that translates is the >>>>>>>> algorithm concept itself. >>>>>>>> >>>>>>>> This is also the cautionary tale. The contents of the project have >>>>>>>> ended up being a number of "drive-by" contributions of implementations >>>>>>>> that, while individually perhaps brilliant (perhaps), didn't >>>>>>>> necessarily match any other implementation in structure, input/output, >>>>>>>> libraries used. The implementations were often a touch academic. The >>>>>>>> result was hard to document, maintain, evolve or use. >>>>>>>> >>>>>>>> Far more of the structure of the MLlib implementations are consistent >>>>>>>> by virtue of being built around Spark core already. That's great. >>>>>>>> >>>>>>>> One can't wait to completely build the foundation before building any >>>>>>>> implementations. To me, the existing implementations are almost >>>>>>>> exactly the basics I would choose. They cover the bases and will >>>>>>>> exercise the abstractions and structure. So that's also great IMHO. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >> >> > >