Re: Any plans for new clustering algorithms?

Sandy Ryza Tue, 22 Apr 2014 10:13:19 -0700

Thanks Matei.  I added a section "How to contribute" page.


On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> The wiki is actually maintained separately in
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We
> restricted editing of the wiki because bots would automatically add stuff.
> I've given you permissions now.
>
> Matei
>
> On Apr 21, 2014, at 6:22 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
>
> > I thought those are files of spark.apache.org?
> >
> > --
> > Nan Zhu
> >
> >
> > On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
> >
> >> The markdown files are under spark/docs. You can submit a PR for
> >> changes. -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza 
> >> <sandy.r...@cloudera.com(mailto:
> sandy.r...@cloudera.com)> wrote:
> >>> How do I get permissions to edit the wiki?
> >>>
> >>>
> >>> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com(mailto:
> men...@gmail.com)> wrote:
> >>>
> >>>> Cannot agree more with your words. Could you add one section about
> >>>> "how and what to contribute" to MLlib's guide? -Xiangrui
> >>>>
> >>>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> >>>> <nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com)> wrote:
> >>>>> I'd say a section in the "how to contribute" page would be a good
> place
> >>>>
> >>>> to put this.
> >>>>>
> >>>>> In general I'd say that the criteria for inclusion of an algorithm
> is it
> >>>> should be high quality, widely known, used and accepted (citations and
> >>>> concrete use cases as examples of this), scalable and parallelizable,
> well
> >>>> documented and with reasonable expectation of dev support
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com(mailto:
> sandy.r...@cloudera.com)> wrote:
> >>>>>>
> >>>>>> If it's not done already, would it make sense to codify this
> philosophy
> >>>>>> somewhere? I imagine this won't be the first time this discussion
> comes
> >>>>>> up, and it would be nice to have a doc to point to. I'd be happy to
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> take a
> >>>>>> stab at this.
> >>>>>>
> >>>>>>
> >>>>>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng 
> >>>>>>> <men...@gmail.com(mailto:
> men...@gmail.com)>
> >>>> wrote:
> >>>>>>>
> >>>>>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >>>>>>> definitely need to spend more time on how to make the design
> scalable.
> >>>>>>> For example, think about current "ProblemWithAlgorithm" naming
> scheme.
> >>>>>>> That being said, new algorithms are welcomed. I wish they are
> >>>>>>> well-established and well-understood by users. They shouldn't be
> >>>>>>> research algorithms tuned to work well with a particular dataset
> but
> >>>>>>> not tested widely. You see the change log from Mahout:
> >>>>>>>
> >>>>>>> ===
> >>>>>>> The following algorithms that were marked deprecated in 0.8 have
> been
> >>>>>>> removed in 0.9:
> >>>>>>>
> >>>>>>> From Clustering:
> >>>>>>> Switched LDA implementation from using Gibbs Sampling to Collapsed
> >>>>>>> Variational Bayes (CVB)
> >>>>>>> Meanshift
> >>>>>>> MinHash - removed due to poor performance, lack of support and
> lack of
> >>>>>>> usage
> >>>>>>>
> >>>>>>> From Classification (both are sequential implementations)
> >>>>>>> Winnow - lack of actual usage and support
> >>>>>>> Perceptron - lack of actual usage and support
> >>>>>>>
> >>>>>>> Collaborative Filtering
> >>>>>>> SlopeOne implementations in
> >>>>>>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >>>>>>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >>>>>>> Distributed pseudo recommender in
> >>>>>>> org.apache.mahout.cf.taste.hadoop.pseudo
> >>>>>>> TreeClusteringRecommender in
> >>>>>>> org.apache.mahout.cf.taste.impl.recommender
> >>>>>>>
> >>>>>>> Mahout Math
> >>>>>>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >>>>>>> ===
> >>>>>>>
> >>>>>>> In MLlib, we should include the algorithms users know how to use
> and
> >>>>>>> we can provide support rather than letting algorithms come and go.
> >>>>>>>
> >>>>>>> My $0.02,
> >>>>>>> Xiangrui
> >>>>>>>
> >>>>>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen 
> >>>>>>>> <so...@cloudera.com(mailto:
> so...@cloudera.com)>
> >>>> wrote:
> >>>>>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
> >>>>>>>>> <p...@mult.ifario.us(mailto:
> p...@mult.ifario.us)>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> wrote:
> >>>>>>>>> - MLlib as Mahout.next would be a unfortunate. There are some
> gems
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> in
> >>>>>>>>> Mahout, but there are also lots of rocks. Setting a minimal bar
> of
> >>>>>>>>> working, correctly implemented, and documented requires a
> surprising
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> amount
> >>>>>>>>> of work.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> As someone with first-hand knowledge, this is correct. To Sang's
> >>>>>>>> question, I can't see value in 'porting' Mahout since it is based
> on a
> >>>>>>>> quite different paradigm. About the only part that translates is
> the
> >>>>>>>> algorithm concept itself.
> >>>>>>>>
> >>>>>>>> This is also the cautionary tale. The contents of the project have
> >>>>>>>> ended up being a number of "drive-by" contributions of
> implementations
> >>>>>>>> that, while individually perhaps brilliant (perhaps), didn't
> >>>>>>>> necessarily match any other implementation in structure,
> input/output,
> >>>>>>>> libraries used. The implementations were often a touch academic.
> The
> >>>>>>>> result was hard to document, maintain, evolve or use.
> >>>>>>>>
> >>>>>>>> Far more of the structure of the MLlib implementations are
> consistent
> >>>>>>>> by virtue of being built around Spark core already. That's great.
> >>>>>>>>
> >>>>>>>> One can't wait to completely build the foundation before building
> any
> >>>>>>>> implementations. To me, the existing implementations are almost
> >>>>>>>> exactly the basics I would choose. They cover the bases and will
> >>>>>>>> exercise the abstractions and structure. So that's also great
> IMHO.
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>
>

Re: Any plans for new clustering algorithms?

Reply via email to