Re: Any plans for new clustering algorithms?

Xiangrui Meng Mon, 21 Apr 2014 18:11:13 -0700

The markdown files are under spark/docs. You can submit a PR for
changes. -Xiangrui


On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <[email protected]> wrote:
> How do I get permissions to edit the wiki?
>
>
> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <[email protected]> wrote:
>
>> Cannot agree more with your words. Could you add one section about
>> "how and what to contribute" to MLlib's guide? -Xiangrui
>>
>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
>> <[email protected]> wrote:
>> > I'd say a section in the "how to contribute" page would be a good place
>> to put this.
>> >
>> > In general I'd say that the criteria for inclusion of an algorithm is it
>> should be high quality, widely known, used and accepted (citations and
>> concrete use cases as examples of this), scalable and parallelizable, well
>> documented and with reasonable expectation of dev support
>> >
>> > Sent from my iPhone
>> >
>> >> On 21 Apr 2014, at 19:59, Sandy Ryza <[email protected]> wrote:
>> >>
>> >> If it's not done already, would it make sense to codify this philosophy
>> >> somewhere?  I imagine this won't be the first time this discussion comes
>> >> up, and it would be nice to have a doc to point to.  I'd be happy to
>> take a
>> >> stab at this.
>> >>
>> >>
>> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <[email protected]>
>> wrote:
>> >>>
>> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>> >>> definitely need to spend more time on how to make the design scalable.
>> >>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>> >>> That being said, new algorithms are welcomed. I wish they are
>> >>> well-established and well-understood by users. They shouldn't be
>> >>> research algorithms tuned to work well with a particular dataset but
>> >>> not tested widely. You see the change log from Mahout:
>> >>>
>> >>> ===
>> >>> The following algorithms that were marked deprecated in 0.8 have been
>> >>> removed in 0.9:
>> >>>
>> >>> From Clustering:
>> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
>> >>> Variational Bayes (CVB)
>> >>> Meanshift
>> >>> MinHash - removed due to poor performance, lack of support and lack of
>> >>> usage
>> >>>
>> >>> From Classification (both are sequential implementations)
>> >>> Winnow - lack of actual usage and support
>> >>> Perceptron - lack of actual usage and support
>> >>>
>> >>> Collaborative Filtering
>> >>>    SlopeOne implementations in
>> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
>> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>> >>>    Distributed pseudo recommender in
>> >>> org.apache.mahout.cf.taste.hadoop.pseudo
>> >>>    TreeClusteringRecommender in
>> >>> org.apache.mahout.cf.taste.impl.recommender
>> >>>
>> >>> Mahout Math
>> >>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>> >>> ===
>> >>>
>> >>> In MLlib, we should include the algorithms users know how to use and
>> >>> we can provide support rather than letting algorithms come and go.
>> >>>
>> >>> My $0.02,
>> >>> Xiangrui
>> >>>
>> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <[email protected]>
>> wrote:
>> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <[email protected]>
>> wrote:
>> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems
>> in
>> >>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>> >>>>> working, correctly implemented, and documented requires a surprising
>> >>> amount
>> >>>>> of work.
>> >>>>
>> >>>> As someone with first-hand knowledge, this is correct. To Sang's
>> >>>> question, I can't see value in 'porting' Mahout since it is based on a
>> >>>> quite different paradigm. About the only part that translates is the
>> >>>> algorithm concept itself.
>> >>>>
>> >>>> This is also the cautionary tale. The contents of the project have
>> >>>> ended up being a number of "drive-by" contributions of implementations
>> >>>> that, while individually perhaps brilliant (perhaps), didn't
>> >>>> necessarily match any other implementation in structure, input/output,
>> >>>> libraries used. The implementations were often a touch academic. The
>> >>>> result was hard to document, maintain, evolve or use.
>> >>>>
>> >>>> Far more of the structure of the MLlib implementations are consistent
>> >>>> by virtue of being built around Spark core already. That's great.
>> >>>>
>> >>>> One can't wait to completely build the foundation before building any
>> >>>> implementations. To me, the existing implementations are almost
>> >>>> exactly the basics I would choose. They cover the bases and will
>> >>>> exercise the abstractions and structure. So that's also great IMHO.
>> >>>
>>

Re: Any plans for new clustering algorithms?

Reply via email to