Re: Any plans for new clustering algorithms?

2014-04-22 Thread Sandy Ryza
Thanks Matei.  I added a section How to contribute page.


On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 The wiki is actually maintained separately in
 https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We
 restricted editing of the wiki because bots would automatically add stuff.
 I've given you permissions now.

 Matei

 On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  I thought those are files of spark.apache.org?
 
  --
  Nan Zhu
 
 
  On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
 
  The markdown files are under spark/docs. You can submit a PR for
  changes. -Xiangrui
 
  On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza 
  sandy.r...@cloudera.com(mailto:
 sandy.r...@cloudera.com) wrote:
  How do I get permissions to edit the wiki?
 
 
  On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com(mailto:
 men...@gmail.com) wrote:
 
  Cannot agree more with your words. Could you add one section about
  how and what to contribute to MLlib's guide? -Xiangrui
 
  On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
  nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote:
  I'd say a section in the how to contribute page would be a good
 place
 
  to put this.
 
  In general I'd say that the criteria for inclusion of an algorithm
 is it
  should be high quality, widely known, used and accepted (citations and
  concrete use cases as examples of this), scalable and parallelizable,
 well
  documented and with reasonable expectation of dev support
 
  Sent from my iPhone
 
  On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com(mailto:
 sandy.r...@cloudera.com) wrote:
 
  If it's not done already, would it make sense to codify this
 philosophy
  somewhere? I imagine this won't be the first time this discussion
 comes
  up, and it would be nice to have a doc to point to. I'd be happy to
 
 
 
 
  take a
  stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng 
  men...@gmail.com(mailto:
 men...@gmail.com)
  wrote:
 
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design
 scalable.
  For example, think about current ProblemWithAlgorithm naming
 scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset
 but
  not tested widely. You see the change log from Mahout:
 
  ===
  The following algorithms that were marked deprecated in 0.8 have
 been
  removed in 0.9:
 
  From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and
 lack of
  usage
 
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
 
  Collaborative Filtering
  SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
  Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
  TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
 
  Mahout Math
  Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
 
  In MLlib, we should include the algorithms users know how to use
 and
  we can provide support rather than letting algorithms come and go.
 
  My $0.02,
  Xiangrui
 
  On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen 
  so...@cloudera.com(mailto:
 so...@cloudera.com)
  wrote:
  On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
  p...@mult.ifario.us(mailto:
 p...@mult.ifario.us)
 
 
 
 
 
  wrote:
  - MLlib as Mahout.next would be a unfortunate. There are some
 gems
 
 
 
 
 
  in
  Mahout, but there are also lots of rocks. Setting a minimal bar
 of
  working, correctly implemented, and documented requires a
 surprising
 
 
 
  amount
  of work.
 
 
  As someone with first-hand knowledge, this is correct. To Sang's
  question, I can't see value in 'porting' Mahout since it is based
 on a
  quite different paradigm. About the only part that translates is
 the
  algorithm concept itself.
 
  This is also the cautionary tale. The contents of the project have
  ended up being a number of drive-by contributions of
 implementations
  that, while individually perhaps brilliant (perhaps), didn't
  necessarily match any other implementation in structure,
 input/output,
  libraries used. The implementations were often a touch academic.
 The
  result was hard to document, maintain, evolve or use.
 
  Far more of the structure of the MLlib implementations are
 consistent
  by virtue of being built around Spark core already. That's great.
 
  One can't wait to completely build the foundation before building
 any
  implementations. To me, the existing implementations are almost
  exactly the basics I 

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Evan R. Sparks
While DBSCAN and others would be welcome contributions, I couldn't agree
more with Sean.




On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen so...@cloudera.com wrote:

 Nobody asked me, and this is a comment on a broader question, not this
 one, but:

 In light of a number of recent items about adding more algorithms,
 I'll say that I personally think an explosion of algorithms should
 come after the MLlib core is more fully baked. I'm thinking of
 finishing out the changes to vectors and matrices, for example. Things
 are going to change significantly in the short term as people use the
 algorithms and see how well the abstractions do or don't work. I've
 seen another similar project suffer mightily from too many algorithms
 too early, so maybe I'm just paranoid.

 Anyway, long-term, I think lots of good algorithms is a right and
 proper goal for MLlib, myself. Consistent approaches, representations
 and APIs will make or break MLlib much more than having or not having
 a particular algorithm. With the plumbing in place, writing the algo
 is the fun easy part.
 --
 Sean Owen | Director, Data Science | London


 On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
 aliaksei.lito...@gmail.com wrote:
  Hi, Spark developers.
  Are there any plans for implementing new clustering algorithms in MLLib?
 As
  far as I understand, current version of Spark ships with only one
  clustering algorithm - K-Means. I want to contribute to Spark and I'm
  thinking of adding more clustering algorithms - maybe
  DBSCANhttp://en.wikipedia.org/wiki/DBSCAN.
  I can start working on it. Does anyone want to join me?



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Paul Brown
I agree that it will be good to see more algorithms added to the MLlib
universe, although this does bring to mind a couple of comments:

- MLlib as Mahout.next would be a unfortunate.  There are some gems in
Mahout, but there are also lots of rocks.  Setting a minimal bar of
working, correctly implemented, and documented requires a surprising amount
of work.

- Not getting any signal out of your data with an algorithm like K-means
implies one of the following: (1) there is no signal in your data, (2) you
should try tuning the algorithm differently, (3) you're using K-means
wrong, (4) you should try preparing the data differently, (5) all of the
above, or (6) none of the above.

My $0.02.
-- Paul


—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen so...@cloudera.com wrote:

 Nobody asked me, and this is a comment on a broader question, not this
 one, but:

 In light of a number of recent items about adding more algorithms,
 I'll say that I personally think an explosion of algorithms should
 come after the MLlib core is more fully baked. I'm thinking of
 finishing out the changes to vectors and matrices, for example. Things
 are going to change significantly in the short term as people use the
 algorithms and see how well the abstractions do or don't work. I've
 seen another similar project suffer mightily from too many algorithms
 too early, so maybe I'm just paranoid.

 Anyway, long-term, I think lots of good algorithms is a right and
 proper goal for MLlib, myself. Consistent approaches, representations
 and APIs will make or break MLlib much more than having or not having
 a particular algorithm. With the plumbing in place, writing the algo
 is the fun easy part.
 --
 Sean Owen | Director, Data Science | London


 On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
 aliaksei.lito...@gmail.com wrote:
  Hi, Spark developers.
  Are there any plans for implementing new clustering algorithms in MLLib?
 As
  far as I understand, current version of Spark ships with only one
  clustering algorithm - K-Means. I want to contribute to Spark and I'm
  thinking of adding more clustering algorithms - maybe
  DBSCANhttp://en.wikipedia.org/wiki/DBSCAN.
  I can start working on it. Does anyone want to join me?



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sean Owen
On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote:
 - MLlib as Mahout.next would be a unfortunate.  There are some gems in
 Mahout, but there are also lots of rocks.  Setting a minimal bar of
 working, correctly implemented, and documented requires a surprising amount
 of work.

As someone with first-hand knowledge, this is correct. To Sang's
question, I can't see value in 'porting' Mahout since it is based on a
quite different paradigm. About the only part that translates is the
algorithm concept itself.

This is also the cautionary tale. The contents of the project have
ended up being a number of drive-by contributions of implementations
that, while individually perhaps brilliant (perhaps), didn't
necessarily match any other implementation in structure, input/output,
libraries used. The implementations were often a touch academic. The
result was hard to document, maintain, evolve or use.

Far more of the structure of the MLlib implementations are consistent
by virtue of being built around Spark core already. That's great.

One can't wait to completely build the foundation before building any
implementations. To me, the existing implementations are almost
exactly the basics I would choose. They cover the bases and will
exercise the abstractions and structure. So that's also great IMHO.


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
+1 on Sean's comment. MLlib covers the basic algorithms but we
definitely need to spend more time on how to make the design scalable.
For example, think about current ProblemWithAlgorithm naming scheme.
That being said, new algorithms are welcomed. I wish they are
well-established and well-understood by users. They shouldn't be
research algorithms tuned to work well with a particular dataset but
not tested widely. You see the change log from Mahout:

===
The following algorithms that were marked deprecated in 0.8 have been
removed in 0.9:

From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
Variational Bayes (CVB)
Meanshift
MinHash - removed due to poor performance, lack of support and lack of usage

From Classification (both are sequential implementations)
Winnow - lack of actual usage and support
Perceptron - lack of actual usage and support

Collaborative Filtering
SlopeOne implementations in
org.apache.mahout.cf.taste.hadoop.slopeone and
org.apache.mahout.cf.taste.impl.recommender.slopeone
Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

Mahout Math
Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
===

In MLlib, we should include the algorithms users know how to use and
we can provide support rather than letting algorithms come and go.

My $0.02,
Xiangrui

On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote:
 On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote:
 - MLlib as Mahout.next would be a unfortunate.  There are some gems in
 Mahout, but there are also lots of rocks.  Setting a minimal bar of
 working, correctly implemented, and documented requires a surprising amount
 of work.

 As someone with first-hand knowledge, this is correct. To Sang's
 question, I can't see value in 'porting' Mahout since it is based on a
 quite different paradigm. About the only part that translates is the
 algorithm concept itself.

 This is also the cautionary tale. The contents of the project have
 ended up being a number of drive-by contributions of implementations
 that, while individually perhaps brilliant (perhaps), didn't
 necessarily match any other implementation in structure, input/output,
 libraries used. The implementations were often a touch academic. The
 result was hard to document, maintain, evolve or use.

 Far more of the structure of the MLlib implementations are consistent
 by virtue of being built around Spark core already. That's great.

 One can't wait to completely build the foundation before building any
 implementations. To me, the existing implementations are almost
 exactly the basics I would choose. They cover the bases and will
 exercise the abstractions and structure. So that's also great IMHO.


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nick Pentreath
I'd say a section in the how to contribute page would be a good place to put 
this.

In general I'd say that the criteria for inclusion of an algorithm is it should 
be high quality, widely known, used and accepted (citations and concrete use 
cases as examples of this), scalable and parallelizable, well documented and 
with reasonable expectation of dev support

Sent from my iPhone

 On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 If it's not done already, would it make sense to codify this philosophy
 somewhere?  I imagine this won't be the first time this discussion comes
 up, and it would be nice to have a doc to point to.  I'd be happy to take a
 stab at this.
 
 
 On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote:
 
 +1 on Sean's comment. MLlib covers the basic algorithms but we
 definitely need to spend more time on how to make the design scalable.
 For example, think about current ProblemWithAlgorithm naming scheme.
 That being said, new algorithms are welcomed. I wish they are
 well-established and well-understood by users. They shouldn't be
 research algorithms tuned to work well with a particular dataset but
 not tested widely. You see the change log from Mahout:
 
 ===
 The following algorithms that were marked deprecated in 0.8 have been
 removed in 0.9:
 
 From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
 Variational Bayes (CVB)
 Meanshift
 MinHash - removed due to poor performance, lack of support and lack of
 usage
 
 From Classification (both are sequential implementations)
 Winnow - lack of actual usage and support
 Perceptron - lack of actual usage and support
 
 Collaborative Filtering
SlopeOne implementations in
 org.apache.mahout.cf.taste.hadoop.slopeone and
 org.apache.mahout.cf.taste.impl.recommender.slopeone
Distributed pseudo recommender in
 org.apache.mahout.cf.taste.hadoop.pseudo
TreeClusteringRecommender in
 org.apache.mahout.cf.taste.impl.recommender
 
 Mahout Math
Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
 ===
 
 In MLlib, we should include the algorithms users know how to use and
 we can provide support rather than letting algorithms come and go.
 
 My $0.02,
 Xiangrui
 
 On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote:
 On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote:
 - MLlib as Mahout.next would be a unfortunate.  There are some gems in
 Mahout, but there are also lots of rocks.  Setting a minimal bar of
 working, correctly implemented, and documented requires a surprising
 amount
 of work.
 
 As someone with first-hand knowledge, this is correct. To Sang's
 question, I can't see value in 'porting' Mahout since it is based on a
 quite different paradigm. About the only part that translates is the
 algorithm concept itself.
 
 This is also the cautionary tale. The contents of the project have
 ended up being a number of drive-by contributions of implementations
 that, while individually perhaps brilliant (perhaps), didn't
 necessarily match any other implementation in structure, input/output,
 libraries used. The implementations were often a touch academic. The
 result was hard to document, maintain, evolve or use.
 
 Far more of the structure of the MLlib implementations are consistent
 by virtue of being built around Spark core already. That's great.
 
 One can't wait to completely build the foundation before building any
 implementations. To me, the existing implementations are almost
 exactly the basics I would choose. They cover the bases and will
 exercise the abstractions and structure. So that's also great IMHO.
 


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
Cannot agree more with your words. Could you add one section about
how and what to contribute to MLlib's guide? -Xiangrui

On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
nick.pentre...@gmail.com wrote:
 I'd say a section in the how to contribute page would be a good place to 
 put this.

 In general I'd say that the criteria for inclusion of an algorithm is it 
 should be high quality, widely known, used and accepted (citations and 
 concrete use cases as examples of this), scalable and parallelizable, well 
 documented and with reasonable expectation of dev support

 Sent from my iPhone

 On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote:

 If it's not done already, would it make sense to codify this philosophy
 somewhere?  I imagine this won't be the first time this discussion comes
 up, and it would be nice to have a doc to point to.  I'd be happy to take a
 stab at this.


 On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote:

 +1 on Sean's comment. MLlib covers the basic algorithms but we
 definitely need to spend more time on how to make the design scalable.
 For example, think about current ProblemWithAlgorithm naming scheme.
 That being said, new algorithms are welcomed. I wish they are
 well-established and well-understood by users. They shouldn't be
 research algorithms tuned to work well with a particular dataset but
 not tested widely. You see the change log from Mahout:

 ===
 The following algorithms that were marked deprecated in 0.8 have been
 removed in 0.9:

 From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
 Variational Bayes (CVB)
 Meanshift
 MinHash - removed due to poor performance, lack of support and lack of
 usage

 From Classification (both are sequential implementations)
 Winnow - lack of actual usage and support
 Perceptron - lack of actual usage and support

 Collaborative Filtering
SlopeOne implementations in
 org.apache.mahout.cf.taste.hadoop.slopeone and
 org.apache.mahout.cf.taste.impl.recommender.slopeone
Distributed pseudo recommender in
 org.apache.mahout.cf.taste.hadoop.pseudo
TreeClusteringRecommender in
 org.apache.mahout.cf.taste.impl.recommender

 Mahout Math
Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
 ===

 In MLlib, we should include the algorithms users know how to use and
 we can provide support rather than letting algorithms come and go.

 My $0.02,
 Xiangrui

 On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote:
 On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote:
 - MLlib as Mahout.next would be a unfortunate.  There are some gems in
 Mahout, but there are also lots of rocks.  Setting a minimal bar of
 working, correctly implemented, and documented requires a surprising
 amount
 of work.

 As someone with first-hand knowledge, this is correct. To Sang's
 question, I can't see value in 'porting' Mahout since it is based on a
 quite different paradigm. About the only part that translates is the
 algorithm concept itself.

 This is also the cautionary tale. The contents of the project have
 ended up being a number of drive-by contributions of implementations
 that, while individually perhaps brilliant (perhaps), didn't
 necessarily match any other implementation in structure, input/output,
 libraries used. The implementations were often a touch academic. The
 result was hard to document, maintain, evolve or use.

 Far more of the structure of the MLlib implementations are consistent
 by virtue of being built around Spark core already. That's great.

 One can't wait to completely build the foundation before building any
 implementations. To me, the existing implementations are almost
 exactly the basics I would choose. They cover the bases and will
 exercise the abstractions and structure. So that's also great IMHO.



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
How do I get permissions to edit the wiki?


On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote:

 Cannot agree more with your words. Could you add one section about
 how and what to contribute to MLlib's guide? -Xiangrui

 On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I'd say a section in the how to contribute page would be a good place
 to put this.
 
  In general I'd say that the criteria for inclusion of an algorithm is it
 should be high quality, widely known, used and accepted (citations and
 concrete use cases as examples of this), scalable and parallelizable, well
 documented and with reasonable expectation of dev support
 
  Sent from my iPhone
 
  On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote:
 
  If it's not done already, would it make sense to codify this philosophy
  somewhere?  I imagine this won't be the first time this discussion comes
  up, and it would be nice to have a doc to point to.  I'd be happy to
 take a
  stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design scalable.
  For example, think about current ProblemWithAlgorithm naming scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset but
  not tested widely. You see the change log from Mahout:
 
  ===
  The following algorithms that were marked deprecated in 0.8 have been
  removed in 0.9:
 
  From Clustering:
   Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and lack of
  usage
 
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
 
  Collaborative Filtering
 SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
 Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
 TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
 
  Mahout Math
 Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
 
  In MLlib, we should include the algorithms users know how to use and
  we can provide support rather than letting algorithms come and go.
 
  My $0.02,
  Xiangrui
 
  On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com
 wrote:
  On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us
 wrote:
  - MLlib as Mahout.next would be a unfortunate.  There are some gems
 in
  Mahout, but there are also lots of rocks.  Setting a minimal bar of
  working, correctly implemented, and documented requires a surprising
  amount
  of work.
 
  As someone with first-hand knowledge, this is correct. To Sang's
  question, I can't see value in 'porting' Mahout since it is based on a
  quite different paradigm. About the only part that translates is the
  algorithm concept itself.
 
  This is also the cautionary tale. The contents of the project have
  ended up being a number of drive-by contributions of implementations
  that, while individually perhaps brilliant (perhaps), didn't
  necessarily match any other implementation in structure, input/output,
  libraries used. The implementations were often a touch academic. The
  result was hard to document, maintain, evolve or use.
 
  Far more of the structure of the MLlib implementations are consistent
  by virtue of being built around Spark core already. That's great.
 
  One can't wait to completely build the foundation before building any
  implementations. To me, the existing implementations are almost
  exactly the basics I would choose. They cover the bases and will
  exercise the abstractions and structure. So that's also great IMHO.
 



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
The markdown files are under spark/docs. You can submit a PR for
changes. -Xiangrui

On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 How do I get permissions to edit the wiki?


 On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote:

 Cannot agree more with your words. Could you add one section about
 how and what to contribute to MLlib's guide? -Xiangrui

 On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I'd say a section in the how to contribute page would be a good place
 to put this.
 
  In general I'd say that the criteria for inclusion of an algorithm is it
 should be high quality, widely known, used and accepted (citations and
 concrete use cases as examples of this), scalable and parallelizable, well
 documented and with reasonable expectation of dev support
 
  Sent from my iPhone
 
  On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote:
 
  If it's not done already, would it make sense to codify this philosophy
  somewhere?  I imagine this won't be the first time this discussion comes
  up, and it would be nice to have a doc to point to.  I'd be happy to
 take a
  stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design scalable.
  For example, think about current ProblemWithAlgorithm naming scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset but
  not tested widely. You see the change log from Mahout:
 
  ===
  The following algorithms that were marked deprecated in 0.8 have been
  removed in 0.9:
 
  From Clustering:
   Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and lack of
  usage
 
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
 
  Collaborative Filtering
 SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
 Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
 TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
 
  Mahout Math
 Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
 
  In MLlib, we should include the algorithms users know how to use and
  we can provide support rather than letting algorithms come and go.
 
  My $0.02,
  Xiangrui
 
  On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com
 wrote:
  On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us
 wrote:
  - MLlib as Mahout.next would be a unfortunate.  There are some gems
 in
  Mahout, but there are also lots of rocks.  Setting a minimal bar of
  working, correctly implemented, and documented requires a surprising
  amount
  of work.
 
  As someone with first-hand knowledge, this is correct. To Sang's
  question, I can't see value in 'porting' Mahout since it is based on a
  quite different paradigm. About the only part that translates is the
  algorithm concept itself.
 
  This is also the cautionary tale. The contents of the project have
  ended up being a number of drive-by contributions of implementations
  that, while individually perhaps brilliant (perhaps), didn't
  necessarily match any other implementation in structure, input/output,
  libraries used. The implementations were often a touch academic. The
  result was hard to document, maintain, evolve or use.
 
  Far more of the structure of the MLlib implementations are consistent
  by virtue of being built around Spark core already. That's great.
 
  One can't wait to completely build the foundation before building any
  implementations. To me, the existing implementations are almost
  exactly the basics I would choose. They cover the bases and will
  exercise the abstractions and structure. So that's also great IMHO.
 



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
I thought this might be a good thing to add to the wiki's How to
contribute 
pagehttps://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark,
as it's not tied to a release.


On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng men...@gmail.com wrote:

 The markdown files are under spark/docs. You can submit a PR for
 changes. -Xiangrui

 On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  How do I get permissions to edit the wiki?
 
 
  On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Cannot agree more with your words. Could you add one section about
  how and what to contribute to MLlib's guide? -Xiangrui
 
  On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
  nick.pentre...@gmail.com wrote:
   I'd say a section in the how to contribute page would be a good
 place
  to put this.
  
   In general I'd say that the criteria for inclusion of an algorithm is
 it
  should be high quality, widely known, used and accepted (citations and
  concrete use cases as examples of this), scalable and parallelizable,
 well
  documented and with reasonable expectation of dev support
  
   Sent from my iPhone
  
   On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  
   If it's not done already, would it make sense to codify this
 philosophy
   somewhere?  I imagine this won't be the first time this discussion
 comes
   up, and it would be nice to have a doc to point to.  I'd be happy to
  take a
   stab at this.
  
  
   On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com
  wrote:
  
   +1 on Sean's comment. MLlib covers the basic algorithms but we
   definitely need to spend more time on how to make the design
 scalable.
   For example, think about current ProblemWithAlgorithm naming
 scheme.
   That being said, new algorithms are welcomed. I wish they are
   well-established and well-understood by users. They shouldn't be
   research algorithms tuned to work well with a particular dataset but
   not tested widely. You see the change log from Mahout:
  
   ===
   The following algorithms that were marked deprecated in 0.8 have
 been
   removed in 0.9:
  
   From Clustering:
Switched LDA implementation from using Gibbs Sampling to Collapsed
   Variational Bayes (CVB)
   Meanshift
   MinHash - removed due to poor performance, lack of support and lack
 of
   usage
  
   From Classification (both are sequential implementations)
   Winnow - lack of actual usage and support
   Perceptron - lack of actual usage and support
  
   Collaborative Filtering
  SlopeOne implementations in
   org.apache.mahout.cf.taste.hadoop.slopeone and
   org.apache.mahout.cf.taste.impl.recommender.slopeone
  Distributed pseudo recommender in
   org.apache.mahout.cf.taste.hadoop.pseudo
  TreeClusteringRecommender in
   org.apache.mahout.cf.taste.impl.recommender
  
   Mahout Math
  Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
   ===
  
   In MLlib, we should include the algorithms users know how to use and
   we can provide support rather than letting algorithms come and go.
  
   My $0.02,
   Xiangrui
  
   On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com
  wrote:
   On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us
  wrote:
   - MLlib as Mahout.next would be a unfortunate.  There are some
 gems
  in
   Mahout, but there are also lots of rocks.  Setting a minimal bar
 of
   working, correctly implemented, and documented requires a
 surprising
   amount
   of work.
  
   As someone with first-hand knowledge, this is correct. To Sang's
   question, I can't see value in 'porting' Mahout since it is based
 on a
   quite different paradigm. About the only part that translates is
 the
   algorithm concept itself.
  
   This is also the cautionary tale. The contents of the project have
   ended up being a number of drive-by contributions of
 implementations
   that, while individually perhaps brilliant (perhaps), didn't
   necessarily match any other implementation in structure,
 input/output,
   libraries used. The implementations were often a touch academic.
 The
   result was hard to document, maintain, evolve or use.
  
   Far more of the structure of the MLlib implementations are
 consistent
   by virtue of being built around Spark core already. That's great.
  
   One can't wait to completely build the foundation before building
 any
   implementations. To me, the existing implementations are almost
   exactly the basics I would choose. They cover the bases and will
   exercise the abstractions and structure. So that's also great IMHO.
  
 



Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nan Zhu
I thought those are files of spark.apache.org? 

-- 
Nan Zhu


On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:

 The markdown files are under spark/docs. You can submit a PR for
 changes. -Xiangrui
 
 On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com 
 (mailto:sandy.r...@cloudera.com) wrote:
  How do I get permissions to edit the wiki?
  
  
  On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com 
  (mailto:men...@gmail.com) wrote:
  
   Cannot agree more with your words. Could you add one section about
   how and what to contribute to MLlib's guide? -Xiangrui
   
   On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
   nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote:
I'd say a section in the how to contribute page would be a good place
   
   to put this.

In general I'd say that the criteria for inclusion of an algorithm is it
   should be high quality, widely known, used and accepted (citations and
   concrete use cases as examples of this), scalable and parallelizable, well
   documented and with reasonable expectation of dev support

Sent from my iPhone

 On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com 
 (mailto:sandy.r...@cloudera.com) wrote:
 
 If it's not done already, would it make sense to codify this 
 philosophy
 somewhere? I imagine this won't be the first time this discussion 
 comes
 up, and it would be nice to have a doc to point to. I'd be happy to
 


   
   take a
 stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com 
  (mailto:men...@gmail.com)
   wrote:
  
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design 
  scalable.
  For example, think about current ProblemWithAlgorithm naming 
  scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset but
  not tested widely. You see the change log from Mahout:
  
  ===
  The following algorithms that were marked deprecated in 0.8 have 
  been
  removed in 0.9:
  
  From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and lack 
  of
  usage
  
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
  
  Collaborative Filtering
  SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
  Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
  TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
  
  Mahout Math
  Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
  
  In MLlib, we should include the algorithms users know how to use and
  we can provide support rather than letting algorithms come and go.
  
  My $0.02,
  Xiangrui
  
   On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com 
   (mailto:so...@cloudera.com)
   wrote:
On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
p...@mult.ifario.us (mailto:p...@mult.ifario.us)
   
  
 

   
   wrote:
- MLlib as Mahout.next would be a unfortunate. There are some 
gems
   
  
 

   
   in
Mahout, but there are also lots of rocks. Setting a minimal bar 
of
working, correctly implemented, and documented requires a 
surprising

   
  
  amount
of work.
   
   
   As someone with first-hand knowledge, this is correct. To Sang's
   question, I can't see value in 'porting' Mahout since it is based 
   on a
   quite different paradigm. About the only part that translates is 
   the
   algorithm concept itself.
   
   This is also the cautionary tale. The contents of the project have
   ended up being a number of drive-by contributions of 
   implementations
   that, while individually perhaps brilliant (perhaps), didn't
   necessarily match any other implementation in structure, 
   input/output,
   libraries used. The implementations were often a touch academic. 
   The
   result was hard to document, maintain, evolve or use.
   
   Far more of the structure of the MLlib implementations are 
   consistent
   by virtue of being built around Spark core already. That's great.
   
   One can't wait to completely build the foundation 

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Matei Zaharia
The wiki is actually maintained separately in 
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted 
editing of the wiki because bots would automatically add stuff. I’ve given you 
permissions now.

Matei

On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 I thought those are files of spark.apache.org? 
 
 -- 
 Nan Zhu
 
 
 On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
 
 The markdown files are under spark/docs. You can submit a PR for
 changes. -Xiangrui
 
 On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com 
 (mailto:sandy.r...@cloudera.com) wrote:
 How do I get permissions to edit the wiki?
 
 
 On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com 
 (mailto:men...@gmail.com) wrote:
 
 Cannot agree more with your words. Could you add one section about
 how and what to contribute to MLlib's guide? -Xiangrui
 
 On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
 nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote:
 I'd say a section in the how to contribute page would be a good place
 
 to put this.
 
 In general I'd say that the criteria for inclusion of an algorithm is it
 should be high quality, widely known, used and accepted (citations and
 concrete use cases as examples of this), scalable and parallelizable, well
 documented and with reasonable expectation of dev support
 
 Sent from my iPhone
 
 On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com 
 (mailto:sandy.r...@cloudera.com) wrote:
 
 If it's not done already, would it make sense to codify this philosophy
 somewhere? I imagine this won't be the first time this discussion comes
 up, and it would be nice to have a doc to point to. I'd be happy to
 
 
 
 
 take a
 stab at this.
 
 
 On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com 
 (mailto:men...@gmail.com)
 wrote:
 
 +1 on Sean's comment. MLlib covers the basic algorithms but we
 definitely need to spend more time on how to make the design scalable.
 For example, think about current ProblemWithAlgorithm naming scheme.
 That being said, new algorithms are welcomed. I wish they are
 well-established and well-understood by users. They shouldn't be
 research algorithms tuned to work well with a particular dataset but
 not tested widely. You see the change log from Mahout:
 
 ===
 The following algorithms that were marked deprecated in 0.8 have been
 removed in 0.9:
 
 From Clustering:
 Switched LDA implementation from using Gibbs Sampling to Collapsed
 Variational Bayes (CVB)
 Meanshift
 MinHash - removed due to poor performance, lack of support and lack of
 usage
 
 From Classification (both are sequential implementations)
 Winnow - lack of actual usage and support
 Perceptron - lack of actual usage and support
 
 Collaborative Filtering
 SlopeOne implementations in
 org.apache.mahout.cf.taste.hadoop.slopeone and
 org.apache.mahout.cf.taste.impl.recommender.slopeone
 Distributed pseudo recommender in
 org.apache.mahout.cf.taste.hadoop.pseudo
 TreeClusteringRecommender in
 org.apache.mahout.cf.taste.impl.recommender
 
 Mahout Math
 Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
 ===
 
 In MLlib, we should include the algorithms users know how to use and
 we can provide support rather than letting algorithms come and go.
 
 My $0.02,
 Xiangrui
 
 On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com)
 wrote:
 On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us 
 (mailto:p...@mult.ifario.us)
 
 
 
 
 
 wrote:
 - MLlib as Mahout.next would be a unfortunate. There are some gems
 
 
 
 
 
 in
 Mahout, but there are also lots of rocks. Setting a minimal bar of
 working, correctly implemented, and documented requires a surprising
 
 
 
 amount
 of work.
 
 
 As someone with first-hand knowledge, this is correct. To Sang's
 question, I can't see value in 'porting' Mahout since it is based on a
 quite different paradigm. About the only part that translates is the
 algorithm concept itself.
 
 This is also the cautionary tale. The contents of the project have
 ended up being a number of drive-by contributions of implementations
 that, while individually perhaps brilliant (perhaps), didn't
 necessarily match any other implementation in structure, input/output,
 libraries used. The implementations were often a touch academic. The
 result was hard to document, maintain, evolve or use.
 
 Far more of the structure of the MLlib implementations are consistent
 by virtue of being built around Spark core already. That's great.
 
 One can't wait to completely build the foundation before building any
 implementations. To me, the existing implementations are almost
 exactly the basics I would choose. They cover the bases and will
 exercise the abstractions and structure. So that's also great IMHO.