Re: Any plans for new clustering algorithms?
Thanks Matei. I added a section How to contribute page. On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia matei.zaha...@gmail.comwrote: The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I've given you permissions now. Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I thought those are files of spark.apache.org? -- Nan Zhu On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com(mailto: sandy.r...@cloudera.com) wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com(mailto: men...@gmail.com) wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com(mailto: sandy.r...@cloudera.com) wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com(mailto: men...@gmail.com) wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com(mailto: so...@cloudera.com) wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us(mailto: p...@mult.ifario.us) wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I
Re: Any plans for new clustering algorithms?
While DBSCAN and others would be welcome contributions, I couldn't agree more with Sean. On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen so...@cloudera.com wrote: Nobody asked me, and this is a comment on a broader question, not this one, but: In light of a number of recent items about adding more algorithms, I'll say that I personally think an explosion of algorithms should come after the MLlib core is more fully baked. I'm thinking of finishing out the changes to vectors and matrices, for example. Things are going to change significantly in the short term as people use the algorithms and see how well the abstractions do or don't work. I've seen another similar project suffer mightily from too many algorithms too early, so maybe I'm just paranoid. Anyway, long-term, I think lots of good algorithms is a right and proper goal for MLlib, myself. Consistent approaches, representations and APIs will make or break MLlib much more than having or not having a particular algorithm. With the plumbing in place, writing the algo is the fun easy part. -- Sean Owen | Director, Data Science | London On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka aliaksei.lito...@gmail.com wrote: Hi, Spark developers. Are there any plans for implementing new clustering algorithms in MLLib? As far as I understand, current version of Spark ships with only one clustering algorithm - K-Means. I want to contribute to Spark and I'm thinking of adding more clustering algorithms - maybe DBSCANhttp://en.wikipedia.org/wiki/DBSCAN. I can start working on it. Does anyone want to join me?
Re: Any plans for new clustering algorithms?
I agree that it will be good to see more algorithms added to the MLlib universe, although this does bring to mind a couple of comments: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. - Not getting any signal out of your data with an algorithm like K-means implies one of the following: (1) there is no signal in your data, (2) you should try tuning the algorithm differently, (3) you're using K-means wrong, (4) you should try preparing the data differently, (5) all of the above, or (6) none of the above. My $0.02. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen so...@cloudera.com wrote: Nobody asked me, and this is a comment on a broader question, not this one, but: In light of a number of recent items about adding more algorithms, I'll say that I personally think an explosion of algorithms should come after the MLlib core is more fully baked. I'm thinking of finishing out the changes to vectors and matrices, for example. Things are going to change significantly in the short term as people use the algorithms and see how well the abstractions do or don't work. I've seen another similar project suffer mightily from too many algorithms too early, so maybe I'm just paranoid. Anyway, long-term, I think lots of good algorithms is a right and proper goal for MLlib, myself. Consistent approaches, representations and APIs will make or break MLlib much more than having or not having a particular algorithm. With the plumbing in place, writing the algo is the fun easy part. -- Sean Owen | Director, Data Science | London On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka aliaksei.lito...@gmail.com wrote: Hi, Spark developers. Are there any plans for implementing new clustering algorithms in MLLib? As far as I understand, current version of Spark ships with only one clustering algorithm - K-Means. I want to contribute to Spark and I'm thinking of adding more clustering algorithms - maybe DBSCANhttp://en.wikipedia.org/wiki/DBSCAN. I can start working on it. Does anyone want to join me?
Re: Any plans for new clustering algorithms?
On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
+1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
I thought this might be a good thing to add to the wiki's How to contribute pagehttps://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark, as it's not tied to a release. On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng men...@gmail.com wrote: The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
I thought those are files of spark.apache.org? -- Nan Zhu On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com) wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com (mailto:men...@gmail.com) wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com) wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com (mailto:men...@gmail.com) wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com (mailto:so...@cloudera.com) wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us (mailto:p...@mult.ifario.us) wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation
Re: Any plans for new clustering algorithms?
The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I’ve given you permissions now. Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I thought those are files of spark.apache.org? -- Nan Zhu On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com) wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com (mailto:men...@gmail.com) wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com) wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com (mailto:men...@gmail.com) wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com (mailto:so...@cloudera.com) wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us (mailto:p...@mult.ifario.us) wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.