Re: Contributing to MLlib on GLM
Hi Gang, No admin is looking at our patch:( do you have some suggestions so that our patch can get noticed by the admin? Best regards, Xiaokai On Mon, Jun 30, 2014 at 8:18 PM, Gang Bai [via Apache Spark Developers List] ml-node+s1001551n713...@n3.nabble.com wrote: Thanks Xiaokai, I’ve created a pull request to merge features in my PR to your repo. Please take a review here https://github.com/xwei-datageek/spark/pull/2 . As for GLMs, here at Sina, we are solving the problem of predicting the num of visitors who read a particular news article or watch an online sports live stream in a particular period. I’m trying to improve the prediction results by tuning features and incorporating new models. So I’ll try Gamma regression later. Thanks for the implementation. Cheers, -Gang On Jun 29, 2014, at 8:17 AM, xwei [hidden email] http://user/SendEmail.jtp?type=nodenode=7131i=0 wrote: Hi Gang, No worries! I agree LBFGS would converge faster and your test suite is more comprehensive. I'd like to merge my branch with yours. I also agree with your viewpoint on the redundancy issue. For different GLMs, usually they only differ in gradient calculation but the regression.scala files are quite similar. For example, linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD, poissonRegressionSGD all share quite a bit of common code in their class implementations. Since such redundancy is already there in the legacy code, simply merging Poisson and Gamma does not seem to help much. So I suggest we just leave them as separate classes for the time being. Best regards, Xiaokai On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List] wrote: Hi Xiaokai, My bad. I didn't notice this before I created another PR for Poisson regression. The mails were buried in junk by the corp mail master. Also, thanks for considering my comments and advice in your PR. Adding my two cents here: * PoissonRegressionModel and GammaRegressionModel have the same fields and prediction method. Shall we use one instead of two redundant classes? Say, a LogLinearModel. * The LBFGS optimizer takes fewer iterations and results in better convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively. You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch. * In addition to the generated test data, We may use some real-world data for testing. In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test suite. -Gang Sent from my iPad On 2014年6月27日, at 下午6:03, xwei [hidden email] wrote: Yes, that's what we did: adding two gradient functions to Gradient.scala and create PoissonRegression and GammaRegression using these gradients. We made a PR on this. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html To unsubscribe from Contributing to MLlib on GLM, click here. NAML -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7131.html To unsubscribe from Contributing to MLlib on GLM, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7033code=d2VpeGlhb2thaUBnbWFpbC5jb218NzAzM3w2NTc5NDUzMzA= . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7197.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib on GLM
Thanks Xiaokai, I’ve created a pull request to merge features in my PR to your repo. Please take a review here https://github.com/xwei-datageek/spark/pull/2 . As for GLMs, here at Sina, we are solving the problem of predicting the num of visitors who read a particular news article or watch an online sports live stream in a particular period. I’m trying to improve the prediction results by tuning features and incorporating new models. So I’ll try Gamma regression later. Thanks for the implementation. Cheers, -Gang On Jun 29, 2014, at 8:17 AM, xwei weixiao...@gmail.com wrote: Hi Gang, No worries! I agree LBFGS would converge faster and your test suite is more comprehensive. I'd like to merge my branch with yours. I also agree with your viewpoint on the redundancy issue. For different GLMs, usually they only differ in gradient calculation but the regression.scala files are quite similar. For example, linearRegressionSGD, logisticRegressionSGD, RidgeRegressionSGD, poissonRegressionSGD all share quite a bit of common code in their class implementations. Since such redundancy is already there in the legacy code, simply merging Poisson and Gamma does not seem to help much. So I suggest we just leave them as separate classes for the time being. Best regards, Xiaokai On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List] wrote: Hi Xiaokai, My bad. I didn't notice this before I created another PR for Poisson regression. The mails were buried in junk by the corp mail master. Also, thanks for considering my comments and advice in your PR. Adding my two cents here: * PoissonRegressionModel and GammaRegressionModel have the same fields and prediction method. Shall we use one instead of two redundant classes? Say, a LogLinearModel. * The LBFGS optimizer takes fewer iterations and results in better convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively. You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch. * In addition to the generated test data, We may use some real-world data for testing. In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test suite. -Gang Sent from my iPad On 2014年6月27日, at 下午6:03, xwei [hidden email] wrote: Yes, that's what we did: adding two gradient functions to Gradient.scala and create PoissonRegression and GammaRegression using these gradients. We made a PR on this. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html To unsubscribe from Contributing to MLlib on GLM, click here. NAML -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib on GLM
Hi Xiaokai, My bad. I didn't notice this before I created another PR for Poisson regression. The mails were buried in junk by the corp mail master. Also, thanks for considering my comments and advice in your PR. Adding my two cents here: * PoissonRegressionModel and GammaRegressionModel have the same fields and prediction method. Shall we use one instead of two redundant classes? Say, a LogLinearModel. * The LBFGS optimizer takes fewer iterations and results in better convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and SGD respectively. You may take a look into it. If it's OK to you, I'd be happy to send a PR to your branch. * In addition to the generated test data, We may use some real-world data for testing. In my implementation, I added the test data from https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test suite. -Gang Sent from my iPad On 2014年6月27日, at 下午6:03, xwei weixiao...@gmail.com wrote: Yes, that's what we did: adding two gradient functions to Gradient.scala and create PoissonRegression and GammaRegression using these gradients. We made a PR on this. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib on GLM
Yes, that's what we did: adding two gradient functions to Gradient.scala and create PoissonRegression and GammaRegression using these gradients. We made a PR on this. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Contributing to MLlib on GLM
Well, as you said, MLLib already supports GLM in a sense. Except they only support two link functions - identity (linear regression) and logit (logistic regression). It should not be too hard to add other link functions, as all you have to do is add a different gradient function for Poisson/Gamma, etc - look at Gradient.scala in mllib. On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote: Hi, I am an intern at PalantirTech and we are building some stuff on top of MLlib. In Particular, GLM is of great interest to us. Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as Logistic Regression, Linear Regression, some other important GLMs like Poisson Regression are still missing. I am curious that if anyone is already working on other GLMs (e.g. Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is adding more GLMs on the roadmap of MLlib? Sincerely, Xiaokai
Re: Contributing to MLlib on GLM
Hi Xiaokai, I think MLLib is definitely interested in supporting additional GLMs. I'm not aware of anybody working on this at the moment. -Sandy On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote: Hi, I am an intern at PalantirTech and we are building some stuff on top of MLlib. In Particular, GLM is of great interest to us. Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as Logistic Regression, Linear Regression, some other important GLMs like Poisson Regression are still missing. I am curious that if anyone is already working on other GLMs (e.g. Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is adding more GLMs on the roadmap of MLlib? Sincerely, Xiaokai
Re: Contributing to MLlib on GLM
Hi Xiaokai, Also take a look through Xiangrui's slides from HadoopSummit a few weeks back: http://www.slideshare.net/xrmeng/m-llib-hadoopsummit The roadmap starting at slide 51 will probably be interesting to you. Andrew On Tue, Jun 17, 2014 at 7:37 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xiaokai, I think MLLib is definitely interested in supporting additional GLMs. I'm not aware of anybody working on this at the moment. -Sandy On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote: Hi, I am an intern at PalantirTech and we are building some stuff on top of MLlib. In Particular, GLM is of great interest to us. Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as Logistic Regression, Linear Regression, some other important GLMs like Poisson Regression are still missing. I am curious that if anyone is already working on other GLMs (e.g. Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is adding more GLMs on the roadmap of MLlib? Sincerely, Xiaokai