Re: FW: MLLIB (Spark) Question.
+cc user@spark.apache.org Reply inline. On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA) Sauptik.Dhar wrote: Hi DB, Thank you for the reply. That explains a lot. I however had a few points regarding this:- 1. Just to help with the debate of not regularizing the b parameter. A standard implementation argues against regularizing the b parameter. See Pg 64 para 1 : http://statweb.stanford.edu/~tibs/ElemStatLearn/ Agreed. We just worry about it will change behavior, but we actually have a PR to change the behavior to standard one, https://github.com/apache/spark/pull/6386 2. Further, is the regularization of b also applicable for the SGD implementation. Currently the SGD vs. BFGS implementations give different results (and both the implementations don't match the IRLS algorithm). Are the SGD/BFGS implemented for different loss functions? Can you please share your thoughts on this. In SGD implementation, we don't standardize the dataset before training. As a result, those columns with low standard deviation will be penalized more, and those with high standard deviation will be penalized less. Also, standardize will help the rate of convergence. As a result, in most of package, they standardize the data implicitly, and get the weights in the standardized space, and transform back to original space so it's transparent for users. 1) LORWithSGD: No standardization, and penalize the intercept. 2) LORWithLBFGS: With standardization but penalize the intercept. 3) New LOR implementation: With standardization without penalizing the intercept. As a result, only the new implementation in Spark ML handles everything correctly. We have tests to verify that the results match R. @Naveen: Please feel free to add/comment on the above points as you see necessary. Thanks, Sauptik. -Original Message- From: DB Tsai Sent: Tuesday, June 16, 2015 2:08 PM To: Ramakrishnan Naveen (CR/RTC1.3-NA) Cc: Dhar Sauptik (CR/RTC1.3-NA) Subject: Re: FW: MLLIB (Spark) Question. Hey, In the LORWithLBFGS api you use, the intercept is regularized while other implementations don't regularize the intercept. That's why you see the difference. The intercept should not be regularized, so we fix this in new Spark ML api in spark 1.4. Since this will change the behavior in the old api if we decide to not regularize the intercept in old version, we are still debating about this. See the following code for full running example in spark 1.4 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala And also check out my talk at spark summit. http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA) Naveen.Ramakrishnan wrote: Hi DB, Hope you are doing well! One of my colleagues, Sauptik, is working with MLLib and the logistic regression based on LBFGS and is having trouble reproducing the same results when compared to Matlab. Please see below for details. I did take a look into this but seems like there’s also discrepancy between the logistic regression with SGD and LBFGS implementations in MLLib. We have attached all the codes for your analysis – it’s in PySpark though. Let us know if you have any questions or concerns. We would very much appreciate your help whenever you get a chance. Best, Naveen. _ From: Dhar Sauptik (CR/RTC1.3-NA) Sent: Thursday, June 11, 2015 6:03 PM To: Ramakrishnan Naveen (CR/RTC1.3-NA) Subject: MLLIB (Spark) Question. Hi Naveen, I am writing this owing to some MLLIB issues I found while using Logistic Regression. Basically, I am trying to test the stability of the L1/L2 – Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm the correctness of the algorithms. For comparison I implemented the L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf . Unfortunately the solutions don’t match:- For example:- Using the Publicly available data (diabetes.csv) for L2 regularized Logistic Regression (with lamda = 0.1) we get, Solutions MATLAB CODE (IRLS):- w = 0.29429347080 0.550681766045083 0.0396336870148899 0.0641285712055971 0.101238592147879 0.261153541551578 0.178686710290069 b= -0.347396594061553 MLLIB (SGD):- (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714], intercept=-0.0074992664631) MLLIB(LBFGS):- (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394
Re: FW: MLLIB (Spark) Question.
Hi Dhar, For standardization, we can disable it effectively by using different regularization on each component. Thus, we're solving the same problem but having better rate of convergence. This is one of the features I will implement. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA) sauptik.d...@us.bosch.com wrote: Hi DB, Thank you for the reply. The answers makes sense. I do have just one more point to add. Note that it may be better to not implicitly standardize the data. Agreed that a number of algorithms benefit from such standardization, but for many applications with contextual information such standardization may not be desirable. Users can always perform the standardization themselves. However, that's just a suggestion. Again, thank you for the clarification. Thanks, Sauptik. -Original Message- From: DB Tsai [mailto:dbt...@dbtsai.com] Sent: Tuesday, June 16, 2015 2:49 PM To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA) Cc: user@spark.apache.org Subject: Re: FW: MLLIB (Spark) Question. +cc user@spark.apache.org Reply inline. On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA) Sauptik.Dhar wrote: Hi DB, Thank you for the reply. That explains a lot. I however had a few points regarding this:- 1. Just to help with the debate of not regularizing the b parameter. A standard implementation argues against regularizing the b parameter. See Pg 64 para 1 : http://statweb.stanford.edu/~tibs/ElemStatLearn/ Agreed. We just worry about it will change behavior, but we actually have a PR to change the behavior to standard one, https://github.com/apache/spark/pull/6386 2. Further, is the regularization of b also applicable for the SGD implementation. Currently the SGD vs. BFGS implementations give different results (and both the implementations don't match the IRLS algorithm). Are the SGD/BFGS implemented for different loss functions? Can you please share your thoughts on this. In SGD implementation, we don't standardize the dataset before training. As a result, those columns with low standard deviation will be penalized more, and those with high standard deviation will be penalized less. Also, standardize will help the rate of convergence. As a result, in most of package, they standardize the data implicitly, and get the weights in the standardized space, and transform back to original space so it's transparent for users. 1) LORWithSGD: No standardization, and penalize the intercept. 2) LORWithLBFGS: With standardization but penalize the intercept. 3) New LOR implementation: With standardization without penalizing the intercept. As a result, only the new implementation in Spark ML handles everything correctly. We have tests to verify that the results match R. @Naveen: Please feel free to add/comment on the above points as you see necessary. Thanks, Sauptik. -Original Message- From: DB Tsai Sent: Tuesday, June 16, 2015 2:08 PM To: Ramakrishnan Naveen (CR/RTC1.3-NA) Cc: Dhar Sauptik (CR/RTC1.3-NA) Subject: Re: FW: MLLIB (Spark) Question. Hey, In the LORWithLBFGS api you use, the intercept is regularized while other implementations don't regularize the intercept. That's why you see the difference. The intercept should not be regularized, so we fix this in new Spark ML api in spark 1.4. Since this will change the behavior in the old api if we decide to not regularize the intercept in old version, we are still debating about this. See the following code for full running example in spark 1.4 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala And also check out my talk at spark summit. http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA) Naveen.Ramakrishnan wrote: Hi DB, Hope you are doing well! One of my colleagues, Sauptik, is working with MLLib and the logistic regression based on LBFGS and is having trouble reproducing the same results when compared to Matlab. Please see below for details. I did take a look into this but seems like there’s also discrepancy between the logistic regression with SGD and LBFGS implementations in MLLib. We have attached all the codes for your analysis – it’s in PySpark though. Let us know if you have any questions or concerns. We would very much appreciate your help whenever you get a chance. Best, Naveen. _ From: Dhar Sauptik (CR/RTC1.3-NA) Sent: Thursday, June 11, 2015 6