Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
+cc user@spark.apache.org

Reply inline.

On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
Sauptik.Dhar wrote:
 Hi DB,

 Thank you for the reply. That explains a lot.

 I however had a few points regarding this:-

 1. Just to help with the debate of not regularizing the b parameter. A 
 standard implementation argues against regularizing the b parameter. See Pg 
 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/


Agreed. We just worry about it will change behavior, but we actually
have a PR to change the behavior to standard one,
https://github.com/apache/spark/pull/6386

 2. Further, is the regularization of b also applicable for the SGD 
 implementation. Currently the SGD vs. BFGS implementations give different 
 results (and both the implementations don't match the IRLS algorithm). Are 
 the SGD/BFGS implemented for different loss functions? Can you please share 
 your thoughts on this.


In SGD implementation, we don't standardize the dataset before
training. As a result, those columns with low standard deviation will
be penalized more, and those with high standard deviation will be
penalized less. Also, standardize will help the rate of convergence.
As a result, in most of package, they standardize the data
implicitly, and get the weights in the standardized space, and
transform back to original space so it's transparent for users.

1) LORWithSGD: No standardization, and penalize the intercept.
2) LORWithLBFGS: With standardization but penalize the intercept.
3) New LOR implementation: With standardization without penalizing the
intercept.

As a result, only the new implementation in Spark ML handles
everything correctly. We have tests to verify that the results match
R.


 @Naveen: Please feel free to add/comment on the above points as you see 
 necessary.

 Thanks,
 Sauptik.

 -Original Message-
 From: DB Tsai
 Sent: Tuesday, June 16, 2015 2:08 PM
 To: Ramakrishnan Naveen (CR/RTC1.3-NA)
 Cc: Dhar Sauptik (CR/RTC1.3-NA)
 Subject: Re: FW: MLLIB (Spark) Question.

 Hey,

 In the LORWithLBFGS api you use, the intercept is regularized while
 other implementations don't regularize the intercept. That's why you
 see the difference.

 The intercept should not be regularized, so we fix this in new Spark
 ML api in spark 1.4. Since this will change the behavior in the old
 api if we decide to not regularize the intercept in old version, we
 are still debating about this.

 See the following code for full running example in spark 1.4
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala

 And also check out my talk at spark summit.
 http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit


 Sincerely,

 DB Tsai
 --
 Blog: https://www.dbtsai.com
 PGP Key ID: 0xAF08DF8D


 On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
 Naveen.Ramakrishnan wrote:
 Hi DB,
 Hope you are doing well! One of my colleagues, Sauptik, is working with
 MLLib and the logistic regression based on LBFGS and is having trouble
 reproducing the same results when compared to Matlab. Please see below for
 details. I did take a look into this but seems like there’s also discrepancy
 between the logistic regression with SGD and LBFGS implementations in MLLib.
 We have attached all the codes for your analysis – it’s in PySpark though.
 Let us know if you have any questions or concerns. We would very much
 appreciate your help whenever you get a chance.

 Best,
 Naveen.

 _
 From: Dhar Sauptik (CR/RTC1.3-NA)
 Sent: Thursday, June 11, 2015 6:03 PM
 To: Ramakrishnan Naveen (CR/RTC1.3-NA)
 Subject: MLLIB (Spark) Question.


 Hi Naveen,

 I am writing this owing to some MLLIB issues I found while using Logistic
 Regression. Basically, I am trying to test the stability of the L1/L2 –
 Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm
 the correctness of the algorithms. For comparison I implemented the
 L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the
 book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
 . Unfortunately the solutions don’t match:-

 For example:-

 Using the Publicly available data (diabetes.csv) for L2 regularized Logistic
 Regression (with lamda = 0.1) we get,

 Solutions

 MATLAB CODE (IRLS):-

 w = 0.29429347080
 0.550681766045083
 0.0396336870148899
 0.0641285712055971
 0.101238592147879
 0.261153541551578
 0.178686710290069

 b=  -0.347396594061553


 MLLIB (SGD):-
 (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714],
 intercept=-0.0074992664631)


 MLLIB(LBFGS):-
 (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
Hi Dhar,

For standardization, we can disable it effectively by using
different regularization on each component. Thus, we're solving the
same problem but having better rate of convergence. This is one of the
features I will implement.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
sauptik.d...@us.bosch.com wrote:
 Hi DB,

 Thank you for the reply. The answers makes sense. I do have just one more 
 point to add.

 Note that it may be better to not implicitly standardize the data. Agreed 
 that a number of algorithms benefit from such standardization, but for many 
 applications with contextual information such standardization may not be 
 desirable.
 Users can always perform the standardization themselves.

 However, that's just a suggestion. Again, thank you for the clarification.

 Thanks,
 Sauptik.


 -Original Message-
 From: DB Tsai [mailto:dbt...@dbtsai.com]
 Sent: Tuesday, June 16, 2015 2:49 PM
 To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
 Cc: user@spark.apache.org
 Subject: Re: FW: MLLIB (Spark) Question.

 +cc user@spark.apache.org

 Reply inline.

 On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
 Sauptik.Dhar wrote:
 Hi DB,

 Thank you for the reply. That explains a lot.

 I however had a few points regarding this:-

 1. Just to help with the debate of not regularizing the b parameter. A 
 standard implementation argues against regularizing the b parameter. See Pg 
 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/


 Agreed. We just worry about it will change behavior, but we actually
 have a PR to change the behavior to standard one,
 https://github.com/apache/spark/pull/6386

 2. Further, is the regularization of b also applicable for the SGD 
 implementation. Currently the SGD vs. BFGS implementations give different 
 results (and both the implementations don't match the IRLS algorithm). Are 
 the SGD/BFGS implemented for different loss functions? Can you please share 
 your thoughts on this.


 In SGD implementation, we don't standardize the dataset before
 training. As a result, those columns with low standard deviation will
 be penalized more, and those with high standard deviation will be
 penalized less. Also, standardize will help the rate of convergence.
 As a result, in most of package, they standardize the data
 implicitly, and get the weights in the standardized space, and
 transform back to original space so it's transparent for users.

 1) LORWithSGD: No standardization, and penalize the intercept.
 2) LORWithLBFGS: With standardization but penalize the intercept.
 3) New LOR implementation: With standardization without penalizing the
 intercept.

 As a result, only the new implementation in Spark ML handles
 everything correctly. We have tests to verify that the results match
 R.


 @Naveen: Please feel free to add/comment on the above points as you see 
 necessary.

 Thanks,
 Sauptik.

 -Original Message-
 From: DB Tsai
 Sent: Tuesday, June 16, 2015 2:08 PM
 To: Ramakrishnan Naveen (CR/RTC1.3-NA)
 Cc: Dhar Sauptik (CR/RTC1.3-NA)
 Subject: Re: FW: MLLIB (Spark) Question.

 Hey,

 In the LORWithLBFGS api you use, the intercept is regularized while
 other implementations don't regularize the intercept. That's why you
 see the difference.

 The intercept should not be regularized, so we fix this in new Spark
 ML api in spark 1.4. Since this will change the behavior in the old
 api if we decide to not regularize the intercept in old version, we
 are still debating about this.

 See the following code for full running example in spark 1.4
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala

 And also check out my talk at spark summit.
 http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit


 Sincerely,

 DB Tsai
 --
 Blog: https://www.dbtsai.com
 PGP Key ID: 0xAF08DF8D


 On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
 Naveen.Ramakrishnan wrote:
 Hi DB,
 Hope you are doing well! One of my colleagues, Sauptik, is working with
 MLLib and the logistic regression based on LBFGS and is having trouble
 reproducing the same results when compared to Matlab. Please see below for
 details. I did take a look into this but seems like there’s also discrepancy
 between the logistic regression with SGD and LBFGS implementations in MLLib.
 We have attached all the codes for your analysis – it’s in PySpark though.
 Let us know if you have any questions or concerns. We would very much
 appreciate your help whenever you get a chance.

 Best,
 Naveen.

 _
 From: Dhar Sauptik (CR/RTC1.3-NA)
 Sent: Thursday, June 11, 2015 6