+cc user@spark.apache.org Reply inline.
On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA) <Sauptik.Dhar> wrote: > Hi DB, > > Thank you for the reply. That explains a lot. > > I however had a few points regarding this:- > > 1. Just to help with the debate of not regularizing the b parameter. A > standard implementation argues against regularizing the b parameter. See Pg > 64 para 1 : http://statweb.stanford.edu/~tibs/ElemStatLearn/ > Agreed. We just worry about it will change behavior, but we actually have a PR to change the behavior to standard one, https://github.com/apache/spark/pull/6386 > 2. Further, is the regularization of b also applicable for the SGD > implementation. Currently the SGD vs. BFGS implementations give different > results (and both the implementations don't match the IRLS algorithm). Are > the SGD/BFGS implemented for different loss functions? Can you please share > your thoughts on this. > In SGD implementation, we don't "standardize" the dataset before training. As a result, those columns with low standard deviation will be penalized more, and those with high standard deviation will be penalized less. Also, "standardize" will help the rate of convergence. As a result, in most of package, they "standardize" the data implicitly, and get the weights in the "standardized" space, and transform back to original space so it's transparent for users. 1) LORWithSGD: No standardization, and penalize the intercept. 2) LORWithLBFGS: With standardization but penalize the intercept. 3) New LOR implementation: With standardization without penalizing the intercept. As a result, only the new implementation in Spark ML handles everything correctly. We have tests to verify that the results match R. > > @Naveen: Please feel free to add/comment on the above points as you see > necessary. > > Thanks, > Sauptik. > > -----Original Message----- > From: DB Tsai > Sent: Tuesday, June 16, 2015 2:08 PM > To: Ramakrishnan Naveen (CR/RTC1.3-NA) > Cc: Dhar Sauptik (CR/RTC1.3-NA) > Subject: Re: FW: MLLIB (Spark) Question. > > Hey, > > In the LORWithLBFGS api you use, the intercept is regularized while > other implementations don't regularize the intercept. That's why you > see the difference. > > The intercept should not be regularized, so we fix this in new Spark > ML api in spark 1.4. Since this will change the behavior in the old > api if we decide to not regularize the intercept in old version, we > are still debating about this. > > See the following code for full running example in spark 1.4 > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala > > And also check out my talk at spark summit. > http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit > > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Blog: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > > On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA) > <Naveen.Ramakrishnan> wrote: >> Hi DB, >> Hope you are doing well! One of my colleagues, Sauptik, is working with >> MLLib and the logistic regression based on LBFGS and is having trouble >> reproducing the same results when compared to Matlab. Please see below for >> details. I did take a look into this but seems like there’s also discrepancy >> between the logistic regression with SGD and LBFGS implementations in MLLib. >> We have attached all the codes for your analysis – it’s in PySpark though. >> Let us know if you have any questions or concerns. We would very much >> appreciate your help whenever you get a chance. >> >> Best, >> Naveen. >> >> _____________________________________________ >> From: Dhar Sauptik (CR/RTC1.3-NA) >> Sent: Thursday, June 11, 2015 6:03 PM >> To: Ramakrishnan Naveen (CR/RTC1.3-NA) >> Subject: MLLIB (Spark) Question. >> >> >> Hi Naveen, >> >> I am writing this owing to some MLLIB issues I found while using Logistic >> Regression. Basically, I am trying to test the stability of the L1/L2 – >> Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm >> the correctness of the algorithms. For comparison I implemented the >> L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the >> book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf >> . Unfortunately the solutions don’t match:- >> >> For example:- >> >> Using the Publicly available data (diabetes.csv) for L2 regularized Logistic >> Regression (with lamda = 0.1) we get, >> >> Solutions >> >> MATLAB CODE (IRLS):- >> >> w = 0.294293470805555 >> 0.550681766045083 >> 0.0396336870148899 >> 0.0641285712055971 >> 0.101238592147879 >> 0.261153541551578 >> 0.178686710290069 >> >> b= -0.347396594061553 >> >> >> MLLIB (SGD):- >> (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714], >> intercept=-0.00749988882664631) >> >> >> MLLIB(LBFGS):- >> (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394], >> intercept=-0.027401869113912316) >> >> >> All the codes are attached to the email. >> >> Apparently the solutions are quite far away from the optimal (and even from >> each other)! Can you please check with DB Tsai on the reasons for such >> differences? Note all the additional parameters are described in the source >> codes. >> >> >> Thanks, >> Best regards / Mit freundlichen Grüßen, >> >> Sauptik Dhar, Ph.D. >> CR/RTC1.3-NA >> >> Sincerely, DB Tsai ---------------------------------------------------------- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org