Re: FW: MLLIB (Spark) Question.

DB Tsai Tue, 16 Jun 2015 14:50:07 -0700

+cc user@spark.apache.org

Reply inline.


On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
<Sauptik.Dhar> wrote:
> Hi DB,
>
> Thank you for the reply. That explains a lot.
>
> I however had a few points regarding this:-
>
> 1. Just to help with the debate of not regularizing the b parameter. A 
> standard implementation argues against regularizing the b parameter. See Pg 
> 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>

Agreed. We just worry about it will change behavior, but we actually
have a PR to change the behavior to standard one,
https://github.com/apache/spark/pull/6386

> 2. Further, is the regularization of b also applicable for the SGD 
> implementation. Currently the SGD vs. BFGS implementations give different 
> results (and both the implementations don't match the IRLS algorithm). Are 
> the SGD/BFGS implemented for different loss functions? Can you please share 
> your thoughts on this.
>

In SGD implementation, we don't "standardize" the dataset before
training. As a result, those columns with low standard deviation will
be penalized more, and those with high standard deviation will be
penalized less. Also, "standardize" will help the rate of convergence.
As a result, in most of package, they "standardize" the data
implicitly, and get the weights in the "standardized" space, and
transform back to original space so it's transparent for users.

1) LORWithSGD: No standardization, and penalize the intercept.
2) LORWithLBFGS: With standardization but penalize the intercept.
3) New LOR implementation: With standardization without penalizing the
intercept.

As a result, only the new implementation in Spark ML handles
everything correctly. We have tests to verify that the results match
R.

>
> @Naveen: Please feel free to add/comment on the above points as you see 
> necessary.
>
> Thanks,
> Sauptik.
>
> -----Original Message-----
> From: DB Tsai
> Sent: Tuesday, June 16, 2015 2:08 PM
> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
> Cc: Dhar Sauptik (CR/RTC1.3-NA)
> Subject: Re: FW: MLLIB (Spark) Question.
>
> Hey,
>
> In the LORWithLBFGS api you use, the intercept is regularized while
> other implementations don't regularize the intercept. That's why you
> see the difference.
>
> The intercept should not be regularized, so we fix this in new Spark
> ML api in spark 1.4. Since this will change the behavior in the old
> api if we decide to not regularize the intercept in old version, we
> are still debating about this.
>
> See the following code for full running example in spark 1.4
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala
>
> And also check out my talk at spark summit.
> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
> <Naveen.Ramakrishnan> wrote:
>> Hi DB,
>>     Hope you are doing well! One of my colleagues, Sauptik, is working with
>> MLLib and the logistic regression based on LBFGS and is having trouble
>> reproducing the same results when compared to Matlab. Please see below for
>> details. I did take a look into this but seems like there’s also discrepancy
>> between the logistic regression with SGD and LBFGS implementations in MLLib.
>> We have attached all the codes for your analysis – it’s in PySpark though.
>> Let us know if you have any questions or concerns. We would very much
>> appreciate your help whenever you get a chance.
>>
>> Best,
>> Naveen.
>>
>> _____________________________________________
>> From: Dhar Sauptik (CR/RTC1.3-NA)
>> Sent: Thursday, June 11, 2015 6:03 PM
>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Subject: MLLIB (Spark) Question.
>>
>>
>> Hi Naveen,
>>
>> I am writing this owing to some MLLIB issues I found while using Logistic
>> Regression. Basically, I am trying to test the stability of the L1/L2 –
>> Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm
>> the correctness of the algorithms. For comparison I implemented the
>> L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the
>> book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>> . Unfortunately the solutions don’t match:-
>>
>> For example:-
>>
>> Using the Publicly available data (diabetes.csv) for L2 regularized Logistic
>> Regression (with lamda = 0.1) we get,
>>
>> Solutions
>>
>> MATLAB CODE (IRLS):-
>>
>> w = 0.294293470805555
>> 0.550681766045083
>> 0.0396336870148899
>> 0.0641285712055971
>> 0.101238592147879
>> 0.261153541551578
>> 0.178686710290069
>>
>> b=  -0.347396594061553
>>
>>
>> MLLIB (SGD):-
>> (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714],
>> intercept=-0.00749988882664631)
>>
>>
>> MLLIB(LBFGS):-
>> (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394],
>> intercept=-0.027401869113912316)
>>
>>
>> All the codes are attached to the email.
>>
>> Apparently the solutions are quite far away from the optimal (and even from
>> each other)! Can you please check with DB Tsai on the reasons for such
>> differences? Note all the additional parameters are described in the source
>> codes.
>>
>>
>> Thanks,
>> Best regards / Mit freundlichen Grüßen,
>>
>> Sauptik Dhar, Ph.D.
>> CR/RTC1.3-NA
>>
>>

Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FW: MLLIB (Spark) Question.

Reply via email to