subject:"FW\: MLLIB \(Spark\) Question."

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai

+cc user@spark.apache.org

Reply inline.

On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
Sauptik.Dhar wrote:
Hi DB,

Thank you for the reply. That explains a lot.

I however had a few points regarding this:-

1. Just to help with the debate of not regularizing the b parameter. A
standard implementation argues against regularizing the b parameter. See Pg
64 para 1 : http://statweb.stanford.edu/~tibs/ElemStatLearn/

Agreed. We just worry about it will change behavior, but we actually
have a PR to change the behavior to standard one,
https://github.com/apache/spark/pull/6386

2. Further, is the regularization of b also applicable for the SGD
implementation. Currently the SGD vs. BFGS implementations give different
results (and both the implementations don't match the IRLS algorithm). Are
the SGD/BFGS implemented for different loss functions? Can you please share
your thoughts on this.

In SGD implementation, we don't standardize the dataset before
training. As a result, those columns with low standard deviation will
be penalized more, and those with high standard deviation will be
penalized less. Also, standardize will help the rate of convergence.
As a result, in most of package, they standardize the data
implicitly, and get the weights in the standardized space, and
transform back to original space so it's transparent for users.

1) LORWithSGD: No standardization, and penalize the intercept.
2) LORWithLBFGS: With standardization but penalize the intercept.
3) New LOR implementation: With standardization without penalizing the
intercept.

As a result, only the new implementation in Spark ML handles
everything correctly. We have tests to verify that the results match
R.

@Naveen: Please feel free to add/comment on the above points as you see
necessary.

Thanks,
Sauptik.

-Original Message-
From: DB Tsai
Sent: Tuesday, June 16, 2015 2:08 PM
To: Ramakrishnan Naveen (CR/RTC1.3-NA)
Cc: Dhar Sauptik (CR/RTC1.3-NA)
Subject: Re: FW: MLLIB (Spark) Question.

Hey,

In the LORWithLBFGS api you use, the intercept is regularized while
other implementations don't regularize the intercept. That's why you
see the difference.

The intercept should not be regularized, so we fix this in new Spark
ML api in spark 1.4. Since this will change the behavior in the old
api if we decide to not regularize the intercept in old version, we
are still debating about this.

See the following code for full running example in spark 1.4
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala

And also check out my talk at spark summit.
http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
Naveen.Ramakrishnan wrote:
Hi DB,
Hope you are doing well! One of my colleagues, Sauptik, is working with
MLLib and the logistic regression based on LBFGS and is having trouble
reproducing the same results when compared to Matlab. Please see below for
details. I did take a look into this but seems like there’s also discrepancy
between the logistic regression with SGD and LBFGS implementations in MLLib.
We have attached all the codes for your analysis – it’s in PySpark though.
Let us know if you have any questions or concerns. We would very much
appreciate your help whenever you get a chance.

Best,
Naveen.

_
From: Dhar Sauptik (CR/RTC1.3-NA)
Sent: Thursday, June 11, 2015 6:03 PM
To: Ramakrishnan Naveen (CR/RTC1.3-NA)
Subject: MLLIB (Spark) Question.

Hi Naveen,

I am writing this owing to some MLLIB issues I found while using Logistic
Regression. Basically, I am trying to test the stability of the L1/L2 –
Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm
the correctness of the algorithms. For comparison I implemented the
L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the
book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
. Unfortunately the solutions don’t match:-

For example:-

Using the Publicly available data (diabetes.csv) for L2 regularized Logistic
Regression (with lamda = 0.1) we get,

Solutions

MATLAB CODE (IRLS):-

w = 0.29429347080
0.550681766045083
0.0396336870148899
0.0641285712055971
0.101238592147879
0.261153541551578
0.178686710290069

b= -0.347396594061553

MLLIB (SGD):-
(weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714],
intercept=-0.0074992664631)

MLLIB(LBFGS):-
(weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai

Hi Dhar,

For standardization, we can disable it effectively by using
different regularization on each component. Thus, we're solving the
same problem but having better rate of convergence. This is one of the
features I will implement.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
sauptik.d...@us.bosch.com wrote:
Hi DB,

Thank you for the reply. The answers makes sense. I do have just one more
point to add.

Note that it may be better to not implicitly standardize the data. Agreed
that a number of algorithms benefit from such standardization, but for many
applications with contextual information such standardization may not be
desirable.
Users can always perform the standardization themselves.

However, that's just a suggestion. Again, thank you for the clarification.

Thanks,
Sauptik.

-Original Message-
From: DB Tsai [mailto:dbt...@dbtsai.com]
Sent: Tuesday, June 16, 2015 2:49 PM
To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
Cc: user@spark.apache.org
Subject: Re: FW: MLLIB (Spark) Question.

+cc user@spark.apache.org

Reply inline.

On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
Sauptik.Dhar wrote:
Hi DB,

Thank you for the reply. That explains a lot.

I however had a few points regarding this:-

Agreed. We just worry about it will change behavior, but we actually
have a PR to change the behavior to standard one,
https://github.com/apache/spark/pull/6386

As a result, only the new implementation in Spark ML handles
everything correctly. We have tests to verify that the results match
R.

@Naveen: Please feel free to add/comment on the above points as you see
necessary.

Thanks,
Sauptik.

-Original Message-
From: DB Tsai
Sent: Tuesday, June 16, 2015 2:08 PM
To: Ramakrishnan Naveen (CR/RTC1.3-NA)
Cc: Dhar Sauptik (CR/RTC1.3-NA)
Subject: Re: FW: MLLIB (Spark) Question.

Hey,

In the LORWithLBFGS api you use, the intercept is regularized while
other implementations don't regularize the intercept. That's why you
see the difference.

See the following code for full running example in spark 1.4
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala

And also check out my talk at spark summit.
http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

Best,
Naveen.

_
From: Dhar Sauptik (CR/RTC1.3-NA)
Sent: Thursday, June 11, 2015 6

Re: FW: MLLIB (Spark) Question.

Re: FW: MLLIB (Spark) Question.

2 matches

Site Navigation

Mail list logo

Footer information