[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291378#comment-17291378
 ] 

zhengruifeng edited comment on SPARK-34448 at 2/26/21, 4:29 AM:
----------------------------------------------------------------

[~srowen] [~weichenxu123]  [~ykerzhner]

My findings until now:

1, as to param {{standardization, its name and doc is misleading. No matter 
whether it is true (by default) or false, LR always `standardize` the input 
vectors in a special way (x => x / std(x)), but the transformed vectors are not 
centered;}}

{{2, for the scala testsuite above, I log the internal gradient and model 
(intercept & coef) at each iteration. I check the objective function and 
gradient, and it seems that they are calculated correctly;}}

{{3, }}{{for the case with const_feature(0.9 & 1.0) above,}}{\{ the mean & std 
of three input features are:}}
{code:java}
featuresMean: [0.4999142959117828,1.4847274177074965,0.9899999976158129]
featuresStd: [0.28501348037270735,0.28375633081273305,0.03000002215257344]{code}
{{note that const_feature (its std is 0.03) will be scaled to (30.0 & 33.3).}}

 

*{{I suspect that the underlying solvers (OWLQN/LBFGS/LBFGSB) can not handle a 
feature with such large(>30) values.}}*

{{3.1, Since std vec affects both the internal scaling and regularization, I 
disable regularization by setting regParam 0.0 to see whether this scaling 
matters.}}

{{With *LBFGS* Solver, the issue also exists, the solution with const_feature 
is:
 }}
{code:java}
Coefficients: [0.29713531586902586,0.1928976631256973,-0.44332696536594945]
Intercept: -3.548585606117963 {code}
{{ }}

{{Then I manually set std vec to one values:}}
{code:java}
 val featuresStd = Array.fill(featuresMean.length)(1.0){code}
{{Then the optimization procedure behaviors as expectations, and the solution 
is:}}
{code:java}
Coefficients: [0.298868144564205,0.20101389459979044,0.008381706578824933]
Intercept: -4.009204134794202 {code}
 

{{3.2, here I reset the regParam to 0.5, with *OWLQN* Solver, the solution with 
all ones std is:}}
{code:java}
Coefficients: [0.296817926857017,0.19312282148846005,-0.17682584221569103]
Intercept: -3.8124413640824466 {code}
 

{{Compared to previous solution:}}
{code:java}
Coefficients: [0.2997261304455311,0.18830032771483074,-0.44301560942213103]
Intercept: -3.5428941035683303 {code}
{{I think the new solution with unit std vec fits better.}}

 

{{To summary, I guess the internal standardization should center the vectors in 
some way to match existing solver.}}

 

{{TODO:}}

{{1, I will refer to other impls to see how standardization is impled;}}

{{2, I will go on this issue to see what will happen if the vectors are 
centered;}}

{{3, This issue may also exist in LiR/SVC/etc. I will check in the future;}}

 

 


was (Author: podongfeng):
[~srowen] [~weichenxu123]  [~ykerzhner] 

My findings until now:

1, as to param {{standardization, its name and doc is misleading. No matter 
whether it is true (by default) or false, LR always `standardize` the input 
vectors in a special way (x => x / std(x)), but the transformed vectors are not 
centered;}}

{{2, for the scala testsuite above, I log out the internal gradient and model 
(intercept & coef) at each iteration. I check the objective function and 
gradient, and it seems that they are calculated correctly;}}

{{3, }}{{for the case with const_feature(0.9 & 1.0) above,}}{{ the mean & std 
of three input features are:}}
{code:java}
featuresMean: [0.4999142959117828,1.4847274177074965,0.9899999976158129]
featuresStd: [0.28501348037270735,0.28375633081273305,0.03000002215257344]{code}
{{note that const_feature (its std is 0.03) will be scaled to (30.0 & 33.3).}}

 

*{{I suspect that the underlying solvers (OWLQN/LBFGS/LBFGSB) can not handle a 
feature with such large(>30) values.}}*

{{3.1, Since std vec affects both the internal scaling and regularization, I 
disable regularization by setting regParam 0.0 to see whether this scaling 
matters.}}

{{With *LBFGS* Solver, the issue also exists, the solution with const_feature 
is:
}}
{code:java}
Coefficients: [0.29713531586902586,0.1928976631256973,-0.44332696536594945]
Intercept: -3.548585606117963 {code}
{{ }}

{{Then I manually set std vec to one values:}}
{code:java}
 val featuresStd = Array.fill(featuresMean.length)(1.0){code}
{{Then the optimization procedure behaviors as expectations, and the solution 
is:}}
{code:java}
Coefficients: [0.298868144564205,0.20101389459979044,0.008381706578824933]
Intercept: -4.009204134794202 {code}
 

{{3.2, here I reset the regParam to 0.5, with *OWLQN* Solver, the solution with 
all ones std is:}}
{code:java}
Coefficients: [0.296817926857017,0.19312282148846005,-0.17682584221569103]
Intercept: -3.8124413640824466 {code}
 

{{Compared to previous solution:}}
{code:java}
Coefficients: [0.2997261304455311,0.18830032771483074,-0.44301560942213103]
Intercept: -3.5428941035683303 {code}
{{I think the new solution with unit std vec fits better.}}

 

{{To summary, I guess the internal standardization should center the vectors in 
some way to match existing solver.}}

 

{{TODO:}}

{{1, I will refer to other impls to see how standardization is impled;}}

{{2, I will go on this issue to see what will happen if the vectors are 
centered;}}

{{3, This issue may also exist in LiR/SVC/etc. I will check in the future;}}

 

 

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34448
>                 URL: https://issues.apache.org/jira/browse/SPARK-34448
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Yakov Kerzhner
>            Priority: Major
>              Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to