[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291559#comment-17291559
 ] 

zhengruifeng commented on SPARK-34448:
--------------------------------------

1, I just make a simple 
impl(https://github.com/apache/spark/pull/31657/commits/49141bbb178ac28af3263efa31299f8eb835830b)
 that internally center the vectors,

then the solution seems ok.
{code:java}
Coefficients: [0.29886424895473795,0.20097637066670226,0.0081964409252861]
Intercept: -4.0089605363236664 {code}
Moreover, it converge much faster.

 

2, however, if we center the vectors, then a lot of (>24) existing testsuite 
fails. It seems existing scaling was designed on purpose.

 

3, existing scaling (x/std_x) was added in 
https://issues.apache.org/jira/browse/SPARK-7262 , and aimed to keep in line 
with {{glmnet}}. But I am not familiar with {{glmnet}}.

{{In sklearn, linear_model.LogisticRegression}} does not standardize input 
vectors, while other linear models (i.e linear_model.ElasticNet) will 
'subtracting the mean and dividing by the l2-norm'.

 

 

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34448
>                 URL: https://issues.apache.org/jira/browse/SPARK-34448
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Yakov Kerzhner
>            Priority: Major
>              Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to