[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291559#comment-17291559 ]
zhengruifeng commented on SPARK-34448: -------------------------------------- 1, I just make a simple impl(https://github.com/apache/spark/pull/31657/commits/49141bbb178ac28af3263efa31299f8eb835830b) that internally center the vectors, then the solution seems ok. {code:java} Coefficients: [0.29886424895473795,0.20097637066670226,0.0081964409252861] Intercept: -4.0089605363236664 {code} Moreover, it converge much faster. 2, however, if we center the vectors, then a lot of (>24) existing testsuite fails. It seems existing scaling was designed on purpose. 3, existing scaling (x/std_x) was added in https://issues.apache.org/jira/browse/SPARK-7262 , and aimed to keep in line with {{glmnet}}. But I am not familiar with {{glmnet}}. {{In sklearn, linear_model.LogisticRegression}} does not standardize input vectors, while other linear models (i.e linear_model.ElasticNet) will 'subtracting the mean and dividing by the l2-norm'. > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.4.5, 3.0.0 > Reporter: Yakov Kerzhner > Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org