[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen updated SPARK-34448: --------------------------------- Priority: Major (was: Critical) > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.4.5, 3.0.0 > Reporter: Yakov Kerzhner > Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org