Yakov Kerzhner created SPARK-34448:
--------------------------------------

             Summary: Under certain conditions the binary logistic regression 
incorrectly computes the intercept and coefficients
                 Key: SPARK-34448
                 URL: https://issues.apache.org/jira/browse/SPARK-34448
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib
    Affects Versions: 3.0.0, 2.4.5
            Reporter: Yakov Kerzhner


I have written up a fairly detailed gist that includes code to reproduce the 
bug, as well as the output of the code and some commentary:
[https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
To summarize: under certain conditions, the minimization that fits a binary 
logistic regression contains a bug that pulls the intercept value towards the 
log(odds) of the target data.  This is mathematically only correct when the 
data comes from distributions with zero means.  In general, this gives 
incorrect intercept values, and consequently incorrect coefficients as well.
As I am not so familiar with the spark code base, I have not been able to find 
this bug within the spark code itself.  A hint to this bug is here: 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
based on the code, I don't believe that the features have zero means at this 
point, and so this heuristic is incorrect.  But an incorrect starting point 
does not explain this bug.  The minimizer should drift to the correct place.  I 
was not able to find the code of the actual objective function that is being 
minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to