[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-03-01 Thread Yakov Kerzhner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293027#comment-17293027
 ] 

Yakov Kerzhner commented on SPARK-34448:


I will try to do a code review, but will focus on the comments so that people 
who see this in the future will understand what is happening.

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-03-01 Thread Yakov Kerzhner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293017#comment-17293017
 ] 

Yakov Kerzhner commented on SPARK-34448:


I took a look over the weekend.  It seems good, and somewhat matches what I did 
in my test example where I centered before running the fitting.  Unfortunately, 
I am not very well versed in scala, so actually reviewing the code is a bit 
hard.  I appreciate the printouts for the test case in the PR, and I now 
understand why spark was returning the log(odds) for the intercept:  The 
division of a non centered vector with a small std dev creates a vector with 
very large entries that looks roughly like a constant vector.  When the 
minimizer computes the gradient, it assigns far more weight to this big vector 
than it does the intercept, as the magnitude appears more important than the 
fact that it isnt exactly constant.  When the optimizer then moves in the 
direction of the gradient, it finds that the value of the objective function 
actually increased (because of the fact that this big vector isnt exactly 
constant), and backtracks several times.  By the time it has backtracked enough 
to actually get a lower value on the objective function, the movement of the 
intercept is nearly 0.  So essentially, the intercept never moves during the 
entire calibration.  This is also why it takes so much longer (because of all 
the backtracking).  Once things are centered, the entries in the gradient for 
the intercept become dominant compared to the vector that is sort of constant, 
and so the minimizer begins adjusting the intercept, and moves it to the 
correct spot.

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-26 Thread Yakov Kerzhner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291757#comment-17291757
 ] 

Yakov Kerzhner commented on SPARK-34448:


The faster convergence when using standardization that includes centering makes 
sense as you can ahead of time guess the value of the intercept (it should 
equal the log(odds)).   What I still don't understand is how is it that in the 
case that the data is not centered, the intercept after the minimization is 
almost exactly equal to the log(odds).  This seems extremely strange to me and 
I can't find a mathematical reason for this to be happening.  In the original 
test example, could you print out the x, f(x), grad(x) as the minimizer moves 
from (0, 0, 0, log(odds)) to the minimum?

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-23 Thread Yakov Kerzhner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289113#comment-17289113
 ] 

Yakov Kerzhner edited comment on SPARK-34448 at 2/23/21, 2:54 PM:
--

As I said in the description, I do not believe that the starting point should 
cause this bug; the minimizer should still drift to the proper minimum.  I said 
the fact that the log(odds) was made the starting point seems to suggest that 
whoever wrote the code believed that the intercept should be close to the 
log(odds), which is only true if the data is centered.  If I had to guess, I 
would guess that there is something in the objective function that pulls the 
intercept towards the log(odds).  This would be a bug, as the log(odds) is a 
good approximation for the intercept if and only if the data is centered.  For 
non-centered data, it is completely wrong to have the intercept equal (or be 
close to) the log(odds).  My test shows precisely this, that when the data is 
not centered, spark still returns an intercept equal to the log(odds) (test 
2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct 
intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an 
intercept almost equal to the log(odds), (test 1.b. log(odds): 
-3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So 
we need to dig into the objective function, and whether somewhere in there is a 
term that penalizes the intercept moving away from the log(odds).   If there is 
nothing there of this sort, then a step through of the minimization process 
should shed some clues as to why the intercept isnt budging from the initial 
value given.


was (Author: ykerzhner):
As I said in the description, I do not believe that the starting point should 
cause this bug; the minimizer should still drift to the proper minimum.  I said 
the fact that the log(odds) was made the starting point seems to suggest that 
whoever wrote the code believed that the intercept should be close to the 
log(odds), which is only true if the data is centered.  If I had to guess, I 
would guess that there is something in the objective function that pulls the 
intercept towards the log(odds).  This would be a bug, as the log(odds) is a 
good approximation for the intercept if and only if the data is centered.  For 
non-centered data, it is completely wrong to have the intercept equal (or be 
close to) the log(odds).  My test shows precisely this, that when the data is 
not centered, spark still returns an intercept equal to the log(odds) (test 
2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct 
intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an 
intercept almost equal to the log(odds), (test 1.b. log(odds): 
-3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So 
we need to dig into the objective function, and whether somewhere in there is a 
term that penalizes the intercept moving away from the log(odds). 

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-23 Thread Yakov Kerzhner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289113#comment-17289113
 ] 

Yakov Kerzhner commented on SPARK-34448:


As I said in the description, I do not believe that the starting point should 
cause this bug; the minimizer should still drift to the proper minimum.  I said 
the fact that the log(odds) was made the starting point seems to suggest that 
whoever wrote the code believed that the intercept should be close to the 
log(odds), which is only true if the data is centered.  If I had to guess, I 
would guess that there is something in the objective function that pulls the 
intercept towards the log(odds).  This would be a bug, as the log(odds) is a 
good approximation for the intercept if and only if the data is centered.  For 
non-centered data, it is completely wrong to have the intercept equal (or be 
close to) the log(odds).  My test shows precisely this, that when the data is 
not centered, spark still returns an intercept equal to the log(odds) (test 
2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct 
intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an 
intercept almost equal to the log(odds), (test 1.b. log(odds): 
-3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So 
we need to dig into the objective function, and whether somewhere in there is a 
term that penalizes the intercept moving away from the log(odds). 

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-17 Thread Yakov Kerzhner (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yakov Kerzhner updated SPARK-34448:
---
Summary: Binary logistic regression incorrectly computes the intercept and 
coefficients when data is not centered  (was: Under certain conditions the 
binary logistic regression incorrectly computes the intercept and coefficients)

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Critical
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34448) Under certain conditions the binary logistic regression incorrectly computes the intercept and coefficients

2021-02-16 Thread Yakov Kerzhner (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yakov Kerzhner updated SPARK-34448:
---
Labels: correctness  (was: )

> Under certain conditions the binary logistic regression incorrectly computes 
> the intercept and coefficients
> ---
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Critical
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34448) Under certain conditions the binary logistic regression incorrectly computes the intercept and coefficients

2021-02-16 Thread Yakov Kerzhner (Jira)
Yakov Kerzhner created SPARK-34448:
--

 Summary: Under certain conditions the binary logistic regression 
incorrectly computes the intercept and coefficients
 Key: SPARK-34448
 URL: https://issues.apache.org/jira/browse/SPARK-34448
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 3.0.0, 2.4.5
Reporter: Yakov Kerzhner


I have written up a fairly detailed gist that includes code to reproduce the 
bug, as well as the output of the code and some commentary:
[https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
To summarize: under certain conditions, the minimization that fits a binary 
logistic regression contains a bug that pulls the intercept value towards the 
log(odds) of the target data.  This is mathematically only correct when the 
data comes from distributions with zero means.  In general, this gives 
incorrect intercept values, and consequently incorrect coefficients as well.
As I am not so familiar with the spark code base, I have not been able to find 
this bug within the spark code itself.  A hint to this bug is here: 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
based on the code, I don't believe that the features have zero means at this 
point, and so this heuristic is incorrect.  But an incorrect starting point 
does not explain this bug.  The minimizer should drift to the correct place.  I 
was not able to find the code of the actual objective function that is being 
minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org