[ 
https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302809#comment-16302809
 ] 

Adrien Lavoillotte commented on SPARK-22879:
--------------------------------------------

Comparing the probability every time (from the raw prediction) would solve this 
issue in all cases. It is a bit more computationally expensive if you left 
{{rawPredictionCol}} set but unset {{probabilityCol}}.

A middle ground would be to switch the order of the test, and use the 
probability column first, and only if it was unset then use the raw prediction 
column. This would solve in most cases (including the default case IIRC) with 
virtually no drawbacks, except that there would still be one corner case 
exhibiting the bug (if you leave in the raw predictions but remove the 
probabilities). 

> LogisticRegression inconsistent prediction when proba == threshold
> ------------------------------------------------------------------
>
>                 Key: SPARK-22879
>                 URL: https://issues.apache.org/jira/browse/SPARK-22879
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 1.6.3
>            Reporter: Adrien Lavoillotte
>            Priority: Minor
>
> I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for 
> binary classification.
> If I predict on a record that yields exactly the probability of the 
> threshold, then the result of {{transform}} is different depending on whether 
> the {{rawPredictionCol}} param is empty on the model or not.
> If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
> being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
> {{probability2prediction}}).
> If however {{rawPredictionCol}} is set (default), then it avoids 
> recomputation by calling {{raw2prediction}}, and the rule becomes {{if 
> (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t 
> / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it 
> predicts 1.
> The use case is that I choose the threshold amongst 
> {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
> the probability for one or more of my test set's records. Re-scoring that 
> record or one that yields the same probability exhibits this behaviour.
> Tested this on Spark 1.6 but the code involved seems to be similar on Spark 
> 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to