[ 
https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16301801#comment-16301801
 ] 

Sean Owen commented on SPARK-22879:
-----------------------------------

Yes these are algebraically the same but not exactly the same due to roundoff. 
I guess I'd argue the right answer is 'false' in your example, because it's the 
comparison with the value the user supplied. Yes it should be consistent, but I 
don't know if this can be avoided, without avoiding the 'raw' comparison 
altogether. That's for performance reasons though.

In general this seems like an extreme corner case, but I see you're trying to 
exactly reproduce certain comparisons. What about rounding the result to your 
nearest test set value, to satisfy your particular use case?

Is there any way to get the speed up (win with no downside in almost all cases) 
without this behavior?

> LogisticRegression inconsistent prediction when proba == threshold
> ------------------------------------------------------------------
>
>                 Key: SPARK-22879
>                 URL: https://issues.apache.org/jira/browse/SPARK-22879
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 1.6.3
>            Reporter: Adrien Lavoillotte
>            Priority: Minor
>
> I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for 
> binary classification.
> If I predict on a record that yields exactly the probability of the 
> threshold, then the result of {{transform}} is different depending on whether 
> the {{rawPredictionCol}} param is empty on the model or not.
> If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule 
> being {{ if (proba > threshold) then 1 else 0 }} (implemented in 
> {{probability2prediction}}).
> If however {{rawPredictionCol}} is set (default), then it avoids 
> recomputation by calling {{raw2prediction}}, and the rule becomes {{if 
> (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t 
> / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it 
> predicts 1.
> The use case is that I choose the threshold amongst 
> {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to 
> the probability for one or more of my test set's records. Re-scoring that 
> record or one that yields the same probability exhibits this behaviour.
> Tested this on Spark 1.6 but the code involved seems to be similar on Spark 
> 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to