[ https://issues.apache.org/jira/browse/SPARK-22879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302809#comment-16302809 ]
Adrien Lavoillotte commented on SPARK-22879: -------------------------------------------- Comparing the probability every time (from the raw prediction) would solve this issue in all cases. It is a bit more computationally expensive if you left {{rawPredictionCol}} set but unset {{probabilityCol}}. A middle ground would be to switch the order of the test, and use the probability column first, and only if it was unset then use the raw prediction column. This would solve in most cases (including the default case IIRC) with virtually no drawbacks, except that there would still be one corner case exhibiting the bug (if you leave in the raw predictions but remove the probabilities). > LogisticRegression inconsistent prediction when proba == threshold > ------------------------------------------------------------------ > > Key: SPARK-22879 > URL: https://issues.apache.org/jira/browse/SPARK-22879 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 1.6.3 > Reporter: Adrien Lavoillotte > Priority: Minor > > I'm using {{org.apache.spark.ml.classification.LogisticRegression}} for > binary classification. > If I predict on a record that yields exactly the probability of the > threshold, then the result of {{transform}} is different depending on whether > the {{rawPredictionCol}} param is empty on the model or not. > If it is empty, as most ML tools I've seen, it correctly predicts 0, the rule > being {{ if (proba > threshold) then 1 else 0 }} (implemented in > {{probability2prediction}}). > If however {{rawPredictionCol}} is set (default), then it avoids > recomputation by calling {{raw2prediction}}, and the rule becomes {{if > (rawPrediction(1) > rawThreshold) 1 else 0}}. The {{rawThreshold = math.log(t > / (1.0 - t))}} is ever-so-slightly below the {{rawPrediction(1)}}, so it > predicts 1. > The use case is that I choose the threshold amongst > {{BinaryClassificationMetrics#thresholds}}, so I get one that corresponds to > the probability for one or more of my test set's records. Re-scoring that > record or one that yields the same probability exhibits this behaviour. > Tested this on Spark 1.6 but the code involved seems to be similar on Spark > 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org