Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14949 The original JIRA [SPARK-8069](https://issues.apache.org/jira/browse/SPARK-8069) refers to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf. That R package calls it "cutoff". Though it does indeed seem to act more like a "weight" or "scaling". I can't say I've come across it before, and it appears this is the only package that does it like this (at least that I've been able to find from some quick searching). I haven't found any theoretical background for it either. In any case, now that we have it, I think it probably best to keep it as is. However, It appears that our implementation here is flawed since in the original R code, the `cutoff` vector sum must be in (0, 1) (and also be >0 everywhere) - see https://github.com/cran/randomForest/blob/9208176df98d561aba6dae239472be8b124e2631/R/predict.randomForest.R#L47. If we're going to base something on another impl, probably best to actually follow it. So: * If `sum(thresholds)` > 1 or < 0, throw and error * If each entry in `thresholds` not > 0, throw an error I believe this takes care of the edge cases since no thresholds can be `0` or `1`. The tie breaker element is taken care of with `Vector.argmax` (if p/t is the same for 2 or more classes, then ties will effectively be broken by class index order). I don't like returning `NaN`. Since the R impl is actually scaling things rather than actually "cutting off" or "thresholding", it should always return a prediction and I think we should too.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org