Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    The failing SparkR test (which compares `RandomForest` predictions to 
hardcoded values) fails not due to a correctness issue but (AFAICT) because of 
an implementation change in best-split selection. 
    
    In this PR we recompute parent node impurity stats when considering each 
split for a feature, instead of computing parent impurity stats once per 
feature (see this by comparing `RandomForest.calculateImpurityStats` in Spark 
master and `ImpurityUtils.calculateImpurityStats` in this PR).
    
    The process of repeatedly computing parent impurity stats results in 
slightly different results at each iteration due to Double precision 
limitations. This in turn can cause different splits to be selected (e.g. if 
two splits have mathematically equal gains, Double precision limitations can 
cause one split to have a higher/smaller gain than the other, influencing 
tiebreaking).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to