Joseph K. Bradley created SPARK-10788: -----------------------------------------
Summary: Decision Tree duplicates bins for unordered categorical features Key: SPARK-10788 URL: https://issues.apache.org/jira/browse/SPARK-10788 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we also consider the 3 flipped splits: * B,C vs. A * C vs. A, B * B vs. A, C This means we communicate twice as much data as needed for these features. We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org