Joseph K. Bradley created SPARK-10788:
-----------------------------------------

             Summary: Decision Tree duplicates bins for unordered categorical 
features
                 Key: SPARK-10788
                 URL: https://issues.apache.org/jira/browse/SPARK-10788
             Project: Spark
          Issue Type: Improvement
          Components: ML
            Reporter: Joseph K. Bradley


Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
copy of each split. E.g., if there are 3 categories A, B, C, then we should 
consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation 
since the spark.mllib implementation will be removed before long (and will 
instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to