[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940316#comment-14940316 ]
Seth Hendrickson commented on SPARK-10788: ------------------------------------------ Yes, much clearer. I can work on this task. > Decision Tree duplicates bins for unordered categorical features > ---------------------------------------------------------------- > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org