[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10788:
--------------------------------------
    Description: 
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data 
as needed for unordered categorical features.  Here's an example.

Say there are 3 categories A, B, C.  We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 
= 6).  However, we could instead collect statistics for the 3 subsets on the 
left-hand side of the 3 possible splits: A and A,B and A,C.  If we also have 
stats for the entire node, then we can compute the stats for the 3 subsets on 
the right-hand side of the splits. In pseudomath: {{stats(B,C) = stats(A,B,C) - 
stats(A)}}.

We should eliminate these extra bins within the spark.ml implementation since 
the spark.mllib implementation will be removed before long (and will instead 
call into spark.ml).

  was:
Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
copy of each split. E.g., if there are 3 categories A, B, C, then we should 
consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation 
since the spark.mllib implementation will be removed before long (and will 
instead call into spark.ml).


> Decision Tree duplicates bins for unordered categorical features
> ----------------------------------------------------------------
>
>                 Key: SPARK-10788
>                 URL: https://issues.apache.org/jira/browse/SPARK-10788
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to