subject:"\[jira\] \[Comment Edited\] \(SPARK\-10788\) Decision Tree duplicates bins for unordered categorical features"

[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2017-04-11 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965244#comment-15965244
 ] 

Yan Facai (颜发才) edited comment on SPARK-10788 at 4/12/17 1:35 AM:
--

[~josephkb] As categories A, B and C are independent, why not collect 
statistics only for cateogry? I mean 1 bin per category, instead of 1 bin per 
split. 

Splits are calculated in the last step in `binsToBestSplit`. So communication 
cost is N bins.


was (Author: facai):
[~josephkb] As categories A, B and C are independent, why not collect 
statistics only for cateogry? Splits are calculated in the last step in 
`binsToBestSplit`. So communication cost is N bins.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940316#comment-14940316
 ] 

Seth Hendrickson edited comment on SPARK-10788 at 10/1/15 8:04 PM:
---

Yes, much clearer, thanks! I can work on this task.


was (Author: sethah):
Yes, much clearer. I can work on this task.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2 matches

Site Navigation

Mail list logo

Footer information