[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-11-06 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240284#comment-16240284
 ] 

Yan Facai (颜发才) commented on SPARK-3383:


[~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I 
believe that unordered features can benefit a lot from the PR.

> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-11-06 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240106#comment-16240106
 ] 

Weichen Xu commented on SPARK-3383:
---

[~facai] Oh I did not notice you have commented here. I think your idea 
mentioned above is exactly the same with what is done in my PR 
https://github.com/apache/spark/pull/19666
So would you mind help review it ? Thanks!


> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-04-11 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965263#comment-15965263
 ] 

Yan Facai (颜发才) commented on SPARK-3383:


How about the idea? 

1. We use `bin` to represent value, which is quantized value for continuous 
feature and category for discrete feature. In DTStatsAggregator, only collect 
stats of bins. At the stage, all operators are same, no matter it is continuous 
or discrete feature.

2. in `binsToBestSplit`, 
+ Continuous / order discrete feature has N bins, and then construct N - 1 
splits. 
+ Unorder discrete feature has N bins, and then construct all possible 
combination, namely, 2^(N-1) - 1 splits.

3. in `binsToBestSplit`, collect all splits and calculate their impurity. Order 
these splits and find the best one.

> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-04-10 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963741#comment-15963741
 ] 

Yan Facai (颜发才) commented on SPARK-3383:


I think the task contains two subtask:

1. separate `split` with `bin`: Now for each categorical feature, there is 1 
bin per split. That's said, for N categories, the communicate cost is 2^{N-1} - 
1 bins. However, if we only get stats for each category, and construct splits 
finally. Namely, 1 bin per category. The communicate cost is N bins.

2. As said in Description, store all but the last bin, and also store the total 
statistics for each node. The communicate cost will be N-1 bins.

I have a question:
1. why unordered features only are allowed in multiclass classification?


> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org