[jira] [Commented] (HIVEMALL-245) Refactor RandomForest for Sparse Data handling

ASF GitHub Bot (JIRA) Tue, 06 Aug 2019 01:06:09 -0700


    [ 
https://issues.apache.org/jira/browse/HIVEMALL-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900743#comment-16900743
 ]


ASF GitHub Bot commented on HIVEMALL-245:
-----------------------------------------

myui commented on issue #198: [WIP][HIVEMALL-245] Refactor RandomForest for 
Sparse Data handling
URL: 
https://github.com/apache/incubator-hivemall/pull/198#issuecomment-518488243
 
 
   TODO
   - [x] prune redundant branches
   - [x] reduce memory usage by using RoaringBitmap for AttributeType
   - [ ] support prediction tracing 
   https://issues.apache.org/jira/browse/HIVEMALL-171
   - [x] refactor RegressionTree as well
   - [ ] introduce more [sophisticated post 
pruning](https://en.wikipedia.org/wiki/Decision_tree_pruning#Reduced_error_pruning)
   - [ ] support the default value for missing values
   - [ ] Fix split handling of sparse numeric values
     - problem: split does not occur when column values have a single value
       - if(x<=1.0) { .. } else { ... } never split where sparse x is already 
1.0
     - if there are only a single value for a column, then treat it as nominal 
value (?)
   - [ ] use more robust RNG in [feature 
sampling](https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L636)
   ```
   Reservoir Sampling returns not-well shuffled result for small stream but 
it’s accepted in Best splitter (?) => more randomness might be required 
(replace java.util.Random ?)
   ```
   - [ ] optimize 
[split](https://github.com/scikit-learn/scikit-learn/blob/4de404d46d24805ff48ad255ec3169a5155986f0/sklearn/tree/_tree.pyx#L224)
  
   ```
   min_sample_leaf >= 2 is satisfied iff min_sample_split >= 4
   So, split only happens when samples in intermediate nodes has >= 2 * 
min_sample_leaf nodes.
   min_sample_leaf = 2 replaces min_sample_split = 3 with min_sample_split = 4
   ```
   - [ ] memoize [a redundant split 
computation](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx#L1241)
   ```
   Skip the CPU intensive evaluation of the impurity criterion for features 
that were already detected as constant (hence not suitable for good splitting) 
by ancestor nodes and save the information on newly discovered constant 
features to spare computation on descendant nodes.
   ```
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor RandomForest for Sparse Data handling
> ----------------------------------------------
>
>                 Key: HIVEMALL-245
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-245
>             Project: Hivemall
>          Issue Type: Improvement
>    Affects Versions: 0.5.2
>            Reporter: Makoto Yui
>            Assignee: Makoto Yui
>            Priority: Major
>             Fix For: 0.6.0
>
>
> * Fix attribute to use RoaringBitmap instead of AttributeType[]
>  * Support pruning of redundant decision tree nodes
>  * Support the default value for missing values
>  * Fix split handling of sparse numeric values
>  ** problem: split does not occur when column values have a single value
>  *** if(x<=1.0) \{ .. } else \{ ... } never split where sparse x is already 
> 1.0
>  ** if there are only a single value for a column, then treat it as nominal 
> value (?)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (HIVEMALL-245) Refactor RandomForest for Sparse Data handling

Reply via email to