Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/142#discussion_r123622385
  
    --- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
    @@ -486,8 +485,18 @@ DecisionTree<Container>::expand(const Accumulator 
&state,
                 Index stats_i = static_cast<Index>(state.stats_lookup(i));
                 assert(stats_i >= 0);
     
    -            // 1. Set the prediction for current node from stats of all 
rows
    -            predictions.row(current) = state.node_stats.row(stats_i);
    +            if (statCount(predictions.row(current)) !=
    +                    statCount(state.node_stats.row(stats_i))){
    +                // Predictions for each node is set by its parent using 
stats
    +                // recorded while training parent node. These stats do not 
include
    +                // rows that had a NULL value for the primary split 
feature.
    +                // The NULL count is included in the 'node_stats' while 
training
    +                // current node. Further, presence of NULL rows indicate 
that
    +                // stats used for deciding 'children_wont_split' are 
inaccurate.
    +                // Hence avoid using the flag to decide termination.
    +                predictions.row(current) = state.node_stats.row(stats_i);
    +                children_wont_split = false;
    +            }
    --- End diff --
    
    - `children_wont_split` is **one** of the factors that determines if 
training should stop after current iteration. `children_wont_split=true` 
implies training stops; `children_wont_split=false` implies other flags 
determine termination. 
    - The lines 516-547 are finding the best feature to split on and are 
necessary - independent of `children_wont_split` and independent of the result 
of line 490. 
    
    I could exchange sections 1 and 2 since they're independent, if that helps 
in reading the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to