[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115088#comment-14115088
 ] 

Qiping Li commented on SPARK-3272:
----------------------------------

Hi Joseph, sorry for the late reply, I still think we should store number of 
instances for the left & right child because whether a node is leaf or not is 
determined by whether the best split can split enough instances to both left 
and right child. 
Even a node has enough instances, if the best split doesn't satisfy min 
instance requirements, it should still be a leaf.

As for invalid information gain value, it is just a constant value to denote 
that split makes no sense because it doesn't satisfy min info gain or min 
instances per node requirements. I think there should be a specific value to 
denote this because split that is invalid should be marked as invalid split so 
the main loop knows to not pick this split, even though we can calculate info 
gain for this split. 

> Calculate prediction for nodes separately from calculating information gain 
> for splits in decision tree
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3272
>                 URL: https://issues.apache.org/jira/browse/SPARK-3272
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Qiping Li
>             Fix For: 1.1.0
>
>
> In current implementation, prediction for a node is calculated along with 
> calculation of information gain stats for each possible splits. The value to 
> predict for a specific node is determined, no matter what the splits are.
> To save computation, we can first calculate prediction first and then 
> calculate information gain stats for each split.
> This is also necessary if we want to support minimum instances per node 
> parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
> because when all splits don't satisfy minimum instances requirement , we 
> don't use information gain of any splits. There should be a way to get the 
> prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to