[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114591#comment-14114591
 ] 

Qiping Li commented on SPARK-3272:
----------------------------------

Hi Joseph, thanks for your comment, I think checking the number of instances 
can't be done in the train() method because we don't know the number of 
instances for the leftSplit or rightSplit, for each split, we can only get 
information from InformationGainStats, which doesn't contain number of 
instances. In my implementation of SPARK-2207, the check is done in 
calculateGainForSplit, when the check fails, return a invalid information gain, 
the calculation of predict value may be skipped in that case. 

Maybe we can include number of instances for leftSplit and rightSplit in 
information gain stats and calculate predict value no matter whether check 
passes or not. I think either is fine for me.

> Calculate prediction for nodes separately from calculating information gain 
> for splits in decision tree
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3272
>                 URL: https://issues.apache.org/jira/browse/SPARK-3272
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Qiping Li
>             Fix For: 1.1.0
>
>
> In current implementation, prediction for a node is calculated along with 
> calculation of information gain stats for each possible splits. The value to 
> predict for a specific node is determined, no matter what the splits are.
> To save computation, we can first calculate prediction first and then 
> calculate information gain stats for each split.
> This is also necessary if we want to support minimum instances per node 
> parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
> because when all splits don't satisfy minimum instances requirement , we 
> don't use information gain of any splits. There should be a way to get the 
> prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to