[ https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114591#comment-14114591 ]
Qiping Li commented on SPARK-3272: ---------------------------------- Hi Joseph, thanks for your comment, I think checking the number of instances can't be done in the train() method because we don't know the number of instances for the leftSplit or rightSplit, for each split, we can only get information from InformationGainStats, which doesn't contain number of instances. In my implementation of SPARK-2207, the check is done in calculateGainForSplit, when the check fails, return a invalid information gain, the calculation of predict value may be skipped in that case. Maybe we can include number of instances for leftSplit and rightSplit in information gain stats and calculate predict value no matter whether check passes or not. I think either is fine for me. > Calculate prediction for nodes separately from calculating information gain > for splits in decision tree > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-3272 > URL: https://issues.apache.org/jira/browse/SPARK-3272 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.0.2 > Reporter: Qiping Li > Fix For: 1.1.0 > > > In current implementation, prediction for a node is calculated along with > calculation of information gain stats for each possible splits. The value to > predict for a specific node is determined, no matter what the splits are. > To save computation, we can first calculate prediction first and then > calculate information gain stats for each split. > This is also necessary if we want to support minimum instances per node > parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) > because when all splits don't satisfy minimum instances requirement , we > don't use information gain of any splits. There should be a way to get the > prediction value. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org