[ https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115088#comment-14115088 ]
Qiping Li commented on SPARK-3272: ---------------------------------- Hi Joseph, sorry for the late reply, I still think we should store number of instances for the left & right child because whether a node is leaf or not is determined by whether the best split can split enough instances to both left and right child. Even a node has enough instances, if the best split doesn't satisfy min instance requirements, it should still be a leaf. As for invalid information gain value, it is just a constant value to denote that split makes no sense because it doesn't satisfy min info gain or min instances per node requirements. I think there should be a specific value to denote this because split that is invalid should be marked as invalid split so the main loop knows to not pick this split, even though we can calculate info gain for this split. > Calculate prediction for nodes separately from calculating information gain > for splits in decision tree > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-3272 > URL: https://issues.apache.org/jira/browse/SPARK-3272 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.0.2 > Reporter: Qiping Li > Fix For: 1.1.0 > > > In current implementation, prediction for a node is calculated along with > calculation of information gain stats for each possible splits. The value to > predict for a specific node is determined, no matter what the splits are. > To save computation, we can first calculate prediction first and then > calculate information gain stats for each split. > This is also necessary if we want to support minimum instances per node > parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) > because when all splits don't satisfy minimum instances requirement , we > don't use information gain of any splits. There should be a way to get the > prediction value. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org