[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328592#comment-15328592 ]
Manoj Kumar commented on SPARK-3155: ------------------------------------ I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? > Support DecisionTree pruning > ---------------------------- > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org