[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328487#comment-15328487
 ] 

Manoj Kumar commented on SPARK-3155:
------------------------------------

I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> ----------------------------
>
>                 Key: SPARK-3155
>                 URL: https://issues.apache.org/jira/browse/SPARK-3155
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to