On 6-May-2004, [EMAIL PROTECTED] wrote: > > I recommend first developing a single-tree model which is excellent for > > getting a visual picture of the model and looking for significant variables > > and interactions. Then, for significantly increased accuracy, I would > > build > > a TreeBoost model consisting a series of boosted trees. TreeBoost > > typically > > has comparable accuracy to neural networks. > > > I have to chime in here and say that I find this assertion quite > surprising. My admittedly limited impression of NN so far is that when > put to the test of generalization/replication, they do no better than > classical algorithms, save the very occasional hidden non-linearity or > two. I honestly can't imagine how a tree technique isn't subject to the > same limitations regarding inference as traditional models--what you > suggest sounds to me like a perfect recipe for a model that is > well-fitted to the sample, but has little chance of having much to do > with the population from which the sample is drawn. Just because the > technique doesn't use traditional concepts such as standard errors > doesn't mean that they don't apply. I'd be happy to be proved wrong. > Do you have any data on the success of what you suggest?
Classical decision trees are appreciated for (1) the clarity of the model they present, (2) they can be understood by people who are not mathematically inclined, and (2) scoring can be done easily using a tree diagram without requiring a computer. And in many cases, they provide very good predictive accuracy. However, it is recognized inside and outside the decision-tree community that other techniques such as neural networks can provide better accuracy for a significant class of problems. To address this limitation, decision tree theorists developed what are called "ensemble" tree models which combine multiple trees to improve the accuracy. Here is a rough outline of the developmental history: 1. The first approach was called "bagging". It generated many trees by randomly selecting a subset of the data without replacement. The resulting trees "voted" on the best classification or averaged their predictions for regression problems. Bagging significantly reduced the variance in the predicted values but did not significantly increase the predictive accuracy. 2. The AdaBoost boosting method was applied to decision trees. I won't go into the details, but AdaBoost builds a series of trees with the residuals from each point in the series going into the next tree which tries to minimize them. It is common for the series to consist of hundreds of small trees. AdaBoost was a major advancement that significantly increased predictive accuracy. 3. The TreeBoost method was developed. TreeBoost is similar to AdaBoost in the sense that a series of trees is created, but it has several significant differences: (1) TreeBoost is designed specifically for boosting trees whereas AdaBoost is a more general boosting technique, (2) Tree Boost uses a different formulation in computing the data values that are fed forward into the next tree, (3) TreeBoost uses a random subset of the data values to introduce a stochastic component to the analysis. The result is a tremendous improvement in predictive accuracy over single-tree models and a significant jump over those for AdaBoost. 4. Random Forests were developed. Random Forest are similar to bagging in the sense that many trees are grown in parallel and they "vote" on the outcome. But the rows are selected at random with replacement for each tree and a random subset of the predictor variables are selected for each tree. For reasons that aren't fully understood, adding the randomization significantly increases the predictive accuracy. So the current state of the art in decision trees models are TreeBoost models and Random Forest models. One method works better for some cases and the other for different cases, but the results tend to be comparable. However, the disadvantage of both of these method (and neural networks) is that you don't have a simple model that you can visualize like a single-tree decision tree. There is a moderate amount of published research comparing single-tree models, AdaBoost, TreeBoost and Random Forests, but there isn't much comparing them with neural networks. I posted a message last week asking for data that had been fitted using neural networks that I could model with TreeBoost, but I haven't received any replies that are suitable for good comparisons. Another person posted a link for one article that does do a comparison with neural networks. That report was produced by people who favor neural networks, and it shows NN edging out boosted trees in the majority of the cases, but the results were very close. (I don't have the URL of that paper handy, but I will look it up tomorrow and post another message.) On the other hand, Leo Breiman -- who is one of the most distinguished researches in the field of decision trees -- asserts that Random Forests have unparalleled predictive accuracy and are superior to all other methods including NN. -- Phil Sherrod (phil.sherrod 'at' sandh.com) http://www.dtreg.com (decision tree modeling) http://www.nlreg.com (nonlinear regression) . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
