Re: Forecasting customer life span

Phil Sherrod Fri, 07 May 2004 05:06:20 -0700

On  6-May-2004, [EMAIL PROTECTED] wrote:

> > I recommend first developing a single-tree model which is excellent for
> > getting a visual picture of the model and looking for significant variables
> > and interactions.  Then, for significantly increased accuracy, I would
> > build
> > a TreeBoost model consisting a series of boosted trees.  TreeBoost
> > typically
> > has comparable accuracy to neural networks.
>
>
> I have to chime in here and say that I find this assertion quite
> surprising.  My admittedly limited impression of NN so far is that when
> put to the test of generalization/replication, they do no better than
> classical algorithms, save the very occasional hidden non-linearity or
> two.  I honestly can't imagine how a tree technique isn't subject to the
> same limitations regarding inference as traditional models--what you
> suggest sounds to me like a perfect recipe for a model that is
> well-fitted to the sample, but has little chance of having much to do
> with the population from which the sample is drawn.  Just because the
> technique doesn't use traditional concepts such as standard errors
> doesn't mean that they don't apply.  I'd be happy to be proved wrong.
> Do you have any data on the success of what you suggest?


Classical decision trees are appreciated for (1) the clarity of the model they
present, (2) they can be understood by people who are not mathematically
inclined, and (2) scoring can be done easily using a tree diagram without
requiring a computer.  And in many cases, they provide very good predictive
accuracy.  However, it is recognized inside and outside the decision-tree
community that other techniques such as neural networks can provide better
accuracy for a significant class of problems.

To address this limitation, decision tree theorists developed what are called
"ensemble" tree models which combine multiple trees to improve the accuracy. 
Here is a rough outline of the developmental history:

1. The first approach was called "bagging".  It generated many trees by
randomly selecting a subset of the data without replacement. The resulting
trees "voted" on the best classification or averaged their predictions for
regression problems.  Bagging significantly reduced the variance in the
predicted values but did not significantly increase the predictive accuracy.

2. The AdaBoost boosting method was applied to decision trees.  I won't go into
the details, but AdaBoost builds a series of trees with the residuals from each
point in the series going into the next tree which tries to minimize them.  It
is common for the series to consist of hundreds of small trees.  AdaBoost was a
major advancement that significantly increased predictive accuracy.

3. The TreeBoost method was developed.  TreeBoost is similar to AdaBoost in the
sense that a series of trees is created, but it has several significant
differences: (1) TreeBoost is designed specifically for boosting trees whereas
AdaBoost is a more general boosting technique, (2) Tree Boost uses a different
formulation in computing the data values that are fed forward into the next
tree, (3) TreeBoost uses a random subset of the data values to introduce a
stochastic component to the analysis.  The result is a tremendous improvement
in predictive accuracy over single-tree models and a significant jump over
those for AdaBoost.

4. Random Forests were developed.  Random Forest are similar to bagging in the
sense that many trees are grown in parallel and they "vote" on the outcome. 
But the rows are selected at random with replacement for each tree and a random
subset of the predictor variables are selected for each tree.  For reasons that
aren't fully understood, adding the randomization significantly increases the
predictive accuracy.

So the current state of the art in decision trees models are TreeBoost models
and Random Forest models.  One method works better for some cases and the other
for different cases, but the results tend to be comparable. However, the
disadvantage of both of these method (and neural networks) is that you don't
have a simple model that you can visualize like a single-tree decision tree.

There is a moderate amount of published research comparing single-tree models,
AdaBoost, TreeBoost and Random Forests, but there isn't much comparing them
with neural networks.  I posted a message last week asking for data that had
been fitted using neural networks that I could model with TreeBoost, but I
haven't received any replies that are suitable for good comparisons.  Another
person posted a link for one article that does do a comparison with neural
networks. That report was produced by people who favor neural networks, and it
shows NN edging out boosted trees in the majority of the cases, but the results
were very close.  (I don't have the URL of that paper handy, but I will look it
up tomorrow and post another message.)  On the other hand, Leo Breiman -- who
is one of the most distinguished researches in the field of decision trees --
asserts that Random Forests have unparalleled predictive accuracy and are
superior to all other methods including NN.

-- 
Phil Sherrod
(phil.sherrod 'at' sandh.com)
http://www.dtreg.com  (decision tree modeling)
http://www.nlreg.com  (nonlinear regression)
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Forecasting customer life span

Reply via email to