Rich Ulrich <[EMAIL PROTECTED]> wrote:
> 
> Here is some more of what Gene posted --
> "Of 20 variables measured to account for the variability in species 
> richness, total deposition of inorganic N (Ndep, kg N ha?1 y?1) was
> the most important predictor, explaining more than half of the
> variation in the number of species per quadrat (Fig. 2A and Eq. 1)....
> "  After accounting for N deposition, mean annual precipitation (MAP,
> mm) explained an additional 8% of variability in species richness. A
> further 5% was explained by the A horizon soil pH (Top pH, Fig. 2B)
> and 3% by altitude (Alt, m). In total, 70% of the variability in
> species richness could be explained by these four variables: ... "
> 
> 
> Stepwise, I have said before, can give you a shorter list
> of variables when you have a list where everything matters.
> 
> Especially, it can give you the *first*  variable, if one of them
> stands out from the others.  In the above, Deposition does
> account for a huge share of variance;  what is unstated (here,
> at least) is whether any of the other (presumably correlated)
> measures were anywhere close to that fraction, univariate.
> 
> Stepwise if *famous*  for being really lousy at giving you
> the number two and three and four when the relative shares 
> of Variance are (for instance) 54, 8, 5, and 3.  If they were
> searching  for 'explanation'  rather than a shorter prediction
> equation, then the authors stumbled badly -- if the stepwise
> result is all they relied on.  Again,  I have not see the paper, 
> so I want my aspersions  to be read as being somewhat 
> hypothetical, or as being cast against the worst-case scenario.


I think my primary objection to Gene's argument was that he seemed to
suggest that stepwise was superior because it's objective (I hope I did
not misunderstand).  It is "objective" in the sense that the machine is
making the decisions, but there is every reason to believe that the
machine is likely not making the _correct_ decisions, so the objectivity
doesn't really help.

Interestingly, one of Steyerberg's recent papers shows that even
saturating the model beyond the usual N:predictors ratio will give better
estimates than stepwise (again, I am referring to the naive stepwise that
almost everyone seems to use, with no correction for all the candidate df
used in the search).  Apparently, the newer approaches, like Tibshirani's
lasso method do better, but then again, people may not like the fact that
their nice models disappear when the appropriate corrections are
implemented.)

>From what I understand of the literature, naive stepwise just doesn't do
what we hope unless the sample size is huge and/or the model is relatively
simple.  But if one has such a sample and parsimony is the aim, it still
seems better to estimate the model with all the candidate predictors, and
remove them "by hand" based on the cost/benefit of removing the variable 
from the prediction.

Mike Babyak
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to