On Mon, 29 Mar 2004 09:31:11 -0500, Bruce Weaver <[EMAIL PROTECTED]>
wrote:

> Rich Ulrich wrote:
[snip, post and my answer]
> > 
> > That was good advice, above, except for the short word
> > about cross-validation, which was (in my opinion) optimistic. 
> > Where stepwise is not a good idea, cross-validation cannot 
> > do much to salvage it.  
> 
> ---- snip -----

BW > 
> Can you elaborate, Rich?  If stepwise selection produced a 
> bad model, the crossvalidation R-squared will be 
> substantially lower than R-squared for the original model, 
> will it not?
> 
> Of course, if variables A and B are highly collinear, you 
> still have the problem of letting an algorithm choose 
> between them rather than making the choice yourself.  Are 
> there other issues?

Yes, there are other issues for the data I have been
accustomed to.  To take your example: for collinear A, B,
the *sensible*  thing, often, is to use the composite in the
first place.

The example below describes how to set up data where
the wrong variables will be Selected  *and*  will look good
on replication.

======= copied from a reply I made in November, 2001 -
 ... 
At least in the social sciences, not all variables are equally 
likely to work as predictors, and distinctions *can*  be drawn 
in advance.  (1) Experts select some individual variables.
(2) Experts know that there ought to be one composite 
variable to represent one syndrome or one hypothesis,
whether it is known to work or not;  typically, ignorance is
not total, and it is a big waste of statistical power to ignore
what is known.   

[ sidebar.   DNA:  genes, proteins, etc., offer a new venue 
for statistics, one where there are thousands of variables 
that *do*  conceal important, new relations.  These don't
fit the social science model very readily.  Computerized
data collection provides thousands or millions of data
points elsewhere; again, the familiar model does not hold.
I can't know what "data mining"  techniques will work 20
years from now.  But ignoring what-is-known will always
entail a waste of statistical power. Today, it is still relevant: 
usually. ]

Consider a potential prediction with 15 variables, where 10 
ought to be scored as a composite (so I assume, or declare
with omniscience), and 5 are theoretically irrelevant:  these 
were the 'best univariate predictors'  out of a hundred or so.  

[ 2004 note:  creaming off good p-values from multiple, unlikely
tests -- that is the main way I know of,  to 'capitalize on
chance'  to the point where you can hardly hope to recover.]

When you put the 15 variables into a stepwise selection, 
the equation enters one or two 'real' predictors, and then
two or three of the random, irrelevant predictors.  

The "real" predictors that would serve excellently in
a composite commit fratricide against each others'
partial-correlations.  Thus, the "best subset"  will emerge
that preferentially includes the random predictors:
because they are *not*  intercorrelated, they do not 
commit fratricide.  Thus it occurs, that  *in social sciences*,
the *bad*  variables are preferred by best-subset
techniques -- in one conventional model.

MORE GENERALLY --
Robust models start out with good variables.  
 - Variable distributions with outliers are suspicious.
 - Variables with no face-validity are highly suspect.

The good book that I browsed on data-mining offered 
the advice that intensive data-preparation was the single,
inviolable  requirement for any type of 'mining.'  A variable
needs measurement that is decent in reliability, but it
can also be considered with special regard to what is 
being predicted.  
[ ... ]
=====





> 
> Cheers,
> Bruce

http://www.pitt.edu/~wpilib/index.html
 - I need a new job, after March 31.  Openings? -
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to