On Mon, 29 Mar 2004 09:31:11 -0500, Bruce Weaver <[EMAIL PROTECTED]> wrote:
> Rich Ulrich wrote: [snip, post and my answer] > > > > That was good advice, above, except for the short word > > about cross-validation, which was (in my opinion) optimistic. > > Where stepwise is not a good idea, cross-validation cannot > > do much to salvage it. > > ---- snip ----- BW > > Can you elaborate, Rich? If stepwise selection produced a > bad model, the crossvalidation R-squared will be > substantially lower than R-squared for the original model, > will it not? > > Of course, if variables A and B are highly collinear, you > still have the problem of letting an algorithm choose > between them rather than making the choice yourself. Are > there other issues? Yes, there are other issues for the data I have been accustomed to. To take your example: for collinear A, B, the *sensible* thing, often, is to use the composite in the first place. The example below describes how to set up data where the wrong variables will be Selected *and* will look good on replication. ======= copied from a reply I made in November, 2001 - ... At least in the social sciences, not all variables are equally likely to work as predictors, and distinctions *can* be drawn in advance. (1) Experts select some individual variables. (2) Experts know that there ought to be one composite variable to represent one syndrome or one hypothesis, whether it is known to work or not; typically, ignorance is not total, and it is a big waste of statistical power to ignore what is known. [ sidebar. DNA: genes, proteins, etc., offer a new venue for statistics, one where there are thousands of variables that *do* conceal important, new relations. These don't fit the social science model very readily. Computerized data collection provides thousands or millions of data points elsewhere; again, the familiar model does not hold. I can't know what "data mining" techniques will work 20 years from now. But ignoring what-is-known will always entail a waste of statistical power. Today, it is still relevant: usually. ] Consider a potential prediction with 15 variables, where 10 ought to be scored as a composite (so I assume, or declare with omniscience), and 5 are theoretically irrelevant: these were the 'best univariate predictors' out of a hundred or so. [ 2004 note: creaming off good p-values from multiple, unlikely tests -- that is the main way I know of, to 'capitalize on chance' to the point where you can hardly hope to recover.] When you put the 15 variables into a stepwise selection, the equation enters one or two 'real' predictors, and then two or three of the random, irrelevant predictors. The "real" predictors that would serve excellently in a composite commit fratricide against each others' partial-correlations. Thus, the "best subset" will emerge that preferentially includes the random predictors: because they are *not* intercorrelated, they do not commit fratricide. Thus it occurs, that *in social sciences*, the *bad* variables are preferred by best-subset techniques -- in one conventional model. MORE GENERALLY -- Robust models start out with good variables. - Variable distributions with outliers are suspicious. - Variables with no face-validity are highly suspect. The good book that I browsed on data-mining offered the advice that intensive data-preparation was the single, inviolable requirement for any type of 'mining.' A variable needs measurement that is decent in reliability, but it can also be considered with special regard to what is being predicted. [ ... ] ===== > > Cheers, > Bruce http://www.pitt.edu/~wpilib/index.html - I need a new job, after March 31. Openings? - . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
