But let's be clear here folks: Ben's comment is apropos: ""As many variables as samples" is particularly scary."
(Aside -- how much scarier then are -omics analyses in which the number of variables is thousands of times the number of samples?) Sensible penalization (it's usually not too sensitive to the details) is only another way of obtaining a parsimonious model with good (in the sense of minimizing overall prediction error: bias + variance) prediction properties. Alas, this is often not what scientists want: they use variable selection to find the "right" covariates, the "most important" variables affecting the response. But this is beyond the power of empirical modeling here: "as many variables as samples" almost guarantees that there will be many different and even nonoverlapping subsets of variables that are, within statistical noise, equally "optimal" predictors. That is, variable selection in such circumstances is just a pretty sophisticated random number generator -- ergo Frank's Draconian warnings. Penalization produces better prediction engines with better properties, but it cannot overcome the "as many variables as samples" problem either. Entropy rules. If what is sought is a way to determine the "truly important" variables, then the study must be designed to provide the information to do so. You don't get something for nothing. Cheers, Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Frank E Harrell Jr Sent: Wednesday, September 02, 2009 9:07 PM To: annie Zhang Cc: r-help@r-project.org Subject: Re: [R] variable selection in logistic annie Zhang wrote: > Hi, Frank, > > You mean the backward and forward stepwise selection is bad? You also > suggest the penalized logistic regression is the best choice? Is there > any function to do it as well as selecting the best penalty? > > Annie All variable selection is bad unless its in the context of penalization. You'll need penalized logistic regression not necessarily with variable selection, for example a quadratic penalty as in a case study in my book, or an L1 penalty (lasso) using other packages. Frank > > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr > <f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>> wrote: > > David Winsemius wrote: > > > On Sep 2, 2009, at 9:36 PM, annie Zhang wrote: > > Hi, R users, > > What may be the best function in R to do variable selection > in logistic > regression? > > > PhD theses, and books by famous statisticians have been pursuing > the answer to that question for decades. > > I have the same number of variables as the number of samples, > and I want to select the best variablesfor prediction. Is > there any function > doing forward selection followed by backward elimination in > stepwise > logistic regression? > > > You should probably be reading up on penalized regression > methods. The stepwise procedures reporting unadjusted > "significance" made available by SAS and SPSS to the unwary > neophyte user have very poor statistical properties. > > -- > > David Winsemius, MD > > > Amen to that. > > Annie, resist the temptation. These methods bite. > > Frank > > > Heritage Laboratories > West Hartford, CT > > ______________________________________________ > R-help@r-project.org <mailto:R-help@r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > > > > -- > Frank E Harrell Jr Professor and Chair School of Medicine > Department of Biostatistics Vanderbilt University > > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.