Frank may be too modest to suggest it, but a great place to start that reading is in his book "Regression Modeling Strategies" chapter 4.
On Sep 3, 2009, at 1:45 PM, Frank E Harrell Jr wrote: > You'll need to do a huge amount of background reading first. These > stepwise options do not incorporate penalization. > > Frank > > annie Zhang wrote: >> Hi, Frank, >> If I want to do prediction as well as to select important >> predictors, which may be the best function to use when I have 35 >> samples and 35 predictors (penalized logistic with variable >> selection)? I saw there is a 'fastbw' function in the Design >> package. And there is a 'step.plr' function in the 'stepPlr' package. >> Thank you, >> Annie >> On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr >> <f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>> wrote: >> annie Zhang wrote: >> Thank you for all your reply. >> Actually as Bert said, besides predicion, I also need >> variable >> selection (I need to know which variables are important). >> As far >> as the sample size and number of variables, both of them are >> small around 35. How can I get accurate prediction as long as >> good predictors? >> Annie >> It is next to impossible to find a unique list of 'important' >> variables without having 50 times as many subjects as potential >> predictors, unless your signal:noise ratio is stunning. >> Frank >> On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter >> <gunter.ber...@gene.com <mailto:gunter.ber...@gene.com> >> <mailto:gunter.ber...@gene.com >> <mailto:gunter.ber...@gene.com>>> >> wrote: >> But let's be clear here folks: >> Ben's comment is apropos: ""As many variables as >> samples" is >> particularly >> scary." >> (Aside -- how much scarier then are -omics analyses in >> which the >> number of >> variables is thousands of times the number of samples?) >> Sensible penalization (it's usually not too sensitive >> to the >> details) is >> only another way of obtaining a parsimonious model with >> good >> (in the >> sense >> of minimizing overall prediction error: bias + variance) >> prediction >> properties. Alas, this is often not what scientists want: >> they use >> variable >> selection to find the "right" covariates, the "most >> important" variables >> affecting the response. But this is beyond the power of >> empirical >> modeling >> here: "as many variables as samples" almost guarantees >> that there >> will be >> many different and even nonoverlapping subsets of >> variables that >> are, within >> statistical noise, equally "optimal" predictors. That is, >> variable >> selection >> in such circumstances is just a pretty sophisticated >> random >> number >> generator >> -- ergo Frank's Draconian warnings. Penalization >> produces better >> prediction >> engines with better properties, but it cannot overcome the >> "as many >> variables as samples" problem either. Entropy rules. If >> what is >> sought is a >> way to determine the "truly important" variables, then the >> study must be >> designed to provide the information to do so. You don't >> get >> something for >> nothing. >> Cheers, >> Bert Gunter >> Genentech Nonclinical Biostatistics >> -----Original Message----- >> From: r-help-boun...@r-project.org >> <mailto:r-help-boun...@r-project.org> >> <mailto:r-help-boun...@r-project.org >> <mailto:r-help-boun...@r-project.org>> >> [mailto:r-help-boun...@r-project.org >> <mailto:r-help-boun...@r-project.org> >> <mailto:r-help-boun...@r-project.org >> <mailto:r-help-boun...@r-project.org>>] On >> Behalf Of Frank E Harrell Jr >> Sent: Wednesday, September 02, 2009 9:07 PM >> To: annie Zhang >> Cc: r-help@r-project.org <mailto:r-help@r-project.org> >> <mailto:r-help@r-project.org <mailto:r-help@r-project.org>> >> Subject: Re: [R] variable selection in logistic >> annie Zhang wrote: >> > Hi, Frank, >> > >> > You mean the backward and forward stepwise selection is >> bad? You also >> > suggest the penalized logistic regression is the best >> choice? Is >> there >> > any function to do it as well as selecting the best >> penalty? >> > >> > Annie >> All variable selection is bad unless its in the context of >> penalization. >> You'll need penalized logistic regression not >> necessarily with >> variable selection, for example a quadratic penalty as >> in a >> case study >> in my book, or an L1 penalty (lasso) using other packages. >> Frank >> > >> > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr >> > <f.harr...@vanderbilt.edu >> <mailto:f.harr...@vanderbilt.edu> >> <mailto:f.harr...@vanderbilt.edu >> <mailto:f.harr...@vanderbilt.edu>> >> <mailto:f.harr...@vanderbilt.edu >> <mailto:f.harr...@vanderbilt.edu> >> <mailto:f.harr...@vanderbilt.edu >> <mailto:f.harr...@vanderbilt.edu>>>> >> wrote: >> > >> > David Winsemius wrote: >> > >> > >> > On Sep 2, 2009, at 9:36 PM, annie Zhang wrote: >> > >> > Hi, R users, >> > >> > What may be the best function in R to do >> variable >> selection >> > in logistic >> > regression? >> > >> > >> > PhD theses, and books by famous >> statisticians have >> been >> pursuing >> > the answer to that question for decades. >> > >> > I have the same number of variables as the >> number of >> samples, >> > and I want to select the best variablesfor >> prediction. Is >> > there any function >> > doing forward selection followed by >> backward >> elimination in >> > stepwise >> > logistic regression? >> > >> > >> > You should probably be reading up on penalized >> regression >> > methods. The stepwise procedures reporting >> unadjusted >> > "significance" made available by SAS and >> SPSS to >> the unwary >> > neophyte user have very poor statistical >> properties. >> > >> > -- >> > >> > David Winsemius, MD >> > >> > >> > Amen to that. >> > >> > Annie, resist the temptation. These methods bite. >> > >> > Frank >> > >> > >> > Heritage Laboratories >> > West Hartford, CT >> > >> > ______________________________________________ >> > R-help@r-project.org <mailto:r-h...@r- >> project.org> >> <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> >> <mailto:R-help@r-project.org <mailto:R-help@r-project.org> >> <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>> >> mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> <http://www.r-project.org/posting-guide.html> >> <http://www.r-project.org/posting-guide.html> >> > <http://www.r-project.org/posting-guide.html> >> > and provide commented, minimal, self-contained, >> reproducible code. >> > >> > >> > >> > -- >> > Frank E Harrell Jr Professor and >> Chair School of >> Medicine >> > Department of >> Biostatistics Vanderbilt >> University >> > >> > >> -- >> Frank E Harrell Jr Professor and Chair >> School of >> Medicine >> Department of Biostatistics >> Vanderbilt >> University >> ______________________________________________ >> R-help@r-project.org <mailto:R-help@r-project.org> >> <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> >> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> <http://www.r-project.org/posting-guide.html> >> <http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, >> reproducible >> code. >> -- Frank E Harrell Jr Professor and Chair >> School of Medicine >> Department of Biostatistics Vanderbilt >> University > > > -- > Frank E Harrell Jr Professor and Chair School of Medicine > Department of Biostatistics Vanderbilt > University > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. Don McKenzie Research Ecologist Pacific Wildland Fire Sciences Lab US Forest Service Affiliate Professor College of Forest Resources and CSES Climate Impacts Group University of Washington phone: 206-732-7824 cell: 206-321-5966 d...@u.washington.edu [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.