Re: [R] variable selection in logistic

Don McKenzie Thu, 03 Sep 2009 13:51:07 -0700

Frank may be too modest to suggest it, but a great place to start  
that reading is in his book "Regression Modeling Strategies"  chapter 4.


On Sep 3, 2009, at 1:45 PM, Frank E Harrell Jr wrote:

> You'll need to do a huge amount of background reading first.  These  
> stepwise options do not incorporate penalization.
>
> Frank
>
> annie Zhang wrote:
>> Hi, Frank,
>>  If I want to do prediction as well as to select important  
>> predictors, which may be the best function to use when I have 35  
>> samples and 35 predictors (penalized logistic with variable  
>> selection)? I saw there is a 'fastbw' function in the Design  
>> package. And there is a 'step.plr' function in the 'stepPlr' package.
>>  Thank you,
>>  Annie
>> On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr  
>> <f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>> wrote:
>>     annie Zhang wrote:
>>         Thank you for all your reply.
>>         Actually as Bert said, besides predicion, I also need  
>> variable
>>         selection (I need to know which variables are important).  
>> As far
>>         as the sample size and number of variables, both of them are
>>         small around 35. How can I get accurate prediction as long as
>>         good predictors?
>>         Annie
>>     It is next to impossible to find a unique list of 'important'
>>     variables without having 50 times as many subjects as potential
>>     predictors, unless your signal:noise ratio is stunning.
>>     Frank
>>         On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter
>>         <gunter.ber...@gene.com <mailto:gunter.ber...@gene.com>
>>         <mailto:gunter.ber...@gene.com  
>> <mailto:gunter.ber...@gene.com>>>
>>         wrote:
>>            But let's be clear here folks:
>>            Ben's comment is apropos: ""As many variables as  
>> samples" is
>>            particularly
>>            scary."
>>            (Aside -- how much scarier then are -omics analyses in  
>> which the
>>            number of
>>            variables is thousands of times the number of samples?)
>>            Sensible penalization (it's usually not too sensitive  
>> to the
>>         details) is
>>            only another way of obtaining a parsimonious model with  
>> good
>>         (in the
>>            sense
>>            of minimizing overall prediction error: bias + variance)
>>         prediction
>>            properties. Alas, this is often not what scientists want:
>>         they use
>>            variable
>>            selection to find the "right" covariates, the "most
>>         important" variables
>>            affecting the response. But this is beyond the power of  
>> empirical
>>            modeling
>>            here: "as many variables as samples" almost guarantees  
>> that there
>>            will be
>>            many different and even nonoverlapping subsets of  
>> variables that
>>            are, within
>>            statistical noise, equally "optimal" predictors. That is,
>>         variable
>>            selection
>>            in such circumstances is just a pretty sophisticated  
>> random
>>         number
>>            generator
>>            -- ergo Frank's Draconian warnings. Penalization  
>> produces better
>>            prediction
>>            engines with better properties, but it cannot overcome the
>>         "as many
>>            variables as samples" problem either. Entropy rules. If  
>> what is
>>            sought is a
>>            way to determine the "truly important" variables, then the
>>         study must be
>>            designed to provide the information to do so. You don't  
>> get
>>            something for
>>            nothing.
>>            Cheers,
>>            Bert Gunter
>>            Genentech Nonclinical Biostatistics
>>            -----Original Message-----
>>            From: r-help-boun...@r-project.org
>>         <mailto:r-help-boun...@r-project.org>
>>            <mailto:r-help-boun...@r-project.org
>>         <mailto:r-help-boun...@r-project.org>>
>>            [mailto:r-help-boun...@r-project.org
>>         <mailto:r-help-boun...@r-project.org>
>>            <mailto:r-help-boun...@r-project.org
>>         <mailto:r-help-boun...@r-project.org>>] On
>>            Behalf Of Frank E Harrell Jr
>>            Sent: Wednesday, September 02, 2009 9:07 PM
>>            To: annie Zhang
>>            Cc: r-help@r-project.org <mailto:r-help@r-project.org>
>>         <mailto:r-help@r-project.org <mailto:r-help@r-project.org>>
>>            Subject: Re: [R] variable selection in logistic
>>            annie Zhang wrote:
>>             > Hi, Frank,
>>             >
>>             > You mean the backward and forward stepwise selection is
>>         bad? You also
>>             > suggest the penalized logistic regression is the best
>>         choice? Is
>>            there
>>             > any function to do it as well as selecting the best  
>> penalty?
>>             >
>>             > Annie
>>            All variable selection is bad unless its in the context of
>>         penalization.
>>             You'll need penalized logistic regression not  
>> necessarily with
>>            variable selection, for example a quadratic penalty as  
>> in a
>>         case study
>>            in my book, or an L1 penalty (lasso) using other packages.
>>            Frank
>>             >
>>             > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
>>             > <f.harr...@vanderbilt.edu
>>         <mailto:f.harr...@vanderbilt.edu>
>>         <mailto:f.harr...@vanderbilt.edu  
>> <mailto:f.harr...@vanderbilt.edu>>
>>            <mailto:f.harr...@vanderbilt.edu
>>         <mailto:f.harr...@vanderbilt.edu>
>>         <mailto:f.harr...@vanderbilt.edu
>>         <mailto:f.harr...@vanderbilt.edu>>>>
>>            wrote:
>>             >
>>             >     David Winsemius wrote:
>>             >
>>             >
>>             >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
>>             >
>>             >             Hi, R users,
>>             >
>>             >             What may be the best function in R to do  
>> variable
>>            selection
>>             >             in logistic
>>             >             regression?
>>             >
>>             >
>>             >         PhD theses, and books by famous  
>> statisticians have
>>         been
>>            pursuing
>>             >         the answer to that question for decades.
>>             >
>>             >             I have the same number of variables as the
>>         number of
>>            samples,
>>             >             and I want to select the best variablesfor
>>         prediction. Is
>>             >             there any function
>>             >             doing forward selection followed by  
>> backward
>>            elimination in
>>             >             stepwise
>>             >             logistic regression?
>>             >
>>             >
>>             >         You should probably be reading up on penalized
>>         regression
>>             >         methods. The stepwise procedures reporting  
>> unadjusted
>>             >         "significance" made available by SAS and  
>> SPSS to
>>         the unwary
>>             >         neophyte user have very poor statistical  
>> properties.
>>             >
>>             >         --
>>             >
>>             >         David Winsemius, MD
>>             >
>>             >
>>             >     Amen to that.
>>             >
>>             >     Annie, resist the temptation.  These methods bite.
>>             >
>>             >     Frank
>>             >
>>             >
>>             >         Heritage Laboratories
>>             >         West Hartford, CT
>>             >
>>             >         ______________________________________________
>>             >         R-help@r-project.org <mailto:r-h...@r- 
>> project.org>
>>         <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
>>            <mailto:R-help@r-project.org <mailto:R-help@r-project.org>
>>         <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>>
>>         mailing list
>>             >         https://stat.ethz.ch/mailman/listinfo/r-help
>>             >         PLEASE do read the posting guide
>>             >         http://www.R-project.org/posting-guide.html
>>         <http://www.r-project.org/posting-guide.html>
>>            <http://www.r-project.org/posting-guide.html>
>>             >         <http://www.r-project.org/posting-guide.html>
>>             >         and provide commented, minimal, self-contained,
>>            reproducible code.
>>             >
>>             >
>>             >
>>             >     --
>>             >     Frank E Harrell Jr   Professor and  
>> Chair                  School of
>>            Medicine
>>             >                         Department of  
>> Biostatistics          Vanderbilt
>>            University
>>             >
>>             >
>>            --
>>            Frank E Harrell Jr   Professor and Chair            
>> School of
>>         Medicine
>>                                 Department of Biostatistics    
>> Vanderbilt
>>         University
>>            ______________________________________________
>>            R-help@r-project.org <mailto:R-help@r-project.org>
>>         <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
>>         mailing list
>>            https://stat.ethz.ch/mailman/listinfo/r-help
>>            PLEASE do read the posting guide
>>            http://www.R-project.org/posting-guide.html
>>         <http://www.r-project.org/posting-guide.html>
>>            <http://www.r-project.org/posting-guide.html>
>>            and provide commented, minimal, self-contained,  
>> reproducible
>>         code.
>>     --     Frank E Harrell Jr   Professor and Chair            
>> School of Medicine
>>                         Department of Biostatistics   Vanderbilt  
>> University
>
>
> -- 
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt  
> University
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.




Don McKenzie
Research Ecologist
Pacific Wildland Fire Sciences Lab
US Forest Service

Affiliate Professor
College of Forest Resources and CSES Climate Impacts Group
University of Washington

phone: 206-732-7824
cell: 206-321-5966
d...@u.washington.edu




        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] variable selection in logistic

Reply via email to