Re: [R] logistic regression with 50 varaibales

Marc Schwartz Mon, 14 Jun 2010 08:22:53 -0700

Joris,

There are two separate issues here:


1. Can you consider an LR model with 50 covariates?

2. Should you have 50 covariates in your LR model?


The answer to 1 is certainly yes, given what I noted below as a general working 
framework. I have personally been involved with the development and validation 
of LR models with ~35 covariates, albeit with notably larger datasets than 
discussed below, because the models are used for prediction. In fact, the 
current incarnations of those same models, now 15 years later, appear to have 
>40 covariates and are quite stable. The interpretation of the models by both 
statisticians and clinicians is relatively straightforward.

The answer to 2 gets into the subject matter that you raise, which is to 
consider other factors beyond the initial rules of thumb for minimum sample 
size. These get into reasonable data reduction methods, the consideration of 
collinearity, subject matter expertise, sparse data, etc.

The issues raised in number 2 are discussed in the two references that I noted.

Two additional references that might be helpful here on the first point are:

P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. A 
simulation study of the number of events per variable in logistic regression 
analysis. J Clin Epi, 49:1373–1379, 1996. 

E. Vittinghoff and C. E. McCulloch. Relaxing the rule of ten events per 
variable in logistic and Cox regression. Am J Epi, 165:710–718, 2006.


Regards,

Marc

On Jun 14, 2010, at 8:38 AM, Joris Meys wrote:

> Hi,
> 
> Marcs explanation is valid to a certain extent, but I don't agree with
> his conclusion. I'd like to point out "the curse of
> dimensionality"(Hughes effect) which starts to play rather quickly.
> 
> The curse of dimensionality is easily demonstrated looking at the
> proximity between your datapoints. Say we scale the interval in one
> dimension to be 1 unit. If you have 20 evenly-spaced observations, the
> distance between the observations is 0.05 units. To have a proximity
> like that in a 2-dimensional space, you need 20^2=400 observations. in
> a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
> distance between your observations is important, as a sparse dataset
> will definitely make your model misbehave.
> 
> Even with about 35 samples per variable, using 50 independent
> variables will render a highly unstable model, as your dataspace is
> about as sparse as it can get. On top of that, interpreting a model
> with 50 variables is close to impossible, and then I didn't even start
> on interactions. No point in trying I'd say. If you really need all
> that information, you might want to take a look at some dimension
> reduction methods first.
> 
> Cheers
> Joris
> 
> On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz <marc_schwa...@me.com> wrote:
>> On Jun 13, 2010, at 10:20 PM, array chip wrote:
>> 
>>> Hi, this is not R technical question per se. I know there are many 
>>> excellent statisticians in this list, so here my questions: I have dataset 
>>> with ~1800 observations and 50 independent variables, so there are about 35 
>>> samples per variable. Is it wise to build a stable multiple logistic model 
>>> with 50 independent variables? Any problem with this approach? Thanks
>>> 
>>> John
>> 
>> 
>> The general rule of thumb is to have 10-20 'events' per covariate degree of 
>> freedom. Frank has suggested that in some cases that number should be as 
>> high as 25.
>> 
>> The number of events is the smaller of the two possible outcomes for your 
>> binary dependent variable.
>> 
>> Covariate degrees of freedom refers to the number of columns in the model 
>> matrix. Continuous variables are 1, binary factors are 1, K-level factors 
>> are K - 1.
>> 
>> So if out of your 1800 records, you have at least 500 to 1000 events, 
>> depending upon how many of your 50 variables are K-level factors and whether 
>> or not you need to consider interactions, you may be OK. Better if towards 
>> the high end of that range, especially if the model is for prediction versus 
>> explanation.
>> 
>> Two excellent references would be Frank's book:
>> 
>>  
>> http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/
>> 
>> and Steyerberg's book:
>> 
>>  
>> http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/
>> 
>> to assist in providing guidance for model building/validation techniques.
>> 
>> HTH,
>> 
>> Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] logistic regression with 50 varaibales

Reply via email to