On Jul 31, 2012, at 10:35 AM, M Pomati <marco.pom...@bristol.ac.uk> wrote:

> 
> 
> Does anyone know of any X^2 tests to compare the fit of logistic models 
> which factor out the sample size? I'm dealing with a very large sample and 
> I fear the significant X^2 test I get when adding a variable to the model 
> is simply a result of the sample size (>200,000 cases).
> 
> I'd rather use the whole dataset instead of taking (small) random samples 
> as it is highly skewed. I've seen things like Phi and Cramer's V for 
> crosstabs but I'm not sure whether they have been used before on logistic 
> regression, if there are better ones and if there are any packages.
> 
> 
> Many thanks
> 
> Marco



Sounds like you are bordering on some type of stepwise approach to including or 
not including covariates in the model. You can search the list archives for a 
myriad of discussions as to why that is a poor approach.

You have the luxury of a large sample. You also have the challenge of 
interpreting covariates that appear to be statistically significant, but may 
have a rather small *effect size* in context. That is where subject matter 
experts need to provide input as to interpretation of the contextual 
significance of the variable, as opposed to the statistical significance of 
that same variable.

A general approach, is to simply pre-specify your model based upon rather 
simple considerations. Also, you need to determine if your goal for the model 
is prediction or explanation. 

What is the incidence of your 'event' in the sample? If it is say 10%, then you 
should have around 20,000 events. The rule of thumb for logistic regression is 
to have around 20 events per covariate degree of freedom (df) to minimize the 
risk of over-fitting the model to your dataset. A continuous covariate is 1 df, 
a k-level factor is k-1 df. So with 20,000 events, your model could feasibly 
have 1,000 covariate df's. I am guessing that you don't have that much 
independent data to begin with.

So, pre-specfy your model on the full dataset and stick with it. Interact with 
subject matter experts on the interpretation of the model.

BTW, this question is really about statistical modeling generally, not really R 
specific. Such queries are best posed to general statistical lists/forums such 
as Stack Exchange. I would also point you to Frank Harrell's book, Regression 
Modeling Strategies.

Regards,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to