[R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread M Pomati


Does anyone know of any X^2 tests to compare the fit of logistic models 
which factor out the sample size? I'm dealing with a very large sample and 
I fear the significant X^2 test I get when adding a variable to the model 
is simply a result of the sample size (200,000 cases).

I'd rather use the whole dataset instead of taking (small) random samples 
as it is highly skewed. I've seen things like Phi and Cramer's V for 
crosstabs but I'm not sure whether they have been used before on logistic 
regression, if there are better ones and if there are any packages.


Many thanks

Marco


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread Marc Schwartz
On Jul 31, 2012, at 10:35 AM, M Pomati marco.pom...@bristol.ac.uk wrote:

 
 
 Does anyone know of any X^2 tests to compare the fit of logistic models 
 which factor out the sample size? I'm dealing with a very large sample and 
 I fear the significant X^2 test I get when adding a variable to the model 
 is simply a result of the sample size (200,000 cases).
 
 I'd rather use the whole dataset instead of taking (small) random samples 
 as it is highly skewed. I've seen things like Phi and Cramer's V for 
 crosstabs but I'm not sure whether they have been used before on logistic 
 regression, if there are better ones and if there are any packages.
 
 
 Many thanks
 
 Marco



Sounds like you are bordering on some type of stepwise approach to including or 
not including covariates in the model. You can search the list archives for a 
myriad of discussions as to why that is a poor approach.

You have the luxury of a large sample. You also have the challenge of 
interpreting covariates that appear to be statistically significant, but may 
have a rather small *effect size* in context. That is where subject matter 
experts need to provide input as to interpretation of the contextual 
significance of the variable, as opposed to the statistical significance of 
that same variable.

A general approach, is to simply pre-specify your model based upon rather 
simple considerations. Also, you need to determine if your goal for the model 
is prediction or explanation. 

What is the incidence of your 'event' in the sample? If it is say 10%, then you 
should have around 20,000 events. The rule of thumb for logistic regression is 
to have around 20 events per covariate degree of freedom (df) to minimize the 
risk of over-fitting the model to your dataset. A continuous covariate is 1 df, 
a k-level factor is k-1 df. So with 20,000 events, your model could feasibly 
have 1,000 covariate df's. I am guessing that you don't have that much 
independent data to begin with.

So, pre-specfy your model on the full dataset and stick with it. Interact with 
subject matter experts on the interpretation of the model.

BTW, this question is really about statistical modeling generally, not really R 
specific. Such queries are best posed to general statistical lists/forums such 
as Stack Exchange. I would also point you to Frank Harrell's book, Regression 
Modeling Strategies.

Regards,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread M Pomati
Marc, thank you very much for your help.
I've posted in on

http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models

and added details.

Many thanks

Marco

--On 31 July 2012 11:50 -0500 Marc Schwartz marc_schwa...@me.com wrote:

 On Jul 31, 2012, at 10:35 AM, M Pomati marco.pom...@bristol.ac.uk wrote:

 
 
  Does anyone know of any X^2 tests to compare the fit of logistic models
  which factor out the sample size? I'm dealing with a very large sample 
and
  I fear the significant X^2 test I get when adding a variable to the 
model
  is simply a result of the sample size (200,000 cases).
 
  I'd rather use the whole dataset instead of taking (small) random 
samples
  as it is highly skewed. I've seen things like Phi and Cramer's V for
  crosstabs but I'm not sure whether they have been used before on 
logistic
  regression, if there are better ones and if there are any packages.
 
 
  Many thanks
 
  Marco



 Sounds like you are bordering on some type of stepwise approach to 
including or not including covariates in the model. You can search the list 
archives for a myriad of discussions as to why that is a poor approach.

 You have the luxury of a large sample. You also have the challenge of 
interpreting covariates that appear to be statistically significant, but 
may have a rather small *effect size* in context. That is where subject 
matter experts need to provide input as to interpretation of the contextual 
significance of the variable, as opposed to the statistical significance of 
that same variable.

 A general approach, is to simply pre-specify your model based upon rather 
simple considerations. Also, you need to determine if your goal for the 
model is prediction or explanation.

 What is the incidence of your 'event' in the sample? If it is say 10%, 
then you should have around 20,000 events. The rule of thumb for logistic 
regression is to have around 20 events per covariate degree of freedom (df) 
to minimize the risk of over-fitting the model to your dataset. A 
continuous covariate is 1 df, a k-level factor is k-1 df. So with 20,000 
events, your model could feasibly have 1,000 covariate df's. I am guessing 
that you don't have that much independent data to begin with.

 So, pre-specfy your model on the full dataset and stick with it. Interact 
with subject matter experts on the interpretation of the model.

 BTW, this question is really about statistical modeling generally, not 
really R specific. Such queries are best posed to general statistical 
lists/forums such as Stack Exchange. I would also point you to Frank 
Harrell's book, Regression Modeling Strategies.

 Regards,

 Marc Schwartz






--
M Pomati
University of Bristol
School for Policy Studies
8 Priory Road
Office:10B
Bristol BS8 1TZ, UK
http://www.bristol.ac.uk/sps/research/centres/poverty

 
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread David Winsemius


On Jul 31, 2012, at 10:25 AM, M Pomati wrote:


Marc, thank you very much for your help.
I've posted in on

http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models 



and added details.


I think you might have gotten a more statistically knowledgeable  
audience at:


http://stats.stackexchange.com/

(And I suggested to the moderators at math-SE that it be migrated.)

--
David.



Many thanks

Marco

--On 31 July 2012 11:50 -0500 Marc Schwartz marc_schwa...@me.com  
wrote:


On Jul 31, 2012, at 10:35 AM, M Pomati marco.pom...@bristol.ac.uk  
wrote:


Does anyone know of any X^2 tests to compare the fit of logistic  
models
which factor out the sample size? I'm dealing with a very large  
sample and
I fear the significant X^2 test I get when adding a variable to  
the model

is simply a result of the sample size (200,000 cases).

I'd rather use the whole dataset instead of taking (small) random  
samples

as it is highly skewed. I've seen things like Phi and Cramer's V for
crosstabs but I'm not sure whether they have been used before on  
logistic

regression, if there are better ones and if there are any packages.


Many thanks

Marco



Sounds like you are bordering on some type of stepwise approach to
including or not including covariates in the model. You can search  
the list
archives for a myriad of discussions as to why that is a poor  
approach.


You have the luxury of a large sample. You also have the challenge of
interpreting covariates that appear to be statistically significant,  
but
may have a rather small *effect size* in context. That is where  
subject
matter experts need to provide input as to interpretation of the  
contextual
significance of the variable, as opposed to the statistical  
significance of

that same variable.


A general approach, is to simply pre-specify your model based upon  
rather
simple considerations. Also, you need to determine if your goal for  
the

model is prediction or explanation.


What is the incidence of your 'event' in the sample? If it is say  
10%,
then you should have around 20,000 events. The rule of thumb for  
logistic
regression is to have around 20 events per covariate degree of  
freedom (df)

to minimize the risk of over-fitting the model to your dataset. A
continuous covariate is 1 df, a k-level factor is k-1 df. So with  
20,000
events, your model could feasibly have 1,000 covariate df's. I am  
guessing

that you don't have that much independent data to begin with.


So, pre-specfy your model on the full dataset and stick with it.  
Interact

with subject matter experts on the interpretation of the model.


BTW, this question is really about statistical modeling generally,  
not

really R specific. Such queries are best posed to general statistical
lists/forums such as Stack Exchange. I would also point you to Frank
Harrell's book, Regression Modeling Strategies.


Regards,

Marc Schwartz


--
M Pomati
University of Bristol




David Winsemius, MD
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.