[R] Overdispersion with binomial distribution

2009-02-16 Thread Jessica L Hite/hitejl/O/VCU


I am attempting to run a glm with a binomial model to analyze proportion
data.
I have been following Crawley's book closely and am wondering if there is
an accepted standard for how much is too much overdispersion? (e.g. change
in AIC has an accepted standard of 2).

In the example, he fits several models, binomial and quasibinomial and then
accepts the quasibinomial.
The output for residual deviance and df does not change between these two
models, however, he accepts the quasibinomial.
Is there a different calculation that I missed that actually provides an
overdispersion factor (as in SAS?)

My code and output are below, given the example in the book, these data are
WAY overdispersed .do I mention this and go on or does this signal the
need to try a different model? If so, any suggestions on the type of
distribution (gamma or negative binomial ?)?

attach(Clutch2)
 y<-cbind(Total,Size-Total)
glm1<-glm(y~Pred,"binomial")
summary(glm1)

Call:
glm(formula = y ~ Pred, family = "binomial")

Deviance Residuals:
Min   1Q   Median   3Q  Max
-9.1022  -2.7899  -0.4781   2.6058   8.4852

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.350950.06612  20.433  < 2e-16 ***
PredF   -0.348110.11719  -2.970  0.00297 **
PredSN  -3.291560.10691 -30.788  < 2e-16 ***
PredW   -1.464510.12787 -11.453  < 2e-16 ***
PredWF  -0.564120.13178  -4.281 1.86e-05 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2815.5  on 108  degrees of freedom
Residual deviance: 1323.5  on 104  degrees of freedom
  (3 observations deleted due to missingness)
AIC: 1585.2

Number of Fisher Scoring iterations: 5


 the output for residual deviance and df does not change even when I
use quasibinomial, is this ok?  #


 library(MASS)
> glm2<-glm(y~Pred,"quasibinomial")
> summary(glm2)

Call:
glm(formula = y ~ Pred, family = "quasibinomial")

Deviance Residuals:
Min   1Q   Median   3Q  Max
-9.1022  -2.7899  -0.4781   2.6058   8.4852

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.3510 0.2398   5.633 1.52e-07 ***
PredF-0.3481 0.4251  -0.819  0.41471
PredSN   -3.2916 0.3878  -8.488 1.56e-13 ***
PredW-1.4645 0.4638  -3.157  0.00208 **
PredWF   -0.5641 0.4780  -1.180  0.24063
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for quasibinomial family taken to be 13.15786)

Null deviance: 2815.5  on 108  degrees of freedom
Residual deviance: 1323.5  on 104  degrees of freedom
  (3 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 5

> anova(glm2,test="F")
Analysis of Deviance Table

Model: quasibinomial, link: logit

Response: y

Terms added sequentially (first to last)


  Df Deviance Resid. Df Resid. Dev  F   Pr(>F)
NULL108 2815.5
Pred   4   1492.0   104 1323.5 28.349 6.28e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> model1<-update(glm2,~.-Pred)
> anova(glm2,model1,test="F")
Analysis of Deviance Table

Model 1: y ~ Pred
Model 2: y ~ 1
  Resid. Df Resid. Dev  Df Deviance  F   Pr(>F)
1   104 1323.5
2   108 2815.5  -4  -1492.0 28.349 6.28e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> coef(glm2)
(Intercept)   PredF  PredSN   PredW  PredWF
  1.3509550  -0.3481096  -3.2915601  -1.4645097  -0.5641223


Thanks
Jessica
hit...@vcu.edu

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Overdispersion with binomial distribution

2009-02-17 Thread Ben Bolker
Jessica L Hite/hitejl/O/VCU  vcu.edu> writes:

> I am attempting to run a glm with a binomial model to analyze proportion
> data.
> I have been following Crawley's book closely and am wondering if there is
> an accepted standard for how much is too much overdispersion? (e.g. change
> in AIC has an accepted standard of 2).

  In principle, in the null case (i.e. data are really binomial)
the deviance is  chi-squared distributed with the df equal
to the residual df.

  For example:

example(glm)
deviance(glm.D93) ## 5.13
summary(glm.D93)$dispersion ## 1 (by definition)
dfr <- df.residual(glm.D93)
deviance(glm.D93)/dfr ## 1.28
d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17
(disp2 <- d2/dfr)  ## 1.293

gg2 <- update(glm.D93,family=quasipoisson)
summary(gg2)$dispersion  ## 1.293, same as above

pchisq(d2,df=dfr,lower.tail=FALSE)

all.equal(coef(glm.D93),coef(gg2)) ## TRUE

se1 <- coef(summary(glm.D93))[,"Std. Error"]
se2 <- coef(summary(gg2))[,"Std. Error"]
se2/se1

# (Intercept)outcome2outcome3  treatment2  treatment3 
#   1.1372341.1372341.1372341.1372341.137234 

sqrt(disp2)
# [1] 1.137234

> My code and output are below, given the example in the book, these data are
> WAY overdispersed .do I mention this and go on or does this signal the
> need to try a different model? If so, any suggestions on the type of
> distribution (gamma or negative binomial ?)?

  Way overdispersed may indicate model lack of fit.  Have
you examined residuals/data for outliers etc.?  

  quasibinomial should be fine, or you can try beta-binomial
(see the aod package) ...


> attach(Clutch2)
>  y<-cbind(Total,Size-Total)
> glm1<-glm(y~Pred,"binomial")
> summary(glm1)
> 
> Call:
> glm(formula = y ~ Pred, family = "binomial")
> 
> Deviance Residuals:
> Min   1Q   Median   3Q  Max
> -9.1022  -2.7899  -0.4781   2.6058   8.4852
> 
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept)  1.350950.06612  20.433  < 2e-16 ***
> PredF   -0.348110.11719  -2.970  0.00297 **
> PredSN  -3.291560.10691 -30.788  < 2e-16 ***
> PredW   -1.464510.12787 -11.453  < 2e-16 ***
> PredWF  -0.564120.13178  -4.281 1.86e-05 ***
> ---
>  the output for residual deviance and df does not change even when I
> use quasibinomial, is this ok?  #

  That's as expected.
 
>  library(MASS)

  you don't really need MASS for quasibinomial.

> > glm2<-glm(y~Pred,"quasibinomial")
> > summary(glm2)
> 
> Call:
> glm(formula = y ~ Pred, family = "quasibinomial")
> 
> Deviance Residuals:
> Min   1Q   Median   3Q  Max
> -9.1022  -2.7899  -0.4781   2.6058   8.4852
> 
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept)   1.3510 0.2398   5.633 1.52e-07 ***
> PredF-0.3481 0.4251  -0.819  0.41471
> PredSN   -3.2916 0.3878  -8.488 1.56e-13 ***
> PredW-1.4645 0.4638  -3.157  0.00208 **
> PredWF   -0.5641 0.4780  -1.180  0.24063
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> 
> (Dispersion parameter for quasibinomial family taken to be 13.15786)
> 
> Null deviance: 2815.5  on 108  degrees of freedom
> Residual deviance: 1323.5  on 104  degrees of freedom
>   (3 observations deleted due to missingness)
> AIC: NA
> 
> Number of Fisher Scoring iterations: 5
> 
> > anova(glm2,test="F")
> Analysis of Deviance Table
> 
> Model: quasibinomial, link: logit
> 
> Response: y
> 
> Terms added sequentially (first to last)
> 
>   Df Deviance Resid. Df Resid. Dev  F   Pr(>F)
> NULL108 2815.5
> Pred   4   1492.0   104 1323.5 28.349 6.28e-16 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> > model1<-update(glm2,~.-Pred)
> > anova(glm2,model1,test="F")
> Analysis of Deviance Table
> 
> Model 1: y ~ Pred
> Model 2: y ~ 1
>   Resid. Df Resid. Dev  Df Deviance  F   Pr(>F)
> 1   104 1323.5
> 2   108 2815.5  -4  -1492.0 28.349 6.28e-16 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> > coef(glm2)
> (Intercept)   PredF  PredSN   PredW  PredWF
>   1.3509550  -0.3481096  -3.2915601  -1.4645097  -0.5641223
> 
> Thanks
> Jessica
> hitejl  vcu.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Overdispersion with binomial distribution

2009-02-17 Thread Prof Brian Ripley

On Tue, 17 Feb 2009, Ben Bolker wrote:


Jessica L Hite/hitejl/O/VCU  vcu.edu> writes:


I am attempting to run a glm with a binomial model to analyze proportion
data.
I have been following Crawley's book closely and am wondering if there is
an accepted standard for how much is too much overdispersion? (e.g. change
in AIC has an accepted standard of 2).



 In principle, in the null case (i.e. data are really binomial)
the deviance is  chi-squared distributed with the df equal
to the residual df.


*Approximately*, provided the expected counts are not near or below 
one.  See MASS §7.5 for an analysis of the size of the approximation 
errors (which can be large and in both directions).


Given that I once had a consulting job where the over-dispersion was 
causing something close ot panic and was entirely illusory, the lack 
of the 'approximately' can have quite serious consequences.



 For example:

example(glm)
deviance(glm.D93) ## 5.13
summary(glm.D93)$dispersion ## 1 (by definition)
dfr <- df.residual(glm.D93)
deviance(glm.D93)/dfr ## 1.28
d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17
(disp2 <- d2/dfr)  ## 1.293

gg2 <- update(glm.D93,family=quasipoisson)
summary(gg2)$dispersion  ## 1.293, same as above

pchisq(d2,df=dfr,lower.tail=FALSE)

all.equal(coef(glm.D93),coef(gg2)) ## TRUE

se1 <- coef(summary(glm.D93))[,"Std. Error"]
se2 <- coef(summary(gg2))[,"Std. Error"]
se2/se1

# (Intercept)outcome2outcome3  treatment2  treatment3
#   1.1372341.1372341.1372341.1372341.137234

sqrt(disp2)
# [1] 1.137234


My code and output are below, given the example in the book, these data are
WAY overdispersed .do I mention this and go on or does this signal the
need to try a different model? If so, any suggestions on the type of
distribution (gamma or negative binomial ?)?


 Way overdispersed may indicate model lack of fit.  Have
you examined residuals/data for outliers etc.?

 quasibinomial should be fine, or you can try beta-binomial
(see the aod package) ...



attach(Clutch2)
 y<-cbind(Total,Size-Total)
glm1<-glm(y~Pred,"binomial")
summary(glm1)

Call:
glm(formula = y ~ Pred, family = "binomial")

Deviance Residuals:
Min   1Q   Median   3Q  Max
-9.1022  -2.7899  -0.4781   2.6058   8.4852

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.350950.06612  20.433  < 2e-16 ***
PredF   -0.348110.11719  -2.970  0.00297 **
PredSN  -3.291560.10691 -30.788  < 2e-16 ***
PredW   -1.464510.12787 -11.453  < 2e-16 ***
PredWF  -0.564120.13178  -4.281 1.86e-05 ***
---
 the output for residual deviance and df does not change even when I
use quasibinomial, is this ok?  #


 That's as expected.


 library(MASS)


 you don't really need MASS for quasibinomial.


glm2<-glm(y~Pred,"quasibinomial")
summary(glm2)


Call:
glm(formula = y ~ Pred, family = "quasibinomial")

Deviance Residuals:
Min   1Q   Median   3Q  Max
-9.1022  -2.7899  -0.4781   2.6058   8.4852

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.3510 0.2398   5.633 1.52e-07 ***
PredF-0.3481 0.4251  -0.819  0.41471
PredSN   -3.2916 0.3878  -8.488 1.56e-13 ***
PredW-1.4645 0.4638  -3.157  0.00208 **
PredWF   -0.5641 0.4780  -1.180  0.24063
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for quasibinomial family taken to be 13.15786)

Null deviance: 2815.5  on 108  degrees of freedom
Residual deviance: 1323.5  on 104  degrees of freedom
  (3 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 5


anova(glm2,test="F")

Analysis of Deviance Table

Model: quasibinomial, link: logit

Response: y

Terms added sequentially (first to last)

  Df Deviance Resid. Df Resid. Dev  F   Pr(>F)
NULL108 2815.5
Pred   4   1492.0   104 1323.5 28.349 6.28e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

model1<-update(glm2,~.-Pred)
anova(glm2,model1,test="F")

Analysis of Deviance Table

Model 1: y ~ Pred
Model 2: y ~ 1
  Resid. Df Resid. Dev  Df Deviance  F   Pr(>F)
1   104 1323.5
2   108 2815.5  -4  -1492.0 28.349 6.28e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

coef(glm2)

(Intercept)   PredF  PredSN   PredW  PredWF
  1.3509550  -0.3481096  -3.2915601  -1.4645097  -0.5641223

Thanks
Jessica
hitejl  vcu.edu



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http:

Re: [R] Overdispersion with binomial distribution

2009-02-17 Thread Ben Bolker
 Thanks for the clarification.
  I actually had MASS open to that page while
I was composing my reply but forgot to mention
it (trying to do too many things at once) ...

  Ben Bolker

Prof Brian Ripley wrote:
> On Tue, 17 Feb 2009, Ben Bolker wrote:
> 
>> Jessica L Hite/hitejl/O/VCU  vcu.edu> writes:
>>
>>> I am attempting to run a glm with a binomial model to analyze proportion
>>> data.
>>> I have been following Crawley's book closely and am wondering if there is
>>> an accepted standard for how much is too much overdispersion? (e.g. change
>>> in AIC has an accepted standard of 2).
> 
>>  In principle, in the null case (i.e. data are really binomial)
>> the deviance is  chi-squared distributed with the df equal
>> to the residual df.
> 
> *Approximately*, provided the expected counts are not near or below 
> one.  See MASS §7.5 for an analysis of the size of the approximation 
> errors (which can be large and in both directions).
> 
> Given that I once had a consulting job where the over-dispersion was 
> causing something close ot panic and was entirely illusory, the lack 
> of the 'approximately' can have quite serious consequences.
> 
>>  For example:
>>
>> example(glm)
>> deviance(glm.D93) ## 5.13
>> summary(glm.D93)$dispersion ## 1 (by definition)
>> dfr <- df.residual(glm.D93)
>> deviance(glm.D93)/dfr ## 1.28
>> d2 <- sum(residuals(glm.D93,"pearson")^2) ## 5.17
>> (disp2 <- d2/dfr)  ## 1.293
>>
>> gg2 <- update(glm.D93,family=quasipoisson)
>> summary(gg2)$dispersion  ## 1.293, same as above
>>
>> pchisq(d2,df=dfr,lower.tail=FALSE)
>>
>> all.equal(coef(glm.D93),coef(gg2)) ## TRUE
>>
>> se1 <- coef(summary(glm.D93))[,"Std. Error"]
>> se2 <- coef(summary(gg2))[,"Std. Error"]
>> se2/se1
>>
>> # (Intercept)outcome2outcome3  treatment2  treatment3
>> #   1.1372341.1372341.1372341.1372341.137234
>>
>> sqrt(disp2)
>> # [1] 1.137234
>>
>>> My code and output are below, given the example in the book, these data are
>>> WAY overdispersed .do I mention this and go on or does this signal the
>>> need to try a different model? If so, any suggestions on the type of
>>> distribution (gamma or negative binomial ?)?
>>  Way overdispersed may indicate model lack of fit.  Have
>> you examined residuals/data for outliers etc.?
>>
>>  quasibinomial should be fine, or you can try beta-binomial
>> (see the aod package) ...
>>
>>
>>> attach(Clutch2)
>>>  y<-cbind(Total,Size-Total)
>>> glm1<-glm(y~Pred,"binomial")
>>> summary(glm1)
>>>
>>> Call:
>>> glm(formula = y ~ Pred, family = "binomial")
>>>
>>> Deviance Residuals:
>>> Min   1Q   Median   3Q  Max
>>> -9.1022  -2.7899  -0.4781   2.6058   8.4852
>>>
>>> Coefficients:
>>> Estimate Std. Error z value Pr(>|z|)
>>> (Intercept)  1.350950.06612  20.433  < 2e-16 ***
>>> PredF   -0.348110.11719  -2.970  0.00297 **
>>> PredSN  -3.291560.10691 -30.788  < 2e-16 ***
>>> PredW   -1.464510.12787 -11.453  < 2e-16 ***
>>> PredWF  -0.564120.13178  -4.281 1.86e-05 ***
>>> ---
>>>  the output for residual deviance and df does not change even when I
>>> use quasibinomial, is this ok?  #
>>  That's as expected.
>>
>>>  library(MASS)
>>  you don't really need MASS for quasibinomial.
>>
 glm2<-glm(y~Pred,"quasibinomial")
 summary(glm2)
>>> Call:
>>> glm(formula = y ~ Pred, family = "quasibinomial")
>>>
>>> Deviance Residuals:
>>> Min   1Q   Median   3Q  Max
>>> -9.1022  -2.7899  -0.4781   2.6058   8.4852
>>>
>>> Coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)   1.3510 0.2398   5.633 1.52e-07 ***
>>> PredF-0.3481 0.4251  -0.819  0.41471
>>> PredSN   -3.2916 0.3878  -8.488 1.56e-13 ***
>>> PredW-1.4645 0.4638  -3.157  0.00208 **
>>> PredWF   -0.5641 0.4780  -1.180  0.24063
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>
>>> (Dispersion parameter for quasibinomial family taken to be 13.15786)
>>>
>>> Null deviance: 2815.5  on 108  degrees of freedom
>>> Residual deviance: 1323.5  on 104  degrees of freedom
>>>   (3 observations deleted due to missingness)
>>> AIC: NA
>>>
>>> Number of Fisher Scoring iterations: 5
>>>
 anova(glm2,test="F")
>>> Analysis of Deviance Table
>>>
>>> Model: quasibinomial, link: logit
>>>
>>> Response: y
>>>
>>> Terms added sequentially (first to last)
>>>
>>>   Df Deviance Resid. Df Resid. Dev  F   Pr(>F)
>>> NULL108 2815.5
>>> Pred   4   1492.0   104 1323.5 28.349 6.28e-16 ***
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
 model1<-update(glm2,~.-Pred)
 anova(glm2,model1,test="F")
>>> Analysis of Deviance Table
>>>
>>> Model 1: y ~ Pred
>>> Model 2: y ~ 1
>>>   Resid. Df Resid. Dev  Df Deviance  F   Pr(>F)
>>> 1   104 1323.5
>>> 2   108 2815.5  -4  -1492.0 28.349 6.28e-16 ***
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.0