This is a very good question.  You have spotted something that not many people 
see and it is important.

The bland assertion that the "deviance can be used as a test of fit" can be 
seriously misleading.

For this data the response is clearly binary, "Admitted" (success) or 
"Rejected" (failure) and the other two factors are explanatory variables.  

Any binomial model can be fitted by expressing the data as a binary response.  
In this case there are

> sum(UCBAd$Freq)
[1] 4526
> 

4526 trials, corresponding to the individual applicants for admission.  We can 
expand the data frame right out to this level and fit the model with the data 
in that form, and in this case the weights will be the default, ie. all 1's.

We can also *group* the data into subsets which are homogeneous with respect to 
the explanatory variables.  

The most succinct grouping would be into 12 groups corresponding to the 2 x 6 
distinct classes defined by the two explanatory variables.  In this case you 
would specify the response either as a two-column matrix of successes/failures, 
or as a proportion with the totals for each of the 12 cases as weights.

Another succince grouping is into 2 x 2 x 6 classes as you do in your example.  
In this case your response is the factor and the weights are the frequencies.

For all three cases a) the estimates will be the same, and so the predictions 
will be identical, b) the deviance will also be the same, but c) the degrees of 
freedom attributed to the deviance will be different.

The reason for c) is, as you have intuited, the saturated model is different.  
Literally, the saturated model is a model with one mean parameter for each 
value taken as an observation when the model is fitted.  So the saturated model 
is *not* invariant with respect to grouping.

Let's look at two of these cases computationally:


> UCB_Expanded <- UCBAd[rep(1:24, UCBAd$Freq), 1:3] ## expand the data frame
> dim(UCB_Expanded)
[1] 4526    3

>  ### now fit your model

> m1 <- glm(Admit ~ Gender + Dept, binomial, UCBAd, weights = Freq)

>  ### and the same model using the binary data form

> m2 <- glm(Admit ~ Gender + Dept, binomial, UCB_Expanded)

>  ### as predicted, the coefficients are identical (up to round off)

> coef(m1)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
       DeptF 
  3.30648006 
> coef(m2)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
       DeptF 
  3.30648006 

>   ### and so are the deviances, perhaps surprisingly:

> deviance(m1)
[1] 5187.488
> deviance(m2)
[1] 5187.488

>   ### but the degrees of freedom attributed to the deviance are different!

> m1$df.resid
[1] 17
> m2$df.resid
[1] 4519
> 
 
If you were to fit the model in the most succinct form, with 12 relative 
frequencies, then you would get the same deviance again, but the degrees of 
freedom would be only 5.

So you need to be very careful in taking the deviance, even in binomial models, 
as a test of fit.  The way the data are grouped is relevant.

If you have two fixed models, e.g. Admit ~ Gender, and Admit ~ Gender + Dept, 
then

The estimated coefficients, and their standard errors, vcov matrix,
The deviance, and so
*Differences* in deviances and
*Differences* in degrees of freedom

will be the same however the data are grouped, and so the usual tests and CI 
processes go ahead fine.

But the deviance itself can be misleading as a test of fit, since the outer 
hypothesis, the saturated model, is not fixed and depends on grouping.  For the 
ungrouped binary case it is *usually* misleading when taken simply at face 
value as chi-squared distributed under the null hypothesis.

I think there is a book or two around that discusses this issue, but probably 
not well enough.

Bill Venables.


-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Giovanni Petris
Sent: Thursday, 17 February 2011 7:58 AM
To: r-help@r-project.org
Subject: [R] Saturated model in binomial glm


Hi all,

Could somebody be so kind to explain to me what is the saturated model
on which deviance and degrees of freedom are calculated when fitting a
binomial glm?

Everything makes sense if I fit the model using as response a vector of
proportions or a two-column matrix. But when the response is a factor
and counts are specified via the "weights" argument, I am kind of lost
as far as what is the implied saturated model. 

Here is a simple example, based on the UCBAdmissions data.

> UCBAd <- as.data.frame(UCBAdmissions)
> UCBAd <- glm(Admit ~ Gender + Dept, family = binomial,
+                         weights = Freq, data = UCBAd)
> UCBAd$deviance
[1] 5187.488
> UCBAd$df.residual
[1] 17

I can see that the data frame UCBAd has 24 rows and using 1+1+5
parameters to fit the model leaves me with 17 degrees of freedom. 

What is not clear to me is what is the saturated model? 

Is it the model that fits a probability zero to each row corresponding
to failures and a probability one to each row corresponding to
successes? If this is so, it seems to me that looking at the deviance as
a goodness-of-fit statistic does not make much sense in this case. Am I
missing something?

Thank you in advance,
Giovanni


-- 

Giovanni Petris  <gpet...@uark.edu>
Associate Professor
Department of Mathematical Sciences
University of Arkansas - Fayetteville, AR 72701
Ph: (479) 575-6324, 575-8630 (fax)
http://definetti.uark.edu/~gpetris/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to