Dear Bill, Thank you very much for your careful discussion of the issue.
It is not surprising that the deviance is the same whether you fit the model using a factor response with weights or individual 0/1 responses. I think this happens because the fitted probabilities in the saturated models are zero or one (depending on the response being a failure or a success) in both cases. The degrees of freedom are clearly different. However, if you consider as response the two-column matrix of successes and failures then you have a different deviance, with different degrees of freedom: > covariates <- subset(UCBAd, subset = Admit == "Admitted", + select = -c(Admit, Freq)) > AdmitReject <- as.matrix(cbind(subset(UCBAd, subset = Admit == "Rejected", + select = Freq), + subset(UCBAd, subset = Admit == "Admitted", + select = Freq))) > UCBAd_matrix <- glm(AdmitReject ~ Gender + Dept, family = binomial, + data = covariates) ## matrix as response > UCBAd_matrix$deviance [1] 20.20428 > UCBAd_matrix$df.residual [1] 5 > coef(UCBAd_matrix) (Intercept) GenderFemale DeptB DeptC DeptD DeptE -0.58205140 -0.09987009 0.04339793 1.26259802 1.29460647 1.73930574 DeptF 3.30648006 In this case the fitted probabilities in the saturated model are the observed proportions. The fitted coefficients are clearly the same. The deviance (and degrees of freedom) resulting from this way of fitting the model _can_ be used as a goodness-of-fit statistic, based on standard arguments, while the deviance (and degrees of freedom) resulting from fitting the model as individual 0/1 responses _can't_. What about deviance and degrees of freedom that I get using the "factor with weights" approach? The number of parameters in the saturated model does not increase with the number of observations, as it happens in the individual 0/1 responses case, but the fitted probabilities in the saturated model are fixed to zero and one. I guess it is not clear to me that the standard likelihood ratio test asymptotic theory holds in this case. Could anybody clarify? Thank you in advance, Giovanni On Thu, 2011-02-17 at 10:50 +1100, bill.venab...@csiro.au wrote: > This is a very good question. You have spotted something that not > many people see and it is important. > > The bland assertion that the "deviance can be used as a test of fit" > can be seriously misleading. > > For this data the response is clearly binary, "Admitted" (success) or > "Rejected" (failure) and the other two factors are explanatory > variables. > > Any binomial model can be fitted by expressing the data as a binary > response. In this case there are > > > sum(UCBAd$Freq) > [1] 4526 > > > > 4526 trials, corresponding to the individual applicants for admission. > We can expand the data frame right out to this level and fit the model > with the data in that form, and in this case the weights will be the > default, ie. all 1's. > > We can also *group* the data into subsets which are homogeneous with > respect to the explanatory variables. > > The most succinct grouping would be into 12 groups corresponding to > the 2 x 6 distinct classes defined by the two explanatory variables. > In this case you would specify the response either as a two-column > matrix of successes/failures, or as a proportion with the totals for > each of the 12 cases as weights. > > Another succince grouping is into 2 x 2 x 6 classes as you do in your > example. In this case your response is the factor and the weights are > the frequencies. > > For all three cases a) the estimates will be the same, and so the > predictions will be identical, b) the deviance will also be the same, > but c) the degrees of freedom attributed to the deviance will be > different. > > The reason for c) is, as you have intuited, the saturated model is > different. Literally, the saturated model is a model with one mean > parameter for each value taken as an observation when the model is > fitted. So the saturated model is *not* invariant with respect to > grouping. > > Let's look at two of these cases computationally: > > > > UCB_Expanded <- UCBAd[rep(1:24, UCBAd$Freq), 1:3] ## expand the data frame > > dim(UCB_Expanded) > [1] 4526 3 > > > ### now fit your model > > > m1 <- glm(Admit ~ Gender + Dept, binomial, UCBAd, weights = Freq) > > > ### and the same model using the binary data form > > > m2 <- glm(Admit ~ Gender + Dept, binomial, UCB_Expanded) > > > ### as predicted, the coefficients are identical (up to round off) > > > coef(m1) > (Intercept) GenderFemale DeptB DeptC DeptD DeptE > -0.58205140 -0.09987009 0.04339793 1.26259802 1.29460647 1.73930574 > DeptF > 3.30648006 > > coef(m2) > (Intercept) GenderFemale DeptB DeptC DeptD DeptE > -0.58205140 -0.09987009 0.04339793 1.26259802 1.29460647 1.73930574 > DeptF > 3.30648006 > > > ### and so are the deviances, perhaps surprisingly: > > > deviance(m1) > [1] 5187.488 > > deviance(m2) > [1] 5187.488 > > > ### but the degrees of freedom attributed to the deviance are different! > > > m1$df.resid > [1] 17 > > m2$df.resid > [1] 4519 > > > > If you were to fit the model in the most succinct form, with 12 > relative frequencies, then you would get the same deviance again, but > the degrees of freedom would be only 5. > > So you need to be very careful in taking the deviance, even in > binomial models, as a test of fit. The way the data are grouped is > relevant. > > If you have two fixed models, e.g. Admit ~ Gender, and Admit ~ Gender + Dept, > then > > The estimated coefficients, and their standard errors, vcov matrix, > The deviance, and so > *Differences* in deviances and > *Differences* in degrees of freedom > > will be the same however the data are grouped, and so the usual tests > and CI processes go ahead fine. > > But the deviance itself can be misleading as a test of fit, since the > outer hypothesis, the saturated model, is not fixed and depends on > grouping. For the ungrouped binary case it is *usually* misleading > when taken simply at face value as chi-squared distributed under the > null hypothesis. > > I think there is a book or two around that discusses this issue, but > probably not well enough. > > Bill Venables. > > > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf Of Giovanni Petris > Sent: Thursday, 17 February 2011 7:58 AM > To: r-help@r-project.org > Subject: [R] Saturated model in binomial glm > > > Hi all, > > Could somebody be so kind to explain to me what is the saturated model > on which deviance and degrees of freedom are calculated when fitting a > binomial glm? > > Everything makes sense if I fit the model using as response a vector of > proportions or a two-column matrix. But when the response is a factor > and counts are specified via the "weights" argument, I am kind of lost > as far as what is the implied saturated model. > > Here is a simple example, based on the UCBAdmissions data. > > > UCBAd <- as.data.frame(UCBAdmissions) > > UCBAd <- glm(Admit ~ Gender + Dept, family = binomial, > + weights = Freq, data = UCBAd) > > UCBAd$deviance > [1] 5187.488 > > UCBAd$df.residual > [1] 17 > > I can see that the data frame UCBAd has 24 rows and using 1+1+5 > parameters to fit the model leaves me with 17 degrees of freedom. > > What is not clear to me is what is the saturated model? > > Is it the model that fits a probability zero to each row corresponding > to failures and a probability one to each row corresponding to > successes? If this is so, it seems to me that looking at the deviance as > a goodness-of-fit statistic does not make much sense in this case. Am I > missing something? > > Thank you in advance, > Giovanni > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.