Dear Milan and Steven, At the risk of muddying the water further, I think that the potential confusion here is that Poisson GLMs are applied in two formally equivalent but substantively different situations: (1) where the counts are cells in a contingency table, in which case the Poisson GLM is used to fit an equivalent loglinear association model to the table; and (2) where the counts are observations on a non-negative integer response -- what's often called "Poisson regression." In the first case, but not the second, it makes sense to think of the sum of the counts as the natural sample size. I don't think that one can expect GLM software to distinguish these cases.
Best, John > -----Original Message----- > From: r-devel-boun...@r-project.org [mailto:r-devel-bounces@r- > project.org] On Behalf Of Milan Bouchet-Valat > Sent: Wednesday, February 27, 2013 5:59 PM > To: Steven McKinney > Cc: r-devel > Subject: Re: [Rd] nobs() with glm(family="poisson") > > Le mercredi 27 février 2013 à 14:26 -0800, Steven McKinney a écrit : > > > > > -----Original Message----- > > > From: r-devel-boun...@r-project.org [mailto:r-devel-bounces@r- > project.org] > > > On Behalf Of Milan Bouchet-Valat > > > Sent: February-27-13 12:56 PM > > > To: peter dalgaard > > > Cc: r-devel > > > Subject: Re: [Rd] nobs() with glm(family="poisson") > > > > > > Thanks for the (critical, indeed) answer! > > > > > > Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit : > > > > On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote: > > > > > > > > > I cannot believes nobody cares about this -- or I'm completely > wrong > > > and > > > > > in that case everybody should rush to put the shame on me... :- > p > > > > > > > > Well, nobs() is the number of observations. If you have 5 Poisson > > > > distributed counts, you have 5 observations. > > > Well, say that to the statistical offices that spend millions to > survey > > > thousands of people with correct (but complex) sampling designs, > they'll > > > be happy to know that the collected data only provides an > information > > > equivalent to 5 independent outcomes. ;-) > > > > Milan: > > > > It seems to me you are mixing up Binomial and Poisson situations, > > and not assessing independence appropriately. > > > > The above example discusses Bernoulli outcomes which are sometimes > > aggregated into Binomial "cases" depending on the study design. > > Now if the survey samples people in the same household or even > > neighbourhood, those Bernoulli outcomes will not be independent > > (hence clustered survey techniques) and summing the Binomial > > denominators would not be appropriate, for the survey analysis or > > for BIC calculations. The "n" in the BIC calculation should > > reflect independent observations. If you knock on the same > > door 1000 times and ask the person who they will vote for, > > you do not have 1000 independent observations, even though > > your Binomial denominator is 1000. > My intention was not to introduce the issue of survey designs into the > discussion, but merely to make the point that in surveys, counts are > usually *to some extent at least* independent observations, even when > clustering is present, and that the fact that different people are > asked > and that each answer costs money is the best indication of that. > Anyway, > BIC does not apply if we are not assuming that the data comes from a > simple random sample, so let's leave this complication aside. > > > The example you show from ?glm is a Poisson example showing > > 9 independent Poisson counts. If I count the number of cars > > passing through an intersection during non-overlapping > > one minute intervals (say 9 such intervals), then the number > > of observations I have is the number of non-overlapping > > one minute interval car count totals (e.g. the nine counts > > c(18, 17, 15, 20, 10, 20, 25, 13, 12)), not the number of > > cars I saw in total. > Interesting. Indeed in the observation setting you describe, 9 is > AFAICT > the correct number of observations. Is this kind of data commonly > fitted > using glm()? > > Do you happen to possess a copy of the book where the ?glm example > comes > from? There are not many of them here in France so I cannot consult it > easily. It seems to me that in the context of a randomized controlled > trial, the number of independent observations is the number of > subjects, > not the number of groups. And thus, BIC() would still return a wrong > value for the ?glm example. > > > A piece of software that adds things up can not know the > > context from which the numbers were derived, so you have to > > figure out the level of independence appropriate to your > > study design and work out the BIC count accordingly. > This is a strong argument indeed. It would mean that BIC() is at best a > function of very limited use, or even a dangerous one, unless one can > safely assume that the case were the number of observations equals the > number of rows in the data is by far the most common one. I am biased > due to my use of log-linear models, but I doubt this is the case. Is it > (I might perfectly be wrong)? > > > Raftery alludes to this in a preceding section: > > > > "When the data have been collected using a complex survey > > design with resulting weights, it is not yet clear what n > > should be, and this issue awaits further study. However, > > it seems reasonable that if the model is based on an > > assumption of simple random sampling but the sampling > > design is less efficient, then n should be reduced to > > reflect the efficiency of the sampling design relative to > > simple random sampling." > I think Raftery had in mind surveys in which the assumption of > independence between observations (counts, not rows) does not hold, but > where it is still the reference from which the sample deviates (lower > "efficiency"). In this case, the number of cells/rows by no means a > good > measure of the number of observations either -- but as I said BIC is > usually considered as not defined in this case. > > > Thanks for sharing your remarks. > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel