[R] Overdispersion in a GLM binomial model
Hello, The share of concurring votes (i.e. yes-yes and no-no) in total votes between a pair of voters is a function of their ideological distance (index continuous on [1,2]). I show by other means that the votes typically are highly positively correlated (with an average c=0.6). This is because voters sit together and discuss the issue before taking a vote, but also because they share common ideologies. The coefficient is significant; sign correct; fit is good: R-sq.(adj)=0.866. BUT there seems to be a massive overdispersion: Deviance explained=39.3%, Residual deviance: 3874.0 on 102 degrees of freedom. AND the residual-fitted plot shows heteroscedasticity. The overdispersion cannot be remedied by regressing on LOG(index), or by using the quasibinomial family with a scale parameter for the variance. The estimated Dispersion parameter for quasibinomial family is large 37.34917. QUESTION: Is there an overdispersion? Can overdispersion be due to correlation between the votes? What can be done? The data is attached below, v1 concurring votes, v0 dissenting votes, idist the index. Thanks, Serguei DATA: v1,v2,idist 376,40,1.125 328,88,1.375 367,49,1.145 372,44,1.125 273,143,1 325,91,1.125 375,41,1.125 357,59,1.375 751,359,1.885 816,294,1 752,358,1.885 829,281,1.3 857,253,1.05 759,351,1.07 848,262,1.135 803,307,1.385 555,555,1.885 346,70,1.5 381,35,1.27 398,18,1.25 289,127,1.125 1003,107,1 580,530,1.585 628,482,1.835 502,608,1.955 745,365,1.75 343,73,1.25 407,9,1.25 373,43,1.5 587,96,1.205 507,176,1.11 528,155,1.06 473,210,1.43 436,247,1.475 585,98,1.145 541,142,1.225 425,258,1.315 540,570,1.885 975,135,1.3 959,151,1.05 973,137,1.07 772,338,1.135 879,231,1.385 327,89,1.23 332,84,1.25 331,85,1.375 339,77,1.25 345,71,1.25 353,63,1 373,43,1.02 266,150,1.145 318,98,1.02 384,32,1.02 346,70,1.23 519,164,1.315 512,171,1.265 481,202,1.635 446,237,1.68 613,70,1.35 553,130,1.43 435,248,1.52 291,125,1.125 345,71,1 397,19,1 357,59,1.25 338,78,1.125 286,130,1.125 326,90,1.375 588,95,1.05 597,86,1.32 564,119,1.365 537,146,1.035 445,238,1.115 559,124,1.205 565,545,1.585 613,497,1.835 485,625,1.955 736,374,1.75 583,527,1.5 954,156,1.25 972,138,1.37 557,126,1.415 540,143,1.085 811,299,1.165 560,123,1.255 846,264,1.085 928,182,1.12 819,291,1.085 872,238,1.335 602,81,1.045 497,186,1.285 745,365,1.205 599,84,1.115 834,276,1.455 468,215,1.33 360,323,1.25 640,43,1.16 541,142,1.08 461,222,1.17 355,328,1.09 729,381,1.25 338,78,1 354,62,1.25 366,50,1.25 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] overdispersion
I would say rather that for binary data (binomial data with n=1) it is not possible to detect overdispersion from examination of the Pearson chi-square or the deviance. Overdispersion may be, and often is, nevertheless present. I am arguing that overdispersion is properly regarded as a function of the variance-covariance structure, not as a function of the sample data. The variance of a two-point distribution is a known function of the mean, providing that independence and identity of distribution can be assumed, or providing that the correlation structure is otherwise known and the mean is constant. That proviso is crucial! If there is some sort of grouping, it may be appropriate to aggregate data over the groups, yielding data that have a binomial form with n1. Over-dispersion can now be detected from the Pearson chi-square or from the deviance. Note that the quasi models assume that the multiplier for the binomial or other variance is constant with p; that may or may not be realistic. Generalized linear mixed models make their own different assumptions about how the variance changes as a function of p; again these may or may not be realistic. It is then the error structure that is crucial. To the extent that distracts from careful thinking about that structure, the term overdispersion is unsatisfactory. There's no obvious way that I can see to supply glm() with an estimate of the dispersion that has been derived independently of the current analysis. Especially in the binary case, this would sometimes be useful. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 12 Jan 2007, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Peter Dalgaard [EMAIL PROTECTED] Date: 12 January 2007 5:04:26 AM To: evaiannario [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch r-help@stat.math.ethz.ch Subject: Re: [R] overdispersion evaiannario wrote: How can I eliminate the overdispersion for binary data apart the use of the quasibinomial? There is no such thing as overdispersion for binary data. (The variance of a two-point distribution is a known function of the mean.) If what you want to do is include random effects of some sort of grouping then you might look into generalized linear mixed models via lmer() from the lme4 package or glmmPQL from MASS. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] overdispersion
John Maindonald wrote: I would say rather that for binary data (binomial data with n=1) it is not possible to detect overdispersion from examination of the Pearson chi-square or the deviance. Overdispersion may be, and often is, nevertheless present. I am arguing that overdispersion is properly regarded as a function of the variance-covariance structure, not as a function of the sample data. The variance of a two-point distribution is a known function of the mean, providing that independence and identity of distribution can be assumed, or providing that the correlation structure is otherwise known and the mean is constant. That proviso is crucial! I don't really disagree, of course. I was mainly being provocative. However, these models play tricks on our intuition. When people speak of overdispersion, they usually imply just what you said: independent data with the correct mean, but somehow a different variance - a mathematical impossibility for binary data. One particular thing to notice is that if the individual means are heterogeneous but sampled independently from the same underlying distribution; you still end up with a marginal binomial distribution. If they are not sampled independently, then you get departures from the binomial, but it may well be in the direction of underdispersion. For an extreme case, take a sample of 50 men and 50 women and count the number of people with breasts. (If you do the same thing with a random sample of 100 _people_, you get the binomial distribution again. Unless you're counting the number of breasts...) If there is some sort of grouping, it may be appropriate to aggregate data over the groups, yielding data that have a binomial form with n1. Over-dispersion can now be detected from the Pearson chi-square or from the deviance. Note that the quasi models assume that the multiplier for the binomial or other variance is constant with p; that may or may not be realistic. Generalized linear mixed models make their own different assumptions about how the variance changes as a function of p; again these may or may not be realistic. It is then the error structure that is crucial. To the extent that distracts from careful thinking about that structure, the term overdispersion is unsatisfactory. There's no obvious way that I can see to supply glm() with an estimate of the dispersion that has been derived independently of the current analysis. Especially in the binary case, this would sometimes be useful. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 12 Jan 2007, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Peter Dalgaard [EMAIL PROTECTED] Date: 12 January 2007 5:04:26 AM To: evaiannario [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch r-help@stat.math.ethz.ch Subject: Re: [R] overdispersion evaiannario wrote: How can I eliminate the overdispersion for binary data apart the use of the quasibinomial? There is no such thing as overdispersion for binary data. (The variance of a two-point distribution is a known function of the mean.) If what you want to do is include random effects of some sort of grouping then you might look into generalized linear mixed models via lmer() from the lme4 package or glmmPQL from MASS. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] overdispersion
How can I eliminate the overdispersion for binary data apart the use of the quasibinomial? help me Eva Iannario -- Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom http://click.libero.it/infostrada11gen07 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] overdispersion
evaiannario wrote: How can I eliminate the overdispersion for binary data apart the use of the quasibinomial? There is no such thing as overdispersion for binary data. (The variance of a two-point distribution is a known function of the mean.) If what you want to do is include random effects of some sort of grouping then you might look into generalized linear mixed models via lmer() from the lme4 package or glmmPQL from MASS. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.