Re: [R] mgcv: distribution of dev with Poisson data
Greg, The deviance being chi^2 distributed on the residual degrees of freedom works in some cases (mostly where the response itself can be reasonably approximated as Gaussian), but rather poorly in others (noteably low count data). This is what you are seeing in your simulations - in the first case the Poisson mean is relatively high - so the normal approximation to the Poisson is not too bad and the deviance is approx. chi.sq on residual df. In the second case the Poisson mean is low, there are lots of zeroes, and the approximation breaks down. So yes, the Pearson estimator is probably a better bet in the latter case. best, Simom ps. Note that the chi squared approximation for the *difference* in deviance between two nested models does not suffer from this problem. On 04/02/14 09:25, Greg Dropkin wrote: mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225 386603 http://people.bath.ac.uk/sw283 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mgcv: distribution of dev with Poisson data
Hi Greg, Yes, this sounds right - with quasipoisson gam uses `extended quasi-likelihood' (see McCullagh and Nelder's GLM book) to allow estimation of the scale parameter along with the smoothing parameters via (RE)ML, and it could well be that this gives a biased scale estimate with low counts (although the shape of the mean variance relationship is unaffected by this). It might well be more sensible to default to a Pearson estimate of the scale parameter for 'gam' ('bam' and 'gamm' already do this), to avoid this issue with quasipoisson and low counts. Your email reminded me that someone had reported rather too high p-values for quasipoisson with these sort of data, and looking back at my list of open issues I see that `someone' was you as well! It's possible that moving to a Pearson estimator of the scale will solve that problem too. Thanks for this... very helpful. best, Simon On 05/02/14 12:56, Greg Dropkin wrote: thanks Simon also, it appears at least with ML that the default scale estimate with quasipoisson (i.e. using dev) is the scale which minimises the ML value of the fitted model. So it is the best model but doesn't actually give the correct mean-variance relation. Is that right? thanks again Greg Greg, The deviance being chi^2 distributed on the residual degrees of freedom works in some cases (mostly where the response itself can be reasonably approximated as Gaussian), but rather poorly in others (noteably low count data). This is what you are seeing in your simulations - in the first case the Poisson mean is relatively high - so the normal approximation to the Poisson is not too bad and the deviance is approx. chi.sq on residual df. In the second case the Poisson mean is low, there are lots of zeroes, and the approximation breaks down. So yes, the Pearson estimator is probably a better bet in the latter case. best, Simom ps. Note that the chi squared approximation for the *difference* in deviance between two nested models does not suffer from this problem. On 04/02/14 09:25, Greg Dropkin wrote: mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225 386603 http://people.bath.ac.uk/sw283 -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225 386603 http://people.bath.ac.uk/sw283 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented,
Re: [R] mgcv: distribution of dev with Poisson data
thanks Simon also, it appears at least with ML that the default scale estimate with quasipoisson (i.e. using dev) is the scale which minimises the ML value of the fitted model. So it is the best model but doesn't actually give the correct mean-variance relation. Is that right? thanks again Greg Greg, The deviance being chi^2 distributed on the residual degrees of freedom works in some cases (mostly where the response itself can be reasonably approximated as Gaussian), but rather poorly in others (noteably low count data). This is what you are seeing in your simulations - in the first case the Poisson mean is relatively high - so the normal approximation to the Poisson is not too bad and the deviance is approx. chi.sq on residual df. In the second case the Poisson mean is low, there are lots of zeroes, and the approximation breaks down. So yes, the Pearson estimator is probably a better bet in the latter case. best, Simom ps. Note that the chi squared approximation for the *difference* in deviance between two nested models does not suffer from this problem. On 04/02/14 09:25, Greg Dropkin wrote: mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225 386603 http://people.bath.ac.uk/sw283 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mgcv: distribution of dev with Poisson data
On 05/02/2014 12:56, Greg Dropkin wrote: thanks Simon also, it appears at least with ML that the default scale estimate with quasipoisson (i.e. using dev) is the scale which minimises the ML value of the fitted model. So it is the best model but doesn't actually give the correct mean-variance relation. Is that right? Well, estimates do not usually give the correct value But ML is biased in many cases, sometimes badly so here, and most people prefer the Pearson estimator of the dispersion. This is McCullagn Nelder's book, even the first edition, but without much evidence: Venables Ripley (2002, §7.5) has plots to back up their hunch. thanks again Greg Greg, The deviance being chi^2 distributed on the residual degrees of freedom works in some cases (mostly where the response itself can be reasonably approximated as Gaussian), but rather poorly in others (noteably low count data). This is what you are seeing in your simulations - in the first case the Poisson mean is relatively high - so the normal approximation to the Poisson is not too bad and the deviance is approx. chi.sq on residual df. In the second case the Poisson mean is low, there are lots of zeroes, and the approximation breaks down. So yes, the Pearson estimator is probably a better bet in the latter case. best, Simom ps. Note that the chi squared approximation for the *difference* in deviance between two nested models does not suffer from this problem. On 04/02/14 09:25, Greg Dropkin wrote: mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225 386603 http://people.bath.ac.uk/sw283 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mgcv: distribution of dev with Poisson data
hi Simon yes, I also got the right shape of the mean-variance relation but the wrong estimate of the parameter. thanks very much Greg Hi Greg, Yes, this sounds right - with quasipoisson gam uses `extended quasi-likelihood' (see McCullagh and Nelder's GLM book) to allow estimation of the scale parameter along with the smoothing parameters via (RE)ML, and it could well be that this gives a biased scale estimate with low counts (although the shape of the mean variance relationship is unaffected by this). It might well be more sensible to default to a Pearson estimate of the scale parameter for 'gam' ('bam' and 'gamm' already do this), to avoid this issue with quasipoisson and low counts. Your email reminded me that someone had reported rather too high p-values for quasipoisson with these sort of data, and looking back at my list of open issues I see that `someone' was you as well! It's possible that moving to a Pearson estimator of the scale will solve that problem too. Thanks for this... very helpful. best, Simon On 05/02/14 12:56, Greg Dropkin wrote: thanks Simon also, it appears at least with ML that the default scale estimate with quasipoisson (i.e. using dev) is the scale which minimises the ML value of the fitted model. So it is the best model but doesn't actually give the correct mean-variance relation. Is that right? thanks again Greg Greg, The deviance being chi^2 distributed on the residual degrees of freedom works in some cases (mostly where the response itself can be reasonably approximated as Gaussian), but rather poorly in others (noteably low count data). This is what you are seeing in your simulations - in the first case the Poisson mean is relatively high - so the normal approximation to the Poisson is not too bad and the deviance is approx. chi.sq on residual df. In the second case the Poisson mean is low, there are lots of zeroes, and the approximation breaks down. So yes, the Pearson estimator is probably a better bet in the latter case. best, Simom ps. Note that the chi squared approximation for the *difference* in deviance between two nested models does not suffer from this problem. On 04/02/14 09:25, Greg Dropkin wrote: mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225 386603 http://people.bath.ac.uk/sw283 -- Simon Wood, Mathematical Science, University of Bath BA2 7AY UK +44 (0)1225
Re: [R] mgcv: distribution of dev with Poisson data
Dear Prof Ripley yes, but if the estimate is biased it's good to know what the bias is. The problem illustrated in the simulations has nothing to do with ML, though, as the default fitting method in mgcv when scale is unknown is GCV and that is what was used, by default, here. The point about ML was that I'd looked separately at what happens when scale varies in fitting a quasipoisson model, and found that ML was minimised (in an example) at the scale estimate (i.e. the default estimate using dev). A very brief look at the code made that seem plausible as it does involve the derivatives with respect to scale. In my limited experience ML is actually better behaved than GCV or REML, giving more conservative estimates and tighter CIs which behave better under simulation. re the Pearson estimates being preferable, that is also what Simon Wood's book says on p172, another reason why I was puzzled that the default estimate used dev instead. thanks Greg On 05/02/2014 12:56, Greg Dropkin wrote: thanks Simon also, it appears at least with ML that the default scale estimate with quasipoisson (i.e. using dev) is the scale which minimises the ML value of the fitted model. So it is the best model but doesn't actually give the correct mean-variance relation. Is that right? Well, estimates do not usually give the correct value But ML is biased in many cases, sometimes badly so here, and most people prefer the Pearson estimator of the dispersion. This is McCullagn Nelder's book, even the first edition, but without much evidence: Venables Ripley (2002, §7.5) has plots to back up their hunch. thanks again Greg Greg, The deviance being chi^2 distributed on the residual degrees of freedom works in some cases (mostly where the response itself can be reasonably approximated as Gaussian), but rather poorly in others (noteably low count data). This is what you are seeing in your simulations - in the first case the Poisson mean is relatively high - so the normal approximation to the Poisson is not too bad and the deviance is approx. chi.sq on residual df. In the second case the Poisson mean is low, there are lots of zeroes, and the approximation breaks down. So yes, the Pearson estimator is probably a better bet in the latter case. best, Simom ps. Note that the chi squared approximation for the *difference* in deviance between two nested models does not suffer from this problem. On 04/02/14 09:25, Greg Dropkin wrote: mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok -- Simon Wood,
[R] mgcv: distribution of dev with Poisson data
mgcv: distribution of dev hi I can't tell if this is a simple error. I'm puzzled by the distribution of dev when fitting a gam to Poisson generated data. I expected dev to be approximately chi-squared on residual d.f., i.e. about 1000 in each case below. In particular, the low values in the 3rd and 4th versions would suggest scale 1, yet the data is Poisson generated. The problem isn't caused by insufficient k values in the smoother. Does this mean that with sparse data the distribution of dev is no longer approx chi-sq on residual df? Does it mean the scale estimate when fitting quasipoisson should be the Pearson version? thanks greg library(mgcv) set.seed(1) x1-runif(1000) linp-2+exp(-2*x1)*sin(8*x1) sum(exp(linp)) dev1-dev2-sums-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linp)) sums[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1[j]=m1$dev dev2[j]=m2$dev } mean(sums) sd(sums) mean(dev1) sd(dev1) mean(dev2) sd(dev2) #dev slighly higher than expected linpa--1+exp(-2*x1)*sin(8*x1) sum(exp(linpa)) dev1a-dev2a-sumsa-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpa)) sumsa[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1a[j]=m1$dev dev2a[j]=m2$dev } mean(sumsa) sd(sumsa) mean(dev1a) sd(dev1a) mean(dev2a) sd(dev2a) #dev a bit lower than expected linpb--1.5+exp(-2*x1)*sin(8*x1) sum(exp(linpb)) dev1b-dev2b-sumsb-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpb)) sumsb[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1b[j]=m1$dev dev2b[j]=m2$dev } mean(sumsb) sd(sumsb) mean(dev1b) sd(dev1b) mean(dev2b) sd(dev2b) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale closer, but also 1 linpc--2+exp(-2*x1)*sin(8*x1) sum(exp(linpc)) dev1c-dev2c-sumsc-rep(0,20) for (j in 1:20) { y-rpois(1000,exp(linpc)) sumsc[j]-sum(y) m1-gam(y~s(x1),family=poisson) m2-gam(y~s(x1,k=20),family=poisson) dev1c[j]=m1$dev dev2c[j]=m2$dev } mean(sumsc) sd(sumsc) mean(dev1c) sd(dev1c) mean(dev2c) sd(dev2c) #dev much lower than expected m1q-gam(y~s(x1),family=quasipoisson,scale=-1) m1q$scale m1q$sig2 m1q$dev/(1000-sum(m1q$edf)) #scale estimate 1 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf)) #Pearson estimate of scale ok __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.