Re: [R] mgcv: distribution of dev with Poisson data

2014-02-05 Thread Simon Wood

Greg,

The deviance being chi^2 distributed on the residual degrees of freedom 
works in some cases (mostly where the response itself can be reasonably 
approximated as Gaussian), but rather poorly in others (noteably low 
count data). This is what you are seeing in your simulations - in the 
first case the Poisson mean is relatively high - so the  normal 
approximation to the Poisson is not too bad and the deviance is approx. 
chi.sq on residual df. In the second case the Poisson mean is low, there 
are lots of zeroes, and the approximation breaks down. So yes, the 
Pearson estimator is probably a better bet in the latter case.


best,
Simom

ps. Note that the chi squared approximation for the *difference* in 
deviance between two nested models does not suffer from this problem.



On 04/02/14 09:25, Greg Dropkin wrote:

mgcv: distribution of dev

hi

I can't tell if this is a simple error.
I'm puzzled by the distribution of dev when fitting a gam to Poisson
generated data.
I expected dev to be approximately chi-squared on residual d.f., i.e.
about 1000 in each case below.
In particular, the low values in the 3rd and 4th versions would suggest
scale  1, yet the data is Poisson generated.
The problem isn't caused by insufficient k values in the smoother.
Does this mean that with sparse data the distribution of dev is no longer
approx chi-sq on residual df?
Does it mean the scale estimate when fitting quasipoisson should be the
Pearson version?

thanks

greg

library(mgcv)
set.seed(1)
x1-runif(1000)
linp-2+exp(-2*x1)*sin(8*x1)
sum(exp(linp))
dev1-dev2-sums-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linp))
sums[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1[j]=m1$dev
dev2[j]=m2$dev
}
mean(sums)
sd(sums)
mean(dev1)
sd(dev1)
mean(dev2)
sd(dev2)

#dev slighly higher than expected

linpa--1+exp(-2*x1)*sin(8*x1)
sum(exp(linpa))
dev1a-dev2a-sumsa-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpa))
sumsa[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1a[j]=m1$dev
dev2a[j]=m2$dev
}
mean(sumsa)
sd(sumsa)
mean(dev1a)
sd(dev1a)
mean(dev2a)
sd(dev2a)

#dev a bit lower than expected

linpb--1.5+exp(-2*x1)*sin(8*x1)
sum(exp(linpb))
dev1b-dev2b-sumsb-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpb))
sumsb[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1b[j]=m1$dev
dev2b[j]=m2$dev
}
mean(sumsb)
sd(sumsb)
mean(dev1b)
sd(dev1b)
mean(dev2b)
sd(dev2b)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale closer, but also  1


linpc--2+exp(-2*x1)*sin(8*x1)
sum(exp(linpc))
dev1c-dev2c-sumsc-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpc))
sumsc[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1c[j]=m1$dev
dev2c[j]=m2$dev
}
mean(sumsc)
sd(sumsc)
mean(dev1c)
sd(dev1c)
mean(dev2c)
sd(dev2c)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$sig2
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale ok





--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603   http://people.bath.ac.uk/sw283

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mgcv: distribution of dev with Poisson data

2014-02-05 Thread Simon Wood

Hi Greg,

Yes, this sounds right - with quasipoisson gam uses `extended 
quasi-likelihood' (see McCullagh and Nelder's GLM book) to allow 
estimation of the scale parameter along with the smoothing parameters 
via (RE)ML, and it could well be that this gives a biased scale estimate 
with low counts (although the shape of the mean variance relationship is 
unaffected by this).


It might well be more sensible to default to a Pearson estimate of the 
scale parameter for 'gam' ('bam' and 'gamm' already do this), to avoid 
this issue with quasipoisson and low counts. Your email reminded me that 
someone had reported rather too high p-values for quasipoisson with 
these sort of data, and looking back at my list of open issues I see 
that `someone' was you as well! It's possible that moving to a Pearson 
estimator of the scale will solve that problem too.


Thanks for this... very helpful.

best,
Simon


On 05/02/14 12:56, Greg Dropkin wrote:

thanks Simon

also, it appears at least with ML that the default scale estimate with
quasipoisson (i.e. using dev) is the scale which minimises the ML value of
the fitted model. So it is the best model but doesn't actually give the
correct mean-variance relation. Is that right?

thanks again

Greg


Greg,

The deviance being chi^2 distributed on the residual degrees of freedom
works in some cases (mostly where the response itself can be reasonably
approximated as Gaussian), but rather poorly in others (noteably low
count data). This is what you are seeing in your simulations - in the
first case the Poisson mean is relatively high - so the  normal
approximation to the Poisson is not too bad and the deviance is approx.
chi.sq on residual df. In the second case the Poisson mean is low, there
are lots of zeroes, and the approximation breaks down. So yes, the
Pearson estimator is probably a better bet in the latter case.

best,
Simom

ps. Note that the chi squared approximation for the *difference* in
deviance between two nested models does not suffer from this problem.


On 04/02/14 09:25, Greg Dropkin wrote:

mgcv: distribution of dev

hi

I can't tell if this is a simple error.
I'm puzzled by the distribution of dev when fitting a gam to Poisson
generated data.
I expected dev to be approximately chi-squared on residual d.f., i.e.
about 1000 in each case below.
In particular, the low values in the 3rd and 4th versions would suggest
scale  1, yet the data is Poisson generated.
The problem isn't caused by insufficient k values in the smoother.
Does this mean that with sparse data the distribution of dev is no
longer
approx chi-sq on residual df?
Does it mean the scale estimate when fitting quasipoisson should be the
Pearson version?

thanks

greg

library(mgcv)
set.seed(1)
x1-runif(1000)
linp-2+exp(-2*x1)*sin(8*x1)
sum(exp(linp))
dev1-dev2-sums-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linp))
sums[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1[j]=m1$dev
dev2[j]=m2$dev
}
mean(sums)
sd(sums)
mean(dev1)
sd(dev1)
mean(dev2)
sd(dev2)

#dev slighly higher than expected

linpa--1+exp(-2*x1)*sin(8*x1)
sum(exp(linpa))
dev1a-dev2a-sumsa-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpa))
sumsa[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1a[j]=m1$dev
dev2a[j]=m2$dev
}
mean(sumsa)
sd(sumsa)
mean(dev1a)
sd(dev1a)
mean(dev2a)
sd(dev2a)

#dev a bit lower than expected

linpb--1.5+exp(-2*x1)*sin(8*x1)
sum(exp(linpb))
dev1b-dev2b-sumsb-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpb))
sumsb[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1b[j]=m1$dev
dev2b[j]=m2$dev
}
mean(sumsb)
sd(sumsb)
mean(dev1b)
sd(dev1b)
mean(dev2b)
sd(dev2b)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale closer, but also  1


linpc--2+exp(-2*x1)*sin(8*x1)
sum(exp(linpc))
dev1c-dev2c-sumsc-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpc))
sumsc[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1c[j]=m1$dev
dev2c[j]=m2$dev
}
mean(sumsc)
sd(sumsc)
mean(dev1c)
sd(dev1c)
mean(dev2c)
sd(dev2c)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$sig2
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale ok





--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603   http://people.bath.ac.uk/sw283








--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603   http://people.bath.ac.uk/sw283

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, 

Re: [R] mgcv: distribution of dev with Poisson data

2014-02-05 Thread Greg Dropkin
thanks Simon

also, it appears at least with ML that the default scale estimate with
quasipoisson (i.e. using dev) is the scale which minimises the ML value of
the fitted model. So it is the best model but doesn't actually give the
correct mean-variance relation. Is that right?

thanks again

Greg

 Greg,

 The deviance being chi^2 distributed on the residual degrees of freedom
 works in some cases (mostly where the response itself can be reasonably
 approximated as Gaussian), but rather poorly in others (noteably low
 count data). This is what you are seeing in your simulations - in the
 first case the Poisson mean is relatively high - so the  normal
 approximation to the Poisson is not too bad and the deviance is approx.
 chi.sq on residual df. In the second case the Poisson mean is low, there
 are lots of zeroes, and the approximation breaks down. So yes, the
 Pearson estimator is probably a better bet in the latter case.

 best,
 Simom

 ps. Note that the chi squared approximation for the *difference* in
 deviance between two nested models does not suffer from this problem.


 On 04/02/14 09:25, Greg Dropkin wrote:
 mgcv: distribution of dev

 hi

 I can't tell if this is a simple error.
 I'm puzzled by the distribution of dev when fitting a gam to Poisson
 generated data.
 I expected dev to be approximately chi-squared on residual d.f., i.e.
 about 1000 in each case below.
 In particular, the low values in the 3rd and 4th versions would suggest
 scale  1, yet the data is Poisson generated.
 The problem isn't caused by insufficient k values in the smoother.
 Does this mean that with sparse data the distribution of dev is no
 longer
 approx chi-sq on residual df?
 Does it mean the scale estimate when fitting quasipoisson should be the
 Pearson version?

 thanks

 greg

 library(mgcv)
 set.seed(1)
 x1-runif(1000)
 linp-2+exp(-2*x1)*sin(8*x1)
 sum(exp(linp))
 dev1-dev2-sums-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linp))
 sums[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1[j]=m1$dev
 dev2[j]=m2$dev
 }
 mean(sums)
 sd(sums)
 mean(dev1)
 sd(dev1)
 mean(dev2)
 sd(dev2)

 #dev slighly higher than expected

 linpa--1+exp(-2*x1)*sin(8*x1)
 sum(exp(linpa))
 dev1a-dev2a-sumsa-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpa))
 sumsa[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1a[j]=m1$dev
 dev2a[j]=m2$dev
 }
 mean(sumsa)
 sd(sumsa)
 mean(dev1a)
 sd(dev1a)
 mean(dev2a)
 sd(dev2a)

 #dev a bit lower than expected

 linpb--1.5+exp(-2*x1)*sin(8*x1)
 sum(exp(linpb))
 dev1b-dev2b-sumsb-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpb))
 sumsb[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1b[j]=m1$dev
 dev2b[j]=m2$dev
 }
 mean(sumsb)
 sd(sumsb)
 mean(dev1b)
 sd(dev1b)
 mean(dev2b)
 sd(dev2b)

 #dev much lower than expected

 m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
 m1q$scale
 m1q$dev/(1000-sum(m1q$edf))

 #scale estimate  1

 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

 #Pearson estimate of scale closer, but also  1


 linpc--2+exp(-2*x1)*sin(8*x1)
 sum(exp(linpc))
 dev1c-dev2c-sumsc-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpc))
 sumsc[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1c[j]=m1$dev
 dev2c[j]=m2$dev
 }
 mean(sumsc)
 sd(sumsc)
 mean(dev1c)
 sd(dev1c)
 mean(dev2c)
 sd(dev2c)

 #dev much lower than expected

 m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
 m1q$scale
 m1q$sig2
 m1q$dev/(1000-sum(m1q$edf))

 #scale estimate  1

 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

 #Pearson estimate of scale ok




 --
 Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
 +44 (0)1225 386603   http://people.bath.ac.uk/sw283



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mgcv: distribution of dev with Poisson data

2014-02-05 Thread Prof Brian Ripley

On 05/02/2014 12:56, Greg Dropkin wrote:

thanks Simon

also, it appears at least with ML that the default scale estimate with
quasipoisson (i.e. using dev) is the scale which minimises the ML value of
the fitted model. So it is the best model but doesn't actually give the
correct mean-variance relation. Is that right?


Well, estimates do not usually give the correct value 

But ML is biased in many cases, sometimes badly so here, and most people 
prefer the Pearson estimator of the dispersion.  This is McCullagn  
Nelder's book, even the first edition, but without much evidence: 
Venables  Ripley (2002, §7.5) has plots to back up their hunch.




thanks again

Greg


Greg,

The deviance being chi^2 distributed on the residual degrees of freedom
works in some cases (mostly where the response itself can be reasonably
approximated as Gaussian), but rather poorly in others (noteably low
count data). This is what you are seeing in your simulations - in the
first case the Poisson mean is relatively high - so the  normal
approximation to the Poisson is not too bad and the deviance is approx.
chi.sq on residual df. In the second case the Poisson mean is low, there
are lots of zeroes, and the approximation breaks down. So yes, the
Pearson estimator is probably a better bet in the latter case.

best,
Simom

ps. Note that the chi squared approximation for the *difference* in
deviance between two nested models does not suffer from this problem.


On 04/02/14 09:25, Greg Dropkin wrote:

mgcv: distribution of dev

hi

I can't tell if this is a simple error.
I'm puzzled by the distribution of dev when fitting a gam to Poisson
generated data.
I expected dev to be approximately chi-squared on residual d.f., i.e.
about 1000 in each case below.
In particular, the low values in the 3rd and 4th versions would suggest
scale  1, yet the data is Poisson generated.
The problem isn't caused by insufficient k values in the smoother.
Does this mean that with sparse data the distribution of dev is no
longer
approx chi-sq on residual df?
Does it mean the scale estimate when fitting quasipoisson should be the
Pearson version?

thanks

greg

library(mgcv)
set.seed(1)
x1-runif(1000)
linp-2+exp(-2*x1)*sin(8*x1)
sum(exp(linp))
dev1-dev2-sums-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linp))
sums[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1[j]=m1$dev
dev2[j]=m2$dev
}
mean(sums)
sd(sums)
mean(dev1)
sd(dev1)
mean(dev2)
sd(dev2)

#dev slighly higher than expected

linpa--1+exp(-2*x1)*sin(8*x1)
sum(exp(linpa))
dev1a-dev2a-sumsa-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpa))
sumsa[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1a[j]=m1$dev
dev2a[j]=m2$dev
}
mean(sumsa)
sd(sumsa)
mean(dev1a)
sd(dev1a)
mean(dev2a)
sd(dev2a)

#dev a bit lower than expected

linpb--1.5+exp(-2*x1)*sin(8*x1)
sum(exp(linpb))
dev1b-dev2b-sumsb-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpb))
sumsb[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1b[j]=m1$dev
dev2b[j]=m2$dev
}
mean(sumsb)
sd(sumsb)
mean(dev1b)
sd(dev1b)
mean(dev2b)
sd(dev2b)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale closer, but also  1


linpc--2+exp(-2*x1)*sin(8*x1)
sum(exp(linpc))
dev1c-dev2c-sumsc-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpc))
sumsc[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1c[j]=m1$dev
dev2c[j]=m2$dev
}
mean(sumsc)
sd(sumsc)
mean(dev1c)
sd(dev1c)
mean(dev2c)
sd(dev2c)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$sig2
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale ok





--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603   http://people.bath.ac.uk/sw283




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mgcv: distribution of dev with Poisson data

2014-02-05 Thread Greg Dropkin
hi Simon

yes, I also got the right shape of the mean-variance relation but the
wrong estimate of the parameter.

thanks very much

Greg

 Hi Greg,

 Yes, this sounds right - with quasipoisson gam uses `extended
 quasi-likelihood' (see McCullagh and Nelder's GLM book) to allow
 estimation of the scale parameter along with the smoothing parameters
 via (RE)ML, and it could well be that this gives a biased scale estimate
 with low counts (although the shape of the mean variance relationship is
 unaffected by this).

 It might well be more sensible to default to a Pearson estimate of the
 scale parameter for 'gam' ('bam' and 'gamm' already do this), to avoid
 this issue with quasipoisson and low counts. Your email reminded me that
 someone had reported rather too high p-values for quasipoisson with
 these sort of data, and looking back at my list of open issues I see
 that `someone' was you as well! It's possible that moving to a Pearson
 estimator of the scale will solve that problem too.

 Thanks for this... very helpful.

 best,
 Simon


 On 05/02/14 12:56, Greg Dropkin wrote:
 thanks Simon

 also, it appears at least with ML that the default scale estimate with
 quasipoisson (i.e. using dev) is the scale which minimises the ML value
 of
 the fitted model. So it is the best model but doesn't actually give
 the
 correct mean-variance relation. Is that right?

 thanks again

 Greg

 Greg,

 The deviance being chi^2 distributed on the residual degrees of freedom
 works in some cases (mostly where the response itself can be reasonably
 approximated as Gaussian), but rather poorly in others (noteably low
 count data). This is what you are seeing in your simulations - in the
 first case the Poisson mean is relatively high - so the  normal
 approximation to the Poisson is not too bad and the deviance is approx.
 chi.sq on residual df. In the second case the Poisson mean is low,
 there
 are lots of zeroes, and the approximation breaks down. So yes, the
 Pearson estimator is probably a better bet in the latter case.

 best,
 Simom

 ps. Note that the chi squared approximation for the *difference* in
 deviance between two nested models does not suffer from this problem.


 On 04/02/14 09:25, Greg Dropkin wrote:
 mgcv: distribution of dev

 hi

 I can't tell if this is a simple error.
 I'm puzzled by the distribution of dev when fitting a gam to Poisson
 generated data.
 I expected dev to be approximately chi-squared on residual d.f., i.e.
 about 1000 in each case below.
 In particular, the low values in the 3rd and 4th versions would
 suggest
 scale  1, yet the data is Poisson generated.
 The problem isn't caused by insufficient k values in the smoother.
 Does this mean that with sparse data the distribution of dev is no
 longer
 approx chi-sq on residual df?
 Does it mean the scale estimate when fitting quasipoisson should be
 the
 Pearson version?

 thanks

 greg

 library(mgcv)
 set.seed(1)
 x1-runif(1000)
 linp-2+exp(-2*x1)*sin(8*x1)
 sum(exp(linp))
 dev1-dev2-sums-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linp))
 sums[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1[j]=m1$dev
 dev2[j]=m2$dev
 }
 mean(sums)
 sd(sums)
 mean(dev1)
 sd(dev1)
 mean(dev2)
 sd(dev2)

 #dev slighly higher than expected

 linpa--1+exp(-2*x1)*sin(8*x1)
 sum(exp(linpa))
 dev1a-dev2a-sumsa-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpa))
 sumsa[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1a[j]=m1$dev
 dev2a[j]=m2$dev
 }
 mean(sumsa)
 sd(sumsa)
 mean(dev1a)
 sd(dev1a)
 mean(dev2a)
 sd(dev2a)

 #dev a bit lower than expected

 linpb--1.5+exp(-2*x1)*sin(8*x1)
 sum(exp(linpb))
 dev1b-dev2b-sumsb-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpb))
 sumsb[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1b[j]=m1$dev
 dev2b[j]=m2$dev
 }
 mean(sumsb)
 sd(sumsb)
 mean(dev1b)
 sd(dev1b)
 mean(dev2b)
 sd(dev2b)

 #dev much lower than expected

 m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
 m1q$scale
 m1q$dev/(1000-sum(m1q$edf))

 #scale estimate  1

 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

 #Pearson estimate of scale closer, but also  1


 linpc--2+exp(-2*x1)*sin(8*x1)
 sum(exp(linpc))
 dev1c-dev2c-sumsc-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpc))
 sumsc[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1c[j]=m1$dev
 dev2c[j]=m2$dev
 }
 mean(sumsc)
 sd(sumsc)
 mean(dev1c)
 sd(dev1c)
 mean(dev2c)
 sd(dev2c)

 #dev much lower than expected

 m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
 m1q$scale
 m1q$sig2
 m1q$dev/(1000-sum(m1q$edf))

 #scale estimate  1

 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

 #Pearson estimate of scale ok




 --
 Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
 +44 (0)1225 386603   http://people.bath.ac.uk/sw283






 --
 Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
 +44 (0)1225 

Re: [R] mgcv: distribution of dev with Poisson data

2014-02-05 Thread Greg Dropkin
Dear Prof Ripley

yes, but if the estimate is biased it's good to know what the bias is.

The problem illustrated in the simulations has nothing to do with ML,
though, as the default fitting method in mgcv when scale is unknown is
GCV and that is what was used, by default, here.

The point about ML was that I'd looked separately at what happens when
scale varies in fitting a quasipoisson model, and found that ML was
minimised (in an example) at the scale estimate (i.e. the default estimate
using dev). A very brief look at the code made that seem plausible as it
does involve the derivatives with respect to scale.

In my limited experience ML is actually better behaved than GCV or REML,
giving more conservative estimates and tighter CIs which behave better
under simulation.

re the Pearson estimates being preferable, that is also what Simon Wood's
book says on p172, another reason why I was puzzled that the default
estimate used dev instead.

thanks

Greg

 On 05/02/2014 12:56, Greg Dropkin wrote:
 thanks Simon
 also, it appears at least with ML that the default scale estimate with
quasipoisson (i.e. using dev) is the scale which minimises the ML value
of
 the fitted model. So it is the best model but doesn't actually give the
 correct mean-variance relation. Is that right?

 Well, estimates do not usually give the correct value 

 But ML is biased in many cases, sometimes badly so here, and most people
prefer the Pearson estimator of the dispersion.  This is McCullagn 
Nelder's book, even the first edition, but without much evidence:
Venables  Ripley (2002, §7.5) has plots to back up their hunch.

 thanks again
 Greg
 Greg,
 The deviance being chi^2 distributed on the residual degrees of
freedom
 works in some cases (mostly where the response itself can be
reasonably
 approximated as Gaussian), but rather poorly in others (noteably low
count data). This is what you are seeing in your simulations - in the
first case the Poisson mean is relatively high - so the  normal
approximation to the Poisson is not too bad and the deviance is
approx.
 chi.sq on residual df. In the second case the Poisson mean is low, there
 are lots of zeroes, and the approximation breaks down. So yes, the
Pearson estimator is probably a better bet in the latter case. best,
 Simom
 ps. Note that the chi squared approximation for the *difference* in
deviance between two nested models does not suffer from this problem.
On 04/02/14 09:25, Greg Dropkin wrote:
 mgcv: distribution of dev
 hi
 I can't tell if this is a simple error.
 I'm puzzled by the distribution of dev when fitting a gam to Poisson
generated data.
 I expected dev to be approximately chi-squared on residual d.f., i.e.
about 1000 in each case below.
 In particular, the low values in the 3rd and 4th versions would suggest
 scale  1, yet the data is Poisson generated.
 The problem isn't caused by insufficient k values in the smoother.
Does this mean that with sparse data the distribution of dev is no
longer
 approx chi-sq on residual df?
 Does it mean the scale estimate when fitting quasipoisson should be the
 Pearson version?
 thanks
 greg
 library(mgcv)
 set.seed(1)
 x1-runif(1000)
 linp-2+exp(-2*x1)*sin(8*x1)
 sum(exp(linp))
 dev1-dev2-sums-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linp))
 sums[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1[j]=m1$dev
 dev2[j]=m2$dev
 }
 mean(sums)
 sd(sums)
 mean(dev1)
 sd(dev1)
 mean(dev2)
 sd(dev2)
 #dev slighly higher than expected
 linpa--1+exp(-2*x1)*sin(8*x1)
 sum(exp(linpa))
 dev1a-dev2a-sumsa-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpa))
 sumsa[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1a[j]=m1$dev
 dev2a[j]=m2$dev
 }
 mean(sumsa)
 sd(sumsa)
 mean(dev1a)
 sd(dev1a)
 mean(dev2a)
 sd(dev2a)
 #dev a bit lower than expected
 linpb--1.5+exp(-2*x1)*sin(8*x1)
 sum(exp(linpb))
 dev1b-dev2b-sumsb-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpb))
 sumsb[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1b[j]=m1$dev
 dev2b[j]=m2$dev
 }
 mean(sumsb)
 sd(sumsb)
 mean(dev1b)
 sd(dev1b)
 mean(dev2b)
 sd(dev2b)
 #dev much lower than expected
 m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
 m1q$scale
 m1q$dev/(1000-sum(m1q$edf))
 #scale estimate  1
 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))
 #Pearson estimate of scale closer, but also  1
 linpc--2+exp(-2*x1)*sin(8*x1)
 sum(exp(linpc))
 dev1c-dev2c-sumsc-rep(0,20)
 for (j in 1:20)
 {
 y-rpois(1000,exp(linpc))
 sumsc[j]-sum(y)
 m1-gam(y~s(x1),family=poisson)
 m2-gam(y~s(x1,k=20),family=poisson)
 dev1c[j]=m1$dev
 dev2c[j]=m2$dev
 }
 mean(sumsc)
 sd(sumsc)
 mean(dev1c)
 sd(dev1c)
 mean(dev2c)
 sd(dev2c)
 #dev much lower than expected
 m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
 m1q$scale
 m1q$sig2
 m1q$dev/(1000-sum(m1q$edf))
 #scale estimate  1
 sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))
 #Pearson estimate of scale ok
 --
 Simon Wood, 

[R] mgcv: distribution of dev with Poisson data

2014-02-04 Thread Greg Dropkin
mgcv: distribution of dev

hi

I can't tell if this is a simple error.
I'm puzzled by the distribution of dev when fitting a gam to Poisson
generated data.
I expected dev to be approximately chi-squared on residual d.f., i.e.
about 1000 in each case below.
In particular, the low values in the 3rd and 4th versions would suggest
scale  1, yet the data is Poisson generated.
The problem isn't caused by insufficient k values in the smoother.
Does this mean that with sparse data the distribution of dev is no longer
approx chi-sq on residual df?
Does it mean the scale estimate when fitting quasipoisson should be the
Pearson version?

thanks

greg

library(mgcv)
set.seed(1)
x1-runif(1000)
linp-2+exp(-2*x1)*sin(8*x1)
sum(exp(linp))
dev1-dev2-sums-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linp))
sums[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1[j]=m1$dev
dev2[j]=m2$dev
}
mean(sums)
sd(sums)
mean(dev1)
sd(dev1)
mean(dev2)
sd(dev2)

#dev slighly higher than expected

linpa--1+exp(-2*x1)*sin(8*x1)
sum(exp(linpa))
dev1a-dev2a-sumsa-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpa))
sumsa[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1a[j]=m1$dev
dev2a[j]=m2$dev
}
mean(sumsa)
sd(sumsa)
mean(dev1a)
sd(dev1a)
mean(dev2a)
sd(dev2a)

#dev a bit lower than expected

linpb--1.5+exp(-2*x1)*sin(8*x1)
sum(exp(linpb))
dev1b-dev2b-sumsb-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpb))
sumsb[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1b[j]=m1$dev
dev2b[j]=m2$dev
}
mean(sumsb)
sd(sumsb)
mean(dev1b)
sd(dev1b)
mean(dev2b)
sd(dev2b)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale closer, but also  1


linpc--2+exp(-2*x1)*sin(8*x1)
sum(exp(linpc))
dev1c-dev2c-sumsc-rep(0,20)
for (j in 1:20)
{
y-rpois(1000,exp(linpc))
sumsc[j]-sum(y)
m1-gam(y~s(x1),family=poisson)
m2-gam(y~s(x1,k=20),family=poisson)
dev1c[j]=m1$dev
dev2c[j]=m2$dev
}
mean(sumsc)
sd(sumsc)
mean(dev1c)
sd(dev1c)
mean(dev2c)
sd(dev2c)

#dev much lower than expected

m1q-gam(y~s(x1),family=quasipoisson,scale=-1)
m1q$scale
m1q$sig2
m1q$dev/(1000-sum(m1q$edf))

#scale estimate  1

sum((y-fitted(m1q))^2/fitted(m1q))/(1000-sum(m1q$edf))

#Pearson estimate of scale ok

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.