Re: [R] bigglm() results different from glm()

Thomas Lumley Tue, 17 Mar 2009 10:31:29 -0700


Yes, the slow convergence is easier to get with the log link.  Overshooting the 
correct coefficient vector has more dramatic effects on the fitted values and 
weights, and in this example the starting value of (0,0) is a long way from the 
truth.   The same sort of thing happens in the Cox model, where there are real 
data sets that will cause numeric overflow in a careless implementation.


It might be worth trying to guess better starting values: saving an iteration 
or two would be useful with large data sets.

     -thomas


On Tue, 17 Mar 2009, John Fox wrote:

Dear Francisco,

I was able to duplicate the problem that you reported, and in addition
discovered that the problem seems to be peculiar to the poisson family.
lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
Another example, with the binomial family:

set.seed(12345)
dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),

ttment=gl(2,50000))

dat$y0 <- ifelse(dat$y > 12, 1, 0)
m1b <- glm(y0~ttment, data=dat, family=binomial)
m1bigb <- bigglm(y0~ttment , data=dat, family=binomial())
coef(m1b)

(Intercept)     ttment2
  -1.33508     2.34263

coef(m1bigb)

(Intercept)     ttment2
  -1.33508     2.34263

deviance(m1b)

[1] 109244

deviance(m1bigb)

[1] 109244

I'm copying this message to Thomas Lumley, who's the author and maintainer
of the biglm package.

Regards,
John

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]

On

Behalf Of Francisco J. Zagmutt
Sent: March-16-09 10:26 PM
To: r-h...@stat.math.ethz.ch
Subject: [R] bigglm() results different from glm()

Dear all,

I am using the bigglm package to fit a few GLM's to a large dataset (3
million rows, 6 columns).  While trying to fit a Poisson GLM I noticed
that the coefficient estimates were very different from what I obtained
when estimating the model on a smaller dataset using glm(), I wrote a
very basic toy example to compare the results of bigglm() against a
glm() call.  Consider the following code:

> require(biglm)
> options(digits=6, scipen=3, contrasts = c("contr.treatment",
"contr.poly"))
> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
ttment=gl(2,50000))
> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
> summary(m1)

<snipped output for this email>
Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.30305    0.00141    1629   <2e-16 ***
ttment2      0.40429    0.00183     221   <2e-16 ***

     Null deviance: 151889  on 99999  degrees of freedom
Residual deviance: 101848  on 99998  degrees of freedom
AIC: 533152

> summary(m1big)
Large data regression model: bigglm(y ~ ttment, data = dat, family =
poisson(link = "log"))
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.651 2.650 2.653 0.001 0
ttment2     4.346 4.344 4.348 0.001 0

> m1big$deviance
[1] 287158986

Notice that the coefficients and deviance are quite different in the
model estimated using bigglm(). If I change the chunk to
seq(1000,10000,1000) the estimates remain the same.

Can someone help me understand what is causing these differences?

Here is my version info:

> version
                _
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          8.1
year           2008
month          12
day            22
svn rev        47281
language       R
version.string R version 2.8.1 (2008-12-22)

Many thanks in advance for your help,

Francisco

--
Francisco J. Zagmutt
Vose Consulting
2891 20th Street
Boulder, CO, 80304
USA
www.voseconsulting.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley                   Assoc. Professor, Biostatistics
tlum...@u.washington.edu        University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigglm() results different from glm()

Reply via email to