Re: [R] bigglm() results different from glm()+Another question

Greg Snow Tue, 07 Jul 2009 08:27:41 -0700

How many rows does xx have?

Let's look at your example for chunksize 10000, you initially fit the first 
10000 observations, then the seq results in just the value 10000 which means 
that you do the update based on vaues 10001 through 20000, if xx only has 10000 
rows, then this should give at least one error.  If xx has 20000 or more rows, 
then only chunksize 10000 will ever see the 20000th value, the other chunksizes 
will use less of the data.

Also looking at the help for update.biglm, the 2nd argument is "moredata" not 
"data", so if the code below is the code that you actually ran, then the new 
data chunks are going into the "..." argument (and being ignored as that is 
there for future expansion and does nothing yet) and the "moredata" argument is 
left empty, which should also be giving an error.  For the code below, the 
model is only being fit to the initial chunk and never updated, so with 
different chunk sizes, there is different amounts of data per model.  You can 
check this by doing summary(fit) and looking at the sample size in the 2nd line.

It is easier for us to help you if you provide code that can be run by copying 
and pasting (we don't have xx, so we can't just run the code below, you could 
include a line to randomly generate an xx, or a link to where a copy of xx can 
be downloaded from).  It also helps if you mention any errors or warnings that 
you receive in the process of running your code.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111

From: utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com]
Sent: Tuesday, July 07, 2009 12:10 AM
To: Greg Snow
Cc: Thomas Lumley; r help
Subject: Re: [R] bigglm() results different from glm()+Another question

Trust me, it is the same total data I am using, even the chunksizes are all 
equal. I also crosschecked by manually creating the chunks and updating as in 
example given on biglm help page.
> ?biglm

Regards
Utkarsh

Greg Snow wrote:
Are you sure that you are fitting all the models on the same total data?  A 
first glance looks like you may be including more data in some of the chunk 
sizes, or be producing an error that update does not know how to deal with.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org<mailto:greg.s...@imail.org>
801.408.8111

From: utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com]
Sent: Monday, July 06, 2009 8:58 AM
To: Thomas Lumley; Greg Snow
Cc: r help
Subject: Re: [R] bigglm() results different from glm()+Another question

The AIC of the biglm models is highly dependent on the size of chunks selected 
(example provided below). This I can somehow expect because the model error 
will increase with the number of chunks.

It will be helpful if you can provide your opinion for comparing different 
models in such cases:

 *   can I compare two models fitted with different chunksizes, or should I 
always use the same chunk size.

 *   although I am not going to use AIC at all in my model selection, but I 
think any other model parameters will also vary in the same way. Am I right?
 *   what would be the ideal chunksize? should it be the maximum possible size 
R and my system's RAM is able to handle?
Any comments will be helpful.

Example of AIC variation with chunksize:

I ran the following code on my data which has 10000 observations and 3 
independent variables

> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit, 
> data=xx[(i+1):(i+chunksize),])
> AIC(fit)
[1] 30647.79

Here are the AIC for other chunksizes:
chunksize    AIC
500          30647.79
1000        29647.79
2000        27647.79
2500        26647.79
5000        21647.79
10000      11647.79

Regards
Utkarsh

utkarshsinghal wrote:
Thank you Mr. Lumley and Mr. Greg. That was helpful.

Regards
Utkarsh

Thomas Lumley wrote:

On Fri, 3 Jul 2009, utkarshsinghal wrote:

Hi Sir,

Thanks for making package available to us. I am facing few problems if you can 
give some hints:

Problem-1:
The model summary and residual deviance matched (in the mail below) but I 
didn't understand why AIC is still different.

AIC(m1)
[1] 532965

AIC(m1big_longer)
[1] 101442.9

That's because AIC.default uses the unnormalized loglikelihood and AIC.biglm 
uses the deviance.  Only differences in AIC between models are meaningful, not 
individual values.

Problem-2:
chunksize argument is there in bigglm but not in biglm, consequently, 
udate.biglm is there, but not update.bigglm
Is my observation correct? If yes, why is this difference?

Because update.bigglm is impossible.

Fitting a glm requires iteration, which means that it requires multiple passes 
through the data. Fitting a linear model requires only a single pass. 
update.biglm can take a fitted or partially fitted biglm and add more data. To 
do the same thing for a bigglm you would need to start over again from the 
beginning of the data set.

To fit a glm, you need to specify a data source that bigglm() can iterate over. 
 You do this with a function that can be called repeatedly to return the next 
chunk of data.

      -thomas

Thomas Lumley            Assoc. Professor, Biostatistics
tlum...@u.washington.edu<mailto:tlum...@u.washington.edu>    University of 
Washington, Seattle

I don't know why the AIC is different, but remember that there are multiple 
definitions for AIC (generally differing in the constant added) and it may just 
be a difference in the constant, or it could be that you have not fit the whole 
dataset (based on your other question).

For an lm model biglm only needs to make a single pass through the data.  This 
was the first function written for the package and the update mechanism was an 
easy way to write the function (and still works well).

The bigglm function came later and the models other than Gaussian require 
multiple passes through the data so instead of the update mechanism that biglm 
uses, bigglm requires the data argument to be a function that returns the next 
chunk of data and can restart to the beginning of the dataset.

Also note that the bigglm function usually only does a few passes through the 
data, usually this is good enough, but in some cases you may need to increase 
the number of passes.

Hope this helps,

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigglm() results different from glm()+Another question

Reply via email to