Hi all,

I wonder if you can help me.

THE PROBLEM

I want to train and test a GLM with some large datasets. I am running into some 
problems when I flatten my tables together to feed into the GLM model, as it 
produces a very large table which is far too big for the memory on my computer.

THREE TABLES - Pipes, Weekly Weather data, Bursts

I have three tables, which are all related to each other.


(1)    Pipe cohorts (114,000 rows) with a range of explanatory variables.       
                    ((1) Linking fields: (A) Pipe cohort ID, (B) weathercell_ID)

(2)    Explanatory Weekly Weather data 12 years (e.g. 624 weeks for each pipe 
cohort) ((2) Linking fields:  (C) week,  (B) weathercell_ID)

(3)    Bursts (40,000 bursts)                                                   
                                                                 ((3) Linking 
fields: (A) Pipe cohort ID,  (C) week)

Effectively, the combination of tables (1) and (2) make the population.  Table 
(3) are the events, or failures.

JOINING THE THREE TABLES

I have previously had far fewer pipe cohort rows.

What I have been doing till now is joining the (1) pipe cohorts data  to the 
(2) weekly weather data.
This repeats the pipe cohort data, each week, for the 12 years, which, now, 
makes a very long table e.g. 624 x 114,000 rows =  71 million rows.
I would then join the (3) burst data to that to see how many bursts there were 
that week, on that pipe cohort.
This made a large, flat file, which I could feed into GLM.

This worked ok when there are not so many pipe cohorts, but now there are 
114,000 rows, when I join the data tables I produce a MASSIVE table (many, many 
GB) which kills my computers.

RELATIONAL DATABASE APPROACH?

I am thinking it would be better to have a relational database structure where, 
for each data point (row) being brought into the BIGLM model, it take the three 
tables and looks up the appropriate values each time, using the defined join 
fields (A, B +C), feeds that into the model, then goes back and looks up the 
next point.

ADVICE?

How would you approach this problem?

I have the data prepared in the three tables.
I need to fit lots of models to see which variables give me the best AIC 
(output: lots of model fits)
Then predict bursts using the best model and the available (1) pipe and (2) 
weather data

Would you use the package BIGLM, linking to a sqlite database? (Or do something 
completely different?)

Many thanks,

Tim



        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to