On Thu, Aug 16, 2007 at 03:24:08PM -0500, Alp ATICI wrote: > I'd like to fit linear models on very large datasets. My data frames > are about 2000000 rows x 200 columns of doubles and I am using an 64 > bit build of R. I've googled about this extensively and went over the > "R Data Import/Export" guide. My primary issue is although my data > represented in ascii form is 4Gb in size (therefore much smaller > considered in binary), R consumes about 12Gb of virtual memory.
One option is to simply buy more memory, which might work for you in this case, but in larger cases, is not scalable. I'm not sure how to make R happier with handling large datasets, but you may be able to use the power of random sampling to help you? Read the data from mysql, selecting a random 10% subset. This should use 1.2 Gb or so. You then fit the model to this subset. Repeat the procedure 100 times using independent samples. Now you have bootstrapped the coefficients of your model. Use the average value and standard deviation of the coefficients as your coefficient estimates and standard errors?? Since swapping is typically 1000 times slower or more than disk access, this process might take 1/10 of the time or less compared to letting the R process thrash its disk. It's a thought, not sure how well it works. -- Daniel Lakeland [EMAIL PROTECTED] http://www.street-artists.org/~dlakelan ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.