Hello everyone,

I would like to get some advices on using R with some really large datasets.

I'm using RH9 Linux R 1.8.1 for a research with a lot of numerical data. The datasets 
total to around 200Mb (shown by memory.size). During my data manipulation, the system 
memory usage grew to 1.5Gb, and this caused a lot of swapping activities on my 1Gb PC. 
This is just a small-scale experiment, the full-scale one will be using data 30 times 
as large (on a 4Gb machine). I can see that I'll need to deal with memory usage 
problem very soon.

I notice that R keeps all datasets in memory at all times. I wonder whether there is 
any way to instruct R to push some of the less-frequently-used data tables out of main 
memory, so as to free up memory for those that are actively in used. It'll be even 
better if R can keep only part of a table in memory only when that part is needed. 
Using save & load could help, but I just wonder whether R is intelligent enough to do 
this by itself, so I don't need to keep track of memory usage at all times.

Another thought is to use a 64-bit machine (AMD64). I find there is a pre-compiled R 
for Fedora Linux on AMD64. Anyone knows whether this version of R runs as 64-bit? If 
so, then will R be able to go beyond the 32-bit 4Gb memory limit?

Also, from the manual, I find that the RPgSQL package (for PostgreSQL database) 
supports a feature "proxy data frame". Does anyone have experience with this? Can 
"proxy data frame" handle memory efficiently for very large datasets? Say, if I have a 
6Gb database table defined as a proxy data frame, will R & RPgSQL be able to handle it 
with just 4Gb of memory?

Any comments will be useful. Many thanks.

Sunny Ho
(Hong Kong University of Science & Technology)

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to