Hello everyone,
I recently made a 64-bit build of R-2.2.1 under Solaris 9 using gcc v.3.4.2.
The server has 12GB memory, 6 Sparc CPUs and plenty of swap space. I was the
only user at the time of the following experiment.
I wanted to benchmark R's capability to read large data files and used a
data set consisting of 2MM records with 65 variables in each row. All but 2
of the variables are of the character type and the other two are numeric.
The whole data set is about 600 MB when stored as plain ASCII file.
The following code was used in the benchmarking runs:
c = list(var1=0, var2=0, var3="", var4="", .....var65="")
A <- scan("test.dat", skip = 1, sep = ",", what = c, nmax=XXXXX,
quiet=FALSE)
summary(A)
where XXXX = 1000000 or 2000000
I made two runs with nmax=1000000 and nmax=2000000 respectively. The first
run completed successfully, in about hour of CPU time. However, the actual
memory usage exceeded 2.2GB, about 7 times of the acutal file size on disk.
The second run aborted when the memory usage reached 4GB. The error messgae
is "vector memory exhausted (limit reached?)".
Three questions:
1) Why were so much memory and CPU consumed to read 300MB of data? Since
almost all of the variables are character, I expected almost of 1-1 mapping
between file size on disk and that in memory
2) Since this is a 64-bit build, I expected it could handle more than the
600MB of data I used. What does the error message mean? I don't beleive the
vector length exceeded the theoretic limit of about 1 billion.
3) The original file was compressed and I had to uncompress it before the
experiement. Is there a way to read compressed files directly in R
Thanks so much for your help.
Min
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html