As part of a continuing thread on the cost of loading large amounts of data into R,
"Vadim Ogranovich" <[EMAIL PROTECTED]> wrote: R's IO is indeed 20 - 50 times slower than that of equivalent C code no matter what you do, which has been a pain for some of us. I wondered to myself just how bad R is at reading, when it is given a fair chance. So I performed an experiment. My machine (according to "Workstation Info") is a SunBlade 100 with 640MB of physical memory running SunOS 5.9 Generic, according to fpversion this is an Ultra2e with the CPU clock running at 500MHz and the main memory clock running at 84MHz (wow, slow memory). R.version is platform sparc-sun-solaris2.9 arch sparc os solaris2.9 system sparc, solaris2.9 status major 1 minor 9.0 year 2004 month 04 day 12 language R and althnough this is a 64-bit machine, it's a 32-bit installation of R. The experiment was this: (1) I wrote a C program that generated 12500 rows of 800 columns, the numbers were integers 0..999,999,999 generated using drand48(). These numbers were written using printf(). It is possible to do quite a bit better by avoiding printf(), but that would ruin the spirit of the comparison, which is to see what can be done with *straightforward* code using *existing* library functions. 21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds. The sizes were chosen to get 100MB; the actual size was 12500 (lines) 10000000 (words) 100012500 (bytes) (2) I wrote a C program that read these numbers using scanf("%d"); it "knew" there were 800 numbers per row and 12500 numbers in all. Again, it is possible to do better by avoiding scanf(), but the point is to look at *straightforward* code. 18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds. (3) I started R, played around a bit doing other things, then issued this command: > system.time(xx <- read.table("/tmp/big.dat", header=FALSE, quote="", + row.names=NULL, colClasses=rep("numeric",800), nrows=12500, + comment.char="") So how long _did_ it take to read 100MB on this machine? 71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds. The result: the R/C ratio was less than 4, whether you measure cpu time or real time. It certainly wasn't anywhere near 20-50 times slower. Of course, *binary* I/O in C *would* be quite a bit faster: (1') generate same integers but write a row at a time using fwrite(): 5 seconds cpu, 25 seconds real; 40 MB. (2') read same integers a row at a time using fread() 0.26 seconds cpu, 1 second real. This would appear to more than justify "20-50 times slower", but reading binary data and reading data in a textual representation are different things, "less than 4 times slower" is the fairer measure. However, it does emphasise the usefulness of problem-specific bulk reading techniques. I thought I'd give you another R measurement: > system.time(xx <- read.table("/tmp/big.dat", header=FALSE)) But I got sick of waiting for it, and killed it after 843 cpu seconds, 3075 real seconds. Without knowing how far it had got, one can say no more than that this is at least 10 times slower than the more informed call to read.table. What this tells me is that if you know something about the data that you _could_ tell read.table about, you do yourself no favour by keeping read.table in the dark. All those options are there for a reason, and it *will* pay to use them. ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html