Richard, Thank you for the analysis. I don't think there is an inconsistency between the factor of 4 you've found in your example and 20 - 50 I found in my data. I guess the major cause of the difference lies with the structure of your data set. Specifically, your test data set differs from mine in two respects: * you have fewer lines, but each line contains many more fields (12500 * 800 in your case and 3.8M * 10 in my) * all of your data fields are doubles, not strings. I have a mixture of doubles and strings.
I posted a more technical message to r-devel where I discussed possible reasons for the IO slowness. One of them is that R is slow at making strings. So if you try to read your data as strings, colClasses=rep("character", 800), I'd guess you will see a very different timing. Even simple reshaping of your matrix, say make it (12500*80) rows by 10 columns, will considerably worsen it. Please let me know the results if you do anything of the above. In my message to r-devel you may also find some timing that supports my estimates. Thanks, Vadim > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Richard A. O'Keefe > Sent: Thursday, July 01, 2004 5:22 PM > To: [EMAIL PROTECTED] > Subject: RE: [R] naive question > > As part of a continuing thread on the cost of loading large > amounts of data into R, > > "Vadim Ogranovich" <[EMAIL PROTECTED]> wrote: > R's IO is indeed 20 - 50 times slower than that of > equivalent C code > no matter what you do, which has been a pain for some of us. > > I wondered to myself just how bad R is at reading, when it is > given a fair chance. So I performed an experiment. > My machine (according to "Workstation Info") is a SunBlade > 100 with 640MB of physical memory running SunOS 5.9 Generic, > according to fpversion this is an Ultra2e with the CPU clock > running at 500MHz and the main memory clock running at 84MHz > (wow, slow memory). R.version is platform sparc-sun-solaris2.9 > arch sparc > os solaris2.9 > system sparc, solaris2.9 > status > major 1 > minor 9.0 > year 2004 > month 04 > day 12 > language R > and althnough this is a 64-bit machine, it's a 32-bit > installation of R. > > The experiment was this: > (1) I wrote a C program that generated 12500 rows of 800 columns, the > numbers were integers 0..999,999,999 generated using drand48(). > These numbers were written using printf(). It is possible to do > quite a bit better by avoiding printf(), but that would ruin the > spirit of the comparison, which is to see what can be done with > *straightforward* code using *existing* library functions. > > 21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds. > > The sizes were chosen to get 100MB; the actual size was > 12500 (lines) 10000000 (words) 100012500 (bytes) > > (2) I wrote a C program that read these numbers using > scanf("%d"); it > "knew" there were 800 numbers per row and 12500 numbers in all. > Again, it is possible to do better by avoiding scanf(), but the > point is to look at *straightforward* code. > > 18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds. > > (3) I started R, played around a bit doing other things, then > issued this > command: > > > system.time(xx <- read.table("/tmp/big.dat", > header=FALSE, quote="", > + row.names=NULL, colClasses=rep("numeric",800), nrows=12500, > + comment.char="") > > So how long _did_ it take to read 100MB on this machine? > > 71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds. > > The result: the R/C ratio was less than 4, whether you > measure cpu time or real time. It certainly wasn't anywhere > near 20-50 times slower. > > Of course, *binary* I/O in C *would* be quite a bit faster: > (1') generate same integers but write a row at a time using fwrite(): > 5 seconds cpu, 25 seconds real; 40 MB. > > (2') read same integers a row at a time using fread() > 0.26 seconds cpu, 1 second real. > > This would appear to more than justify "20-50 times slower", > but reading binary data and reading data in a textual > representation are different things, "less than 4 times > slower" is the fairer measure. However, it does emphasise > the usefulness of problem-specific bulk reading techniques. > > I thought I'd give you another R measurement: > > system.time(xx <- read.table("/tmp/big.dat", header=FALSE)) > But I got sick of waiting for it, and killed it after 843 cpu seconds, > 3075 real seconds. Without knowing how far it had got, one > can say no more than that this is at least 10 times slower > than the more informed call to read.table. > > What this tells me is that if you know something about the > data that you _could_ tell read.table about, you do yourself > no favour by keeping read.table in the dark. All those > options are there for a reason, and it *will* pay to use them. > > ______________________________________________ > [EMAIL PROTECTED] mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html