Re: [R] R tools for large files

Murray Jorgensen Mon, 25 Aug 2003 23:56:22 +0000

I would like to thank those who have responded and especially Brian Ripley for making his unix tools for Windows available. A colleague has also mentioned to me the set of unix tools called Cygwin.

Two things that can be done with R alone are to read the first n lines of a file into n strings with readLines() and to scan in a block of the file after skipping a number of lines.

I will probably use Fortran to extract subsets of the file as I need to use it for other things that I am planning to do with the file.

I'll maybe also play a bit with readLines() and writeLines() inside loops to see if I can build up my random subsets of files this way.

BTW, I now estimate the file at about 100,000 lines so indeed, it is not all that large!

Murray Jorgensen

Prof Brian Ripley wrote:

On Mon, 25 Aug 2003, Murray Jorgensen wrote:
At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:

I think that is only a medium-sized file.
"Large" for my purposes means "more than I really want to read into memory"
which in turn means "takes more than 30s". I'm at home now and the file
isn't so I'm not sure if the file is large or not.
More responses interspesed below. BTW, I forgot to mention that I'm using
Windows and so do not have nice unix tools readily available.
But you do, thanks to me, as you need them to installed R packages.

On Mon, 25 Aug 2003, Murray Jorgensen wrote:

I'm wondering if anyone has written some functions or code for handling very large files in R. I am working with a data file that is 41 variables times who knows how many observations making up 27MB altogether.

The sort of thing that I am thinking of having R do is

- count the number of lines in a file
You can do that without reading the file into memory: use system(paste("wc -l", filename))
Don't think that I can do that in Windows XL.

I presume you mean Windows XP? Of course you can, and wc.exe is in Rtools.zip!
or read in blocks of lines via a

connection
But that does sound promising!

- form a data frame by selecting all cases whose line numbers are in a supplied vector (which could be used to extract random subfiles of particular sizes)
R should handle that easily in today's memory sizes. Buy some more RAM if you don't already have 1/2Gb. As others have said, for a real large file, use a RDBMS to do the selection for you.
It's just that R is so good in reading in initial segments of a file that I
can't believe that it can't be effective in reading more general
(pre-specified) subsets.


--
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: [EMAIL PROTECTED]                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Re: [R] R tools for large files

Reply via email to