Two things that can be done with R alone are to read the first n lines of a file into n strings with readLines() and to scan in a block of the file after skipping a number of lines.
I will probably use Fortran to extract subsets of the file as I need to use it for other things that I am planning to do with the file.
I'll maybe also play a bit with readLines() and writeLines() inside loops to see if I can build up my random subsets of files this way.
BTW, I now estimate the file at about 100,000 lines so indeed, it is not all that large!
Murray Jorgensen
Prof Brian Ripley wrote:
On Mon, 25 Aug 2003, Murray Jorgensen wrote:
At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:
I think that is only a medium-sized file.
"Large" for my purposes means "more than I really want to read into memory" which in turn means "takes more than 30s". I'm at home now and the file isn't so I'm not sure if the file is large or not.
More responses interspesed below. BTW, I forgot to mention that I'm using Windows and so do not have nice unix tools readily available.
But you do, thanks to me, as you need them to installed R packages.
On Mon, 25 Aug 2003, Murray Jorgensen wrote:
I'm wondering if anyone has written some functions or code for handling very large files in R. I am working with a data file that is 41 variables times who knows how many observations making up 27MB altogether.
The sort of thing that I am thinking of having R do is
- count the number of lines in a file
You can do that without reading the file into memory: use
system(paste("wc -l", filename))
Don't think that I can do that in Windows XL.
I presume you mean Windows XP? Of course you can, and wc.exe is in Rtools.zip!
or read in blocks of lines via a
connection
But that does sound promising!
- form a data frame by selecting all cases whose line numbers are in a supplied vector (which could be used to extract random subfiles of particular sizes)
R should handle that easily in today's memory sizes. Buy some more RAM if you don't already have 1/2Gb. As others have said, for a real large file,
use a RDBMS to do the selection for you.
It's just that R is so good in reading in initial segments of a file that I can't believe that it can't be effective in reading more general (pre-specified) subsets.
-- Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: [EMAIL PROTECTED] Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
