I had a file with 200,000 lines in it and it took 1 second to select 3000 sample lines out of it. One of the things is to use a connection so that the file stays opens and then just 'skip' to the next record to read:
> input <- file("/tempxx.txt", "r") > sel <- 3000 > remaining <- 200000 > # get the records numbers to select > recs <- sort(sample(1:remaining, sel)) > # compute number to skip on each read; account for the record just read > skip <- diff(c(1, recs)) - 1 > # allocate my data > mysel <- vector('character', sel) > system.time({ + for (i in 1:sel){ + mysel[i] <- scan(input, what="", sep="\n", skip=skip[i], n=1, quiet=TRUE) + } + }) [1] 0.97 0.02 1.00 NA NA > > On 2/2/07, juli g. pausas <[EMAIL PROTECTED]> wrote: > Hi all, > I have a large file (1.8 GB) with 900,000 lines that I would like to read. > Each line is a string characters. Specifically I would like to randomly > select 3000 lines. For smaller files, what I'm doing is: > > trs <- scan("myfile", what= character(), sep = "\n") > trs<- trs[sample(length(trs), 3000)] > > And this works OK; however my computer seems not able to handle the 1.8 G > file. > I thought of an alternative way that not require to read the whole file: > > sel <- sample(1:900000, 3000) > for (i in 1:3000) { > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1) > write(un, "myfile_short", append=TRUE) > } > > This works on my computer; however it is extremely slow; it read one line > each time. It is been running for 25 hours and I think it has done less than > half of the file (Yes, probably I do not have a very good computer and I'm > working under Windows ...). > So my question is: do you know any other faster way to do this? > Thanks in advance > > Juli > > -- > http://www.ceam.es/pausas > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.