> From: Marc Schwartz > > On Mon, 2004-12-06 at 12:26 -0500, Liaw, Andy wrote: > > > Marc alerted me off-list that count.fields() might spent > time delimiting > > fields, which is not needed for the purpose of counting > lines, and suggested > > using sep="\n" as a possible way to make it more efficient. > (Thanks, Marc!) > > > > Here are some tests on a file with 14337 lines and 8900 > fields (space > > delimited). > > > > > system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE) > > [1] 48.86 0.24 49.30 0.00 0.00 > > > system.time(n <- length(count.fields("hcv.ap", > sep="\n")), gcFirst=TRUE) > > [1] 42.19 0.26 42.60 0.00 0.00 > > Andy, > > I suspect that the relatively modest gain to be had here is the result > of count.fields() still scanning the input buffer for the delimiting > character, even though it would occur only once per line using the > newline character. Thus, the overhead is not reduced substantially. > > A scan of the source code for the .Internal function would validate > that. > > Thanks for testing this. > > As both you and Thomas mention, 'wc' is clearly the fastest way to go > based upon your additional figures. > > Best regards, > > Marc
Marc, I wrote the following function to read the file in chunks: countLines <- function(file, chunk=1e3) { f <- file(file, "r") on.exit(close(f)) nLines <- 0 while((n <- length(readLines(f, chunk))) > 0) nLines <- nLines + n nLines } To my surprise: > system.time(n4 <- countLines3("hcv.ap"), gcFirst=TRUE) [1] 35.24 0.26 35.53 0.00 0.00 > system.time(n4 <- countLines3("hcv.ap", 1), gcFirst=TRUE) [1] 36.10 0.32 36.43 0.00 0.00 There's almost no penalty (in time) in reading one line at a time. One do save quite a bit of memory, though. Cheers, Andy ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html