> From: Liaw, Andy > > > From: Marc Schwartz > > > > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote: > > > hi all > > > If I wanna get the total number of lines in a big file > > without reading > > > the file's content into R as matrix or data frame, any methods or > > > functions? > > > thanks in advance. > > > Regards > > > > See ?readLines > > > > You can use: > > > > length(readLines("FileName")) > > > > to get the number of lines read. > > > > HTH, > > > > Marc Schwartz > > > On a system equipped with `wc' (*nix or Windows with such utilities > installed and on PATH) I would use that. Otherwise > length(count.fields()) > might be a good choice. > > Cheers, > Andy
Marc alerted me off-list that count.fields() might spent time delimiting fields, which is not needed for the purpose of counting lines, and suggested using sep="\n" as a possible way to make it more efficient. (Thanks, Marc!) Here are some tests on a file with 14337 lines and 8900 fields (space delimited). > system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE) [1] 48.86 0.24 49.30 0.00 0.00 > system.time(n <- length(count.fields("hcv.ap", sep="\n")), gcFirst=TRUE) [1] 42.19 0.26 42.60 0.00 0.00 > n [1] 14337 > system.time(n2 <- length(readLines("hcv.ap")), gcFirst=TRUE) [1] 37.77 0.56 38.35 0.00 0.00 > n2 [1] 14337 > system.time(n3 <- scan(pipe("wc -l hcv.ap"), what=list(0, NULL))[[1]], gcFirst=T) Read 1 records [1] 0.00 0.00 0.33 0.08 0.25 > n3 [1] 14337 My only concern with the readLines() approach is that it still needs to read the entire file into memory (if I'm not mistaken), which may not be desirable: > system.time(obj <- readLines("hcv.ap"), gcFirst=TRUE) [1] 36.72 0.48 37.24 0.00 0.00 > object.size(obj)/1024^2 [1] 244.6308 So it took 244+ MB just to store the text read in. I would use a loop and read the file in small chunks, if I really want to do it in R. Cheers, Andy ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html