Re: [R] scan() vs readChar() speed
Thanks; I did not notice an appreciable difference between scan() and scan(what=double()) in this example. Adding to my confusion, I noted a strange and apparently systematic discrepency between the timing results when the code is run within R.app, within emacs, or from a terminal. Any idea what might be causing this? Thanks, baptiste On 2 April 2012 11:04, Duncan Murdoch wrote: > On 12-04-01 2:58 AM, baptiste auguie wrote: >> >> Dear list, >> >> I am trying to find a fast solution to read moderately large (1 -- 10 >> million entries) text files containing only tab-delimited numeric >> values. My test file is the following, >> >> nr<- 1000 >> nc<- 5000 >> >> m<- matrix(round(rnorm(nr*nc),3),nr=nr) >> write.table(m, file = "a.txt", append=FALSE, >> row.names = FALSE, col.names = FALSE) >> >> >> scan() is faster than read.table(), as expected, but still quite slow >> compared to Matlab for example. Based on archived discussions on this >> list and Stack Overflow, I tried readChar(); it's really fast. >> However, it returns a long character string, where I really want >> numeric values. I can use as.numeric(strsplit()), but to my complete >> surprise it is faster to run scan() on this text string. Consider the >> following comparison (I use the command line wc to optimize the memory >> allocation), > > > Tell it the types of the columns, and it will go a bit faster. > > Duncan Murdoch > >> >> load_file1<- function(f){ >> ## ask wc the number of words >> n<- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)), >> what=list(integer(), character()), quiet=TRUE)[[1]] >> all<- scan(f, nmax=n, quiet=TRUE) >> invisible(all) >> } >> >> load_file2<- function(f){ >> ## ask wc the number of characters >> n<- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)), >> what=list(integer(), character()), quiet=TRUE)[[1]] >> tc<- textConnection(readChar(f, n)) >> all<- scan(tc, quiet=TRUE, multi.line = FALSE) >> close(tc) >> invisible(all) >> } >> >> >> system.time(a<- load_file1("a.txt")) >> ## user system elapsed >> ## 7.805 0.138 8.026 >> system.time(b<- load_file2("a.txt")) >> ## user system elapsed >> ## 2.182 0.301 2.538 >> all.equal(a, b) >> ##> [1] TRUE >> >> >> Could someone explain to me why it is faster to scan a textConnection >> than the original file? Have I missed a better solution? >> >> Thanks, >> >> baptiste >> >> sessionInfo() >> R version 2.15.0 RC (2012-03-29 r58868) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] scan() vs readChar() speed
On 12-04-01 2:58 AM, baptiste auguie wrote: Dear list, I am trying to find a fast solution to read moderately large (1 -- 10 million entries) text files containing only tab-delimited numeric values. My test file is the following, nr<- 1000 nc<- 5000 m<- matrix(round(rnorm(nr*nc),3),nr=nr) write.table(m, file = "a.txt", append=FALSE, row.names = FALSE, col.names = FALSE) scan() is faster than read.table(), as expected, but still quite slow compared to Matlab for example. Based on archived discussions on this list and Stack Overflow, I tried readChar(); it's really fast. However, it returns a long character string, where I really want numeric values. I can use as.numeric(strsplit()), but to my complete surprise it is faster to run scan() on this text string. Consider the following comparison (I use the command line wc to optimize the memory allocation), Tell it the types of the columns, and it will go a bit faster. Duncan Murdoch load_file1<- function(f){ ## ask wc the number of words n<- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)), what=list(integer(), character()), quiet=TRUE)[[1]] all<- scan(f, nmax=n, quiet=TRUE) invisible(all) } load_file2<- function(f){ ## ask wc the number of characters n<- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)), what=list(integer(), character()), quiet=TRUE)[[1]] tc<- textConnection(readChar(f, n)) all<- scan(tc, quiet=TRUE, multi.line = FALSE) close(tc) invisible(all) } system.time(a<- load_file1("a.txt")) ## user system elapsed ## 7.805 0.138 8.026 system.time(b<- load_file2("a.txt")) ## user system elapsed ## 2.182 0.301 2.538 all.equal(a, b) ##> [1] TRUE Could someone explain to me why it is faster to scan a textConnection than the original file? Have I missed a better solution? Thanks, baptiste sessionInfo() R version 2.15.0 RC (2012-03-29 r58868) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] scan() vs readChar() speed
Dear list, I am trying to find a fast solution to read moderately large (1 -- 10 million entries) text files containing only tab-delimited numeric values. My test file is the following, nr <- 1000 nc <- 5000 m <- matrix(round(rnorm(nr*nc),3),nr=nr) write.table(m, file = "a.txt", append=FALSE, row.names = FALSE, col.names = FALSE) scan() is faster than read.table(), as expected, but still quite slow compared to Matlab for example. Based on archived discussions on this list and Stack Overflow, I tried readChar(); it's really fast. However, it returns a long character string, where I really want numeric values. I can use as.numeric(strsplit()), but to my complete surprise it is faster to run scan() on this text string. Consider the following comparison (I use the command line wc to optimize the memory allocation), load_file1 <- function(f){ ## ask wc the number of words n <- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)), what=list(integer(), character()), quiet=TRUE)[[1]] all <- scan(f, nmax=n, quiet=TRUE) invisible(all) } load_file2 <- function(f){ ## ask wc the number of characters n <- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)), what=list(integer(), character()), quiet=TRUE)[[1]] tc <- textConnection(readChar(f, n)) all <- scan(tc, quiet=TRUE, multi.line = FALSE) close(tc) invisible(all) } system.time(a <- load_file1("a.txt")) ## user system elapsed ## 7.805 0.138 8.026 system.time(b <- load_file2("a.txt")) ## user system elapsed ## 2.182 0.301 2.538 all.equal(a, b) ## > [1] TRUE Could someone explain to me why it is faster to scan a textConnection than the original file? Have I missed a better solution? Thanks, baptiste sessionInfo() R version 2.15.0 RC (2012-03-29 r58868) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.