On 12-04-01 2:58 AM, baptiste auguie wrote:
Dear list,

I am trying to find a fast solution to read moderately large (1 -- 10
million entries) text files containing only tab-delimited numeric
values. My test file is the following,

nr<- 1000
nc<- 5000

m<- matrix(round(rnorm(nr*nc),3),nr=nr)
write.table(m, file = "a.txt", append=FALSE,
             row.names = FALSE, col.names = FALSE)


scan() is faster than read.table(), as expected, but still quite slow
compared to Matlab for example. Based on archived discussions on this
list and Stack Overflow, I tried readChar(); it's really fast.
However, it returns a long character string, where I really want
numeric values. I can use as.numeric(strsplit()), but to my complete
surprise it is faster to run scan() on this text string. Consider the
following comparison (I use the command line wc to optimize the memory
allocation),

Tell it the types of the columns, and it will go a bit faster.

Duncan Murdoch


load_file1<- function(f){
   ## ask wc the number of words
   n<- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)),
             what=list(integer(), character()), quiet=TRUE)[[1]]
   all<- scan(f, nmax=n, quiet=TRUE)
   invisible(all)
}

load_file2<- function(f){
   ## ask wc the number of characters
   n<- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)),
             what=list(integer(), character()), quiet=TRUE)[[1]]
   tc<- textConnection(readChar(f, n))
   all<- scan(tc, quiet=TRUE, multi.line = FALSE)
   close(tc)
   invisible(all)
}


system.time(a<- load_file1("a.txt"))
  ## user  system elapsed
  ##  7.805   0.138   8.026
system.time(b<- load_file2("a.txt"))
  ## user  system elapsed
  ##  2.182   0.301   2.538
all.equal(a, b)
##>  [1] TRUE


Could someone explain to me why it is faster to scan a textConnection
than the original file? Have I missed a better solution?

Thanks,

baptiste

sessionInfo()
R version 2.15.0 RC (2012-03-29 r58868)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to