dear R experts: apologies for all my speed and memory questions. I have a bet with my coauthors that I can make R reasonably efficient through R-appropriate programming techniques. this is not just for kicks, but for work. for benchmarking, my [3 year old] Mac Pro has 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
right now, it seems that 'split()' is why I am losing my bet. (split is an integral component of *apply() and by(), so I need split() to be fast. its resulting list can then be fed, e.g., to mclapply().) I made up an example to illustrate my ills: library(data.table) N <- 1000 T <- N*10 d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) )) setkey(d, "key"); gc() ## force a garbage collection cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n") print(system.time( s<-split(d, d$key) )) My ordered input data table (or data frame; doesn't make a difference) is 114MB in size. it takes about a second to create. split() only needs to reshape it. this simple operation takes almost 5 minutes on my computer. with a data set that is larger, this explodes further. am I doing something wrong? is there an alternative to split()? sincerely, /iaw ---- Ivo Welch (ivo.we...@gmail.com) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.