Not sure if there is a question in here somewhere? But if I can point out an observation: if you are doing summary calculations across the rows like this, my guess is that using a data.table (data.frame) structure for that will really bite you, because this operation on a data.table/data.frame is expensive;
x <- dt[i,] However it's much faster with a matrix. It doesn't seem like you're doing anything with this dataset that takes advantage of data.table's quick grouping/indexing mojo, so why store it in data.table at all? Witness: R> library(data.table) R> m <- matrix(rnorm(1e6), nrow=10) R> d <- as.data.table(m) R> idxs <- sample(1:nrow(m), 500, replace=TRUE) R> system.time(for (i in idxs) x <- m[i,]) user system elapsed 0.497 0.169 0.670 R> system.time(for (i in idxs) x <- d[i,]) ## I killed it after waiting for 14 seconds -steve On Thu, Feb 5, 2015 at 11:48 AM, Camilo Mora <cm...@dal.ca> wrote: > In total I found 8 different way to calculate quantile in very a large > data.table. I share below their performances for future reference. Tests 1, 7 > and 8 were the fastest I found. > > Best, > > Camilo > > library(data.table) > v <- data.table(x=runif(10000),x2 = runif(10000), > x3=runif(10000),x4=runif(10000)) > > #fastest > Sys.time()->StartTEST1 > t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] ) > Sys.time()->EndTEST1 > > Sys.time()->StartTEST2 > v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)] > Sys.time()->EndTEST2 > > Sys.time()->StartTEST3 > v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)] > Sys.time()->EndTEST3 > v > v[, c("L","H","M"):=NULL] > > v[,Names:=rownames(v)] > setkey(v,Names) > > Sys.time()->StartTEST4 > v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = Names] > Sys.time()->EndTEST4 > v > v[, c("L","H","M"):=NULL] > > > Sys.time()->StartTEST5 > v[, as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names] > Sys.time()->EndTEST5 > > > Sys.time()->StartTEST6 > v[, as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4] > Sys.time()->EndTEST6 > > > Sys.time()->StartTEST7 > v[, as.list(quantile(c(x , x2, x3, x4 > ),c(.1,.90,.5),na.rm=TRUE)), by=Names] > Sys.time()->EndTEST7 > > > # melting the database and doing quantily by summary. This is the second > fastest, which is ironic given that the database has to be melted first > library(reshape2) > Sys.time()->StartTEST8 > vs<-melt(v) > vs[, as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names] > Sys.time()->EndTEST8 > > > EndTEST1-StartTEST1 > EndTEST2-StartTEST2 > EndTEST3-StartTEST3 > EndTEST4-StartTEST4 > EndTEST5-StartTEST5 > EndTEST6-StartTEST6 > EndTEST7-StartTEST7 > EndTEST8-StartTEST8 > > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Steve Lianoglou Computational Biologist Genentech ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.