[R] Fastest way to calculate quantile in large data.table
In total I found 8 different way to calculate quantile in very a large data.table. I share below their performances for future reference. Tests 1, 7 and 8 were the fastest I found. Best, Camilo library(data.table) v - data.table(x=runif(1),x2 = runif(1), x3=runif(1),x4=runif(1)) #fastest Sys.time()-StartTEST1 t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] ) Sys.time()-EndTEST1 Sys.time()-StartTEST2 v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)] Sys.time()-EndTEST2 Sys.time()-StartTEST3 v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)] Sys.time()-EndTEST3 v v[, c(L,H,M):=NULL] v[,Names:=rownames(v)] setkey(v,Names) Sys.time()-StartTEST4 v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = Names] Sys.time()-EndTEST4 v v[, c(L,H,M):=NULL] Sys.time()-StartTEST5 v[, as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names] Sys.time()-EndTEST5 Sys.time()-StartTEST6 v[, as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4] Sys.time()-EndTEST6 Sys.time()-StartTEST7 v[, as.list(quantile(c(x , x2,x3,x4 ),c(.1,.90,.5),na.rm=TRUE)), by=Names] Sys.time()-EndTEST7 # melting the database and doing quantily by summary. This is the second fastest, which is ironic given that the database has to be melted first library(reshape2) Sys.time()-StartTEST8 vs-melt(v) vs[, as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names] Sys.time()-EndTEST8 EndTEST1-StartTEST1 EndTEST2-StartTEST2 EndTEST3-StartTEST3 EndTEST4-StartTEST4 EndTEST5-StartTEST5 EndTEST6-StartTEST6 EndTEST7-StartTEST7 EndTEST8-StartTEST8 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fastest way to calculate quantile in large data.table
Not sure if there is a question in here somewhere? But if I can point out an observation: if you are doing summary calculations across the rows like this, my guess is that using a data.table (data.frame) structure for that will really bite you, because this operation on a data.table/data.frame is expensive; x - dt[i,] However it's much faster with a matrix. It doesn't seem like you're doing anything with this dataset that takes advantage of data.table's quick grouping/indexing mojo, so why store it in data.table at all? Witness: R library(data.table) R m - matrix(rnorm(1e6), nrow=10) R d - as.data.table(m) R idxs - sample(1:nrow(m), 500, replace=TRUE) R system.time(for (i in idxs) x - m[i,]) user system elapsed 0.497 0.169 0.670 R system.time(for (i in idxs) x - d[i,]) ## I killed it after waiting for 14 seconds -steve On Thu, Feb 5, 2015 at 11:48 AM, Camilo Mora cm...@dal.ca wrote: In total I found 8 different way to calculate quantile in very a large data.table. I share below their performances for future reference. Tests 1, 7 and 8 were the fastest I found. Best, Camilo library(data.table) v - data.table(x=runif(1),x2 = runif(1), x3=runif(1),x4=runif(1)) #fastest Sys.time()-StartTEST1 t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] ) Sys.time()-EndTEST1 Sys.time()-StartTEST2 v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)] Sys.time()-EndTEST2 Sys.time()-StartTEST3 v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)] Sys.time()-EndTEST3 v v[, c(L,H,M):=NULL] v[,Names:=rownames(v)] setkey(v,Names) Sys.time()-StartTEST4 v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = Names] Sys.time()-EndTEST4 v v[, c(L,H,M):=NULL] Sys.time()-StartTEST5 v[, as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names] Sys.time()-EndTEST5 Sys.time()-StartTEST6 v[, as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4] Sys.time()-EndTEST6 Sys.time()-StartTEST7 v[, as.list(quantile(c(x , x2,x3,x4 ),c(.1,.90,.5),na.rm=TRUE)), by=Names] Sys.time()-EndTEST7 # melting the database and doing quantily by summary. This is the second fastest, which is ironic given that the database has to be melted first library(reshape2) Sys.time()-StartTEST8 vs-melt(v) vs[, as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names] Sys.time()-EndTEST8 EndTEST1-StartTEST1 EndTEST2-StartTEST2 EndTEST3-StartTEST3 EndTEST4-StartTEST4 EndTEST5-StartTEST5 EndTEST6-StartTEST6 EndTEST7-StartTEST7 EndTEST8-StartTEST8 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Steve Lianoglou Computational Biologist Genentech __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Fastest way to calculate quantile in large data.table
Hi everyone, I have a data.table with 200 columns and few million rows and am trying to calculate the .1 and .9 quantiles for each row across all 200 columns. I have found different ways to do this, all with different performances. The examples I used are below. I wonder whether there is a faster way to do this? Thanks and best, Camilo library(data.table) v - data.table(x=runif(1),x2 = runif(1), x3=runif(1),x4=runif(1)) v[,Names:=rownames(v)] #test 1 using .SD but not .SDcols Sys.time()-StartTEST1 v[, as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names] Sys.time()-EndTEST1 #test 2 using .SD and .SDcols Sys.time()-StartTEST2 v[, as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names,.SDcols=1:4] Sys.time()-EndTEST2 #test 3 using colnames directly. This is the fastest I found Sys.time()-StartTEST3 v[, as.list(quantile(c(x , x2,x3,x4 ),c(.1,.90),na.rm=TRUE)), by=Names] Sys.time()-EndTEST3 # melting the database and doing quantile by summary. This is the second fastest, which is ironic given that the database has to be melted first library(reshape2) Sys.time()-StartTEST4 vs-melt(v) vs[, as.list(quantile(value,c(.1,.90),na.rm=TRUE)), by=Names] Sys.time()-EndTEST4 EndTEST1-StartTEST1 EndTEST2-StartTEST2 EndTEST3-StartTEST3 EndTEST4-StartTEST4 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.