[R] Fastest way to calculate quantile in large data.table

2015-02-05 Thread Camilo Mora
In total I found 8 different way to calculate quantile in very a large 
data.table. I share below their performances for future reference. Tests 1, 7 
and 8 were the fastest I found.

Best,

Camilo

library(data.table)
v - data.table(x=runif(1),x2 = runif(1),  
x3=runif(1),x4=runif(1))

#fastest
Sys.time()-StartTEST1
t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] )
Sys.time()-EndTEST1

Sys.time()-StartTEST2
v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
Sys.time()-EndTEST2

Sys.time()-StartTEST3
v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
Sys.time()-EndTEST3
v
v[, c(L,H,M):=NULL]

v[,Names:=rownames(v)]
setkey(v,Names)

Sys.time()-StartTEST4
v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = Names]
Sys.time()-EndTEST4
v
v[, c(L,H,M):=NULL]


Sys.time()-StartTEST5
v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names]
Sys.time()-EndTEST5


Sys.time()-StartTEST6
v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4]
Sys.time()-EndTEST6


Sys.time()-StartTEST7
v[,  as.list(quantile(c(x ,   x2,x3,x4 
),c(.1,.90,.5),na.rm=TRUE)), by=Names]
Sys.time()-EndTEST7


# melting the database and doing quantily by summary. This is the second 
fastest, which is ironic given that the database has to be melted first
library(reshape2)
Sys.time()-StartTEST8
vs-melt(v)
vs[,  as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names]
Sys.time()-EndTEST8


EndTEST1-StartTEST1
EndTEST2-StartTEST2
EndTEST3-StartTEST3
EndTEST4-StartTEST4
EndTEST5-StartTEST5
EndTEST6-StartTEST6
EndTEST7-StartTEST7
EndTEST8-StartTEST8


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fastest way to calculate quantile in large data.table

2015-02-05 Thread Steve Lianoglou
Not sure if there is a question in here somewhere?

But if I can point out an observation: if you are doing summary
calculations across the rows like this, my guess is that using a
data.table (data.frame) structure for that will really bite you,
because this operation on a data.table/data.frame is expensive;

  x - dt[i,]

However it's much faster with a matrix. It doesn't seem like you're
doing anything with this dataset that takes advantage of data.table's
quick grouping/indexing mojo, so why store it in  data.table at all?

Witness:

R library(data.table)
R m - matrix(rnorm(1e6), nrow=10)
R d - as.data.table(m)
R idxs - sample(1:nrow(m), 500, replace=TRUE)

R system.time(for (i in idxs) x - m[i,])
   user  system elapsed
  0.497   0.169   0.670

R system.time(for (i in idxs) x - d[i,])
## I killed it after waiting for 14 seconds

-steve

On Thu, Feb 5, 2015 at 11:48 AM, Camilo Mora cm...@dal.ca wrote:
 In total I found 8 different way to calculate quantile in very a large 
 data.table. I share below their performances for future reference. Tests 1, 7 
 and 8 were the fastest I found.

 Best,

 Camilo

 library(data.table)
 v - data.table(x=runif(1),x2 = runif(1),  
 x3=runif(1),x4=runif(1))

 #fastest
 Sys.time()-StartTEST1
 t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] )
 Sys.time()-EndTEST1

 Sys.time()-StartTEST2
 v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
 Sys.time()-EndTEST2

 Sys.time()-StartTEST3
 v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
 Sys.time()-EndTEST3
 v
 v[, c(L,H,M):=NULL]

 v[,Names:=rownames(v)]
 setkey(v,Names)

 Sys.time()-StartTEST4
 v[, c(L,H,M):=quantile(.SD,probs =c(.1,.9,.5)), by = Names]
 Sys.time()-EndTEST4
 v
 v[, c(L,H,M):=NULL]


 Sys.time()-StartTEST5
 v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names]
 Sys.time()-EndTEST5


 Sys.time()-StartTEST6
 v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4]
 Sys.time()-EndTEST6


 Sys.time()-StartTEST7
 v[,  as.list(quantile(c(x ,   x2,x3,x4 
 ),c(.1,.90,.5),na.rm=TRUE)), by=Names]
 Sys.time()-EndTEST7


 # melting the database and doing quantily by summary. This is the second 
 fastest, which is ironic given that the database has to be melted first
 library(reshape2)
 Sys.time()-StartTEST8
 vs-melt(v)
 vs[,  as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names]
 Sys.time()-EndTEST8


 EndTEST1-StartTEST1
 EndTEST2-StartTEST2
 EndTEST3-StartTEST3
 EndTEST4-StartTEST4
 EndTEST5-StartTEST5
 EndTEST6-StartTEST6
 EndTEST7-StartTEST7
 EndTEST8-StartTEST8


 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Fastest way to calculate quantile in large data.table

2015-02-04 Thread Camilo Mora
Hi everyone,

I have a data.table with 200 columns and few million rows and am trying to 
calculate the .1 and .9 quantiles for each row across all 200 columns.

I have found different ways to do this, all with different performances. The 
examples I used are below. I wonder whether there is a faster way to do this?

Thanks and best,

Camilo


library(data.table)
v - data.table(x=runif(1),x2 = runif(1),  
x3=runif(1),x4=runif(1))
v[,Names:=rownames(v)]

#test 1 using .SD but not .SDcols
Sys.time()-StartTEST1
v[,  as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names]
Sys.time()-EndTEST1

#test 2 using .SD and .SDcols
Sys.time()-StartTEST2
v[,  as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names,.SDcols=1:4]
Sys.time()-EndTEST2

#test 3 using colnames directly. This is the fastest I found
Sys.time()-StartTEST3
v[,  as.list(quantile(c(x ,   x2,x3,x4 
),c(.1,.90),na.rm=TRUE)), by=Names]
Sys.time()-EndTEST3

# melting the database and doing quantile by summary. This is the second 
fastest, which is ironic given that the database has to be melted first
library(reshape2)
Sys.time()-StartTEST4
vs-melt(v)
vs[,  as.list(quantile(value,c(.1,.90),na.rm=TRUE)), by=Names]
Sys.time()-EndTEST4


EndTEST1-StartTEST1
EndTEST2-StartTEST2
EndTEST3-StartTEST3
EndTEST4-StartTEST4

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.