Re: [R] speed of a vector operation question
Thank you all very much for your time and suggestions. The link to stackoverflow was very helpful. Here are some timings in case someone wants to know. (I noticed that microbenchmark results vary, depending on how many functions one tries to benchmark at a time. However, the min stays about the same) # just to refresh, most of the code is from stackoverflow link provided by Martin Morgan : http://stackoverflow.com/questions/16213029/more-efficient- strategy-for-which-or-match f0 - function(v) length(which(v 0)) f1 - function(v) sum(v 0) f2 - function(v) which.min(v 0) - 1L f3 - function(x) { # binary search implemented in R imin - 1L imax - length(x) while (imax = imin) { imid - as.integer(imin + (imax - imin) / 2) if (x[imid] = 0) imax - imid - 1L else imin - imid + 1L } imax } f3.c - cmpfun(f3) # pre-compiled # binary search in C f4 - cfunction(c(x = numeric), int imin = 0, imax = Rf_length(x) - 1, imid; while (imax = imin) { imid = imin + (imax - imin) / 2; if (REAL(x)[imid] = 0) imax = imid - 1; else imin = imid + 1; } return ScalarInteger(imax + 1); ) # this one is separate suggestion by William Dunlap : f5 - function(v) { tabulate(findInterval(v, c(-Inf, 0, 1, Inf)))[1] } vec - c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6)) # the identity of results was verified microbenchmark(f1(vec), f2(vec), f3(vec), f3.c(vec), f4(vec), f5(vec)) Unit: microseconds expr min lqmedian uq max neval f1(vec) 17054.233 17831.1385 18514.305 19512.4705 54603.435 100 f2(vec) 23624.353 25026.4265 26034.785 29322.1150 60014.458 100 f3(vec)76.90293.2340 111.834 116.8370 129.888 100 f3.c(vec)21.88330.753037.75754.125062.939 100 f4(vec) 6.57510.588530.38931.938537.610 100 f5(vec) 35365.088 36767.6175 38317.103 40671.2000 69209.425 100 So, i'll try to go with the inline binary search and see if I can precompile complex conditions. Thank you, again, for your help! Mikhail. On Friday, April 26, 2013 20:52:27 Suzen, Mehmet wrote: Hello Mikhail, I could suggest you to use ff package for fast access to large data structures: http://cran.r-project.org/web/packages/ff/index.html http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf Best Mehmet On 26 April 2013 18:12, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
I think the sum way is the best. On my Linux machine running R-3.0.0 the sum way is slightly faster: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 4.664 0.340 5.018 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 5.017 0.160 5.186 If you are doing many of these counts on the same dataset you can save time by using functions like cut(), table(), ecdf(), and findInterval(). E.g., system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 5.332 0.568 5.909 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.500 0.008 0.511 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE You should do the timings yourself, as the relative speeds will depend on the version or dialect of the R interpreter and how it was compiled. E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka 'TERR') on this same 8-core Linux box the sum way is considerably faster then the length(which) way: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 1.870.030.48 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 3.210.040.83 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 2.190.040.56 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.270.010.13 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of lcn Sent: Friday, April 26, 2013 12:09 PM To: Mikhail Umorin Cc: r-help@r-project.org Subject: Re: [R] speed of a vector operation question I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
A very similar question was asked on StackOverflow (by Mikhail? and then I guess the answers there were somehow not satisfactory...) http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-match where it turns out that a binary search (implemented in R) on the sorted vector is much faster than sum, etc. I guess because it's log N without copying. The more complicated condition x .3 x .5 could be satisfied with multiple calls to the search. Martin On 04/26/2013 01:20 PM, William Dunlap wrote: I think the sum way is the best. On my Linux machine running R-3.0.0 the sum way is slightly faster: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 4.664 0.340 5.018 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 5.017 0.160 5.186 If you are doing many of these counts on the same dataset you can save time by using functions like cut(), table(), ecdf(), and findInterval(). E.g., system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 5.332 0.568 5.909 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.500 0.008 0.511 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE You should do the timings yourself, as the relative speeds will depend on the version or dialect of the R interpreter and how it was compiled. E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka 'TERR') on this same 8-core Linux box the sum way is considerably faster then the length(which) way: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 1.870.030.48 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 3.210.040.83 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 2.190.040.56 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.270.010.13 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of lcn Sent: Friday, April 26, 2013 12:09 PM To: Mikhail Umorin Cc: r-help@r-project.org Subject: Re: [R] speed of a vector operation question I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
R's findInterval can also take advantage of a sorted x vector. E.g., in R-3.0.0 on the same 8-core Linux box: x - rexp(1e6, 2) system.time(for(i in 1:100)tabulate(findInterval(x, c(-Inf, .3, .5, Inf)))[2]) user system elapsed 2.444 0.000 2.446 xs - sort(x) system.time(for(i in 1:100)tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2]) user system elapsed 1.472 0.000 1.475 tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2] [1] 180636 sum( xs .3 xs = .5 ) [1] 180636 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: Martin Morgan [mailto:mtmor...@fhcrc.org] Sent: Friday, April 26, 2013 1:33 PM To: William Dunlap Cc: lcn; Mikhail Umorin; r-help@r-project.org Subject: Re: [R] speed of a vector operation question A very similar question was asked on StackOverflow (by Mikhail? and then I guess the answers there were somehow not satisfactory...) http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or- match where it turns out that a binary search (implemented in R) on the sorted vector is much faster than sum, etc. I guess because it's log N without copying. The more complicated condition x .3 x .5 could be satisfied with multiple calls to the search. Martin On 04/26/2013 01:20 PM, William Dunlap wrote: I think the sum way is the best. On my Linux machine running R-3.0.0 the sum way is slightly faster: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 4.664 0.340 5.018 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 5.017 0.160 5.186 If you are doing many of these counts on the same dataset you can save time by using functions like cut(), table(), ecdf(), and findInterval(). E.g., system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 5.332 0.568 5.909 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.500 0.008 0.511 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE You should do the timings yourself, as the relative speeds will depend on the version or dialect of the R interpreter and how it was compiled. E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka 'TERR') on this same 8-core Linux box the sum way is considerably faster then the length(which) way: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 1.870.030.48 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 3.210.040.83 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 2.190.040.56 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.270.010.13 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of lcn Sent: Friday, April 26, 2013 12:09 PM To: Mikhail Umorin Cc: r-help@r-project.org Subject: Re: [R] speed of a vector operation question I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read