Re: [R] speed of a vector operation question

2013-04-29 Thread Mikhail Umorin

Thank you all very much for your time and suggestions. The link to 
stackoverflow was very helpful. Here are some timings in case someone wants to 
know. (I noticed that microbenchmark results vary, depending on how many 
functions one tries to benchmark at a time. However, the min stays about the 
same)

# just to refresh, most of the code is from stackoverflow link provided by 
Martin Morgan  : http://stackoverflow.com/questions/16213029/more-efficient-
strategy-for-which-or-match

f0 - function(v) length(which(v  0))

f1 - function(v) sum(v  0)

f2 - function(v) which.min(v  0) - 1L

f3 - function(x) { # binary search implemented in R
imin - 1L
imax - length(x)
while (imax = imin) {
imid - as.integer(imin + (imax - imin) / 2)
if (x[imid] = 0)
imax - imid - 1L
else
imin - imid + 1L
}
imax
}

f3.c - cmpfun(f3) # pre-compiled

# binary search in C
f4 - cfunction(c(x = numeric),  
int imin = 0, imax = Rf_length(x) - 1, imid;
while (imax = imin) {
imid = imin + (imax - imin) / 2;
if (REAL(x)[imid] = 0)
imax = imid - 1;
else
imin = imid + 1;
}
return ScalarInteger(imax + 1);
)

# this one is separate suggestion by William Dunlap :
f5 - function(v) {
  tabulate(findInterval(v, c(-Inf, 0, 1, Inf)))[1]
}

vec - c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
# the identity of results was verified

microbenchmark(f1(vec), f2(vec), f3(vec), f3.c(vec), f4(vec), f5(vec))
Unit: microseconds
  expr   min lqmedian uq   max neval
   f1(vec) 17054.233 17831.1385 18514.305 19512.4705 54603.435   100
   f2(vec) 23624.353 25026.4265 26034.785 29322.1150 60014.458   100
   f3(vec)76.90293.2340   111.834   116.8370   129.888   100
 f3.c(vec)21.88330.753037.75754.125062.939   100
   f4(vec) 6.57510.588530.38931.938537.610   100
   f5(vec) 35365.088 36767.6175 38317.103 40671.2000 69209.425   100


So, i'll try to go with the inline binary search and see if I can precompile 
complex conditions.

Thank you, again, for your help!

Mikhail.




On Friday, April 26, 2013 20:52:27 Suzen, Mehmet wrote:
 Hello Mikhail,
 
 I could suggest you to use ff package for fast access to large data
 structures:
 
 http://cran.r-project.org/web/packages/ff/index.html
 http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf
 
 Best
 
 Mehmet
 
 On 26 April 2013 18:12, Mikhail Umorin mike...@gmail.com wrote:
  Hello,
  
  I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
  are sorted (with duplicates) in the vector (v). I am obtaining the length
  of vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
  scalar variables. What is the most efficient way to do this?
  
  I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
  to me more efficient than length(which(v  c)), but, please, correct me
  if I'm wrong. So, is there anything faster than what I already use?
  
  I'm running R 2.14.2 on Linux kernel 3.4.34.
  
  I appreciate your time,
  
  Mikhail
  
  [[alternative HTML version deleted]]
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html and provide commented,
  minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread lcn
I think the sum way is the best.


On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:

 Hello,

 I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
 are
 sorted (with duplicates) in the vector (v). I am obtaining the length of
 vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
 scalar
 variables. What is the most efficient way to do this?

 I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
 to
 me more efficient than length(which(v  c)), but, please, correct me if I'm
 wrong. So, is there anything faster than what I already use?

 I'm running R 2.14.2 on Linux kernel 3.4.34.

 I appreciate your time,

 Mikhail
 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread William Dunlap

 I think the sum way is the best.

On my Linux machine running R-3.0.0 the sum way is slightly faster:
   x - rexp(1e6, 2)
   system.time(for(i in 1:100)sum(x.3  x.5))
 user  system elapsed
4.664   0.340   5.018
   system.time(for(i in 1:100)length(which(x.3  x.5)))
 user  system elapsed
5.017   0.160   5.186

If you are doing many of these counts on the same dataset you
can save time by using functions like cut(), table(), ecdf(), and
findInterval().  E.g.,
 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128)  
 x=i), FUN.VALUE=0L))
   user  system elapsed
  5.332   0.568   5.909
 system.time(r2 - table(cut(x, seq(0,1,by=1/128
   user  system elapsed
  0.500   0.008   0.511
 all.equal(as.vector(r1), as.vector(r2))
[1] TRUE

You should do the timings yourself, as the relative speeds will depend
on the version or dialect of  the R interpreter and how it was compiled.
E.g., with the current development version of 'TIBCO Enterprise Runtime for R' 
(aka 'TERR')
on this same 8-core Linux box the sum way is considerably faster then
the length(which) way:
   x - rexp(1e6, 2)
   system.time(for(i in 1:100)sum(x.3  x.5))
 user  system elapsed
 1.870.030.48
   system.time(for(i in 1:100)length(which(x.3  x.5)))
 user  system elapsed
 3.210.040.83
   system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) 
 x=i), FUN.VALUE=0L))
 user  system elapsed
 2.190.040.56
   system.time(r2 - table(cut(x, seq(0,1,by=1/128
 user  system elapsed
 0.270.010.13
   all.equal(as.vector(r1), as.vector(r2))
  [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf
 Of lcn
 Sent: Friday, April 26, 2013 12:09 PM
 To: Mikhail Umorin
 Cc: r-help@r-project.org
 Subject: Re: [R] speed of a vector operation question
 
 I think the sum way is the best.
 
 
 On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:
 
  Hello,
 
  I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
  are
  sorted (with duplicates) in the vector (v). I am obtaining the length of
  vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
  scalar
  variables. What is the most efficient way to do this?
 
  I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
  to
  me more efficient than length(which(v  c)), but, please, correct me if I'm
  wrong. So, is there anything faster than what I already use?
 
  I'm running R 2.14.2 on Linux kernel 3.4.34.
 
  I appreciate your time,
 
  Mikhail
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread Martin Morgan
A very similar question was asked on StackOverflow (by Mikhail? and then I guess 
the answers there were somehow not satisfactory...)



http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-match

where it turns out that a binary search (implemented in R) on the sorted vector 
is much faster than sum, etc. I guess because it's log N without copying. The 
more complicated condition x  .3  x  .5 could be satisfied with multiple 
calls to the search.


Martin

On 04/26/2013 01:20 PM, William Dunlap wrote:



I think the sum way is the best.


On my Linux machine running R-3.0.0 the sum way is slightly faster:
x - rexp(1e6, 2)
system.time(for(i in 1:100)sum(x.3  x.5))
  user  system elapsed
 4.664   0.340   5.018
system.time(for(i in 1:100)length(which(x.3  x.5)))
  user  system elapsed
 5.017   0.160   5.186

If you are doing many of these counts on the same dataset you
can save time by using functions like cut(), table(), ecdf(), and
findInterval().  E.g.,

system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128)  
x=i), FUN.VALUE=0L))

user  system elapsed
   5.332   0.568   5.909

system.time(r2 - table(cut(x, seq(0,1,by=1/128

user  system elapsed
   0.500   0.008   0.511

all.equal(as.vector(r1), as.vector(r2))

[1] TRUE

You should do the timings yourself, as the relative speeds will depend
on the version or dialect of  the R interpreter and how it was compiled.
E.g., with the current development version of 'TIBCO Enterprise Runtime for R' 
(aka 'TERR')
on this same 8-core Linux box the sum way is considerably faster then
the length(which) way:
x - rexp(1e6, 2)
system.time(for(i in 1:100)sum(x.3  x.5))
  user  system elapsed
  1.870.030.48
system.time(for(i in 1:100)length(which(x.3  x.5)))
  user  system elapsed
  3.210.040.83
system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128)  
x=i), FUN.VALUE=0L))
  user  system elapsed
  2.190.040.56
system.time(r2 - table(cut(x, seq(0,1,by=1/128
  user  system elapsed
  0.270.010.13
all.equal(as.vector(r1), as.vector(r2))
   [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf
Of lcn
Sent: Friday, April 26, 2013 12:09 PM
To: Mikhail Umorin
Cc: r-help@r-project.org
Subject: Re: [R] speed of a vector operation question

I think the sum way is the best.


On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:


Hello,

I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
are
sorted (with duplicates) in the vector (v). I am obtaining the length of
vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
scalar
variables. What is the most efficient way to do this?

I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
to
me more efficient than length(which(v  c)), but, please, correct me if I'm
wrong. So, is there anything faster than what I already use?

I'm running R 2.14.2 on Linux kernel 3.4.34.

I appreciate your time,

Mikhail
 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread William Dunlap
R's findInterval can also take advantage of a sorted x vector.  E.g.,
in R-3.0.0 on the same 8-core Linux box:

 x - rexp(1e6, 2)
 system.time(for(i in 1:100)tabulate(findInterval(x, c(-Inf, .3, .5, Inf)))[2])
   user  system elapsed
  2.444   0.000   2.446
 xs - sort(x)
 system.time(for(i in 1:100)tabulate(findInterval(xs, c(-Inf, .3, .5, 
 Inf)))[2])
   user  system elapsed
  1.472   0.000   1.475
 
 tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2]
[1] 180636
 sum( xs  .3  xs = .5 )
[1] 180636


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: Martin Morgan [mailto:mtmor...@fhcrc.org]
 Sent: Friday, April 26, 2013 1:33 PM
 To: William Dunlap
 Cc: lcn; Mikhail Umorin; r-help@r-project.org
 Subject: Re: [R] speed of a vector operation question
 
 A very similar question was asked on StackOverflow (by Mikhail? and then I 
 guess
 the answers there were somehow not satisfactory...)
 
 
 http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-
 match
 
 where it turns out that a binary search (implemented in R) on the sorted 
 vector
 is much faster than sum, etc. I guess because it's log N without copying. The
 more complicated condition x  .3  x  .5 could be satisfied with multiple
 calls to the search.
 
 Martin
 
 On 04/26/2013 01:20 PM, William Dunlap wrote:
 
  I think the sum way is the best.
 
  On my Linux machine running R-3.0.0 the sum way is slightly faster:
  x - rexp(1e6, 2)
  system.time(for(i in 1:100)sum(x.3  x.5))
user  system elapsed
   4.664   0.340   5.018
  system.time(for(i in 1:100)length(which(x.3  x.5)))
user  system elapsed
   5.017   0.160   5.186
 
  If you are doing many of these counts on the same dataset you
  can save time by using functions like cut(), table(), ecdf(), and
  findInterval().  E.g.,
  system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) 
   x=i),
 FUN.VALUE=0L))
  user  system elapsed
 5.332   0.568   5.909
  system.time(r2 - table(cut(x, seq(0,1,by=1/128
  user  system elapsed
 0.500   0.008   0.511
  all.equal(as.vector(r1), as.vector(r2))
  [1] TRUE
 
  You should do the timings yourself, as the relative speeds will depend
  on the version or dialect of  the R interpreter and how it was compiled.
  E.g., with the current development version of 'TIBCO Enterprise Runtime for 
  R' (aka
 'TERR')
  on this same 8-core Linux box the sum way is considerably faster then
  the length(which) way:
  x - rexp(1e6, 2)
  system.time(for(i in 1:100)sum(x.3  x.5))
user  system elapsed
1.870.030.48
  system.time(for(i in 1:100)length(which(x.3  x.5)))
user  system elapsed
3.210.040.83
  system.time(r1 - vapply(seq(0,1,by=1/128)[-1], 
  function(i)sum(x(i-1/128)  x=i),
 FUN.VALUE=0L))
user  system elapsed
2.190.040.56
  system.time(r2 - table(cut(x, seq(0,1,by=1/128
user  system elapsed
0.270.010.13
  all.equal(as.vector(r1), as.vector(r2))
 [1] TRUE
 
  Bill Dunlap
  Spotfire, TIBCO Software
  wdunlap tibco.com
 
 
  -Original Message-
  From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
 Behalf
  Of lcn
  Sent: Friday, April 26, 2013 12:09 PM
  To: Mikhail Umorin
  Cc: r-help@r-project.org
  Subject: Re: [R] speed of a vector operation question
 
  I think the sum way is the best.
 
 
  On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:
 
  Hello,
 
  I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
  are
  sorted (with duplicates) in the vector (v). I am obtaining the length of
  vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
  scalar
  variables. What is the most efficient way to do this?
 
  I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
  to
  me more efficient than length(which(v  c)), but, please, correct me if 
  I'm
  wrong. So, is there anything faster than what I already use?
 
  I'm running R 2.14.2 on Linux kernel 3.4.34.
 
  I appreciate your time,
 
  Mikhail
   [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read