Ravi: Well, that's a challenge!
Here is a solution which is faster than Bill's as long as length(y) << length(x). f3 <- function(x, y) length(y)*ecdf(y)(x) Then using your test on my workstation, I get for f2(): > tt [1] 0.761 1.071 1.329 1.565 1.894 3.865 11.824 and for f3(): > tt [1] 0.717 0.882 1.010 1.161 1.389 3.080 15.530 I suspect ecdf() is a little more intuitive than findInterval(). Hope this helps, Ray Brownrigg On Fri, 15 Apr 2011, Ravi Varadhan wrote: > Bill's code is insanely fast! > > f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y))) > > n1 <- 1e07 > n2 <- 10^c(1,2,3,4,5,6,7) > tt <- rep(NA, 7) > x <- rnorm(n1) > for (i in 1:length(n2)){ > y <- runif(n2[i]) > tt[i] <- system.time(a1 <- f2(x, y))[3] > } > > > tt > > [1] 0.70 0.86 1.03 1.28 1.54 4.99 12.07 > > > I would be surprised if this can be beaten even in C/C++. > > Ravi. > ________________________________________ > From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf > Of Douglas Bates [ba...@stat.wisc.edu] Sent: Thursday, April 14, 2011 6:22 > PM > To: William Dunlap > Cc: r-help@r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel > Subject: Re: [R] Find number of elements less than some number: > Elegant/fastsolution needed > > My colleague Sunduz Keles once mentioned a similar problem to me. She > had a large sample from a reference distribution and a test sample > (both real-valued in her case) and she wanted, for each element of the > test sample, the proportion of the reference sample that was less than > the element. It's a type of empirical p-value calculation. > > I forget the exact sizes of the samples (do you remember, Sunduz?) but > they were in the tens of thousands or larger. Solutions in R tended > to involve comparing every element in the test sample to every element > in the reference sample but, of course, that is unnecessary. If you > sort both samples then you can start the comparisons for a particular > element of the test sample at the element of the reference sample > where the last comparison failed. > > I was able to write a very short C++ function using the Rcpp package > that provided about a 1000-fold increase in speed relative to the best > I could do in R. I don't have the script on this computer so I will > post it tomorrow when I am back on the computer at the office. > > Apologies for cross-posting to the Rcpp-devel list but I am doing so > because this might make a good example of the usefulness of Rcpp and > inline. > > On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdun...@tibco.com> wrote: > >> -----Original Message----- > >> From: r-help-boun...@r-project.org > >> [mailto:r-help-boun...@r-project.org] On Behalf Of Kevin Ummel > >> Sent: Thursday, April 14, 2011 12:35 PM > >> To: r-help@r-project.org > >> Subject: [R] Find number of elements less than some number: > >> Elegant/fastsolution needed > >> > >> Take vector x and a subset y: > >> > >> x=1:10 > >> > >> y=c(4,5,7,9) > >> > >> For each value in 'x', I want to know how many elements in > >> 'y' are less than 'x'. > >> > >> An example would be: > >> > >> sapply(x,FUN=function(i) {length(which(y<i))}) > >> > >> [1] 0 0 0 0 1 2 2 3 3 4 > >> > >> But this solution is far too slow when x and y have lengths > >> in the millions. > >> > >> I'm certain an elegant (and computationally efficient) > >> solution exists, but I'm in the weeds at this point. > > > > If x is sorted findInterval(x, y) would do it for <= but you > > want strict <. Try > > > > f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y))) > > > > where your version is > > > > f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))}) > > > > E.g., > > > > > x <- sort(sample(1e6, size=1e5)) > > > y <- sample(1e6, size=1e4, replace=TRUE) > > > system.time(r0 <- f0(x,y)) > > > > > user system elapsed > > > > 7.900 0.310 8.211 > > > > > system.time(r2 <- f2(x,y)) > > > > > user system elapsed > > > > 0.000 0.000 0.004 > > > > > identical(r0, r2) > > > > [1] TRUE > > > > Bill Dunlap > > Spotfire, TIBCO Software > > wdunlap tibco.com > > > >> Any help is much appreciated. > >> > >> Kevin > >> > >> University of Manchester > >> > >> > >> > >> > >> > >> > >> > >> > >> Take two vectors x and y, where y is a subset of x: > >> > >> x=1:10 > >> > >> y=c(2,5,6,9) > >> > >> If y is removed from x, the original x values now have a new > >> placement (index) in the resulting vector (new): > >> > >> new=x[-y] > >> > >> index=1:length(new) > >> > >> The challenge is: How can I *quickly* and *efficiently* > >> deduce the new 'index' value directly from the original 'x' > >> value -- using only 'y' as an input? > >> > >> In practice, I have very large matrices containing the 'x' > >> values, and I need to convert them to the corresponding > >> 'index' if the 'y' values are removed. > >> > >> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.