Martin, Thanks for showing the timing tests. It is important to see how the time (and memory usage) grows with the size of the problem, where size may be the number of rows or length of the lag. Here is another function to toss in the hat. It uses no loops and does all the sum by diff'ing a cumsum, which loses some precision. I think the big speedup comes from the calculation of startPos. You can also use approx(method="const") or some zoo function to do this.
The gSum function, which computes sums of overlapping subsequences of its input, could be changed to a call to lapply without a dramatic lose of speed and thus avoid the precision problems. It also supposes that dat$a is already sorted. f.wwd <- function(dat, max=5) { # filtering approach a <- dat$a minStart <- a - max i <- rep(c(FALSE, TRUE), each=length(a))[order(c(minStart, a))] startPos <- cumsum(i)[!i] + 1 endPos <- seq(along=a) gSum <- function(x) { cs <- cumsum(x) cs[endPos] - cs[startPos] + x[startPos] } dat$b <- gSum(dat$b) dat$c <- gSum(dat$c) dat } Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Martin Morgan > Sent: Wednesday, September 02, 2009 9:17 AM > To: Alexander Shenkin > Cc: r-help@r-project.org; spec...@stat.berkeley.edu; > cbe...@tajo.ucsd.edu > Subject: Re: [R] Avoiding loops > > Alexander Shenkin wrote: > > Though, from my limited understanding, the 'apply' family > of functions > > are actually just loops. Please correct me if I'm wrong. So, while > > more readable (which is important), they're not necessarily more > > efficient than explicit 'for' loops. > > Hi Allie -- This uses an R-level loop (and a lot of C loops!), but the > length of the loop is only as long as the maximum lag > > > f0 <- function(df0, max_lag) > { > max_lag <- min(nrow(df0), max_lag) > a <- df0[[1]] > ans <- df <- df0[,-1, drop=FALSE] > for (lag in seq_len(max_lag)) { > idx <- diff(a, lag) <= max_lag > pad <- logical(lag) > ans[c(pad, idx),] <- ans[c(pad, idx),] + df[c(idx, pad),] > } > cbind(a, ans) > } > > it makes the assumption that 'a' is sorted and unique, as in a time > series. This > > f1 <- function(df0, max_lag) > { > max_lag <- min(nrow(df0), max_lag) > a <- df0[[1]] > ans <- df0[,-1, drop=FALSE] > lag <- 1 > while(sum(idx <- diff(a, lag) <= max_lag) != 0) { > pad <- logical(lag) > ans[c(pad, idx),] <- ans[c(pad, idx),] + df[c(idx, pad),] > lag <- lag + 1 > } > cbind(a, ans) > } > > relaxes the assumption that 'a' is unique, I think, but I > haven't tested > carefully; it seems to perform about the same as f0. I think there's a > clever recursive solution in there, too. > > This is my implementation of Phil's solution > > phil0 <- function(df0, max_lag) > { > with(df0, { > g <- function(x) > apply(df0[a - x >= -max_lag & a - x <= 0, c('b','c')], > 2, sum) > data.frame(a, t(sapply(a, g))) > }) > } > > Here's my implementation of Chuck Berry's solution > > chuck0 <- function(df0, max_lag) > { > criterion <- > as.matrix(dist(df0$a)) <= max_lag & outer(df0$a,df0$a,">=") > criterion %*% as.matrix(df0[, c("b","c")]) > } > > Here's a data generator > > setup <- function(n, m) > ## n: number of rows > ## m: expected counts per sum > { > a0 <- sort(sample(seq_len(m * n), n)) > data.frame(a=a0, b=as.integer(runif(n, 1, 10)), > c=as.integer(runif(n, 1, 10))) > } > > and a comparison with > > df0 <- setup(10^3, 3) > max_lag <- 5 > > > system.time(f0res <- f0(df0, max_lag), gcFirst=TRUE) > user system elapsed > 0.016 0.000 0.016 > > system.time(phil0res <- phil0(df0, max_lag), gcFirst=TRUE) > user system elapsed > 0.960 0.000 0.962 > > system.time(chuck0res <- chuck0(df0, 5), gcFirst=TRUE) > user system elapsed > 0.252 0.000 0.254 > > > all.equal(f0res, phil0res) > [1] TRUE > > > all.equal(as.matrix(f0res[,2:3]), chuck0res, check.attributes=FALSE) > [1] TRUE > > The f0 solution seems to be usable up to about a million rows, > > > df0 <- setup(10^6, 3) > > system.time(f0res <- f0(df0, max_lag), gcFirst=TRUE) > user system elapsed > 2.680 0.004 2.700 > > Martin > > > > > allie > > > > On 9/2/2009 3:13 AM, Phil Spector wrote: > >> Here's one way (assuming your data frame is named dat): > >> > >> with(dat, > >> data.frame(a,t(sapply(a,function(x){ > >> apply(dat[a - x >= -5 & a - x <= > >> 0,c('b','c')],2,sum)})))) > >> > >> > >> - Phil Spector > >> Statistical Computing Facility > >> Department of Statistics > >> UC Berkeley > >> spec...@stat.berkeley.edu > >> > >> > >> > >> On Tue, 1 Sep 2009, dolar wrote: > >> > >>> Would like some tips on how to avoid loops as I know they > are slow in R > >>> > >>> i've got a data frame : > >>> > >>> a b c > >>> 1 5 2 > >>> 4 6 9 > >>> 5 2 3 > >>> 8 3 2 > >>> > >>> What i'd like is to sum for each value of a, the sum of b > and the sum > >>> of c > >>> where a equal to or less than (with a distance of 5) > >>> > >>> i.e. for row three > >>> we have a=5 > >>> i'd like to sum up b and sum up c with the above rule > >>> since 5, 4 and 1 are less than (within a distance of 5) > or equal to > >>> 5, then > >>> we should get the following result: > >>> > >>> a b c > >>> 5 13 14 > >>> > >>> the overall result should be > >>> a b c > >>> 1 5 2 > >>> 4 11 11 > >>> 5 13 14 > >>> 8 11 14 > >>> > >>> how can i do this without a loop? > >>> -- > >>> View this message in context: > >>> http://www.nabble.com/Avoiding-loops-tp25251376p25251376.html > >>> Sent from the R help mailing list archive at Nabble.com. > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.