> -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of William Dunlap > Sent: Wednesday, May 05, 2010 12:59 PM > To: Joris Meys; jim holtman > Cc: R mailing list > Subject: Re: [R] Avoiding for-loop for splitting vector into > subvectorsbasedon positions > > > -----Original Message----- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On Behalf Of Joris Meys > > Sent: Tuesday, May 04, 2010 2:02 PM > > To: jim holtman > > Cc: R mailing list > > Subject: Re: [R] Avoiding for-loop for splitting vector into > > subvectorsbased on positions > > > > Thanks, works nicely. I have to do some clocking to see how much the > > improvement is, but I surely learnt again. > > > > Attentive readers might have noticed my initial code contains > > an error. > > tmp <- x[pos2[i]:pos2[i+1]] > > should be: > > tmp <- x[pos2[i]:(pos2[i+1]-1)] > > off course... > > I think you also wanted your for loop to run > along 1:length(pos) instead of 1:length(x). > > Your subject line asked how to avoid a for loop > but you seem to be interested in how to make > your function run quickly. These are different > questions. > > The following test functions seem to show that > your time (and probably memory) problems arise > from growing a dataset: > out <- c() > for(i in 1:length(pos)) { > ... > out<-c(out, length(tmp)) > } > instead of preallocating it and inserting into it: > out <- numeric(length(pos)) # or integer or list or ... ? > for(i in 1:length(pos)) { > ... > out[i] <- length(tmp) > } > > makeData <- function (nX, nPos) { > # make data for timing tests > pos <- sort(sample(nX, size=nPos, replace=FALSE)) > pos[1] <- 1L > list(x = seq_len(nX), pos = pos) > } > > f0 <- function (x, pos, FUN = length) { > # OP's code, slightly modified > pos2 <- c(pos, length(x) + 1) > retval <- c() > for (i in seq_len(length(pos))) { > tmp <- x[pos2[i]:(pos2[i + 1] - 1)] > retval <- c(retval, FUN(tmp)) > } > retval > } > > f1 <- function (x, pos, FUN = length) { > # like f0 but we preallocate the result > pos2 <- c(pos, length(x) + 1) > retval <- numeric(length(pos)) > for (i in seq_len(length(pos))) { > tmp <- x[pos2[i]:(pos2[i + 1] - 1)] > retval[i] <- FUN(tmp) > } > retval > } > > f2 <- function (x, pos, FUN = length) { > # use tapply > groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1))) > tapply(x, groupId, FUN) > } > > f3 <- function (x, pos, FUN = length) { > # lapply(split(...)) > groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1))) > unlist(lapply(split(x, groupId), FUN)) > } > > # make one million numbers in 400 thousand groups > z <- makeData(nX=1e6, nPos=4e5) > t0 <- system.time( r0 <- f0(z$x, z$pos) ) > t1 <- system.time( r1 <- f1(z$x, z$pos) ) > t2 <- system.time( r2 <- f2(z$x, z$pos) ) > t3 <- system.time( r3 <- f3(z$x, z$pos) ) > > > rbind(t0=t0, t1=t1, t2=t2, t3=t3) > user.self sys.self elapsed user.child sys.child > t0 429.44 3.30 425.84 NA NA > t1 3.20 0.00 3.16 NA NA > t2 6.91 0.01 6.72 NA NA > t3 2.68 0.02 2.72 NA NA
I forgot to mention the new-to-R-2.11.0 vapply() function. If you know the type of the output of FUN and the type is simple enough it can do what tapply() or [ls]apply(split()) do but more reliably and using less time and memory. f4 <- function (x, pos, FUN = length) { groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1))) vapply(split(x, groupId), FUN = FUN, FUN.VALUE = numeric(1)) } > system.time(r4 <- f4(z$x, z$pos)) user system elapsed 2.23 0.01 2.31 > all(r4==r0) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > > The results from each, r0-r3, are almost the same. > f1 produced a "numeric" (double precision) result > instead of an integer one (length() returns an integer). > tapply() spends time seeing if FUN always returns > the same kind of result and simplifies the answer > if it does. The others will run into problems > if FUN doesn't always return a single number. Choose > a method based on how general the code needs to be > and how much error checking your require. > > In any case, growing a vector that is destined to be > large can take a lot of time. > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > > On Tue, May 4, 2010 at 5:50 PM, jim holtman > > <jholt...@gmail.com> wrote: > > > > > Try this: > > > > > > > x <- 1:10 > > > > pos <- c(1,4,7) > > > > pat <- rep(seq_along(pos), times=diff(c(pos, length(x) + 1))) > > > > split(x, pat) > > > $`1` > > > [1] 1 2 3 > > > $`2` > > > [1] 4 5 6 > > > $`3` > > > [1] 7 8 9 10 > > > > > > > > > > > > On Tue, May 4, 2010 at 11:29 AM, Joris Meys > > <jorism...@gmail.com> wrote: > > > > > >> Dear all, > > >> > > >> I'm trying to optimize code and want to avoid for-loops > as much as > > >> possible. > > >> I'm applying a calculation on subvectors from a big one, > > and I get the > > >> subvectors by using a vector of starting positions: > > >> > > >> x <- 1:10 > > >> pos <- c(1,4,7) > > >> n <- length(x) > > >> > > >> I try to do something like this : > > >> pos2 <- c(pos, n+1) > > >> > > >> out <- c() > > >> for(i in 1:n){ > > >> tmp <- x[pos2[i]:pos2[i+1]] > > >> out <- c(out, length(tmp)) > > >> } > > >> > > >> Never mind the length function, I apply a far more > > complicated one. It's > > >> about the use of the indices in the for-loop. I didn't see > > any way of > > >> doing > > >> that with an apply, unless there is a very convenient way > > of splitting my > > >> vector in a list of the subvectors or so. > > >> > > >> Anybody an idea? > > >> Cheers > > >> -- > > >> Joris Meys > > >> Statistical Consultant > > >> > > >> Ghent University > > >> Faculty of Bioscience Engineering > > >> Department of Applied mathematics, biometrics and process control > > >> > > >> Coupure Links 653 > > >> B-9000 Gent > > >> > > >> tel : +32 9 264 59 87 > > >> joris.m...@ugent.be > > >> ------------------------------- > > >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > >> > > >> [[alternative HTML version deleted]] > > >> > > >> ______________________________________________ > > >> R-help@r-project.org mailing list > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > >> PLEASE do read the posting guide > > >> > > http://www.R-project.org/posting-guide.html<http://www.r-proje > ct.org/posting-guide.html> > > >> and provide commented, minimal, self-contained, > reproducible code. > > >> > > > > > > > > > > > > -- > > > Jim Holtman > > > Cincinnati, OH > > > +1 513 646 9390 > > > > > > What is the problem that you are trying to solve? > > > > > > > > > > > -- > > Joris Meys > > Statistical Consultant > > > > Ghent University > > Faculty of Bioscience Engineering > > Department of Applied mathematics, biometrics and process control > > > > Coupure Links 653 > > B-9000 Gent > > > > tel : +32 9 264 59 87 > > joris.m...@ugent.be > > ------------------------------- > > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.