On 7/5/07, jim holtman <[EMAIL PROTECTED]> wrote: > You are getting two very different results in what you are comparing. > > > system.time(lapply(1:10^4, mean)) > user system elapsed > 1.31 0.00 1.31 > is returning a list with 10,000 values in it. It is taking time to allocate > the space and such. > > > system.time(for(i in 1:10^4) mean(i)) > user system elapsed > 0.33 0.00 0.32 > is just returning a single value (mean(10^4)) and is not having to allocate > space and setup the structure for a list. Typically you use 'lapply' not > only for 'looping', but more importantly returning the values associated > with the processing.
The point still holds: > system.time(lapply(1:10^4, mean)) user system elapsed 3.748 2.404 6.161 > system.time({ a = numeric(10^4); for (i in 1:10^4) a[i] = mean(i) }) user system elapsed 0.716 0.004 0.720 To really get rid of the for loop, you need to move the loop to pure C code, e.g. > system.time(rowMeans(matrix(1:10^4, ncol = 1))) user system elapsed 0.004 0.000 0.004 Sometimes you can do this using functions available in R, e.g. using tapply() in your original question and rowMeans() in this example. Sometimes you cannot, and the only way to gain efficiency is to write custom C code (we do not have enough information to decide which is the case in your real example, since we don't know what it is). -Deepayan > On 7/5/07, Michael Frumin <[EMAIL PROTECTED]> wrote: > > > > the problem I have is that userid's are not just sequential from > > 1:n_users. if they were, of course I'd have made a big matrix that was > > n_users x n_fields and that would be that. but, I think what I cando is > > just use the hash to store the index into the result matrix, nothing > > more. then the rest of it will be easy. > > > > but please tell me more about eliminating loops. In many cases in R I > > have used lapply and derivatives to avoid loops, but in this case they > > seem to give me extra overhead simply by the generation of their result > > lists: > > > > > system.time(lapply(1:10^4, mean)) > > user system elapsed > > 1.31 0.00 1.31 > > > system.time(for(i in 1:10^4) mean(i)) > > user system elapsed > > 0.33 0.00 0.32 > > > > > > thanks, > > mike > > > > > > > I don't think that's a fair comparison--- much of the overhead comes > > > from the use of data frames and the creation of the indexing vector. I > > > get > > > > > > > n_accts <- 10^3 > > > > n_trans <- 10^4 > > > > t <- list() > > > > t$amt <- runif(n_trans) > > > > t$acct <- as.character(round(runif(n_trans, 1, n_accts))) > > > > uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts) > > > > for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, > > n=0) > > > > system.time(for (i in seq_along(t$amt)) { > > > + acct <- t$acct[i] > > > + x <- uhash[[acct]] > > > + uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1) > > > + }, gcFirst = TRUE) > > > user system elapsed > > > 0.508 0.008 0.517 > > > > udf <- matrix(0, nrow = n_accts, ncol = 2) > > > > rownames(udf) <- as.character(1:n_accts) > > > > colnames(udf) <- c("amt", "n") > > > > system.time(for (i in seq_along(t$amt)) { > > > + idx <- t$acct[i] > > > + udf[idx, ] <- udf[idx, ] + c(t$amt[i], 1) > > > + }, gcFirst = TRUE) > > > user system elapsed > > > 1.872 0.008 1.883 > > > > > > The loop is still going to be the problem for realistic examples. > > > > > > -Deepayan > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem you are trying to solve? > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.