the problem I have is that userid's are not just sequential from 1:n_users. if they were, of course I'd have made a big matrix that was n_users x n_fields and that would be that. but, I think what I cando is just use the hash to store the index into the result matrix, nothing more. then the rest of it will be easy.
but please tell me more about eliminating loops. In many cases in R I have used lapply and derivatives to avoid loops, but in this case they seem to give me extra overhead simply by the generation of their result lists: > system.time(lapply(1:10^4, mean)) user system elapsed 1.31 0.00 1.31 > system.time(for(i in 1:10^4) mean(i)) user system elapsed 0.33 0.00 0.32 thanks, mike > I don't think that's a fair comparison--- much of the overhead comes > from the use of data frames and the creation of the indexing vector. I > get > > > n_accts <- 10^3 > > n_trans <- 10^4 > > t <- list() > > t$amt <- runif(n_trans) > > t$acct <- as.character(round(runif(n_trans, 1, n_accts))) > > uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts) > > for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, n=0) > > system.time(for (i in seq_along(t$amt)) { > + acct <- t$acct[i] > + x <- uhash[[acct]] > + uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1) > + }, gcFirst = TRUE) > user system elapsed > 0.508 0.008 0.517 > > udf <- matrix(0, nrow = n_accts, ncol = 2) > > rownames(udf) <- as.character(1:n_accts) > > colnames(udf) <- c("amt", "n") > > system.time(for (i in seq_along(t$amt)) { > + idx <- t$acct[i] > + udf[idx, ] <- udf[idx, ] + c(t$amt[i], 1) > + }, gcFirst = TRUE) > user system elapsed > 1.872 0.008 1.883 > > The loop is still going to be the problem for realistic examples. > > -Deepayan ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.