Cole, Bioconductor's high throughput sequencing infrastructure package IRanges contains code that may be useful for speeding up base::rbind.data.frame. I've extracted the salient bits from that rbind method, but left the corner case handling code out. IRanges's rbind method took the approach of treating a data set as a list of equal length columns, and so it contains a number of lapplys and vector concatenation calls. Given that base::rbind.data.frame sits at the core of many operations, I'm not sure if patches would be accepted for it, but I could take a crack at it.
biocRBind <- function(..., deparse.level=1) { ## Simplified version of IRanges's rbind method for DataFrame ## Removed all data checks, ignored row names args <- list(...) df <- args[[1L]] cn <- colnames(df) cl <- unlist(lapply(as.list(df, use.names = FALSE), class)) factors <- unlist(lapply(as.list(df, use.names = FALSE), is.factor)) cols <- lapply(seq_len(length(df)), function(i) { cols <- lapply(args, `[[`, cn[i]) if (factors[i]) { # combine factor levels, coerce to character levs <- unique(unlist(lapply(cols, levels), use.names=FALSE)) cols <- lapply(cols, as.character) } combined <- do.call(c, unname(cols)) if (factors[i]) combined <- factor(combined, levs) as(combined, cl[i]) }) names(cols) <- colnames(df) do.call(data.frame, cols) } # create list of data.frames set.seed(123) dat <- vector("list", 20000) for(i in seq_along(dat)) { size <- sample(1:30, 1) dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), size, replace=TRUE)) } # sample runs > system.time(do.call(biocRBind, dat)) user system elapsed 2.120 0.000 2.125 > system.time(do.call(biocRBind, dat)) user system elapsed 2.092 0.000 2.091 > system.time(do.call(biocRBind, dat)) user system elapsed 2.080 0.000 2.077 > sessionInfo() R Under development (unstable) (2012-04-19 r59111) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.16.0 Cheers, Patrick On 4/19/2012 3:34 PM, Cole Beck wrote: > It's normal for me to create a list of data.frames and then use > do.call('rbind', list(...)) to create a single data.frame. However, > I've noticed as the size of the list grows large, it is perhaps better > to do this in chunks. As an example here's a list of 20,000 similar > data.frames. > > # create list of data.frames > dat <- vector("list", 20000) > for(i in seq_along(dat)) { > size <- sample(1:30, 1) > dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), > letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), > size, replace=TRUE)) > } > # combine into one data.frame, normal usage > # system.time(do.call('rbind', dat)) # takes 2-3 minutes > combine <- function(x, steps=NA, verbose=FALSE) { > nr <- length(x) > if(is.na(steps)) steps <- nr > while(nr %% steps != 0) steps <- steps+1 > if(verbose) cat(sprintf("step size: %s\r\n", steps)) > dl <- vector("list", steps) > for(i in seq(steps)) { > ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps) > dl[[i]] <- do.call("rbind", x[ix]) > } > do.call("rbind", dl) > } > # combine into one data.frame > system.time(combine(dat, 100)) # takes 5-10 seconds > > I'm very surprised by this result. Does this improvement seem > reasonable? I would think "do.call" could utilize something similar > by default when the length of "args" is too high. Is using "do.call" > not recommended in this scenario? > > Regards, > Cole Beck > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel