Here's some timings on seemingly minor variations of data structure showing timings ranging by a factor of 100 (factor of 3 if the worst is omitted). One of the keys is to avoid use of the partial string match that happens with ordinary data frame subscripting.
-- Tony Plate > n <- 10000 # number of rows in data frame > k <- 500 # number of vectors in indexing list > # use a data frame with regular row names and id as factor (defaults for data.frame) > df <- data.frame(id=paste("ID", seq(len=n), sep=""), result=seq(len=n), stringsAsFactors=TRUE) > object.size(df) [1] 440648 > df[1:3,,drop=FALSE] id result 1 ID1 1 2 ID2 2 3 ID3 3 > set.seed(1) > ids <- lapply(seq(k), function(i) paste("ID", sample(n, size=sample(seq(ceiling(n/1000), n/2, 1))), sep="")) > sum(sapply(ids, length)) [1] 1263508 > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE])) user system elapsed 3.00 0.00 3.03 > > # use a data frame with automatic row names (should be low overhead) and id as factor > df <- data.frame(id=paste("ID", seq(len=n), sep=""), result=seq(len=n), row.names=NULL, stringsAsFactors=TRUE) > object.size(df) [1] 440648 > df[1:3,,drop=FALSE] id result 1 ID1 1 2 ID2 2 3 ID3 3 > set.seed(1) > ids <- lapply(seq(k), function(i) paste("ID", sample(n, size=sample(seq(ceiling(n/1000), n/2, 1))), sep="")) > sum(sapply(ids, length)) [1] 1263508 > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE])) user system elapsed 2.68 0.00 2.70 > > # use a data frame with automatic row names (should be low overhead) and id as character > df <- data.frame(id=paste("ID", seq(len=n), sep=""), result=seq(len=n), row.names=NULL, stringsAsFactors=FALSE) > object.size(df) [1] 400448 > df[1:3,,drop=FALSE] id result 1 ID1 1 2 ID2 2 3 ID3 3 > set.seed(1) > ids <- lapply(seq(k), function(i) paste("ID", sample(n, size=sample(seq(ceiling(n/1000), n/2, 1))), sep="")) > sum(sapply(ids, length)) [1] 1263508 > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE])) user system elapsed 1.54 0.00 1.59 > > # use a data frame with ids as the row names & subscripting for matching (should be high overhead) > df <- data.frame(id=paste("ID", seq(len=n), sep=""), result=seq(len=n), row.names="id") > object.size(df) [1] 400384 > df[1:3,,drop=FALSE] result ID1 1 ID2 2 ID3 3 > set.seed(1) > ids <- lapply(seq(k), function(i) paste("ID", sample(n, size=sample(seq(ceiling(n/1000), n/2, 1))), sep="")) > sum(sapply(ids, length)) [1] 1263508 > system.time(lapply(ids, function(i) df[i,,drop=FALSE])) user system elapsed 109.15 0.04 111.28 > > # use a data frame with ids as the row names & match() > df <- data.frame(id=paste("ID", seq(len=n), sep=""), result=seq(len=n), row.names="id") > object.size(df) [1] 400384 > df[1:3,,drop=FALSE] result ID1 1 ID2 2 ID3 3 > set.seed(1) > ids <- lapply(seq(k), function(i) paste("ID", sample(n, size=sample(seq(ceiling(n/1000), n/2, 1))), sep="")) > sum(sapply(ids, length)) [1] 1263508 > system.time(lapply(ids, function(i) df[match(i, rownames(df)),,drop=FALSE])) user system elapsed 1.53 0.00 1.58 > > # use a named numeric vector to store the same data as was stored in the data frame > x <- seq(len=n) > names(x) <- paste("ID", seq(len=n), sep="") > object.size(x) [1] 400104 > x[1:3] ID1 ID2 ID3 1 2 3 > set.seed(1) > ids <- lapply(seq(k), function(i) paste("ID", sample(n, size=sample(seq(ceiling(n/1000), n/2, 1))), sep="")) > sum(sapply(ids, length)) [1] 1263508 > system.time(lapply(ids, function(i) x[match(i, names(x))])) user system elapsed 1.14 0.05 1.19 > Iestyn Lewis wrote: > Good tip - an Rprof trace over my real data set resulted in a file > filled with: > > pmatch [.data.frame [ FUN lapply > pmatch [.data.frame [ FUN lapply > pmatch [.data.frame [ FUN lapply > pmatch [.data.frame [ FUN lapply > pmatch [.data.frame [ FUN lapply > ... > with very few other calls in there. pmatch seems to be the string > search function, so I'm guessing there's no hashing going on, or not > very good hashing. > > I'll let you know how the environment option works - the Bioconductor > project seems to make extensive use of it, so I'm guessing it's the way > to go. > > Iestyn > > hadley wickham wrote: >>> But... it's not any faster, which is worrisome to me because it seems >>> like your code uses rownames and would take advantage of the hashing >>> potential of named items. >> I'm pretty sure it will use a hash to access the specified rows. >> Before you pursue an environment based solution, you might want to >> profile the code to check that the hashing is actually the slowest >> part - I suspect creating all new data.frames is taking the most time. >> >> Hadley > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.