On 07/15/2010 01:12 AM, Hervé Pagès wrote: > Hi, > > I'm subsetting a named vector using character indices. > My vector of indices (or keys) is 10x longer than the vector > I'm subsetting. All my keys are distinct and only 10% of them > are valid (i.e. match a name of the vector being subsetted). > It is surprisingly slow: > > x1 <- 1:1000 > names(x1) <- paste("a", x1, sep="") > keys <- sample(c(names(x1), paste("b", 1:9000, sep=""))) >> system.time(y1 <- x1[keys]) > user system elapsed > 0.410 0.000 0.416 > > x2 <- 1:2000 > names(x2) <- paste("a", x2, sep="") > keys <- sample(c(names(x2), paste("b", 1:18000, sep=""))) >> system.time(y2 <- x2[keys]) > user system elapsed > 1.730 0.000 1.736
For what its worth, I think this comes about in the loop starting at subscript.c:538, which seems to be there to allow [<-,*,character to extend a vector with new named elements > x=c(a=1) > x["b"] = 2 > x a b 1 2 It seems to be irrelevant (?) for sub-setting per se (though by analogy one might expect x["c"] to return a length-1 vector NA with name "c", whereas it returns a vector with names NA). Seems like the O(n^2) loop through NonNullStringMatch could be replaced by look-ups into a hash, or an additional argument could be propagated to stringSubscript to exit early when names aren't required. Or the call to makeSubscript at subset.c:164 could instead be made to matchE in unique.c. Martin > > x3 <- 1:4000 > names(x3) <- paste("a", x3, sep="") > keys <- sample(c(names(x3), paste("b", 1:36000, sep=""))) >> system.time(y3 <- x3[keys]) > user system elapsed > 8.900 0.010 9.227 > > x4 <- 1:8000 > names(x4) <- paste("a", x4, sep="") > keys <- sample(c(names(x4), paste("b", 1:72000, sep=""))) >> system.time(y4 <- x4[keys]) > user system elapsed > 130.390 0.000 132.316 > > And it's apparently worse than quadratic in time! > > I'm wondering why this subsetting by name is so slow since it > seems it could be implemented with x4[match(keys, names(x4))], > which is very fast: only 0.012s! > > This is with R-2.11.0 and R-2.12.0. > > Thanks, > H. > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel