R factors are the natural way to represent factors -- and should be efficient since they use small integers. But in fact, for many (but not all) operations, R factors are considerably slower than integers, or even character strings. This appears to be because whenever a factor vector is subsetted, the entire levels vector is copied. For example:
> i1 <- sample(1e4,1e6,replace=T) > c1 <- paste('x',i1) > f1 <- factor(c1) > system.time(replicate(1e4,{q1<-i1[100:200];1})) user system elapsed 0.03 0.00 0.04 > system.time(replicate(1e4,{q1<-c1[100:200];1})) user system elapsed 0.04 0.00 0.04 > system.time(replicate(1e4,{q1<-f1[100:200];1})) user system elapsed 0.67 0.00 0.68 Putting the levels vector in an environment speeds up subsetting: myfactor <- function(...) { f <- factor(...) g <- unclass(f) class(g) <- "myfactor" attr(g,"mylevels") <- as.environment(list(levels=attr(f,"mylevels"))) g } `[.myfactor` <- function (x, ...) { y <- NextMethod("[") attributes(y) <- attributes(x) y } > m1 <- myfactor(f1) > system.time(replicate(1e4,{q1<-m1[100:200];1})) user system elapsed 0.05 0.00 0.04 Given R's value semantics, I believe this approach can be extended to most of class factor's functionality without problems, copying the environment if necessary. Some quick tests seem to show that this is no slower than ordinary factors even for very small numbers of levels. To do this, appropriate methods for this class (print, [<-, levels<-, etc.) would have to be written. Perhaps some core R functions also have to be changed? Am I missing some obvious flaw in this approach? Has anyone already implemented a factors package using this or some similar approach? Thanks, -s ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel