Hi, As far as I can tell data.frame class adds two features to those of lists: * matrix structure via [,] and [,]<- operators (well, I know these are actually "["(i, j, ...), not "[,]"). * row names attribute. It seems that the overhead of the support for the row names, both computational and RAM-wise, is rather non-trivial. I frequently subscript from a data.frame, i.e. use [,] on data frames, and my timing shows that the equivalent list operation is about 7 times faster, see below. On the other hand, at least in my usage pattern, I really rarely benefit from the row names attribute, so as far as I am concerned row names is just an overhead. (Of course the speed difference may be due to other factors, the only thing I can tell is that subscripting is very slow in data frames relative to in lists). I thought of writing a new class, say lightweight.data.frame, that would be polymorphic with the existing data.frame class. The class would inherit from "list" and implement [,], [,]<- operators. It would also implement the "rownames" function that would return seq(nrow(x)), etc. It should also implement as.data.frame to avoid the overhead of conversion to a full-blown data.frame in calls like lm(y ~ x, data=myLightweightDataframe). Has anyone thought of this? Can you see any potential problems? Thanks, Vadim P.S. These are the timing results comparing data.frame operations to those of lists
# make a 1e6 * 5 list > system.time(x <- lapply(seq(5), function(x) rnorm(1e6))) [1] 4.46 0.10 4.57 0.00 0.00 # convert it to a data.frame > system.time(y <- as.data.frame(x)) [1] 49.17 1.25 50.61 0.00 0.00 # do an equivalent of x[-1,] on the list > i <- seq(2, nrow(y)); system.time(x.sub <- lapply(x, function(x) x[i])) [1] 0.19 0.15 0.35 0.00 0.00 # do an equivalent of x[-1,] on the data.frame > i <- seq(2, nrow(y)); system.time(y.sub <- y[i,]) [1] 2.08 0.56 2.64 0.00 0.00 > 2.64/0.35 [1] 7.542857 ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel