Hi, Following previous discussion on this list (http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html) I have created a package as suggested, and uploaded it to CRAN incoming : data.table.tar.gz.
** Your comments and feedback will be very much appreciated. ** >From help(data.table) : This class really does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames. Most of the code is copied from base functions with the code manipulating row.names removed. A data.table is identical to a data.frame other than: * it doesn't have rownames * [,drop] by default is FALSE, so selecting a single row will always return a single row data.table not a vector * The comma is optional inside [], so DT[3] returns the 3rd row as a 1 row data.table * [] is like a call to subset() * [,...], is like a call to with(). (not yet implemented) Motivation: * up to 10 times less memory * up to 10 times faster to create, and copy * simpler R code * the white book defines rownames, so data.frame can't be changed ... => new class Examples: nr = 1000000 D = rep(1:5,nr/5) system.time(DF <<- data.frame(colA=D, colB=D)) # 2.08 system.time(DT <<- data.table(colA=D, colB=D)) # 0.15 (over 10 times faster to create) identical(as.data.table(DF), DT) identical(dim(DT),dim(DF)) object.size(DF)/object.size(DT) # 10 times less memory tt = subset(DF,colA>3) ss = DT[colA>3] identical(as.data.table(tt), ss) mean(subset(DF,colA+colB>5,"colB")) mean(DT[colA+colB>5]$colB) tt = with(subset(DF,colA>3),colA+colB) ss = with(DT[colA>3],colA+colB) # but could be: DT[colA>3,colA+colB] (not yet implemented) identical(tt, ss) tt = DF[with(DF,tapply(1:nrow(DF),colB,last)),] # select last row grouping by colB ss = DT[tapply(1:nrow(DT),colB,last)] # but could be: DT[last,group=colB] (not yet implemented) identical(as.data.table(tt), ss) Lkp=1:3 tt = DF[with(DF,colA %in% Lkp),] ss = DT[colA %in% Lkp] # expressions inside the [] can see objects in the calling frame identical(as.data.table(tt), ss) In each case above there is either a space, time, or code brevity advantage with the data.table. The motivation for the new class grew from the realization that performance of data.frames can be improved by removing the rownames. See here for the previous discussion http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html. Regards, Matthew ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel