[R] [R-pkgs] data.table is on CRAN (enhanced data.frame for time series joins and more)

Matthew Dowle Tue, 31 Mar 2009 00:57:59 -0700

Dear all,

The data.table package was released back in August 2008. This email is topublicise its existence in response to several suggestions to do so. Itseems I didn't send a general announcement about it at the time andtherefore perhaps, not surprisingly, not many people know about it. Glancingat some r-help threads recently supports the idea of sending a publicannouncement.

The main difference between data.frame and data.table is enhancedfunctionality in [.data.table where most documentation for this packagelives i.e. help("[.data.table"). Selected extracts from the packagedocumentation follow.


The package builds on base R functionality to reduce 2 types of time :
  1. programming time (easier to write, read, debug and maintain)
  2. compute time

when combining database like operations (subset, with and by) and providessimilar joins that merge provides but faster. This is achieved by using R'scolumn based ordered in-memory data.frame, eval within the environment of alist (i.e. with), the [.data.table mechanism to condense the features andcompiled C to make certain operations fast.

[.data.table is like [.data.frame but i and j can be expressions of columnnames directly. Furthermore i may itself be a data.table which invokes afast table join using binary search in O(log n) time. Allowing i to bedata.table is consistent with subsetting an n-dimension array by an n-columnmatrix in base R. data.tables do not have rownames but may instead have akey of one or more columns using setkey. This key may be used for rowindexing instead of rownames.


Examples comparing [.data.frame and [.data.table :

DF = data.frame(a=1:5, b=6:10)
DT = data.table(a=1:5, b=6:10)

tt = subset(DF,a==3)

ss = DT[a==3] # just use the column name 'a' directly. No need toremember the comma. The i argument is like the 'where' in SQL.

identical(as.data.table(tt), ss)

tt = with(subset(DF,a==3),a+b+1)

ss = DT[a==3,a+b+1] # j is like select in SQL and the select argumentof subset in base R. j can be an expression of column names directly,including a data.table of multiple expressions. Here the j expression isexecuted just for the rows matching the i argument.

identical(tt, ss)

# Examples above use vector scans i.e. the "a==3" expression first creates alogical vector as long as the total number of rows and then evaluates a==3for every row.# Examples below use binary search, invoked by passing in a data.table asthe i argument. Joins in SQL are performed in the where clause and the iargument is like where, so this seems very natural (to me anyway!)


DT = data.table(a=letters[1:5], b=6:10)
setkey(DT,a)
identical(DT[J("d")], DT[4])        # binary search to row for 'd'

DT = data.table(id=rep(c("A","B"),each=3),date=c(20080501L,20080502L,20080506L), v=1:6)

setkey(DT,id,date)

DT["A"] # all 3 rows for A since multby default is "all"DT[J("A",20080502L)] # row for A where date also matchesexactlyDT[J("A",20080505L)] # NA since 5 May is missing (outer joinby default)

DT[J("A",20080505L),nomatch=0]             # inner join instead
dts = c(20080501L, 20080502L, 20080505L, 20080506L, 20080507L, 20080508L)

DT[J("A",dts)] # 3 of the dates in dts matchexactlyDT[J("A",dts),roll=TRUE] # roll previous data forward i.e.return the prevailing observationDT[J("A",dts),rolltolast=TRUE] # roll all but last observationforward


tables(mb=TRUE)   # prints table names, number of rows, size in memory

Thanks to all those who have made suggestions and feedback so far. Furthercomments and feedback on the package would be much appreciated.


Regards, Matthew

_______________________________________________
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] data.table is on CRAN (enhanced data.frame for time series joins and more)

Reply via email to