Stavros Macrakis <macrakis <at> alum.mit.edu> writes: > > data.table certainly has some useful mechanisms, and I've been > experimenting with it as an implementation mechanism, though it's not a > drop-in substitute for factors. Also, though it is efficient for set > operations between small sets and large sets, it is not very efficient for > operations between two large sets
As a general statement that could do with some clarification ;) data.table likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I believe) efficient for joining two large 2+ column keyed data sets because the upper bound of each row's one-sided binary search is localised in that case (by group of the previous key column). As I understand it, Stavros has a different type of 'two large datasets' : English language website data. Each set is one large vector of uniformly distributed unique strings. That appears to be quite a different problem to multiple columns of many times duplicated data. Matthew > Thanks everyone, and if you do come across a relevant CRAN package, I'd be > very interested in hearing about it. > > -s > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel