Matthew, Yes, the case I am thinking of is a 1-column key; sorry for the overgeneralization. I haven't thought much about the multi-column key case.
-s On Mon, Nov 7, 2011 at 12:48, Matthew Dowle <mdo...@mdowle.plus.com> wrote: > Stavros Macrakis <macrakis <at> alum.mit.edu> writes: > > > > data.table certainly has some useful mechanisms, and I've been > > experimenting with it as an implementation mechanism, though it's not a > > drop-in substitute for factors. Also, though it is efficient for set > > operations between small sets and large sets, it is not very efficient > for > > operations between two large sets > > As a general statement that could do with some clarification ;) data.table > likes keys consisting of multiple ordered columns, e.g. (id,date). It is (I > believe) efficient for joining two large 2+ column keyed data sets because > the > upper bound of each row's one-sided binary search is localised in that > case (by > group of the previous key column). > > As I understand it, Stavros has a different type of 'two large datasets' : > English language website data. Each set is one large vector of uniformly > distributed unique strings. That appears to be quite a different problem to > multiple columns of many times duplicated data. > > Matthew > > > Thanks everyone, and if you do come across a relevant CRAN package, I'd > be > > very interested in hearing about it. > > > > -s > > > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel