On Fri, Feb 15, 2013 at 3:35 PM, Hadley Wickham <h.wick...@gmail.com> wrote:
> > There are R objects representing survey data sets, with the data stored > in > > a database table. The subset() method, when applied to these objects, > > creates a new table indicating which rows of the data table are in the > > subset -- we don't modify the original table, because that breaks the > > call-by-value semantics. When the subset object in R goes out of scope, > we > > need to delete the extra database table. > > Isn't subset slightly too early to do this? It would be slightly more > efficient for subset to return an object that creates the table when > you first attempt to modify it. > The subset table isn't a copy of the subset, it contains the unique key and an indicator column showing whether the element is in the subset. I need this even if the subset is never modified, so that I can join it to the main table and use it in SQL 'where' conditions to get computations for the right subset of the data. The whole point of this new sqlsurvey package is that most of the aggregation operations happen in the database rather than in R, which is faster for very large data tables. The use case is things like the American Community Survey and the Nationwide Emergency Department Subsample, with millions or tens of millions of records and quite a lot of variables. At this scale, loading stuff into memory isn't feasible on commodity desktops and laptops, and even on computers with enough memory, the database (MonetDB) is faster. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel