On Fri, Feb 15, 2013 at 3:35 PM, Hadley Wickham <h.wick...@gmail.com> wrote:

> > There are R objects representing survey data sets, with the data stored
> in
> > a database table.  The subset() method, when applied to these objects,
> > creates a new table indicating which rows of the data table are in the
> > subset -- we don't modify the original table, because that breaks the
> > call-by-value semantics. When the subset object in R goes out of scope,
> we
> > need to delete the extra database table.
>
> Isn't subset slightly too early to do this?  It would be slightly more
> efficient for subset to return an object that creates the table when
> you first attempt to modify it.
>

The subset table isn't a copy of the subset, it contains the unique key and
an indicator column showing whether the element is in the subset.  I need
this even if the subset is never modified, so that I can join it to the
main table and use it in SQL 'where' conditions to get computations for the
right subset of the data.

 The whole point of this new sqlsurvey package is that most of the
aggregation operations happen in the database rather than in R, which is
faster for very large data tables.  The use case is things like the
American Community Survey and the Nationwide Emergency Department
Subsample, with millions or tens of millions of records and quite a lot of
variables.  At this scale, loading stuff into memory isn't feasible on
commodity desktops and laptops, and even on computers with enough memory,
the database (MonetDB) is faster.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to