Re: [R] Histogram over a Large Data Set (DB): How?

Eric Eide Fri, 18 Nov 2005 14:18:21 -0800

"Sean" == Sean Davis <[EMAIL PROTECTED]> writes:

        Sean> Have you tried just grabbing the whole column using dbGetQuery?
        Sean> Try doing this:
        Sean> 
        Sean> spams <- dbGetQuery(con,"select unixtime from email limit
        Sean> 1000000")
        Sean> 
        Sean> Then increase from 1,000,000 to 1.5 million, to 2 million, etc.
        Sean> until you break something (run out of memory), if you do at all.


Yes, you are right.  For the example problem that I posed, R can indeed process
the entire query result in memory.  (The R process grows to 240MB, though!)

        Sean> However, the BETTER way to do this, if you already have the data
        Sean> in the database is to allow the database to do the histogram for
        Sean> you.  For example, to get a count of spams by day, in MySQL do
        Sean> something like: [...]

Yes, again you are right --- the particular problem that I posed is probably
better handled by formulating a more sophisticated SQL query.

But really, my goal isn't to solve the the example problem that I posed ---
rather, it is to understand how people use R to process very large data sets.
The research project that I'm working on will eventually need to deal with
query results that cannot fit in main memory, and for which the built-in
statistical facilities of most DBMSs will be insufficient.

Some of my colleagues have previously written their analyses "by hand," using
various scripting languages to read and process records from a DB in chunks.
Writing things in this way, however, can be tedious and error-prone.  Instead
of taking this approach, I would like to be able to use existing statistics
packages that have the ability to deal with large datasets in good ways.

So, I seek to understand the ways that people deal with these sorts of
situations in R.  Your advice is very helpful --- one should solve problems in
the simplest ways available! --- but I would still like to understand the
harder cases, and how one can use "general" R functions in combination with
DBI's `dbApply' and `fetch' interfaces, which divide results into chunks.

Thanks!

Eric.

-- 
-------------------------------------------------------------------------------
Eric Eide <[EMAIL PROTECTED]>  .         University of Utah School of Computing
http://www.cs.utah.edu/~eeide/ . +1 (801) 585-5512 voice, +1 (801) 581-5843 FAX

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Histogram over a Large Data Set (DB): How?

Reply via email to