Re: [R] Processing large datasets

Roman Naumenko Wed, 25 May 2011 07:32:20 -0700

Thanks Jonathan. 

I'm already using RMySQL to load data for couple of days. 
I wanted to know what are the relevant R capabilities if I want to process much 
bigger tables.


R always reads the whole set into memory and this might be a limitation in case 
of big tables, correct? 
Doesn't it use temporary files or something similar to deal such amount of 
data? 

As an example I know that SAS handles sas7bdat files up to 1TB on a box with 
76GB memory, without noticeable issues. 

--Roman 

----- Original Message -----

> In cases where I have to parse through large datasets that will not
> fit into R's memory, I will grab relevant data using SQL and then
> analyze said data using R. There are several packages designed to do
> this, like [1] and [2] below, that allow you to query a database
> using
> SQL and end up with that data in an R data.frame.

> [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
> [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
> <ro...@bestroman.com> wrote:
> > Hi R list,
> >
> > I'm new to R software, so I'd like to ask about it is capabilities.
> > What I'm looking to do is to run some statistical tests on quite
> > big
> > tables which are aggregated quotes from a market feed.
> >
> > This is a typical set of data.
> > Each day contains millions of records (up to 10 non filtered).
> >
> > 2011-05-24 750 Bid DELL 14130770 400
> > 15.4800 BATS 35482391 Y 1 1 0 0
> > 2011-05-24 904 Bid DELL 14130772 300
> > 15.4800 BATS 35482391 Y 1 0 0 0
> > 2011-05-24 904 Bid DELL 14130773 135
> > 15.4800 BATS 35482391 Y 1 0 0 0
> >
> > I'll need to filter it out first based on some criteria.
> > Since I keep it mysql database, it can be done through by query.
> > Not
> > super efficient, checked it already.
> >
> > Then I need to aggregate dataset into different time frames (time
> > is
> > represented in ms from midnight, like 35482391).
> > Again, can be done through a databases query, not sure what gonna
> > be faster.
> > Aggregated tables going to be much smaller, like thousands rows per
> > observation day.
> >
> > Then calculate basic statistic: mean, standard deviation, sums etc.
> > After stats are calculated, I need to perform some statistical
> > hypothesis tests.
> >
> > So, my question is: what tool faster for data aggregation and
> > filtration
> > on big datasets: mysql or R?
> >
> > Thanks,
> > --Roman N.
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >

> --
> ===============================================
> Jon Daily
> Technician
> ===============================================
> #!/usr/bin/env outside
> # It's great, trust me.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Processing large datasets

Reply via email to