[R] Processing large datasets

2011-05-25 Thread Roman Naumenko
Hi R list,

I'm new to R software, so I'd like to ask about it is capabilities.
What I'm looking to do is to run some statistical tests on quite big 
tables which are aggregated quotes from a market feed.

This is a typical set of data.
Each day contains millions of records (up to 10 non filtered).

2011-05-24  750 Bid DELL14130770400 
15.4800 BATS35482391Y   1   1   0   0
2011-05-24  904 Bid DELL14130772300 
15.4800 BATS35482391Y   1   0   0   0
2011-05-24  904 Bid DELL14130773135 
15.4800 BATS35482391Y   1   0   0   0

I'll need to filter it out first based on some criteria.
Since I keep it mysql database, it can be done through by query. Not 
super efficient, checked it already.

Then I need to aggregate dataset into different time frames (time is 
represented in ms from midnight, like 35482391).
Again, can be done through a databases query, not sure what gonna be faster.
Aggregated tables going to be much smaller, like thousands rows per 
observation day.

Then calculate basic statistic: mean, standard deviation, sums etc.
After stats are calculated, I need to perform some statistical 
hypothesis tests.

So, my question is: what tool faster for data aggregation and filtration 
on big datasets: mysql or R?

Thanks,
--Roman N.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Jonathan Daily
In cases where I have to parse through large datasets that will not
fit into R's memory, I will grab relevant data using SQL and then
analyze said data using R. There are several packages designed to do
this, like [1] and [2] below, that allow you to query a database using
SQL and end up with that data in an R data.frame.

[1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
[2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote:
 Hi R list,

 I'm new to R software, so I'd like to ask about it is capabilities.
 What I'm looking to do is to run some statistical tests on quite big
 tables which are aggregated quotes from a market feed.

 This is a typical set of data.
 Each day contains millions of records (up to 10 non filtered).

 2011-05-24      750     Bid     DELL    14130770        400
 15.4800         BATS    35482391        Y       1       1       0       0
 2011-05-24      904     Bid     DELL    14130772        300
 15.4800         BATS    35482391        Y       1       0       0       0
 2011-05-24      904     Bid     DELL    14130773        135
 15.4800         BATS    35482391        Y       1       0       0       0

 I'll need to filter it out first based on some criteria.
 Since I keep it mysql database, it can be done through by query. Not
 super efficient, checked it already.

 Then I need to aggregate dataset into different time frames (time is
 represented in ms from midnight, like 35482391).
 Again, can be done through a databases query, not sure what gonna be faster.
 Aggregated tables going to be much smaller, like thousands rows per
 observation day.

 Then calculate basic statistic: mean, standard deviation, sums etc.
 After stats are calculated, I need to perform some statistical
 hypothesis tests.

 So, my question is: what tool faster for data aggregation and filtration
 on big datasets: mysql or R?

 Thanks,
 --Roman N.

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
===
Jon Daily
Technician
===
#!/usr/bin/env outside
# It's great, trust me.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Steve Lianoglou
Hi,

On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote:
 Hi R list,

 I'm new to R software, so I'd like to ask about it is capabilities.
 What I'm looking to do is to run some statistical tests on quite big
 tables which are aggregated quotes from a market feed.

 This is a typical set of data.
 Each day contains millions of records (up to 10 non filtered).

 2011-05-24      750     Bid     DELL    14130770        400
 15.4800         BATS    35482391        Y       1       1       0       0
 2011-05-24      904     Bid     DELL    14130772        300
 15.4800         BATS    35482391        Y       1       0       0       0
 2011-05-24      904     Bid     DELL    14130773        135
 15.4800         BATS    35482391        Y       1       0       0       0

 I'll need to filter it out first based on some criteria.
 Since I keep it mysql database, it can be done through by query. Not
 super efficient, checked it already.

 Then I need to aggregate dataset into different time frames (time is
 represented in ms from midnight, like 35482391).
 Again, can be done through a databases query, not sure what gonna be faster.
 Aggregated tables going to be much smaller, like thousands rows per
 observation day.

 Then calculate basic statistic: mean, standard deviation, sums etc.
 After stats are calculated, I need to perform some statistical
 hypothesis tests.

 So, my question is: what tool faster for data aggregation and filtration
 on big datasets: mysql or R?

Why not try a few experiments and see for yourself -- I guess the
answer will depend on what exactly you are doing.

If your datasets are *really* huge, check out some packages listed
under the Large memory and out-of-memory data section of the
HighPerformanceComputing task view at CRAN:

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Also, if you find yourself needing to do lots of
grouping/summarizing type of calculations over large data frame-like
objects, you might want to check out the data.table package:

http://cran.r-project.org/web/packages/data.table/index.html

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Roman Naumenko
Thanks Jonathan. 

I'm already using RMySQL to load data for couple of days. 
I wanted to know what are the relevant R capabilities if I want to process much 
bigger tables. 

R always reads the whole set into memory and this might be a limitation in case 
of big tables, correct? 
Doesn't it use temporary files or something similar to deal such amount of 
data? 

As an example I know that SAS handles sas7bdat files up to 1TB on a box with 
76GB memory, without noticeable issues. 

--Roman 

- Original Message -

 In cases where I have to parse through large datasets that will not
 fit into R's memory, I will grab relevant data using SQL and then
 analyze said data using R. There are several packages designed to do
 this, like [1] and [2] below, that allow you to query a database
 using
 SQL and end up with that data in an R data.frame.

 [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
 [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

 On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
 ro...@bestroman.com wrote:
  Hi R list,
 
  I'm new to R software, so I'd like to ask about it is capabilities.
  What I'm looking to do is to run some statistical tests on quite
  big
  tables which are aggregated quotes from a market feed.
 
  This is a typical set of data.
  Each day contains millions of records (up to 10 non filtered).
 
  2011-05-24 750 Bid DELL 14130770 400
  15.4800 BATS 35482391 Y 1 1 0 0
  2011-05-24 904 Bid DELL 14130772 300
  15.4800 BATS 35482391 Y 1 0 0 0
  2011-05-24 904 Bid DELL 14130773 135
  15.4800 BATS 35482391 Y 1 0 0 0
 
  I'll need to filter it out first based on some criteria.
  Since I keep it mysql database, it can be done through by query.
  Not
  super efficient, checked it already.
 
  Then I need to aggregate dataset into different time frames (time
  is
  represented in ms from midnight, like 35482391).
  Again, can be done through a databases query, not sure what gonna
  be faster.
  Aggregated tables going to be much smaller, like thousands rows per
  observation day.
 
  Then calculate basic statistic: mean, standard deviation, sums etc.
  After stats are calculated, I need to perform some statistical
  hypothesis tests.
 
  So, my question is: what tool faster for data aggregation and
  filtration
  on big datasets: mysql or R?
 
  Thanks,
  --Roman N.
 
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 

 --
 ===
 Jon Daily
 Technician
 ===
 #!/usr/bin/env outside
 # It's great, trust me.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Roman Naumenko
 Hi,

 On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
 ro...@bestroman.com wrote:
  Hi R list,
 
  I'm new to R software, so I'd like to ask about it is capabilities.
  What I'm looking to do is to run some statistical tests on quite
  big
  tables which are aggregated quotes from a market feed.
 
  This is a typical set of data.
  Each day contains millions of records (up to 10 non filtered).
 
  2011-05-24 750 Bid DELL 14130770 400
  15.4800 BATS 35482391 Y 1 1 0 0
  2011-05-24 904 Bid DELL 14130772 300
  15.4800 BATS 35482391 Y 1 0 0 0
  2011-05-24 904 Bid DELL 14130773 135
  15.4800 BATS 35482391 Y 1 0 0 0
 
  I'll need to filter it out first based on some criteria.
  Since I keep it mysql database, it can be done through by query.
  Not
  super efficient, checked it already.
 
  Then I need to aggregate dataset into different time frames (time
  is
  represented in ms from midnight, like 35482391).
  Again, can be done through a databases query, not sure what gonna
  be faster.
  Aggregated tables going to be much smaller, like thousands rows per
  observation day.
 
  Then calculate basic statistic: mean, standard deviation, sums etc.
  After stats are calculated, I need to perform some statistical
  hypothesis tests.
 
  So, my question is: what tool faster for data aggregation and
  filtration
  on big datasets: mysql or R?

 Why not try a few experiments and see for yourself -- I guess the
 answer will depend on what exactly you are doing.

 If your datasets are *really* huge, check out some packages listed
 under the Large memory and out-of-memory data section of the
 HighPerformanceComputing task view at CRAN:

 http://cran.r-project.org/web/views/HighPerformanceComputing.html

 Also, if you find yourself needing to do lots of
 grouping/summarizing type of calculations over large data
 frame-like objects, you might want to check out the data.table package:

 http://cran.r-project.org/web/packages/data.table/index.html

 --
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact

I don't think data.table is fundamentally different from data.frame type, but 
thanks for the suggestion. 

http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf
Just like data.frames, data.tables must fit inside RAM

The ff package by Adler, listed in Large memory and out-of-memory data is 
probably most interesting.

--Roman Naumenko

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Marc Schwartz

Take a look at the High-Performance and Parallel Computing with R CRAN Task 
View:

  http://cran.us.r-project.org/web/views/HighPerformanceComputing.html

specifically at the section labeled Large memory and out-of-memory data.

There are some specific R features that have been implemented in a fashion to 
enable out of memory operations, but not all.

I believe that Revolution's commercial version of R, has developed 'big data' 
functionality, but would defer to them for additional details.

You can of course use a 64 bit version of R on a 64 bit OS to increase 
accessible RAM, however, there will still be object size limitations predicated 
upon the fact that R uses 32 bit signed integers for indexing into objects. See 
?Memory-limits for more information.

HTH,

Marc Schwartz


On May 25, 2011, at 8:49 AM, Roman Naumenko wrote:

 Thanks Jonathan. 
 
 I'm already using RMySQL to load data for couple of days. 
 I wanted to know what are the relevant R capabilities if I want to process 
 much bigger tables. 
 
 R always reads the whole set into memory and this might be a limitation in 
 case of big tables, correct? 
 Doesn't it use temporary files or something similar to deal such amount of 
 data? 
 
 As an example I know that SAS handles sas7bdat files up to 1TB on a box with 
 76GB memory, without noticeable issues. 
 
 --Roman 
 
 - Original Message -
 
 In cases where I have to parse through large datasets that will not
 fit into R's memory, I will grab relevant data using SQL and then
 analyze said data using R. There are several packages designed to do
 this, like [1] and [2] below, that allow you to query a database
 using
 SQL and end up with that data in an R data.frame.
 
 [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
 [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
 
 On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
 ro...@bestroman.com wrote:
 Hi R list,
 
 I'm new to R software, so I'd like to ask about it is capabilities.
 What I'm looking to do is to run some statistical tests on quite
 big
 tables which are aggregated quotes from a market feed.
 
 This is a typical set of data.
 Each day contains millions of records (up to 10 non filtered).
 
 2011-05-24 750 Bid DELL 14130770 400
 15.4800 BATS 35482391 Y 1 1 0 0
 2011-05-24 904 Bid DELL 14130772 300
 15.4800 BATS 35482391 Y 1 0 0 0
 2011-05-24 904 Bid DELL 14130773 135
 15.4800 BATS 35482391 Y 1 0 0 0
 
 I'll need to filter it out first based on some criteria.
 Since I keep it mysql database, it can be done through by query.
 Not
 super efficient, checked it already.
 
 Then I need to aggregate dataset into different time frames (time
 is
 represented in ms from midnight, like 35482391).
 Again, can be done through a databases query, not sure what gonna
 be faster.
 Aggregated tables going to be much smaller, like thousands rows per
 observation day.
 
 Then calculate basic statistic: mean, standard deviation, sums etc.
 After stats are calculated, I need to perform some statistical
 hypothesis tests.
 
 So, my question is: what tool faster for data aggregation and
 filtration
 on big datasets: mysql or R?
 
 Thanks,
 --Roman N.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Steve Lianoglou
Hi,

On Wed, May 25, 2011 at 10:18 AM, Roman Naumenko ro...@bestroman.com wrote:
[snip]
 I don't think data.table is fundamentally different from data.frame type, but 
 thanks for the suggestion.

 http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf
 Just like data.frames, data.tables must fit inside RAM

Yeah, I know -- I only mentioned in the context of manipulating
data.frame-like objects -- sorry if I wasn't clear.

If you've got data that's data.frame like that you can store in ram
AND you find yourself wanting to do some summary calcs over different
subgroups of it, you might find that data.table will be a quicker way
to get that done -- the larger your data.frame/table, the more
noticeable the speed.

To give you and idea of what scenarios I'm talking about, other
packages you'd use to do the same would by plyr and sqldf.

For out of memory datasets, you're in a different realm -- hence the
HPC Task view link.

 The ff package by Adler, listed in Large memory and out-of-memory data is 
 probably most interesting.

Cool.

I've had some luck using the bigmemory package (and friends) in the
past as well.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets/ non answer but Q on writing data frame derivative.

2011-05-25 Thread Mike Marchywka






 Date: Wed, 25 May 2011 09:49:00 -0400
 From: ro...@bestroman.com
 To: biomathjda...@gmail.com
 CC: r-help@r-project.org
 Subject: Re: [R] Processing large datasets

 Thanks Jonathan.

 I'm already using RMySQL to load data for couple of days.
 I wanted to know what are the relevant R capabilities if I want to process 
 much bigger tables.

 R always reads the whole set into memory and this might be a limitation in 
 case of big tables, correct?

ok, now I ask, perhaps for my first R effort I will try to find source code for
data frame and make a paging or streaming derivative. That is, at least for
fixed size things, it can supply things like number of total rows but
has facilities for paging in and out of memory. Presumably all users of data
frame have to work through a limited interface which I guess could be 
expanded with various hints on  prefetch this for example. I haven't looked
at this idea in a while but the issue keeps coming up, dev list maybe?

Anyway, for your immediate issues with a few statistics you could
probably write a simple c++ program that ultimately becomes part of
an R package. It is a good idea to see what is available but these
questions come up here a lot and the normal suggestion is DB which
is exactly the opposite of what you want if you have predictable
access patterns ( although even here prefetch could probably be implemented).






 Doesn't it use temporary files or something similar to deal such amount of 
 data?

 As an example I know that SAS handles sas7bdat files up to 1TB on a box with 
 76GB memory, without noticeable issues.

 --Roman

 - Original Message -

  In cases where I have to parse through large datasets that will not
  fit into R's memory, I will grab relevant data using SQL and then
  analyze said data using R. There are several packages designed to do
  this, like [1] and [2] below, that allow you to query a database
  using
  SQL and end up with that data in an R data.frame.

  [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
  [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

  On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
   wrote:
   Hi R list,
  
   I'm new to R software, so I'd like to ask about it is capabilities.
   What I'm looking to do is to run some statistical tests on quite
   big
   tables which are aggregated quotes from a market feed.
  
   This is a typical set of data.
   Each day contains millions of records (up to 10 non filtered).
  
   2011-05-24 750 Bid DELL 14130770 400
   15.4800 BATS 35482391 Y 1 1 0 0
   2011-05-24 904 Bid DELL 14130772 300
   15.4800 BATS 35482391 Y 1 0 0 0
   2011-05-24 904 Bid DELL 14130773 135
   15.4800 BATS 35482391 Y 1 0 0 0
  
   I'll need to filter it out first based on some criteria.
   Since I keep it mysql database, it can be done through by query.
   Not
   super efficient, checked it already.
  
   Then I need to aggregate dataset into different time frames (time
   is
   represented in ms from midnight, like 35482391).
   Again, can be done through a databases query, not sure what gonna
   be faster.
   Aggregated tables going to be much smaller, like thousands rows per
   observation day.
  
   Then calculate basic statistic: mean, standard deviation, sums etc.
   After stats are calculated, I need to perform some statistical
   hypothesis tests.
  
   So, my question is: what tool faster for data aggregation and
   filtration
   on big datasets: mysql or R?
  
   Thanks,
   --Roman N.
  
   [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
   http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
  

  --
  ===
  Jon Daily
  Technician
  ===
  #!/usr/bin/env outside
  # It's great, trust me.

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Mike Marchywka




 Date: Wed, 25 May 2011 10:18:48 -0400
 From: ro...@bestroman.com
 To: mailinglist.honey...@gmail.com
 CC: r-help@r-project.org
 Subject: Re: [R] Processing large datasets

  Hi,
  If your datasets are *really* huge, check out some packages listed
  under the Large memory and out-of-memory data section of the
  HighPerformanceComputing task view at CRAN:

  http://cran.r-project.org/web/views/HighPerformanceComputing.html

Does this have any specific limitations ? It sounds offhand like it
does paging and all the needed buffering for arbitrary size
data. Does it work with everything? I seem to recall bigmemory came up
before in this context and there was some problem.

Thanks.




  Also, if you find yourself needing to do lots of
  grouping/summarizing type of calculations over large data
  frame-like objects, you might want to check out the data.table package:

  http://cran.r-project.org/web/packages/data.table/index.html

  --
  Steve Lianoglou
  Graduate Student: Computational Systems Biology
  | Memorial Sloan-Kettering Cancer Center
  | Weill Medical College of Cornell University
  Contact Info: http://cbio.mskcc.org/~lianos/contact

 I don't think data.table is fundamentally different from data.frame type, but 
 thanks for the suggestion.

 http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf
 Just like data.frames, data.tables must fit inside RAM

 The ff package by Adler, listed in Large memory and out-of-memory data is 
 probably most interesting.

 --Roman Naumenko

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Hugo Mildenberger
With PostgreSQL at least, R can also be used as implementation
language for stored procedures. Hence data transfers between 
processes can be avoided alltogether. 

   http://www.joeconway.com/plr/

Implemention of such a procedure in R appears to be straighforward:
 
   CREATE OR REPLACE FUNCTION overpaid (emp) RETURNS bool AS '
  if (20  arg1$salary) {
  return(TRUE)
  }
  if (arg1$age  30  10  arg1$salary) {
  return(TRUE)
  }
  return(FALSE)
' LANGUAGE 'plr';

  CREATE TABLE emp (name text, age int, salary numeric(10,2));
INSERT INTO emp VALUES ('Joe', 41, 25.00);
INSERT INTO emp VALUES ('Jim', 25, 12.00);
INSERT INTO emp VALUES ('Jon', 35, 5.00);
 

  SELECT name, overpaid(emp) FROM emp;
   name | overpaid
--+--
Joe  | t
Jim  | t
Jon  | f
   (3 rows)


Best 



On Wednesday 25 May 2011 14:12:23 Jonathan Daily wrote:
 In cases where I have to parse through large datasets that will not
 fit into R's memory, I will grab relevant data using SQL and then
 analyze said data using R. There are several packages designed to do
 this, like [1] and [2] below, that allow you to query a database using
 SQL and end up with that data in an R data.frame.
 
 [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
 [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
 
 On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote:
  Hi R list,
 
  I'm new to R software, so I'd like to ask about it is capabilities.
  What I'm looking to do is to run some statistical tests on quite big
  tables which are aggregated quotes from a market feed.
 
  This is a typical set of data.
  Each day contains millions of records (up to 10 non filtered).
 
  2011-05-24  750 Bid DELL14130770400
  15.4800 BATS35482391Y   1   1   0   0
  2011-05-24  904 Bid DELL14130772300
  15.4800 BATS35482391Y   1   0   0   0
  2011-05-24  904 Bid DELL14130773135
  15.4800 BATS35482391Y   1   0   0   0
 
  I'll need to filter it out first based on some criteria.
  Since I keep it mysql database, it can be done through by query. Not
  super efficient, checked it already.
 
  Then I need to aggregate dataset into different time frames (time is
  represented in ms from midnight, like 35482391).
  Again, can be done through a databases query, not sure what gonna be faster.
  Aggregated tables going to be much smaller, like thousands rows per
  observation day.
 
  Then calculate basic statistic: mean, standard deviation, sums etc.
  After stats are calculated, I need to perform some statistical
  hypothesis tests.
 
  So, my question is: what tool faster for data aggregation and filtration
  on big datasets: mysql or R?
 
  Thanks,
  --Roman N.
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Steve Lianoglou
Hi,

On Wed, May 25, 2011 at 11:00 AM, Mike Marchywka marchy...@hotmail.com wrote:
[snip]
  If your datasets are *really* huge, check out some packages listed
  under the Large memory and out-of-memory data section of the
  HighPerformanceComputing task view at CRAN:

  http://cran.r-project.org/web/views/HighPerformanceComputing.html

 Does this have any specific limitations ? It sounds offhand like it
 does paging and all the needed buffering for arbitrary size
 data. Does it work with everything?

I'm not sure what limitations ... I know the bigmemory (and ff)
packages try hard to make using out-of-memory datasets as
transparent as possible.

That having been said, I guess you will have to port more advanced
methods to use such packages, hence the existence of the biglm,
biganalytics, bigtabulate packages do.

 I seem to recall bigmemory came up
 before in this context and there was some problem.

Well -- I don't often see emails on this list complaining about their
functionality. That doesn't mean they're flawless (I also don't
scrutinize the list traffic too closely). It could be that not too
many people use them, or that people give up before they come knocking
when there is a problem.

Has something specifically failed for you in the past, or?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Processing large datasets

2011-05-25 Thread Mike Marchywka









 Date: Wed, 25 May 2011 12:32:37 -0400
 Subject: Re: [R] Processing large datasets
 From: mailinglist.honey...@gmail.com
 To: marchy...@hotmail.com
 CC: ro...@bestroman.com; r-help@r-project.org

 Hi,

 On Wed, May 25, 2011 at 11:00 AM, Mike Marchywka  wrote:
 [snip]
   If your datasets are *really* huge, check out some packages listed
   under the Large memory and out-of-memory data section of the
   HighPerformanceComputing task view at CRAN:
 
   http://cran.r-project.org/web/views/HighPerformanceComputing.html
 
  Does this have any specific limitations ? It sounds offhand like it
  does paging and all the needed buffering for arbitrary size
  data. Does it work with everything?

 I'm not sure what limitations ... I know the bigmemory (and ff)
 packages try hard to make using out-of-memory datasets as
 transparent as possible.

 That having been said, I guess you will have to port more advanced
 methods to use such packages, hence the existence of the biglm,
 biganalytics, bigtabulate packages do.

  I seem to recall bigmemory came up
  before in this context and there was some problem.

 Well -- I don't often see emails on this list complaining about their
 functionality. That doesn't mean they're flawless (I also don't
 scrutinize the list traffic too closely). It could be that not too
 many people use them, or that people give up before they come knocking
 when there is a problem.

 Has something specifically failed for you in the past, or?

No, I haven't tried. I may have it confused with something else.
But this question does come up a bit usually related to 
 I tried to read huge file into data frame and wanted to pass
it to something with predictable memory access patterns and it
ran out of memory. What can I do? I guess I also stopped reading
anything after  using a DB as this is generally not a replacement
for a data strcuture. I'll take a look when I have a big dataset that
I can't condense easily. 







 -steve

 --
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
  | Memorial Sloan-Kettering Cancer Center
  | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.