[R] Processing large datasets
Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL14130770400 15.4800 BATS35482391Y 1 1 0 0 2011-05-24 904 Bid DELL14130772300 15.4800 BATS35482391Y 1 0 0 0 2011-05-24 904 Bid DELL14130773135 15.4800 BATS35482391Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame. [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- === Jon Daily Technician === #!/usr/bin/env outside # It's great, trust me. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Hi, On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Why not try a few experiments and see for yourself -- I guess the answer will depend on what exactly you are doing. If your datasets are *really* huge, check out some packages listed under the Large memory and out-of-memory data section of the HighPerformanceComputing task view at CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Also, if you find yourself needing to do lots of grouping/summarizing type of calculations over large data frame-like objects, you might want to check out the data.table package: http://cran.r-project.org/web/packages/data.table/index.html -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Thanks Jonathan. I'm already using RMySQL to load data for couple of days. I wanted to know what are the relevant R capabilities if I want to process much bigger tables. R always reads the whole set into memory and this might be a limitation in case of big tables, correct? Doesn't it use temporary files or something similar to deal such amount of data? As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. --Roman - Original Message - In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame. [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- === Jon Daily Technician === #!/usr/bin/env outside # It's great, trust me. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Hi, On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Why not try a few experiments and see for yourself -- I guess the answer will depend on what exactly you are doing. If your datasets are *really* huge, check out some packages listed under the Large memory and out-of-memory data section of the HighPerformanceComputing task view at CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Also, if you find yourself needing to do lots of grouping/summarizing type of calculations over large data frame-like objects, you might want to check out the data.table package: http://cran.r-project.org/web/packages/data.table/index.html -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact I don't think data.table is fundamentally different from data.frame type, but thanks for the suggestion. http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf Just like data.frames, data.tables must fit inside RAM The ff package by Adler, listed in Large memory and out-of-memory data is probably most interesting. --Roman Naumenko __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Take a look at the High-Performance and Parallel Computing with R CRAN Task View: http://cran.us.r-project.org/web/views/HighPerformanceComputing.html specifically at the section labeled Large memory and out-of-memory data. There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all. I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details. You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?Memory-limits for more information. HTH, Marc Schwartz On May 25, 2011, at 8:49 AM, Roman Naumenko wrote: Thanks Jonathan. I'm already using RMySQL to load data for couple of days. I wanted to know what are the relevant R capabilities if I want to process much bigger tables. R always reads the whole set into memory and this might be a limitation in case of big tables, correct? Doesn't it use temporary files or something similar to deal such amount of data? As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. --Roman - Original Message - In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame. [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Hi, On Wed, May 25, 2011 at 10:18 AM, Roman Naumenko ro...@bestroman.com wrote: [snip] I don't think data.table is fundamentally different from data.frame type, but thanks for the suggestion. http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf Just like data.frames, data.tables must fit inside RAM Yeah, I know -- I only mentioned in the context of manipulating data.frame-like objects -- sorry if I wasn't clear. If you've got data that's data.frame like that you can store in ram AND you find yourself wanting to do some summary calcs over different subgroups of it, you might find that data.table will be a quicker way to get that done -- the larger your data.frame/table, the more noticeable the speed. To give you and idea of what scenarios I'm talking about, other packages you'd use to do the same would by plyr and sqldf. For out of memory datasets, you're in a different realm -- hence the HPC Task view link. The ff package by Adler, listed in Large memory and out-of-memory data is probably most interesting. Cool. I've had some luck using the bigmemory package (and friends) in the past as well. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets/ non answer but Q on writing data frame derivative.
Date: Wed, 25 May 2011 09:49:00 -0400 From: ro...@bestroman.com To: biomathjda...@gmail.com CC: r-help@r-project.org Subject: Re: [R] Processing large datasets Thanks Jonathan. I'm already using RMySQL to load data for couple of days. I wanted to know what are the relevant R capabilities if I want to process much bigger tables. R always reads the whole set into memory and this might be a limitation in case of big tables, correct? ok, now I ask, perhaps for my first R effort I will try to find source code for data frame and make a paging or streaming derivative. That is, at least for fixed size things, it can supply things like number of total rows but has facilities for paging in and out of memory. Presumably all users of data frame have to work through a limited interface which I guess could be expanded with various hints on prefetch this for example. I haven't looked at this idea in a while but the issue keeps coming up, dev list maybe? Anyway, for your immediate issues with a few statistics you could probably write a simple c++ program that ultimately becomes part of an R package. It is a good idea to see what is available but these questions come up here a lot and the normal suggestion is DB which is exactly the opposite of what you want if you have predictable access patterns ( although even here prefetch could probably be implemented). Doesn't it use temporary files or something similar to deal such amount of data? As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. --Roman - Original Message - In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame. [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- === Jon Daily Technician === #!/usr/bin/env outside # It's great, trust me. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Date: Wed, 25 May 2011 10:18:48 -0400 From: ro...@bestroman.com To: mailinglist.honey...@gmail.com CC: r-help@r-project.org Subject: Re: [R] Processing large datasets Hi, If your datasets are *really* huge, check out some packages listed under the Large memory and out-of-memory data section of the HighPerformanceComputing task view at CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Does this have any specific limitations ? It sounds offhand like it does paging and all the needed buffering for arbitrary size data. Does it work with everything? I seem to recall bigmemory came up before in this context and there was some problem. Thanks. Also, if you find yourself needing to do lots of grouping/summarizing type of calculations over large data frame-like objects, you might want to check out the data.table package: http://cran.r-project.org/web/packages/data.table/index.html -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact I don't think data.table is fundamentally different from data.frame type, but thanks for the suggestion. http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf Just like data.frames, data.tables must fit inside RAM The ff package by Adler, listed in Large memory and out-of-memory data is probably most interesting. --Roman Naumenko __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
With PostgreSQL at least, R can also be used as implementation language for stored procedures. Hence data transfers between processes can be avoided alltogether. http://www.joeconway.com/plr/ Implemention of such a procedure in R appears to be straighforward: CREATE OR REPLACE FUNCTION overpaid (emp) RETURNS bool AS ' if (20 arg1$salary) { return(TRUE) } if (arg1$age 30 10 arg1$salary) { return(TRUE) } return(FALSE) ' LANGUAGE 'plr'; CREATE TABLE emp (name text, age int, salary numeric(10,2)); INSERT INTO emp VALUES ('Joe', 41, 25.00); INSERT INTO emp VALUES ('Jim', 25, 12.00); INSERT INTO emp VALUES ('Jon', 35, 5.00); SELECT name, overpaid(emp) FROM emp; name | overpaid --+-- Joe | t Jim | t Jon | f (3 rows) Best On Wednesday 25 May 2011 14:12:23 Jonathan Daily wrote: In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame. [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko ro...@bestroman.com wrote: Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL14130770400 15.4800 BATS35482391Y 1 1 0 0 2011-05-24 904 Bid DELL14130772300 15.4800 BATS35482391Y 1 0 0 0 2011-05-24 904 Bid DELL14130773135 15.4800 BATS35482391Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Hi, On Wed, May 25, 2011 at 11:00 AM, Mike Marchywka marchy...@hotmail.com wrote: [snip] If your datasets are *really* huge, check out some packages listed under the Large memory and out-of-memory data section of the HighPerformanceComputing task view at CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Does this have any specific limitations ? It sounds offhand like it does paging and all the needed buffering for arbitrary size data. Does it work with everything? I'm not sure what limitations ... I know the bigmemory (and ff) packages try hard to make using out-of-memory datasets as transparent as possible. That having been said, I guess you will have to port more advanced methods to use such packages, hence the existence of the biglm, biganalytics, bigtabulate packages do. I seem to recall bigmemory came up before in this context and there was some problem. Well -- I don't often see emails on this list complaining about their functionality. That doesn't mean they're flawless (I also don't scrutinize the list traffic too closely). It could be that not too many people use them, or that people give up before they come knocking when there is a problem. Has something specifically failed for you in the past, or? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Processing large datasets
Date: Wed, 25 May 2011 12:32:37 -0400 Subject: Re: [R] Processing large datasets From: mailinglist.honey...@gmail.com To: marchy...@hotmail.com CC: ro...@bestroman.com; r-help@r-project.org Hi, On Wed, May 25, 2011 at 11:00 AM, Mike Marchywka wrote: [snip] If your datasets are *really* huge, check out some packages listed under the Large memory and out-of-memory data section of the HighPerformanceComputing task view at CRAN: http://cran.r-project.org/web/views/HighPerformanceComputing.html Does this have any specific limitations ? It sounds offhand like it does paging and all the needed buffering for arbitrary size data. Does it work with everything? I'm not sure what limitations ... I know the bigmemory (and ff) packages try hard to make using out-of-memory datasets as transparent as possible. That having been said, I guess you will have to port more advanced methods to use such packages, hence the existence of the biglm, biganalytics, bigtabulate packages do. I seem to recall bigmemory came up before in this context and there was some problem. Well -- I don't often see emails on this list complaining about their functionality. That doesn't mean they're flawless (I also don't scrutinize the list traffic too closely). It could be that not too many people use them, or that people give up before they come knocking when there is a problem. Has something specifically failed for you in the past, or? No, I haven't tried. I may have it confused with something else. But this question does come up a bit usually related to I tried to read huge file into data frame and wanted to pass it to something with predictable memory access patterns and it ran out of memory. What can I do? I guess I also stopped reading anything after using a DB as this is generally not a replacement for a data strcuture. I'll take a look when I have a big dataset that I can't condense easily. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.