Re: [R] Large data and space use
Richard, I currently have no problem with running out of memory. I was referring to people who have said they use LARGE structures and I am pointing out how they can temporarily get way larger even when not expected. Functions that temporarily will balloon up might come with notifications. And, yes, some transformations may well be doable outside R or in chunks. What gets me is how often users have no idea what happens when they invoke a package. I am not against transformations and needed duplications. I am more interested in whether some existing code might be evaluated and updated in somewhat harmless ways as in removing stuff as soon as it is definitely not needed. Of course there are tradeoffs. I have seen times only one column of a data.frame was needed and the entire data.frame was copied and then returned. That is OK but clearly it might be more economical to ask just for a single column to be changed in place. People often use a sledgehammer when a thumbtack will do. But as noted, R has features that often delay things so a full copy is not made and thus less memory is ever used. But people seem to think that since all “local” memory is generally returned when the function ends, so why bother micromanaging it as it runs. Arguably, some R packages may make changes in what is kept and for how long. Standard R lets you specify what rows and what columns of a data.frame to keep in a single argument as in df[rows, columns] while something like dplyr offers multiple smaller steps in a grammar of sorts so you do something like a select() followed (often in a pipeline) by a filter() or done in the opposite order. Each additional change is sometimes done by programmers in minimal steps so that a more efficient implementation is harder to do as each one does just one thing well. That may also be a plus, especially if pipelined objects are released in progress and not all at the end of the pipeline. From: Richard O'Keefe Sent: Sunday, November 28, 2021 3:54 AM To: Avi Gross Cc: R-help Mailing List Subject: Re: [R] Large data and space use If you have enough data that running out of memory is a serious problem, then a language like R or Python or Octave or Matlab that offers you NO control over storage may not be the best choice. You might need to consider Julia or even Rust. However, if you have enough data that running out of memory is a serious problem, your problems may be worse than you think. In 2021, Linux is *still* having OOM Killer problems. https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/ Your process hogging memory may cause some other process to be killed. Even if that doesn't happen, your process may be simply thrown off the machine without being warned. It may be one of the biggest problems around in statistical computing: how to make it straightforward to carve up a problem so that it can be run on many machines. R has the 'Rmpi' and 'snow' packages, amongst others. https://CRAN.R-project.org/view=HighPerformanceComputing Another approach is to select and transform data outside R. If you have data in some kind of data base then doing select and transform in the data base may be a good approach. On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help mailto:r-help@r-project.org> > wrote: Several recent questions and answers have mad e me look at some code and I realized that some functions may not be great to use when you are dealing with very large amounts of data that may already be getting close to limits of your memory. Does the function you call to do one thing to your object perhaps overdo it and make multiple copies and not delete them as soon as they are not needed? An example was a recent post suggesting a nice set of tools you can use to convert your data.frame so the columns are integers or dates no matter how they were read in from a CSV file or created. What I noticed is that often copies of a sort were made by trying to change the original say to one date format or another and then deciding which, if any to keep. Sometimes multiple transformations are tried and this may be done repeatedly with intermediates left lying around. Yes, the memory will all be implicitly returned when the function completes. But often these functions invoke yet other functions which work on their copies. You an end up with your original data temporarily using multiple times as much actual memory. R does have features so some things are "shared" unless one copy or another changes. But in the cases I am looking at, changes are the whole idea. What I wonder is whether such functions should clearly call an rm() or the equivalent as soon as possible when something is no longer needed. The various kinds of pipelines are another case in point as they involve all kinds of hidden temporary variables that eventually need to be cleaned up. When are they
Re: [R] Large data and space use
If you have enough data that running out of memory is a serious problem, then a language like R or Python or Octave or Matlab that offers you NO control over storage may not be the best choice. You might need to consider Julia or even Rust. However, if you have enough data that running out of memory is a serious problem, your problems may be worse than you think. In 2021, Linux is *still* having OOM Killer problems. https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/ Your process hogging memory may cause some other process to be killed. Even if that doesn't happen, your process may be simply thrown off the machine without being warned. It may be one of the biggest problems around in statistical computing: how to make it straightforward to carve up a problem so that it can be run on many machines. R has the 'Rmpi' and 'snow' packages, amongst others. https://CRAN.R-project.org/view=HighPerformanceComputing Another approach is to select and transform data outside R. If you have data in some kind of data base then doing select and transform in the data base may be a good approach. On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help wrote: > Several recent questions and answers have mad e me look at some code and I > realized that some functions may not be great to use when you are dealing > with very large amounts of data that may already be getting close to limits > of your memory. Does the function you call to do one thing to your object > perhaps overdo it and make multiple copies and not delete them as soon as > they are not needed? > > > > An example was a recent post suggesting a nice set of tools you can use to > convert your data.frame so the columns are integers or dates no matter how > they were read in from a CSV file or created. > > > > What I noticed is that often copies of a sort were made by trying to change > the original say to one date format or another and then deciding which, if > any to keep. Sometimes multiple transformations are tried and this may be > done repeatedly with intermediates left lying around. Yes, the memory will > all be implicitly returned when the function completes. But often these > functions invoke yet other functions which work on their copies. You an end > up with your original data temporarily using multiple times as much actual > memory. > > > > R does have features so some things are "shared" unless one copy or another > changes. But in the cases I am looking at, changes are the whole idea. > > > > What I wonder is whether such functions should clearly call an rm() or the > equivalent as soon as possible when something is no longer needed. > > > > The various kinds of pipelines are another case in point as they involve > all > kinds of hidden temporary variables that eventually need to be cleaned up. > When are they removed? I have seen pipelines with 10 or more steps as > perhaps data is read in, has rows removed or columns removed or re-ordered > and grouping applied and merged with others and reports generated. The > intermediates are often of similar sizes with the data and if large, can > add > up. If writing the code linearly using temp1 and temp2 type of variables to > hold the output of one stage and the input of the text stage, I would be > tempted to add a rm(temp1) as soon as it was finished being used, or just > reuse the same name of temp1 so the previous contents are no longer being > pointed to and can be taken by the garbage collector at some time. > > > > So I wonder if some functions should have a note in their manual pages > specifying what may happen to the volume of data as they run. An example > would be if I had a function that took a matrix and simply squared it using > matrix multiplication. There are various ways to do this and one of them > simply makes a copy and invokes the built-in way in R that multiplies two > matrices. It then returns the result. So you end up storing basically three > times the size of the matrix right before you return it. Other methods > might do the actual multiplication in loops operating on subsections of the > matrix and if done carefully, never keep more than say 2.1 times as much > data around. > > > > Or is this not important often enough? All I know, is data may be getting > larger much faster than memory in our machines gets larger. > > > > > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.h
Re: [R] Large data and space use
First priority is to obtain a correct answer. Second priority is to document it and write tests for it. Third priority is to optimize it. Sometimes it is useful to keep intermediate values around to support supplemental calculations ala "summary", that may or may not lead to using rm where you might think it should be. But often the optimization step is simply neglected. On November 27, 2021 9:56:50 AM PST, Avi Gross via R-help wrote: >Several recent questions and answers have mad e me look at some code and I >realized that some functions may not be great to use when you are dealing >with very large amounts of data that may already be getting close to limits >of your memory. Does the function you call to do one thing to your object >perhaps overdo it and make multiple copies and not delete them as soon as >they are not needed? > > > >An example was a recent post suggesting a nice set of tools you can use to >convert your data.frame so the columns are integers or dates no matter how >they were read in from a CSV file or created. > > > >What I noticed is that often copies of a sort were made by trying to change >the original say to one date format or another and then deciding which, if >any to keep. Sometimes multiple transformations are tried and this may be >done repeatedly with intermediates left lying around. Yes, the memory will >all be implicitly returned when the function completes. But often these >functions invoke yet other functions which work on their copies. You an end >up with your original data temporarily using multiple times as much actual >memory. > > > >R does have features so some things are "shared" unless one copy or another >changes. But in the cases I am looking at, changes are the whole idea. > > > >What I wonder is whether such functions should clearly call an rm() or the >equivalent as soon as possible when something is no longer needed. > > > >The various kinds of pipelines are another case in point as they involve all >kinds of hidden temporary variables that eventually need to be cleaned up. >When are they removed? I have seen pipelines with 10 or more steps as >perhaps data is read in, has rows removed or columns removed or re-ordered >and grouping applied and merged with others and reports generated. The >intermediates are often of similar sizes with the data and if large, can add >up. If writing the code linearly using temp1 and temp2 type of variables to >hold the output of one stage and the input of the text stage, I would be >tempted to add a rm(temp1) as soon as it was finished being used, or just >reuse the same name of temp1 so the previous contents are no longer being >pointed to and can be taken by the garbage collector at some time. > > > >So I wonder if some functions should have a note in their manual pages >specifying what may happen to the volume of data as they run. An example >would be if I had a function that took a matrix and simply squared it using >matrix multiplication. There are various ways to do this and one of them >simply makes a copy and invokes the built-in way in R that multiplies two >matrices. It then returns the result. So you end up storing basically three >times the size of the matrix right before you return it. Other methods >might do the actual multiplication in loops operating on subsections of the >matrix and if done carefully, never keep more than say 2.1 times as much >data around. > > > >Or is this not important often enough? All I know, is data may be getting >larger much faster than memory in our machines gets larger. > > > > > > > [[alternative HTML version deleted]] > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data set
HI, You can try dbLoad() from hash package. Not sure whether it will be successful. A.K. - Original Message - From: Lorcan Treanor To: r-help@r-project.org Cc: Sent: Monday, July 23, 2012 8:02 AM Subject: [R] Large data set Hi all, Have a problem. Trying to read in a data set that has about 112,000,000 rows and 8 columns and obviously enough it was too big for R to handle. The columns are mode up of 2 integer columns and 6 logical columns. The text file is about 4.2 Gb in size. Also I have 4 Gb of RAM and 218 Gb of available space on the hard drive. I tried the dumpDF function but it was too big. Also tried bring in the data is 10 sets of about 12,000,000. Are there are other ways of getting around the size of the data. Regards, Lorcan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data set
First of all, try to determine the smallest file you can read with an empty workspace. Once you have done that, then break up your file into that size sets and read them in. The next question is what do you want to do with 112M rows of data. Can you process them a set a time and then aggregate the results. I have no problem in reading in files with 10M rows on a 32-bit version of R on Windows with 3GB of memory. So a little more information on "what is the problem you are trying to solve" would be useful. On Mon, Jul 23, 2012 at 8:02 AM, Lorcan Treanor wrote: > Hi all, > > Have a problem. Trying to read in a data set that has about 112,000,000 > rows and 8 columns and obviously enough it was too big for R to handle. The > columns are mode up of 2 integer columns and 6 logical columns. The text > file is about 4.2 Gb in size. Also I have 4 Gb of RAM and 218 Gb of > available space on the hard drive. I tried the dumpDF function but it was > too big. Also tried bring in the data is 10 sets of about 12,000,000. Are > there are other ways of getting around the size of the data. > > Regards, > > Lorcan > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] large data set (matrix) using image()
Works perfectly well with R-2.14.1 32-bit on a Windows device. Since you have not followed the posting guide and forgot to give details about your platform, there is not much we can do. Uwe Ligges On 22.12.2011 23:08, Karen Liu wrote: When I use the image() function for a relatively small matrix it works perfectly, eg.x<- 1:100 z<- matrix(rnorm(10^4),10^2,10^2) image(x=x,y=x,z=z,col=rainbow(3))but when I want to plot a larger matrix, it doesn't really work. Most of the times, it just plot a few intermitent points.x<- 1:1000 z<- matrix(rnorm(10^6),10^3,10^3) image(x=x,y=x,z=z,col=rainbow(3)) Generating the matrix didn't seem to be a problem. I would appreciate any thoughts and ideas. I have tried using heatmap in bioconductor. However, I want to substitute the dendograms with axes, but when I suppressed the dendogram, I can't successfully add any axis. If anyone knows heatmap() well and would like to help via this function, it would work also. Cheers!Karen Liu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large Data
http://www.google.com/#hl=en&source=hp&q=R+big+data+sets&aq=f&aqi=g1&aql=&oq=&gs_rfai=&fp=686584f57664 Cheers Joris On Mon, Jun 14, 2010 at 12:07 PM, Meenakshi wrote: > > HI, > > I want to import 1.5G CSV file in R. > But the following error comes: > > 'Victor allocation 12.4 size' > > How to read the large CSV file in R . > > Any one can help me? > > -- > View this message in context: > http://r.789695.n4.nabble.com/Large-Data-tp2254130p2254130.html > Sent from the R help mailing list archive at Nabble.com. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large Data
And this one is only from last week. Please, read the posting guides carefully. Cheers Joris -- Forwarded message -- From: Joris Meys Date: Sat, Jun 5, 2010 at 11:04 PM Subject: Re: [R] What is the largest in memory data object you've worked with in R? To: Nathan Stephens Cc: r-help You have to take some things into account : - the maximum memory set for R might not be the maximum memory available - R needs the memory not only for the dataset. Matrix manipulations require frquently double of the amount of memory taken by the dataset. - memory allocation is important when dealing with large datasets. There is plenty of information about that - R has some packages to get around memory problems with big datasets. Read this discussione for example : http://tolstoy.newcastle.edu.au/R/help/05/05/4507.html and this page of Matthew Keller is a good summary too : http://www.matthewckeller.com/html/memory.html Cheers Joris On Sat, Jun 5, 2010 at 12:32 AM, Nathan Stephens wrote: > For me, I've found that I can easily work with 1 GB datasets. This includes > linear models and aggregations. Working with 5 GB becomes cumbersome. > Anything over that, and R croaks. I'm using a dual quad core Dell with 48 > GB of RAM. > > I'm wondering if there is anyone out there running jobs in the 100 GB > range. If so, what does your hardware look like? > > --Nathan > >[[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php On Mon, Jun 14, 2010 at 12:07 PM, Meenakshi wrote: > > HI, > > I want to import 1.5G CSV file in R. > But the following error comes: > > 'Victor allocation 12.4 size' > > How to read the large CSV file in R . > > Any one can help me? > > -- > View this message in context: > http://r.789695.n4.nabble.com/Large-Data-tp2254130p2254130.html > Sent from the R help mailing list archive at Nabble.com. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data set in R
Thanks Kjetil. This is exactly what I wanted. Hardi From: Kjetil Halvorsen Cc: r-help Sent: Monday, March 2, 2009 9:45:43 PM Subject: Re: [R] Large data set in R install.packages("biglm", dep=TRUE) library(help=biglm) kjetil Hello, I'm trying to use R statistical packages to do ANOVA analysis using aov() and lm(). I'm having a problem when I have a large data set for input data from Full Factorial Design Experiment with replications. R seems to store everything in the memory and it fails when memory is not enough to hold the massive computation. Have anyone successfully used R to do such analysis before? Are there any work around on this problem? Thanks, Hardi __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data set in R
install.packages("biglm", dep=TRUE) library(help=biglm) kjetil On Mon, Mar 2, 2009 at 7:06 AM, Hardi wrote: > > Hello, > > I'm trying to use R statistical packages to do ANOVA analysis using aov() > and lm(). > I'm having a problem when I have a large data set for input data from Full > Factorial Design Experiment with replications. > R seems to store everything in the memory and it fails when memory is not > enough to hold the massive computation. > > Have anyone successfully used R to do such analysis before? Are there any > work around on this problem? > > Thanks, > > Hardi > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data sets with R (binding to hadoop available?)
Hi Martin, Sorry for the late reply. I realize this might now be straying too far from r-help, if there is a better forum for this topic (R use with Hadoop) please let me know. I agree it would indeed be great to leverage Hadoop via R syntax or R itself. A first step is figuring out how computations could be translated into map and reduce steps. I am beginning to see efforts in this direction: http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal http://www.cs.stanford.edu/people/ang//papers/nips06- mapreducemulticore.pdf http://cwiki.apache.org/MAHOUT/ Per Wikipedia, "A mahout is a person who drives an elephant". It would be nice if PIG and R either played well together or adopted each other's strengths (in driving the Hadoop elephant)! Avram On Aug 22, 2008, at 9:24 AM, Martin Morgan wrote: Hi Avram -- My understanding is that Google-like map / reduce achieves throughput by coordinating distributed calculation with distributed data. snow, Rmpi, nws, etc provide a way of distributing calculations, but don't help with coordinating distributed calculation with distributed data. SQL (at least naively implemented as a single database server) doesn't help with distributed data and the overhead of data movement from the server to compute nodes might be devastating. A shared file system across compute nodes (the implicit approach usually taken parallel R applications) offloads data distribution to the file system, which may be effective for not-too-large (10's of GB?) data. Many non-trivial R algorithms are not directly usable in distributed map, because they expect to operate on 'all of the data' rather than on data chunks. Out-of-the-box 'reduce' in R is limited really to collation (the parallel lapply-like functions) or sapply-like simplification; one would rather have more talented reducers (e.g., to aggregate bootstrap results). The list of talents required to exploit Hadoop starts to become intimidating (R, Java, Hadoop, PIG, + cluster management, etc), so it would certainly be useful to have that encapsulated in a way that requires only R skills! Martin <[EMAIL PROTECTED]> writes: Hi Apart from database interfaces such as sqldf which Gabor has mentioned, there are also packages specifically for handling large data: see the "ff" package, for instance. I am currently playing with parallelizing R computations via Hadoop. I haven't looked at PIG yet though. Rory -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Roland Rau Sent: 21 August 2008 20:04 To: Avram Aelony Cc: r-help@r-project.org Subject: Re: [R] Large data sets with R (binding to hadoop available?) Hi Avram Aelony wrote: Dear R community, I find R fantastic and use R whenever I can for my data analytic needs. Certain data sets, however, are so large that other tools seem to be needed to pre-process data such that it can be brought into R for further analysis. Questions I have for the many expert contributors on this list are: 1. How do others handle situations of large data sets (gigabytes, terabytes) for analysis in R ? I usually try to store the data in an SQLite database and interface via functions from the packages RSQLite (and DBI). No idea about Question No. 2, though. Hope this helps, Roland P.S. When I am sure that I only need a certain subset of large data sets, I still prefer to do some pre-processing in awk (gawk). 2.P.S. The size of my data sets are in the gigabyte range (not terabyte range). This might be important if your data sets are *really large* and you want to use sqlite: http://www.sqlite.org/whentouse.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. * ** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority This e-mail message is confidential and for use by the=2...{{dropped:22}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting
Re: [R] Large Data Set Help
On Mon, 25 Aug 2008, Roland Rau wrote: Hi, Jason Thibodeau wrote: I am attempting to perform some simple data manipulation on a large data set. I have a snippet of the whole data set, and my small snippet is 2GB in CSV. Is there a way I can read my csv, select a few columns, and write it to an output file in real time? This is what I do right now to a small test file: data <- read.csv('data.csv', header = FALSE) data_filter <- data[c(1,3,4)] write.table(data_filter, file = "filter_data.csv", sep = ",", row.names = FALSE, col.names = FALSE) in this case, I think R is not the best tool for the job. I would rather suggest to use an implementation of the awk language (e.g. gawk). I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB unzipped), piped into gawk) unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt Or unzip -p myzipfile.zip | cut -d, -f1,3,4 > myfiltereddata.txt But beware that both this and Roland's solution will return a,c,d for an input line consisting of a,"b,c",d,e,f HTH, Chuck and it took about 90 seconds. Please note that you might need to specify your delimiter (field separator (FS) and output field separator (OFS)) => gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv I hope this helps (despite not encouraging the usage of R), Roland __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large Data Set Help
Hi, Jason Thibodeau wrote: I am attempting to perform some simple data manipulation on a large data set. I have a snippet of the whole data set, and my small snippet is 2GB in CSV. Is there a way I can read my csv, select a few columns, and write it to an output file in real time? This is what I do right now to a small test file: data <- read.csv('data.csv', header = FALSE) data_filter <- data[c(1,3,4)] write.table(data_filter, file = "filter_data.csv", sep = ",", row.names = FALSE, col.names = FALSE) in this case, I think R is not the best tool for the job. I would rather suggest to use an implementation of the awk language (e.g. gawk). I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB unzipped), piped into gawk) unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt and it took about 90 seconds. Please note that you might need to specify your delimiter (field separator (FS) and output field separator (OFS)) => gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv I hope this helps (despite not encouraging the usage of R), Roland __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large Data Set Help
Establish a "connection" with the file you want to read, read in 1,000 rows (or whatever you want). If you are using read.csv and there is a header, you might want to skip it initially since there will be no header when you read the next 1000 rows. Also put 'as.is=TRUE" so that character fields are not converted to factors. You can then write out the columns that you want. You can put this in a loop till you reach the end of file. On Mon, Aug 25, 2008 at 3:34 PM, Jason Thibodeau <[EMAIL PROTECTED]> wrote: > I am attempting to perform some simple data manipulation on a large data > set. I have a snippet of the whole data set, and my small snippet is 2GB in > CSV. > > Is there a way I can read my csv, select a few columns, and write it to an > output file in real time? This is what I do right now to a small test file: > > data <- read.csv('data.csv', header = FALSE) > > data_filter <- data[c(1,3,4)] > > write.table(data_filter, file = "filter_data.csv", sep = ",", row.names = > FALSE, col.names = FALSE) > > This test file writes the three columns to my desired output file. Can I do > this while bypassing the storage of the entire array in memory? > > Thank you very much for the help. > -- > Jason > >[[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data sets with R (binding to hadoop available?)
Hi Avram -- My understanding is that Google-like map / reduce achieves throughput by coordinating distributed calculation with distributed data. snow, Rmpi, nws, etc provide a way of distributing calculations, but don't help with coordinating distributed calculation with distributed data. SQL (at least naively implemented as a single database server) doesn't help with distributed data and the overhead of data movement from the server to compute nodes might be devastating. A shared file system across compute nodes (the implicit approach usually taken parallel R applications) offloads data distribution to the file system, which may be effective for not-too-large (10's of GB?) data. Many non-trivial R algorithms are not directly usable in distributed map, because they expect to operate on 'all of the data' rather than on data chunks. Out-of-the-box 'reduce' in R is limited really to collation (the parallel lapply-like functions) or sapply-like simplification; one would rather have more talented reducers (e.g., to aggregate bootstrap results). The list of talents required to exploit Hadoop starts to become intimidating (R, Java, Hadoop, PIG, + cluster management, etc), so it would certainly be useful to have that encapsulated in a way that requires only R skills! Martin <[EMAIL PROTECTED]> writes: > Hi > > Apart from database interfaces such as sqldf which Gabor has > mentioned, there are also packages specifically for handling large > data: see the "ff" package, for instance. > > I am currently playing with parallelizing R computations via Hadoop. I > haven't looked at PIG yet though. > > Rory > > > -Original Message- From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Roland Rau Sent: 21 > August 2008 20:04 To: Avram Aelony Cc: r-help@r-project.org Subject: > Re: [R] Large data sets with R (binding to hadoop available?) > > Hi > > Avram Aelony wrote: >> Dear R community, >> I find R fantastic and use R whenever I can for my data analytic >> needs. Certain data sets, however, are so large that other tools >> seem to be needed to pre-process data such that it can be brought >> into R for further analysis. >> Questions I have for the many expert contributors on this list are: >> 1. How do others handle situations of large data sets (gigabytes, >> terabytes) for analysis in R ? >> > I usually try to store the data in an SQLite database and interface >> via functions from the packages RSQLite (and DBI). > > No idea about Question No. 2, though. > > Hope this helps, > Roland > > > P.S. When I am sure that I only need a certain subset of large data >> sets, I still prefer to do some pre-processing in awk (gawk). > 2.P.S. The size of my data sets are in the gigabyte range (not >> terabyte range). This might be important if your data sets are >> *really large* and you want to use sqlite: >> http://www.sqlite.org/whentouse.html > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > *** > The Royal Bank of Scotland plc. Registered in Scotland No >> 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. > Authorised and regulated by the Financial Services Authority > > This e-mail message is confidential and for use by >> the=2...{{dropped:22}} > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data sets with R (binding to hadoop available?)
On Thu, 21 Aug 2008, Roland Rau wrote: Hi Avram Aelony wrote: (in part) 1. How do others handle situations of large data sets (gigabytes, terabytes) for analysis in R ? I usually try to store the data in an SQLite database and interface via functions from the packages RSQLite (and DBI). No idea about Question No. 2, though. Hope this helps, Roland P.S. When I am sure that I only need a certain subset of large data sets, I still prefer to do some pre-processing in awk (gawk). 2.P.S. The size of my data sets are in the gigabyte range (not terabyte range). This might be important if your data sets are *really large* and you want to use sqlite: http://www.sqlite.org/whentouse.html I use netCDF for (genomic) datasets in the 100Gb range, with the ncdf package, because SQLite was too slow for the sort of queries I needed. HDF5 would be another possibility; I'm not sure of the current status of the HDF5 support in Bioconductor, though. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data sets with R (binding to hadoop available?)
Hi Apart from database interfaces such as sqldf which Gabor has mentioned, there are also packages specifically for handling large data: see the "ff" package, for instance. I am currently playing with parallelizing R computations via Hadoop. I haven't looked at PIG yet though. Rory -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Roland Rau Sent: 21 August 2008 20:04 To: Avram Aelony Cc: r-help@r-project.org Subject: Re: [R] Large data sets with R (binding to hadoop available?) Hi Avram Aelony wrote: > > Dear R community, > > I find R fantastic and use R whenever I can for my data analytic needs. > Certain data sets, however, are so large that other tools seem to be > needed to pre-process data such that it can be brought into R for > further analysis. > > Questions I have for the many expert contributors on this list are: > > 1. How do others handle situations of large data sets (gigabytes, > terabytes) for analysis in R ? > I usually try to store the data in an SQLite database and interface via functions from the packages RSQLite (and DBI). No idea about Question No. 2, though. Hope this helps, Roland P.S. When I am sure that I only need a certain subset of large data sets, I still prefer to do some pre-processing in awk (gawk). 2.P.S. The size of my data sets are in the gigabyte range (not terabyte range). This might be important if your data sets are *really large* and you want to use sqlite: http://www.sqlite.org/whentouse.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority This e-mail message is confidential and for use by the=2...{{dropped:22}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data sets with R (binding to hadoop available?)
Hi Avram Aelony wrote: Dear R community, I find R fantastic and use R whenever I can for my data analytic needs. Certain data sets, however, are so large that other tools seem to be needed to pre-process data such that it can be brought into R for further analysis. Questions I have for the many expert contributors on this list are: 1. How do others handle situations of large data sets (gigabytes, terabytes) for analysis in R ? I usually try to store the data in an SQLite database and interface via functions from the packages RSQLite (and DBI). No idea about Question No. 2, though. Hope this helps, Roland P.S. When I am sure that I only need a certain subset of large data sets, I still prefer to do some pre-processing in awk (gawk). 2.P.S. The size of my data sets are in the gigabyte range (not terabyte range). This might be important if your data sets are *really large* and you want to use sqlite: http://www.sqlite.org/whentouse.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Large data sets with R (binding to hadoop available?)
RSQLite package can read files into an SQLite database without the data going through R. sqldf package provides a front end that makes it particularly easy to use - basically you need only a couple of lines of code. Other databases have similar facilities. See: http://sqldf.googlecode.com On Thu, Aug 21, 2008 at 2:32 PM, Avram Aelony <[EMAIL PROTECTED]> wrote: > > Dear R community, > > I find R fantastic and use R whenever I can for my data analytic needs. > Certain data sets, however, are so large that other tools seem to be needed > to pre-process data such that it can be brought into R for further analysis. > > Questions I have for the many expert contributors on this list are: > > 1. How do others handle situations of large data sets (gigabytes, terabytes) > for analysis in R ? > > 2. Are there existing ways or plans to devise ways to use the R language to > interact with Hadoop or PIG ? The Hadoop project by Apache has been > successful at processing data on a large scale using the map-reduce > algorithm. A sister project uses an emerging language called "PIG-latin" or > simply "PIG" for using the Hadoop framework in a manner reminiscent of the > look and feel of R. Is there an opportunity here to create a conceptual > bridge since these projects are also open-source? Does it already exist? > > > Thanks in advance for your comments. > > -Avram > > > > > --- > Information about Hadoop: > http://wiki.apache.org/hadoop/ > http://en.wikipedia.org/wiki/Hadoop > > "Apache Hadoop is a free Java software framework that supports data > intensive distributed applications running on large clusters of commodity > computers.[1] It enables applications to work with thousands of nodes and > petabytes of data. Hadoop was inspired by Google's MapReduce and Google File > System (GFS) papers." > > > > --- > Information about PIG: > > http://incubator.apache.org/pig/ > > "Pig is a platform for analyzing large data sets that consists of a > high-level language for expressing data analysis programs, coupled with > infrastructure for evaluating these programs. The salient property of Pig > programs is that their structure is amenable to substantial parallelization, > which in turns enables them to handle very large data sets. > At the present time, Pig's infrastructure layer consists of a compiler that > produces sequences of Map-Reduce programs, for which large-scale parallel > implementations already exist (e.g., the Hadoop subproject). Pig's language > layer currently consists of a textual language called Pig Latin, which has > the following key properties: > > * Ease of programming. It is trivial to achieve parallel execution of > simple, "embarrassingly parallel" data analysis tasks. Complex tasks > comprised of multiple interrelated data transformations are explicitly > encoded as data flow sequences, making them easy to write, understand, and > maintain. > * Optimization opportunities. The way in which tasks are encoded permits the > system to optimize their execution automatically, allowing the user to focus > on semantics rather than efficiency. > * Extensibility. Users can create their own functions to do special-purpose > processing." > > ---__ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.