[R] FW: Large datasets in R
Hi, I have two further comments/questions about large datasets in R. 1. Does R's ability to handle large datasets depend on the operating system's use of virtual memory? In theory, at least, VM should make the difference between installed RAM and virtual memory on a hard drive primarily a determinant of how fast R will calculate rather than whether or not it can do the calculations. However, if R has some low-level routines that have to be memory resident and use more memory as the amount of data increases, this may not hold. Can someone shed light on this? 2. Is What 64-bit versions of R are available at present? Marsh Feldman The University of Rhode Island -Original Message- From: Thomas Lumley [mailto:[EMAIL PROTECTED] Sent: Monday, July 17, 2006 3:21 PM To: Deepankar Basu Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Large datasets in R On Mon, 17 Jul 2006, Deepankar Basu wrote: Hi! I am a student of economics and currently do most of my statistical work using STATA. For various reasons (not least of which is an aversion for proprietary software), I am thinking of shifting to R. At the current juncture my concern is the following: would I be able to work on relatively large data-sets using R? For instance, I am currently working on a data-set which is about 350MB in size. Would be possible to work data-sets of such sizes using R? The answer depends on a lot of things, but most importantly 1) What you are going to do with the data 2) Whether you have a 32-bit or 64-bit version of R 3) How much memory your computer has. In a 32-bit version of R (where R will not be allowed to address more than 2-3Gb of memory) an object of size 350Mb is large enough to cause problems (see eg the R Installation and Adminstration Guide). If your 350Mb data set has lots of variables and you only use a few at a time then you may not have any trouble even on a 32-bit system once you have read in the data. If you have a 64-bit version of R and a few Gb of memory then there should be no real difficulty in working with that size of data set for most analyses. You might come across some analyses (eg some cluster analysis functions) that use n^2 memory for n observations and so break down. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
In my experience, the OS's use of virtual memory is only relevant in the rough sense that the OS can store *other* running applications in virtual memory so that R can use as much of the physical memory as possible. Once R itself overflows into virtual memory it quickly becomes unusable. I'm not sure I understand your second question. As R is available in source code form, it can be compiled for many 64-bit operating systems. -roger Marshall Feldman wrote: Hi, I have two further comments/questions about large datasets in R. 1. Does R's ability to handle large datasets depend on the operating system's use of virtual memory? In theory, at least, VM should make the difference between installed RAM and virtual memory on a hard drive primarily a determinant of how fast R will calculate rather than whether or not it can do the calculations. However, if R has some low-level routines that have to be memory resident and use more memory as the amount of data increases, this may not hold. Can someone shed light on this? 2. Is What 64-bit versions of R are available at present? Marsh Feldman The University of Rhode Island -Original Message- From: Thomas Lumley [mailto:[EMAIL PROTECTED] Sent: Monday, July 17, 2006 3:21 PM To: Deepankar Basu Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Large datasets in R On Mon, 17 Jul 2006, Deepankar Basu wrote: Hi! I am a student of economics and currently do most of my statistical work using STATA. For various reasons (not least of which is an aversion for proprietary software), I am thinking of shifting to R. At the current juncture my concern is the following: would I be able to work on relatively large data-sets using R? For instance, I am currently working on a data-set which is about 350MB in size. Would be possible to work data-sets of such sizes using R? The answer depends on a lot of things, but most importantly 1) What you are going to do with the data 2) Whether you have a 32-bit or 64-bit version of R 3) How much memory your computer has. In a 32-bit version of R (where R will not be allowed to address more than 2-3Gb of memory) an object of size 350Mb is large enough to cause problems (see eg the R Installation and Adminstration Guide). If your 350Mb data set has lots of variables and you only use a few at a time then you may not have any trouble even on a 32-bit system once you have read in the data. If you have a 64-bit version of R and a few Gb of memory then there should be no real difficulty in working with that size of data set for most analyses. You might come across some analyses (eg some cluster analysis functions) that use n^2 memory for n observations and so break down. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Roger D. Peng | http://www.biostat.jhsph.edu/~rpeng/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
On Tue, 18 Jul 2006, Marshall Feldman wrote: Hi, I have two further comments/questions about large datasets in R. 1. Does R's ability to handle large datasets depend on the operating system's use of virtual memory? In theory, at least, VM should make the difference between installed RAM and virtual memory on a hard drive primarily a determinant of how fast R will calculate rather than whether or not it can do the calculations. However, if R has some low-level routines that have to be memory resident and use more memory as the amount of data increases, this may not hold. Can someone shed light on this? The issue is address space, not RAM. The limits Thomas mentions are on VM, not RAM, and it is common to have at least as much RAM installed as the VM address space for a user process. There is no low-level code in R that has any idea if it is memory-resident, nor AFAIK is there any portable way to do so in a user process in a modern OS. (R is as far as possible written to C99 and POSIX standards.) 2. Is What 64-bit versions of R are available at present? Any OS with a 64-bit CPU that you can find a viable 64-bit compiler suite for. We've had 64-bit versions of R since the last millenium on Solaris, IRIX, HP-UX, OSF/1 and more recently on AIX, FreeBSD, Linux, MacOS X (on so-called G5) and probably others. The exception is probably Windows, for which there is no known free `viable 64-bit compiler suite', but it is likely that there are commercial ones. Marsh Feldman The University of Rhode Island -Original Message- From: Thomas Lumley [mailto:[EMAIL PROTECTED] Sent: Monday, July 17, 2006 3:21 PM To: Deepankar Basu Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Large datasets in R On Mon, 17 Jul 2006, Deepankar Basu wrote: Hi! I am a student of economics and currently do most of my statistical work using STATA. For various reasons (not least of which is an aversion for proprietary software), I am thinking of shifting to R. At the current juncture my concern is the following: would I be able to work on relatively large data-sets using R? For instance, I am currently working on a data-set which is about 350MB in size. Would be possible to work data-sets of such sizes using R? The answer depends on a lot of things, but most importantly 1) What you are going to do with the data 2) Whether you have a 32-bit or 64-bit version of R 3) How much memory your computer has. In a 32-bit version of R (where R will not be allowed to address more than 2-3Gb of memory) an object of size 350Mb is large enough to cause problems (see eg the R Installation and Adminstration Guide). If your 350Mb data set has lots of variables and you only use a few at a time then you may not have any trouble even on a 32-bit system once you have read in the data. If you have a 64-bit version of R and a few Gb of memory then there should be no real difficulty in working with that size of data set for most analyses. You might come across some analyses (eg some cluster analysis functions) that use n^2 memory for n observations and so break down. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
Hi, I have a related question. How differently do other statistical softwares handle large data? The original post claims that 350 MB is fine on Stata. Some one suggested S-Plus. I have heard people say that SAS can handle large data sets. Why can others do it and R seem to have a problem? Don't these softwares load the data onto RAM. -- Ritwik Sinha Graduate Student Epidemiology and Biostatistics Case Western Reserve University http://darwin.cwru.edu/~rsinha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
S-Plus stores objects as files whereas R stores them in memory. SAS was developed many years ago when optimizing computer resources was more important than it is now. On 7/18/06, Ritwik Sinha [EMAIL PROTECTED] wrote: Hi, I have a related question. How differently do other statistical softwares handle large data? The original post claims that 350 MB is fine on Stata. Some one suggested S-Plus. I have heard people say that SAS can handle large data sets. Why can others do it and R seem to have a problem? Don't these softwares load the data onto RAM. -- Ritwik Sinha Graduate Student Epidemiology and Biostatistics Case Western Reserve University http://darwin.cwru.edu/~rsinha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
On Tue, 18 Jul 2006, Ritwik Sinha wrote: Hi, I have a related question. How differently do other statistical softwares handle large data? The original post claims that 350 MB is fine on Stata. Some one suggested S-Plus. I have heard people say that SAS can handle large data sets. Why can others do it and R seem to have a problem? Don't these softwares load the data onto RAM. Stata does load the data into RAM and does have limits for the same reason that R does. However, Stata has a less flexible representation of its data (basically one rectangular dataset) and so it can handle somewhat larger data sets for any given memory size. For example, even with 512Gb of memory a 350Mb data set might be usable in Stata and with 1Gb it would certainly be. Stata is also faster for a given memory load, apparently because of its simpler language design [some evidence for this is that the recent language additions to support flexible graphics run rather more slowly than eg lattice in R]. The other approach is to write the estimation routines so that only part of the data need be in memory at a given time. *Some* procedures in SAS and SPSS work this way, and this is the idea of the S-PLUS 7.0 system for handling large data sets. This approach requires the programmer to handle the reading of sections of code from memory, something that can only be automated to a limited extent. People have used R in this way, storing data in a database and reading it as required. There are also some efforts to provide facilities to support this sort of programming (such as the current project funded by Google Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). One reason there isn't more of this is that relying on Moore's Law has worked very well over the years. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
Well, SPSS used to claim that all its algorithms dealt with only one case at a time and therefore that it could handle very large files. I suppose a large correlation matrix could cause it problems. Marsh Feldmman -Original Message- From: Ritwik Sinha [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 18, 2006 10:54 AM To: Prof Brian Ripley Cc: Marshall Feldman; r-help@stat.math.ethz.ch Subject: Re: [R] FW: Large datasets in R Hi, I have a related question. How differently do other statistical softwares handle large data? The original post claims that 350 MB is fine on Stata. Some one suggested S-Plus. I have heard people say that SAS can handle large data sets. Why can others do it and R seem to have a problem? Don't these softwares load the data onto RAM. -- Ritwik Sinha Graduate Student Epidemiology and Biostatistics Case Western Reserve University http://darwin.cwru.edu/~rsinha __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
[Thomas Lumley] People have used R in this way, storing data in a database and reading it as required. There are also some efforts to provide facilities to support this sort of programming (such as the current project funded by Google Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). Interesting project indeed! However, if R requires uses more swapping because arrays do not all fit in physical memory, crudely replacing swapping with database accesses is not necessarily going to buy a drastic speed improvement: the paging gets done in user space instead of being done in the kernel. Long ago, while working on CDC mainframes, astonishing at the time but tiny by nowadays standards, there was a program able to invert or do simplexes on very big matrices. I do not remember the name of the program, and never studied it but superficially (I was in computer support for researchers, but not a researcher myself). The program was documented as being extremely careful at organising accesses to rows and columns (or parts thereof) in such a way that real memory was best used. In other words, at the core of this program was a paging system very specialised and cooperative with the problems meant to be solved. However, the source of this program was just plain huge (let's say from memory, about three or four times the size of the optimizing FORTRAN compiler, which I already knew better as an impressive algorithmic undertaking). So, good or wrong, the prejudice stuck solidly in me at the time, if nothing else, that handling big arrays the right way, speed-wise, ought to be very difficult. One reason there isn't more of this is that relying on Moore's Law has worked very well over the years. On the other hand, the computational needs for scientific problems grow fairly quickly to the size of our ability to solve them. Let me take weather forecasting for example. 3-D geographical grids are never fine enough for the resolution meteorologists would like to get, and the time required for each prediction step grows very rapidly, to increase precision by not so much. By merely tuning a few parameters, these people may easily pump nearly all the available cycles out the supercomputers given to them, and they do so without hesitation. Moore's Law will never succeed at calming their starving hunger! :-). -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: Large datasets in R
Or, more succinctly, Pinard's Law: The demands of ever more data always exceed the capabilities of ever better hardware. ;-D -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of François Pinard Sent: Tuesday, July 18, 2006 3:56 PM To: Thomas Lumley Cc: r-help@stat.math.ethz.ch Subject: Re: [R] FW: Large datasets in R [Thomas Lumley] People have used R in this way, storing data in a database and reading it as required. There are also some efforts to provide facilities to support this sort of programming (such as the current project funded by Google Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). Interesting project indeed! However, if R requires uses more swapping because arrays do not all fit in physical memory, crudely replacing swapping with database accesses is not necessarily going to buy a drastic speed improvement: the paging gets done in user space instead of being done in the kernel. Long ago, while working on CDC mainframes, astonishing at the time but tiny by nowadays standards, there was a program able to invert or do simplexes on very big matrices. I do not remember the name of the program, and never studied it but superficially (I was in computer support for researchers, but not a researcher myself). The program was documented as being extremely careful at organising accesses to rows and columns (or parts thereof) in such a way that real memory was best used. In other words, at the core of this program was a paging system very specialised and cooperative with the problems meant to be solved. However, the source of this program was just plain huge (let's say from memory, about three or four times the size of the optimizing FORTRAN compiler, which I already knew better as an impressive algorithmic undertaking). So, good or wrong, the prejudice stuck solidly in me at the time, if nothing else, that handling big arrays the right way, speed-wise, ought to be very difficult. One reason there isn't more of this is that relying on Moore's Law has worked very well over the years. On the other hand, the computational needs for scientific problems grow fairly quickly to the size of our ability to solve them. Let me take weather forecasting for example. 3-D geographical grids are never fine enough for the resolution meteorologists would like to get, and the time required for each prediction step grows very rapidly, to increase precision by not so much. By merely tuning a few parameters, these people may easily pump nearly all the available cycles out the supercomputers given to them, and they do so without hesitation. Moore's Law will never succeed at calming their starving hunger! :-). -- François Pinard http://pinard.progiciels-bpi.ca __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.