[R] FW: Large datasets in R

2006-07-18 Thread Marshall Feldman
Hi,

I have two further comments/questions about large datasets in R.
 
1. Does R's ability to handle large datasets depend on the operating
system's use of virtual memory? In theory, at least, VM should make the
difference between installed RAM and virtual memory on a hard drive
primarily a determinant of how fast R will calculate rather than whether or
not it can do the calculations. However, if R has some low-level routines
that have to be memory resident and use more memory as the amount of data
increases, this may not hold. Can someone shed light on this?

2. Is What 64-bit versions of R are available at present?

Marsh Feldman
The University of Rhode Island

-Original Message-
From: Thomas Lumley [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 17, 2006 3:21 PM
To: Deepankar Basu
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] Large datasets in R

On Mon, 17 Jul 2006, Deepankar Basu wrote:

> Hi!
>
> I am a student of economics and currently do most of my statistical work
> using STATA. For various reasons (not least of which is an aversion for
> proprietary software), I am thinking of shifting to R. At the current
> juncture my concern is the following: would I be able to work on
> relatively large data-sets using R? For instance, I am currently working
> on a data-set which is about 350MB in size. Would be possible to work
> data-sets of such sizes using R?


The answer depends on a lot of things, but most importantly
1) What you are going to do with the data
2) Whether you have a 32-bit or 64-bit version of R
3) How much memory your computer has.

In a 32-bit version of R (where R will not be allowed to address more than 
2-3Gb of memory) an object of size 350Mb is large enough to cause problems 
(see eg the R Installation and Adminstration Guide).

If your 350Mb data set has lots of variables and you only use a few at a 
time then you may not have any trouble even on a 32-bit system once you 
have read in the data.

If you have a 64-bit version of R and a few Gb of memory then there should 
be no real difficulty in working with that size of data set for most 
analyses.  You might come across some analyses (eg some cluster analysis 
functions) that use n^2 memory for n observations and so break down.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Roger D. Peng
In my experience, the OS's use of virtual memory is only relevant in the rough 
sense that the OS can store *other* running applications in virtual memory so 
that R can use as much of the physical memory as possible.  Once R itself 
overflows into virtual memory it quickly becomes unusable.

I'm not sure I understand your second question.  As R is available in source 
code form, it can be compiled for many 64-bit operating systems.

-roger

Marshall Feldman wrote:
> Hi,
> 
> I have two further comments/questions about large datasets in R.
>  
> 1. Does R's ability to handle large datasets depend on the operating
> system's use of virtual memory? In theory, at least, VM should make the
> difference between installed RAM and virtual memory on a hard drive
> primarily a determinant of how fast R will calculate rather than whether or
> not it can do the calculations. However, if R has some low-level routines
> that have to be memory resident and use more memory as the amount of data
> increases, this may not hold. Can someone shed light on this?
> 
> 2. Is What 64-bit versions of R are available at present?
> 
>   Marsh Feldman
>   The University of Rhode Island
> 
> -Original Message-
> From: Thomas Lumley [mailto:[EMAIL PROTECTED] 
> Sent: Monday, July 17, 2006 3:21 PM
> To: Deepankar Basu
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] Large datasets in R
> 
> On Mon, 17 Jul 2006, Deepankar Basu wrote:
> 
>> Hi!
>>
>> I am a student of economics and currently do most of my statistical work
>> using STATA. For various reasons (not least of which is an aversion for
>> proprietary software), I am thinking of shifting to R. At the current
>> juncture my concern is the following: would I be able to work on
>> relatively large data-sets using R? For instance, I am currently working
>> on a data-set which is about 350MB in size. Would be possible to work
>> data-sets of such sizes using R?
> 
> 
> The answer depends on a lot of things, but most importantly
> 1) What you are going to do with the data
> 2) Whether you have a 32-bit or 64-bit version of R
> 3) How much memory your computer has.
> 
> In a 32-bit version of R (where R will not be allowed to address more than 
> 2-3Gb of memory) an object of size 350Mb is large enough to cause problems 
> (see eg the R Installation and Adminstration Guide).
> 
> If your 350Mb data set has lots of variables and you only use a few at a 
> time then you may not have any trouble even on a 32-bit system once you 
> have read in the data.
> 
> If you have a 64-bit version of R and a few Gb of memory then there should 
> be no real difficulty in working with that size of data set for most 
> analyses.  You might come across some analyses (eg some cluster analysis 
> functions) that use n^2 memory for n observations and so break down.
> 
> 
>   -thomas
> 
> Thomas Lumley Assoc. Professor, Biostatistics
> [EMAIL PROTECTED] University of Washington, Seattle
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Prof Brian Ripley
On Tue, 18 Jul 2006, Marshall Feldman wrote:

> Hi,
> 
> I have two further comments/questions about large datasets in R.
>  
> 1. Does R's ability to handle large datasets depend on the operating
> system's use of virtual memory? In theory, at least, VM should make the
> difference between installed RAM and virtual memory on a hard drive
> primarily a determinant of how fast R will calculate rather than whether or
> not it can do the calculations. However, if R has some low-level routines
> that have to be memory resident and use more memory as the amount of data
> increases, this may not hold. Can someone shed light on this?

The issue is address space, not RAM.  The limits Thomas mentions are on 
VM, not RAM, and it is common to have at least as much RAM installed as 
the VM address space for a user process.

There is no low-level code in R that has any idea if it is 
memory-resident, nor AFAIK is there any portable way to do so in a user 
process in a modern OS.  (R is as far as possible written to C99 and POSIX 
standards.)

> 2. Is What 64-bit versions of R are available at present?

Any OS with a 64-bit CPU that you can find a viable 64-bit compiler suite 
for.  We've had 64-bit versions of R since the last millenium on Solaris, 
IRIX, HP-UX, OSF/1 and more recently on AIX, FreeBSD, Linux, MacOS X (on 
so-called G5) and probably others.

The exception is probably Windows, for which there is no known free 
`viable 64-bit compiler suite', but it is likely that there are commercial 
ones.
 

> 
>   Marsh Feldman
>   The University of Rhode Island
> 
> -Original Message-
> From: Thomas Lumley [mailto:[EMAIL PROTECTED] 
> Sent: Monday, July 17, 2006 3:21 PM
> To: Deepankar Basu
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] Large datasets in R
> 
> On Mon, 17 Jul 2006, Deepankar Basu wrote:
> 
> > Hi!
> >
> > I am a student of economics and currently do most of my statistical work
> > using STATA. For various reasons (not least of which is an aversion for
> > proprietary software), I am thinking of shifting to R. At the current
> > juncture my concern is the following: would I be able to work on
> > relatively large data-sets using R? For instance, I am currently working
> > on a data-set which is about 350MB in size. Would be possible to work
> > data-sets of such sizes using R?
> 
> 
> The answer depends on a lot of things, but most importantly
> 1) What you are going to do with the data
> 2) Whether you have a 32-bit or 64-bit version of R
> 3) How much memory your computer has.
> 
> In a 32-bit version of R (where R will not be allowed to address more than 
> 2-3Gb of memory) an object of size 350Mb is large enough to cause problems 
> (see eg the R Installation and Adminstration Guide).
> 
> If your 350Mb data set has lots of variables and you only use a few at a 
> time then you may not have any trouble even on a 32-bit system once you 
> have read in the data.
> 
> If you have a 64-bit version of R and a few Gb of memory then there should 
> be no real difficulty in working with that size of data set for most 
> analyses.  You might come across some analyses (eg some cluster analysis 
> functions) that use n^2 memory for n observations and so break down.
> 
> 
>   -thomas
> 
> Thomas Lumley Assoc. Professor, Biostatistics
> [EMAIL PROTECTED] University of Washington, Seattle
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Ritwik Sinha
Hi,

I have a related question. How differently do other statistical
softwares handle large data?

The original post claims that 350 MB is fine on Stata. Some one
suggested S-Plus. I have heard people say that SAS can handle large
data sets. Why can others do it and R seem to have a problem? Don't
these softwares load the data onto RAM.

-- 
Ritwik Sinha
Graduate Student
Epidemiology and Biostatistics
Case Western Reserve University

http://darwin.cwru.edu/~rsinha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Gabor Grothendieck
S-Plus stores objects as files whereas R stores them in memory.
SAS was developed many years ago when optimizing computer
resources was more important than it is now.

On 7/18/06, Ritwik Sinha <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have a related question. How differently do other statistical
> softwares handle large data?
>
> The original post claims that 350 MB is fine on Stata. Some one
> suggested S-Plus. I have heard people say that SAS can handle large
> data sets. Why can others do it and R seem to have a problem? Don't
> these softwares load the data onto RAM.
>
> --
> Ritwik Sinha
> Graduate Student
> Epidemiology and Biostatistics
> Case Western Reserve University
>
> http://darwin.cwru.edu/~rsinha
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Thomas Lumley
On Tue, 18 Jul 2006, Ritwik Sinha wrote:

> Hi,
>
> I have a related question. How differently do other statistical
> softwares handle large data?
>
> The original post claims that 350 MB is fine on Stata. Some one
> suggested S-Plus. I have heard people say that SAS can handle large
> data sets. Why can others do it and R seem to have a problem? Don't
> these softwares load the data onto RAM.
>

Stata does load the data into RAM and does have limits for the same reason 
that R does. However, Stata has a less flexible representation of its data 
(basically one rectangular dataset) and so it can handle somewhat larger 
data sets for any given memory size. For example, even with 512Gb of 
memory a 350Mb data set might be usable in Stata and with 1Gb it would 
certainly be. Stata is also faster for a given memory load, apparently 
because of its simpler language design [some evidence for this is that the 
recent language additions to support flexible graphics run rather more 
slowly than eg lattice in R].

The other approach is to write the estimation routines so that only part 
of the data need be in memory at a given time.  *Some* procedures in SAS 
and SPSS work this way, and this is the idea of the S-PLUS 7.0 system for 
handling large data sets.   This approach requires the programmer to 
handle the reading of sections of code from memory, something that can 
only be automated to a limited extent.

People have used R in this way, storing data in a database and reading it 
as required. There are also some efforts to provide facilities to support 
this sort of programming (such as the current project funded by Google 
Summer of Code:  http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 
One reason there isn't more of this is that relying on Moore's Law has 
worked very well over the years.


  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Marshall Feldman
Well, SPSS used to claim that all its algorithms dealt with only one case at
a time and therefore that it could handle very large files. I suppose a
large correlation matrix could cause it problems.

Marsh Feldmman

-Original Message-
From: Ritwik Sinha [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 18, 2006 10:54 AM
To: Prof Brian Ripley
Cc: Marshall Feldman; r-help@stat.math.ethz.ch
Subject: Re: [R] FW: Large datasets in R

Hi,

I have a related question. How differently do other statistical
softwares handle large data?

The original post claims that 350 MB is fine on Stata. Some one
suggested S-Plus. I have heard people say that SAS can handle large
data sets. Why can others do it and R seem to have a problem? Don't
these softwares load the data onto RAM.

-- 
Ritwik Sinha
Graduate Student
Epidemiology and Biostatistics
Case Western Reserve University

http://darwin.cwru.edu/~rsinha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread François Pinard
[Thomas Lumley]

>People have used R in this way, storing data in a database and reading it 
>as required. There are also some efforts to provide facilities to support 
>this sort of programming (such as the current project funded by Google 
>Summer of Code:  http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 

Interesting project indeed!  However, if R requires uses more swapping 
because arrays do not all fit in physical memory, crudely replacing 
swapping with database accesses is not necessarily going to buy
a drastic speed improvement: the paging gets done in user space instead 
of being done in the kernel.

Long ago, while working on CDC mainframes, astonishing at the time but 
tiny by nowadays standards, there was a program able to invert or do 
simplexes on very big matrices.  I do not remember the name of the 
program, and never studied it but superficially (I was in computer 
support for researchers, but not a researcher myself).  The program was 
documented as being extremely careful at organising accesses to rows and 
columns (or parts thereof) in such a way that real memory was best used.
In other words, at the core of this program was a paging system very 
specialised and cooperative with the problems meant to be solved.

However, the source of this program was just plain huge (let's say from 
memory, about three or four times the size of the optimizing FORTRAN 
compiler, which I already knew better as an impressive algorithmic 
undertaking).  So, good or wrong, the prejudice stuck solidly in me at 
the time, if nothing else, that handling big arrays the right way, 
speed-wise, ought to be very difficult.

>One reason there isn't more of this is that relying on Moore's Law has 
>worked very well over the years.

On the other hand, the computational needs for scientific problems grow 
fairly quickly to the size of our ability to solve them.  Let me take
weather forecasting for example.  3-D geographical grids are never fine 
enough for the resolution meteorologists would like to get, and the time 
required for each prediction step grows very rapidly, to increase 
precision by not so much.  By merely tuning a few parameters, these 
people may easily pump nearly all the available cycles out the 
supercomputers given to them, and they do so without hesitation.  
Moore's Law will never succeed at calming their starving hunger! :-).

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Berton Gunter
Or, more succinctly, "Pinard's Law":

The demands of ever more data always exceed the capabilities of ever better
hardware.

;-D

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
  

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of François Pinard
> Sent: Tuesday, July 18, 2006 3:56 PM
> To: Thomas Lumley
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] FW: Large datasets in R
> 
> [Thomas Lumley]
> 
> >People have used R in this way, storing data in a database 
> and reading it 
> >as required. There are also some efforts to provide 
> facilities to support 
> >this sort of programming (such as the current project funded 
> by Google 
> >Summer of Code:  
> http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 
> 
> Interesting project indeed!  However, if R requires uses more 
> swapping 
> because arrays do not all fit in physical memory, crudely replacing 
> swapping with database accesses is not necessarily going to buy
> a drastic speed improvement: the paging gets done in user 
> space instead 
> of being done in the kernel.
> 
> Long ago, while working on CDC mainframes, astonishing at the 
> time but 
> tiny by nowadays standards, there was a program able to invert or do 
> simplexes on very big matrices.  I do not remember the name of the 
> program, and never studied it but superficially (I was in computer 
> support for researchers, but not a researcher myself).  The 
> program was 
> documented as being extremely careful at organising accesses 
> to rows and 
> columns (or parts thereof) in such a way that real memory was 
> best used.
> In other words, at the core of this program was a paging system very 
> specialised and cooperative with the problems meant to be solved.
> 
> However, the source of this program was just plain huge 
> (let's say from 
> memory, about three or four times the size of the optimizing FORTRAN 
> compiler, which I already knew better as an impressive algorithmic 
> undertaking).  So, good or wrong, the prejudice stuck solidly 
> in me at 
> the time, if nothing else, that handling big arrays the right way, 
> speed-wise, ought to be very difficult.
> 
> >One reason there isn't more of this is that relying on 
> Moore's Law has 
> >worked very well over the years.
> 
> On the other hand, the computational needs for scientific 
> problems grow 
> fairly quickly to the size of our ability to solve them.  Let me take
> weather forecasting for example.  3-D geographical grids are 
> never fine 
> enough for the resolution meteorologists would like to get, 
> and the time 
> required for each prediction step grows very rapidly, to increase 
> precision by not so much.  By merely tuning a few parameters, these 
> people may easily pump nearly all the available cycles out the 
> supercomputers given to them, and they do so without hesitation.  
> Moore's Law will never succeed at calming their starving hunger! :-).
> 
> -- 
> François Pinard   http://pinard.progiciels-bpi.ca
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.