[R] FW: Large datasets in R

2006-07-18 Thread Marshall Feldman
Hi,

I have two further comments/questions about large datasets in R.
 
1. Does R's ability to handle large datasets depend on the operating
system's use of virtual memory? In theory, at least, VM should make the
difference between installed RAM and virtual memory on a hard drive
primarily a determinant of how fast R will calculate rather than whether or
not it can do the calculations. However, if R has some low-level routines
that have to be memory resident and use more memory as the amount of data
increases, this may not hold. Can someone shed light on this?

2. Is What 64-bit versions of R are available at present?

Marsh Feldman
The University of Rhode Island

-Original Message-
From: Thomas Lumley [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 17, 2006 3:21 PM
To: Deepankar Basu
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] Large datasets in R

On Mon, 17 Jul 2006, Deepankar Basu wrote:

 Hi!

 I am a student of economics and currently do most of my statistical work
 using STATA. For various reasons (not least of which is an aversion for
 proprietary software), I am thinking of shifting to R. At the current
 juncture my concern is the following: would I be able to work on
 relatively large data-sets using R? For instance, I am currently working
 on a data-set which is about 350MB in size. Would be possible to work
 data-sets of such sizes using R?


The answer depends on a lot of things, but most importantly
1) What you are going to do with the data
2) Whether you have a 32-bit or 64-bit version of R
3) How much memory your computer has.

In a 32-bit version of R (where R will not be allowed to address more than 
2-3Gb of memory) an object of size 350Mb is large enough to cause problems 
(see eg the R Installation and Adminstration Guide).

If your 350Mb data set has lots of variables and you only use a few at a 
time then you may not have any trouble even on a 32-bit system once you 
have read in the data.

If you have a 64-bit version of R and a few Gb of memory then there should 
be no real difficulty in working with that size of data set for most 
analyses.  You might come across some analyses (eg some cluster analysis 
functions) that use n^2 memory for n observations and so break down.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Roger D. Peng
In my experience, the OS's use of virtual memory is only relevant in the rough 
sense that the OS can store *other* running applications in virtual memory so 
that R can use as much of the physical memory as possible.  Once R itself 
overflows into virtual memory it quickly becomes unusable.

I'm not sure I understand your second question.  As R is available in source 
code form, it can be compiled for many 64-bit operating systems.

-roger

Marshall Feldman wrote:
 Hi,
 
 I have two further comments/questions about large datasets in R.
  
 1. Does R's ability to handle large datasets depend on the operating
 system's use of virtual memory? In theory, at least, VM should make the
 difference between installed RAM and virtual memory on a hard drive
 primarily a determinant of how fast R will calculate rather than whether or
 not it can do the calculations. However, if R has some low-level routines
 that have to be memory resident and use more memory as the amount of data
 increases, this may not hold. Can someone shed light on this?
 
 2. Is What 64-bit versions of R are available at present?
 
   Marsh Feldman
   The University of Rhode Island
 
 -Original Message-
 From: Thomas Lumley [mailto:[EMAIL PROTECTED] 
 Sent: Monday, July 17, 2006 3:21 PM
 To: Deepankar Basu
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Large datasets in R
 
 On Mon, 17 Jul 2006, Deepankar Basu wrote:
 
 Hi!

 I am a student of economics and currently do most of my statistical work
 using STATA. For various reasons (not least of which is an aversion for
 proprietary software), I am thinking of shifting to R. At the current
 juncture my concern is the following: would I be able to work on
 relatively large data-sets using R? For instance, I am currently working
 on a data-set which is about 350MB in size. Would be possible to work
 data-sets of such sizes using R?
 
 
 The answer depends on a lot of things, but most importantly
 1) What you are going to do with the data
 2) Whether you have a 32-bit or 64-bit version of R
 3) How much memory your computer has.
 
 In a 32-bit version of R (where R will not be allowed to address more than 
 2-3Gb of memory) an object of size 350Mb is large enough to cause problems 
 (see eg the R Installation and Adminstration Guide).
 
 If your 350Mb data set has lots of variables and you only use a few at a 
 time then you may not have any trouble even on a 32-bit system once you 
 have read in the data.
 
 If you have a 64-bit version of R and a few Gb of memory then there should 
 be no real difficulty in working with that size of data set for most 
 analyses.  You might come across some analyses (eg some cluster analysis 
 functions) that use n^2 memory for n observations and so break down.
 
 
   -thomas
 
 Thomas Lumley Assoc. Professor, Biostatistics
 [EMAIL PROTECTED] University of Washington, Seattle
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

-- 
Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Prof Brian Ripley
On Tue, 18 Jul 2006, Marshall Feldman wrote:

 Hi,
 
 I have two further comments/questions about large datasets in R.
  
 1. Does R's ability to handle large datasets depend on the operating
 system's use of virtual memory? In theory, at least, VM should make the
 difference between installed RAM and virtual memory on a hard drive
 primarily a determinant of how fast R will calculate rather than whether or
 not it can do the calculations. However, if R has some low-level routines
 that have to be memory resident and use more memory as the amount of data
 increases, this may not hold. Can someone shed light on this?

The issue is address space, not RAM.  The limits Thomas mentions are on 
VM, not RAM, and it is common to have at least as much RAM installed as 
the VM address space for a user process.

There is no low-level code in R that has any idea if it is 
memory-resident, nor AFAIK is there any portable way to do so in a user 
process in a modern OS.  (R is as far as possible written to C99 and POSIX 
standards.)

 2. Is What 64-bit versions of R are available at present?

Any OS with a 64-bit CPU that you can find a viable 64-bit compiler suite 
for.  We've had 64-bit versions of R since the last millenium on Solaris, 
IRIX, HP-UX, OSF/1 and more recently on AIX, FreeBSD, Linux, MacOS X (on 
so-called G5) and probably others.

The exception is probably Windows, for which there is no known free 
`viable 64-bit compiler suite', but it is likely that there are commercial 
ones.
 

 
   Marsh Feldman
   The University of Rhode Island
 
 -Original Message-
 From: Thomas Lumley [mailto:[EMAIL PROTECTED] 
 Sent: Monday, July 17, 2006 3:21 PM
 To: Deepankar Basu
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Large datasets in R
 
 On Mon, 17 Jul 2006, Deepankar Basu wrote:
 
  Hi!
 
  I am a student of economics and currently do most of my statistical work
  using STATA. For various reasons (not least of which is an aversion for
  proprietary software), I am thinking of shifting to R. At the current
  juncture my concern is the following: would I be able to work on
  relatively large data-sets using R? For instance, I am currently working
  on a data-set which is about 350MB in size. Would be possible to work
  data-sets of such sizes using R?
 
 
 The answer depends on a lot of things, but most importantly
 1) What you are going to do with the data
 2) Whether you have a 32-bit or 64-bit version of R
 3) How much memory your computer has.
 
 In a 32-bit version of R (where R will not be allowed to address more than 
 2-3Gb of memory) an object of size 350Mb is large enough to cause problems 
 (see eg the R Installation and Adminstration Guide).
 
 If your 350Mb data set has lots of variables and you only use a few at a 
 time then you may not have any trouble even on a 32-bit system once you 
 have read in the data.
 
 If you have a 64-bit version of R and a few Gb of memory then there should 
 be no real difficulty in working with that size of data set for most 
 analyses.  You might come across some analyses (eg some cluster analysis 
 functions) that use n^2 memory for n observations and so break down.
 
 
   -thomas
 
 Thomas Lumley Assoc. Professor, Biostatistics
 [EMAIL PROTECTED] University of Washington, Seattle
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Ritwik Sinha
Hi,

I have a related question. How differently do other statistical
softwares handle large data?

The original post claims that 350 MB is fine on Stata. Some one
suggested S-Plus. I have heard people say that SAS can handle large
data sets. Why can others do it and R seem to have a problem? Don't
these softwares load the data onto RAM.

-- 
Ritwik Sinha
Graduate Student
Epidemiology and Biostatistics
Case Western Reserve University

http://darwin.cwru.edu/~rsinha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Gabor Grothendieck
S-Plus stores objects as files whereas R stores them in memory.
SAS was developed many years ago when optimizing computer
resources was more important than it is now.

On 7/18/06, Ritwik Sinha [EMAIL PROTECTED] wrote:
 Hi,

 I have a related question. How differently do other statistical
 softwares handle large data?

 The original post claims that 350 MB is fine on Stata. Some one
 suggested S-Plus. I have heard people say that SAS can handle large
 data sets. Why can others do it and R seem to have a problem? Don't
 these softwares load the data onto RAM.

 --
 Ritwik Sinha
 Graduate Student
 Epidemiology and Biostatistics
 Case Western Reserve University

 http://darwin.cwru.edu/~rsinha

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Thomas Lumley
On Tue, 18 Jul 2006, Ritwik Sinha wrote:

 Hi,

 I have a related question. How differently do other statistical
 softwares handle large data?

 The original post claims that 350 MB is fine on Stata. Some one
 suggested S-Plus. I have heard people say that SAS can handle large
 data sets. Why can others do it and R seem to have a problem? Don't
 these softwares load the data onto RAM.


Stata does load the data into RAM and does have limits for the same reason 
that R does. However, Stata has a less flexible representation of its data 
(basically one rectangular dataset) and so it can handle somewhat larger 
data sets for any given memory size. For example, even with 512Gb of 
memory a 350Mb data set might be usable in Stata and with 1Gb it would 
certainly be. Stata is also faster for a given memory load, apparently 
because of its simpler language design [some evidence for this is that the 
recent language additions to support flexible graphics run rather more 
slowly than eg lattice in R].

The other approach is to write the estimation routines so that only part 
of the data need be in memory at a given time.  *Some* procedures in SAS 
and SPSS work this way, and this is the idea of the S-PLUS 7.0 system for 
handling large data sets.   This approach requires the programmer to 
handle the reading of sections of code from memory, something that can 
only be automated to a limited extent.

People have used R in this way, storing data in a database and reading it 
as required. There are also some efforts to provide facilities to support 
this sort of programming (such as the current project funded by Google 
Summer of Code:  http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 
One reason there isn't more of this is that relying on Moore's Law has 
worked very well over the years.


  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Marshall Feldman
Well, SPSS used to claim that all its algorithms dealt with only one case at
a time and therefore that it could handle very large files. I suppose a
large correlation matrix could cause it problems.

Marsh Feldmman

-Original Message-
From: Ritwik Sinha [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 18, 2006 10:54 AM
To: Prof Brian Ripley
Cc: Marshall Feldman; r-help@stat.math.ethz.ch
Subject: Re: [R] FW: Large datasets in R

Hi,

I have a related question. How differently do other statistical
softwares handle large data?

The original post claims that 350 MB is fine on Stata. Some one
suggested S-Plus. I have heard people say that SAS can handle large
data sets. Why can others do it and R seem to have a problem? Don't
these softwares load the data onto RAM.

-- 
Ritwik Sinha
Graduate Student
Epidemiology and Biostatistics
Case Western Reserve University

http://darwin.cwru.edu/~rsinha

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread François Pinard
[Thomas Lumley]

People have used R in this way, storing data in a database and reading it 
as required. There are also some efforts to provide facilities to support 
this sort of programming (such as the current project funded by Google 
Summer of Code:  http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 

Interesting project indeed!  However, if R requires uses more swapping 
because arrays do not all fit in physical memory, crudely replacing 
swapping with database accesses is not necessarily going to buy
a drastic speed improvement: the paging gets done in user space instead 
of being done in the kernel.

Long ago, while working on CDC mainframes, astonishing at the time but 
tiny by nowadays standards, there was a program able to invert or do 
simplexes on very big matrices.  I do not remember the name of the 
program, and never studied it but superficially (I was in computer 
support for researchers, but not a researcher myself).  The program was 
documented as being extremely careful at organising accesses to rows and 
columns (or parts thereof) in such a way that real memory was best used.
In other words, at the core of this program was a paging system very 
specialised and cooperative with the problems meant to be solved.

However, the source of this program was just plain huge (let's say from 
memory, about three or four times the size of the optimizing FORTRAN 
compiler, which I already knew better as an impressive algorithmic 
undertaking).  So, good or wrong, the prejudice stuck solidly in me at 
the time, if nothing else, that handling big arrays the right way, 
speed-wise, ought to be very difficult.

One reason there isn't more of this is that relying on Moore's Law has 
worked very well over the years.

On the other hand, the computational needs for scientific problems grow 
fairly quickly to the size of our ability to solve them.  Let me take
weather forecasting for example.  3-D geographical grids are never fine 
enough for the resolution meteorologists would like to get, and the time 
required for each prediction step grows very rapidly, to increase 
precision by not so much.  By merely tuning a few parameters, these 
people may easily pump nearly all the available cycles out the 
supercomputers given to them, and they do so without hesitation.  
Moore's Law will never succeed at calming their starving hunger! :-).

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Large datasets in R

2006-07-18 Thread Berton Gunter
Or, more succinctly, Pinard's Law:

The demands of ever more data always exceed the capabilities of ever better
hardware.

;-D

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
  

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of François Pinard
 Sent: Tuesday, July 18, 2006 3:56 PM
 To: Thomas Lumley
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] FW: Large datasets in R
 
 [Thomas Lumley]
 
 People have used R in this way, storing data in a database 
 and reading it 
 as required. There are also some efforts to provide 
 facilities to support 
 this sort of programming (such as the current project funded 
 by Google 
 Summer of Code:  
 http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 
 
 Interesting project indeed!  However, if R requires uses more 
 swapping 
 because arrays do not all fit in physical memory, crudely replacing 
 swapping with database accesses is not necessarily going to buy
 a drastic speed improvement: the paging gets done in user 
 space instead 
 of being done in the kernel.
 
 Long ago, while working on CDC mainframes, astonishing at the 
 time but 
 tiny by nowadays standards, there was a program able to invert or do 
 simplexes on very big matrices.  I do not remember the name of the 
 program, and never studied it but superficially (I was in computer 
 support for researchers, but not a researcher myself).  The 
 program was 
 documented as being extremely careful at organising accesses 
 to rows and 
 columns (or parts thereof) in such a way that real memory was 
 best used.
 In other words, at the core of this program was a paging system very 
 specialised and cooperative with the problems meant to be solved.
 
 However, the source of this program was just plain huge 
 (let's say from 
 memory, about three or four times the size of the optimizing FORTRAN 
 compiler, which I already knew better as an impressive algorithmic 
 undertaking).  So, good or wrong, the prejudice stuck solidly 
 in me at 
 the time, if nothing else, that handling big arrays the right way, 
 speed-wise, ought to be very difficult.
 
 One reason there isn't more of this is that relying on 
 Moore's Law has 
 worked very well over the years.
 
 On the other hand, the computational needs for scientific 
 problems grow 
 fairly quickly to the size of our ability to solve them.  Let me take
 weather forecasting for example.  3-D geographical grids are 
 never fine 
 enough for the resolution meteorologists would like to get, 
 and the time 
 required for each prediction step grows very rapidly, to increase 
 precision by not so much.  By merely tuning a few parameters, these 
 people may easily pump nearly all the available cycles out the 
 supercomputers given to them, and they do so without hesitation.  
 Moore's Law will never succeed at calming their starving hunger! :-).
 
 -- 
 François Pinard   http://pinard.progiciels-bpi.ca
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.