Greg, As far as I understand, SAS is more efficient handling large data probably than S+/R. Do you have any idea why?
On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of > > Bi-Info (http://members.home.nl/bi-info) > > Sent: Monday, April 09, 2007 4:23 PM > > To: Gabor Grothendieck > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > Subject: Re: [R] Reasons to Use R > > [snip] > > > So what's the big deal about S using files instead of memory > > like R. I don't get the point. Isn't there enough swap space > > for S? (Who cares > > anyway: it works, isn't it?) Or are there any problems with S > > and large datasets? I don't get it. You use them, Greg. So > > you might discuss that issue. > > > > Wilfred > > > > > > This is my understanding of the issue (not anything official). > > If you use up all the memory while in R, then the OS will start swapping > memory to disk, but the OS does not know what parts of memory correspond > to which objects, so it is entirely possible that the chunk swapped to > disk contains parts of different data objects, so when you need one of > those objects again, everything needs to be swapped back in. This is > very inefficient. > > S-PLUS occasionally runs into the same problem, but since it does some > of its own swapping to disk it can be more efficient by swapping single > data objects (data frames, etc.). Also, since S-PLUS is already saving > everything to disk, it does not actually need to do a full swap, it can > just look and see that a particular data frame has not been used for a > while, know that it is already saved on the disk, and unload it from > memory without having to write it to disk first. > > The g.data package for R has some of this functionality of keeping data > on the disk until needed. > > The better approach for large data sets is to only have some of the data > in memory at a time and to automatically read just the parts that you > need. So for big datasets it is recommended to have the actual data > stored in a database and use one of the database connection packages to > only read in the subset that you need. The SQLiteDF package for R is > working on automating this process for R. There are also the bigdata > module for S-PLUS and the biglm package for R have ways of doing some of > the common analyses using chunks of data at a time. This idea is not > new. There was a program in the late 1970s and 80s called Rummage by > Del Scott (I guess technically it still exists, I have a copy on a 5.25" > floppy somewhere) that used the approach of specify the model you wanted > to fit first, then specify the data file. Rummage would then figure out > which sufficient statistics were needed and read the data in chunks, > compute the sufficient statistics on the fly, and not keep more than a > couple of lines of the data in memory at once. Unfortunately it did not > have much of a user interface, so when memory was cheap and datasets > only medium sized it did not compete well, I guess it was just a bit too > ahead of its time. > > Hope this helps, > > > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > [EMAIL PROTECTED] > (801) 408-8111 > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.