From: Douglas Bates > > On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > > Greg, > > As far as I understand, SAS is more efficient handling large data > > probably than S+/R. Do you have any idea why? > > SAS originated at a time when large data sets were stored on > magnetic tape and the only reasonable way to process them was > sequentially. > Thus most statistics procedures in SAS act as filters, > processing one record at a time and accumulating summary > information. In the past SAS performed a least squares fit > by accumulating the crossproduct of [X:y] and then using the > using the sweep operator to reduce that matrix. For such an > approach the number of observations does not affect the > amount of storage required. Adding observations just > requires more time. > > This works fine (although there are numerical disadvantages > to this approach - try mentioning the sweep operator to an > expert in numerical linear algebra - you get a blank stare)
For those who stared blankly at the above: The sweep operator is just a facier version of the good old Gaussian elimination... Andy > as long as the operations that you wish to perform fit into > this model. Making the desired operations fit into the model > is the primary reason for the awkwardness in many SAS analyses. > > The emphasis in R is on flexibility and the use of good > numerical techniques - not on processing large data sets > sequentially. The algorithms used in R for most least > squares fits generate and analyze the complete model matrix > instead of summary quantities. (The algorithms in the biglm > package are a compromise that work on horizontal sections of > the model matrix.) > > If your only criterion for comparison is the ability to work > with very large data sets performing operations that can fit > into the filter model used by SAS then SAS will be a better > choice. However you do lock yourself into a certain set of > operations and you are doing it to save memory, which is a > commodity that decreases in price very rapidly. > > As mentioned in other replies, for many years the majority of > SAS uses are for data manipulation rather than for > statistical analysis so the filter model has been modified in > later versions. > > > > > > > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > > -----Original Message----- > > > > From: [EMAIL PROTECTED] > > > > [mailto:[EMAIL PROTECTED] On Behalf Of Bi-Info > > > > (http://members.home.nl/bi-info) > > > > Sent: Monday, April 09, 2007 4:23 PM > > > > To: Gabor Grothendieck > > > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > > > Subject: Re: [R] Reasons to Use R > > > > > > [snip] > > > > > > > So what's the big deal about S using files instead of > memory like > > > > R. I don't get the point. Isn't there enough swap space for S? > > > > (Who cares > > > > anyway: it works, isn't it?) Or are there any problems > with S and > > > > large datasets? I don't get it. You use them, Greg. So > you might > > > > discuss that issue. > > > > > > > > Wilfred > > > > > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > > > If you use up all the memory while in R, then the OS will start > > > swapping memory to disk, but the OS does not know what parts of > > > memory correspond to which objects, so it is entirely > possible that > > > the chunk swapped to disk contains parts of different > data objects, > > > so when you need one of those objects again, everything > needs to be > > > swapped back in. This is very inefficient. > > > > > > S-PLUS occasionally runs into the same problem, but since it does > > > some of its own swapping to disk it can be more efficient by > > > swapping single data objects (data frames, etc.). Also, since > > > S-PLUS is already saving everything to disk, it does not actually > > > need to do a full swap, it can just look and see that a > particular > > > data frame has not been used for a while, know that it is already > > > saved on the disk, and unload it from memory without > having to write it to disk first. > > > > > > The g.data package for R has some of this functionality > of keeping > > > data on the disk until needed. > > > > > > The better approach for large data sets is to only have > some of the > > > data in memory at a time and to automatically read just the parts > > > that you need. So for big datasets it is recommended to have the > > > actual data stored in a database and use one of the database > > > connection packages to only read in the subset that you > need. The > > > SQLiteDF package for R is working on automating this > process for R. > > > There are also the bigdata module for S-PLUS and the > biglm package > > > for R have ways of doing some of the common analyses > using chunks of > > > data at a time. This idea is not new. There was a > program in the > > > late 1970s and 80s called Rummage by Del Scott (I guess > technically it still exists, I have a copy on a 5.25" > > > floppy somewhere) that used the approach of specify the model you > > > wanted to fit first, then specify the data file. Rummage > would then > > > figure out which sufficient statistics were needed and > read the data > > > in chunks, compute the sufficient statistics on the fly, and not > > > keep more than a couple of lines of the data in memory at once. > > > Unfortunately it did not have much of a user interface, so when > > > memory was cheap and datasets only medium sized it did > not compete > > > well, I guess it was just a bit too ahead of its time. > > > > > > Hope this helps, > > > > > > > > > > > > -- > > > Gregory (Greg) L. Snow Ph.D. > > > Statistical Data Center > > > Intermountain Healthcare > > > [EMAIL PROTECTED] > > > (801) 408-8111 > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > -- > > WenSui Liu > > A lousy statistician who happens to know a little programming > > (http://spaces.msn.com/statcompute/blog) > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}} ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.