I think SAS has the database part built into it. I have heard 2nd hand of new statisticians going to work for a company and asking if they have SAS, the reply is "Yes we use SAS for our database, does it do statistics also?" Also I heard something about SAS is no longer considered an acronym, they like having it be just a name and don't want the fact that one of the S's used to stand for statistics to scare away companies that use it as a database.
Maybe someone more up on SAS can confirm or deny this. Also one issue to always look at is central control versus ease of extendability. If you have a program that is completely under your control and does one set of things, then extending it to a new model (big data) is fairly straight forward. R is the opposite end of the spectrum with many contributers and many techniques. Extending some basic pieces to be very efficient with big data could be done easily, but would break many other pieces. Getting all the different packages to conform to a single standard in a short amount of time would be near impossible. With R's flexibility, there are probably some problems that can be done quicker with a proper use of biglm than with SAS and I expect that with some more work and maturity the SQLiteDF package may start to rival SAS as well on certain problems. While SAS is a useful program and great at certain things, there are some tecniques that I would not even attempt using SAS that are fairly straigh forward in R (I remember seeing some SAS code to do a bootstrap that included a datastep to read in and extract information from a SAS output file, <<SHUDDER>> SAS/ODS has improved this, but I would much rather bootstrap in R/S-PLUS than anything else). Remember, everything is better than everything else given the right comparison. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 > -----Original Message----- > From: Wensui Liu [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 10, 2007 3:26 PM > To: Greg Snow > Cc: Bi-Info (http://members.home.nl/bi-info); Gabor > Grothendieck; Lorenzo Isella; r-help@stat.math.ethz.ch > Subject: Re: [R] Reasons to Use R > > Greg, > As far as I understand, SAS is more efficient handling large > data probably than S+/R. Do you have any idea why? > > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of Bi-Info > > > (http://members.home.nl/bi-info) > > > Sent: Monday, April 09, 2007 4:23 PM > > > To: Gabor Grothendieck > > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > > Subject: Re: [R] Reasons to Use R > > > > [snip] > > > > > So what's the big deal about S using files instead of > memory like R. > > > I don't get the point. Isn't there enough swap space for S? (Who > > > cares > > > anyway: it works, isn't it?) Or are there any problems with S and > > > large datasets? I don't get it. You use them, Greg. So you might > > > discuss that issue. > > > > > > Wilfred > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > If you use up all the memory while in R, then the OS will start > > swapping memory to disk, but the OS does not know what > parts of memory > > correspond to which objects, so it is entirely possible > that the chunk > > swapped to disk contains parts of different data objects, > so when you > > need one of those objects again, everything needs to be > swapped back > > in. This is very inefficient. > > > > S-PLUS occasionally runs into the same problem, but since > it does some > > of its own swapping to disk it can be more efficient by swapping > > single data objects (data frames, etc.). Also, since S-PLUS is > > already saving everything to disk, it does not actually > need to do a > > full swap, it can just look and see that a particular data > frame has > > not been used for a while, know that it is already saved on > the disk, > > and unload it from memory without having to write it to disk first. > > > > The g.data package for R has some of this functionality of keeping > > data on the disk until needed. > > > > The better approach for large data sets is to only have some of the > > data in memory at a time and to automatically read just the > parts that > > you need. So for big datasets it is recommended to have the actual > > data stored in a database and use one of the database connection > > packages to only read in the subset that you need. The SQLiteDF > > package for R is working on automating this process for R. > There are > > also the bigdata module for S-PLUS and the biglm package for R have > > ways of doing some of the common analyses using chunks of data at a > > time. This idea is not new. There was a program in the late 1970s > > and 80s called Rummage by Del Scott (I guess technically it > still exists, I have a copy on a 5.25" > > floppy somewhere) that used the approach of specify the model you > > wanted to fit first, then specify the data file. Rummage > would then > > figure out which sufficient statistics were needed and read > the data > > in chunks, compute the sufficient statistics on the fly, > and not keep > > more than a couple of lines of the data in memory at once. > > Unfortunately it did not have much of a user interface, so > when memory > > was cheap and datasets only medium sized it did not compete well, I > > guess it was just a bit too ahead of its time. > > > > Hope this helps, > > > > > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > [EMAIL PROTECTED] > > (801) 408-8111 > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > WenSui Liu > A lousy statistician who happens to know a little programming > (http://spaces.msn.com/statcompute/blog) > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.