My thanks to Drs. Armstrong, Bates, Harrell, Liaw, Lumley, Prager, Schwartz, and Mr. Wang for their replies. I have pasted my original message and their replies below.
After viewing http://www.itl.nist.gov/div898/strd/ as suggested by Dr. Schwartz, it occurred to me that it might be educational to search for some data repositories on google. I was able to find some,though I'm sure many of the R listserv readers are already aware of them: http://kdd.ics.uci.edu/ http://www.ics.uci.edu/~mlearn/MLOther.html http://www.ldeo.columbia.edu/datarep/ http://data.geocomm.com/ http://libraries.mit.edu/gis/data/repository.html http://nssdc.gsfc.nasa.gov/ -david paul -----Original Message----- I am one of only 5 or 6 people in my organization making the effort to include R/Splus as an analysis tool in everyday work - the rest of my colleagues use SAS exclusively. Today, one of them made the assertion that he believes the numerical algorithms in SAS are superior to those in Splus and R -- ie, optimization routines are faster in SAS, the SAS Institute has teams of excellent numerical analysts that ensure its superiority to anything freely available, PROC NLMIXED is more flexible than nlme( ) in the sense that it allows a much wider array of error structures than can be used in R/Splus, &etc. I obviously do not subscribe to these views and would like to refute them, but I am not a numerical analyst and am still a novice at R/Splus. Do there exist refereed papers comparing the numerical capabilities of these platforms? If not, are there other resources I might look up and pass along to my colleagues? --------------------------- This link might give you some insight, but SAS is not one of the packages benchmarked here. http://www.sciviews.org/other/benchmark.htm [Whit Armstrong] --------------------------- I don't have papers comparing the numerical capabilities but I say bunk to your colleagues. The last time I looked, SAS still relies on the out of date Gauss-Jordan sweep operator in many key places, in place of the QR decomposition that R and S-Plus use in regression. And SAS being closed source makes it impossible to see how it really does calculations in some cases. See http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf Section 1.6 for a comparison of S and SAS (though this doesn't address numerical reliability). Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities (last year it was about 10 years behind) in my estimation. Frank Harrell SAS User, 1969-1991 --- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University --------------------------- Too bad your colleagues weren't at the "State of Statistical Software" session at JSM. I was there. It was so packed that people ran out of standing room. The three speakers are all R advocates (Jan De Leeuw, Luke Tierney and Duncan Temple Lang). The most interesting thing (to me) about the session is that the discussant is a person from SAS (first name Wolfgang). I just had to hear what he'd say. The SAS person essentially said that the numerical accuracy of R (probability functions, especially) is unmatched because the routines were written by authority figures in the area. (That's one advantage he said R has, but also said that the fact that that code is open, even SAS is looking at the R source, and that, to him, is a disadvantage. He obviously missed the point of open source.) One of the criticisms he had for R, compared to SAS, is that R may not have undergone extensive QA tests. He said that SAS now probably has only a handful of PROC developers (not exactly the "team" your colleague imagined), but 5-6 times more software testers. I think hearing from the horse's mouth beats reading articles in the journal for this sort of things. There was a recent article in American statistician bashing the numerical instability and bad quality of RNG in JMP (a SAS product). SAS posted a "white paper" on their web site refuting some those claims (but they did changed the RNG to Mersenne Twister in JMP5), comparing JMP with Excel and SAS. I must say that comparison isn't convincing, as neither Excel nor SAS can really be trusted as gold standard. Andy [Liaw] --------------------------- In follow up to Frank's reply, allow me to point you to some additional papers and articles on numerical accuracy issues. I have not reviewed these in some time and they may be a bit dated relative to current versions. These do not cover R specifically, but do address S-Plus and SAS. This is not an exhaustive list by any means, but many of the papers do have other references that may be of value. 1. http://www.stat.uni-muenchen.de/~knuesel/elv/accuracy.html 2. http://www.amstat.org/publications/tas/mccull-1.pdf 3. http://www.amstat.org/publications/tas/mccull.pdf 4. http://www.npl.co.uk/ssfm/download/documents/cmsc06_00.pdf Another option is that NIST has reference datasets available for comparison at: http://www.itl.nist.gov/div898/strd/ These would allow you to conduct your own comparisons if you desire. HTH, Marc Schwartz (Also a former SAS user) --------------------------- I can't say for the optimisation routines, but I have found this... When I was doing my MSc thesis, using tree-based models and neural networks for classifications, I discovered something interesting. Using SAS Enterprise Miner (SAS EM), its Tree Node is far more efficient than the rpart package. Using the same (or very similar at least) parameter settings, SAS EM can produce a tree in about 1 minute while it would take rpart 5 ~ 6 minutes (same data, same machine....). Having said that, I still prefer rpart as it can draw a beautiful tree, whereas it is very difficult to fit the graphical tree produced by SAS EM into one A4 page -- in the end I had to use the text tree. However, the Neural Network node in SAS EM is less efficient than nnet. The time it takes to fit a neural network in R using nnet is much faster.... Cheers, Kevin [Wang] --------------------------- I suspect it will be difficult to find the answer to your colleagues' assertions without doing your own studies. How important is it to you to settle this disagreement? One could always name the many leading statisticians who contribute to R, but I don't think that name-dropping settles anything. Nonetheless, even if SAS were faster, that would be only part of the issue. As you know, R offers vastly better exploratory graphics, better graphics overall, far more flexible programming, user extensibility, and more natural programming access to the results of previous computations. So even if your colleagues were right in their assertions, they would be overlooking many capabilities of the S language that are not readily available in SAS. IMO, SAS shines in its ability to read files in almost any format, to handle gigantic data sets without burping, and to produce formatted cross-tabulations and other highly structured text reports. However, if your colleagues work at all in data exploration, they are ignoring important tools by not exploring R or S-Plus. Michael Prager, Ph.D. NOAA Center for Coastal Fisheries and Habitat Research Beaufort, North Carolina 28516 http://shrimp.ccfhrb.noaa.gov/~mprager/ DISCLAIMER: Opinions expressed are personal, not official. No government endorsement of any commercial product is made or implied. --------------------------- Although they are out of date, there are some comparisons of accuracy in McCullough, B. D. (1998), "Assessing the reliability of statistical software: Part I", The American Statistician, 52, 358-366. McCullough, B. D. (1999), "Assessing the reliability of statistical software: Part II", The American Statistician, 53, 149-159. Regarding PROC NLMIXED versus nlme, there are a lot of differences between them. I don't think that PROC NLMIXED will handle nested random effects while nlme does. However, nlme assumes the underlying noise is Gaussian while PROC NLMIXED allows Gaussian or binomial or Poisson. PROC NLMIXED uses adaptive Gaussian quadrature to evaluate the marginal log-likelihood whereas nlme uses a less accurate evaluation but better parameterizations of the variance of the random effects. I think it would be difficult to declare one to be superior to the other. [Douglas Bates] --------------------------- While I don't subscribe to the general theory, they have a point about PROC NLMIXED. It does more accurate calculations for generalised linear mixed models than are currently available in R/S-PLUS, and for logistic random effects models the difference can sometimes be large enought to matter. -Thomas [Lumley] ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help