I had mentioned exactly the same thing to others and the feedback I got is - 'when you have a hammer, everything will look like a nail' ^_^.
On 6/7/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote: > Robert Wilkins wrote: > > As noted on the R-project web site itself ( www.r-project.org -> > > Manuals -> R Data Import/Export ), it can be cumbersome to prepare > > messy and dirty data for analysis with the R tool itself. I've also > > seen at least one S programming book (one of the yellow Springer ones) > > that says, more briefly, the same thing. > > The R Data Import/Export page recommends examples using SAS, Perl, > > Python, and Java. It takes a bit of courage to say that ( when you go > > to a corporate software web site, you'll never see a page saying "This > > is the type of problem that our product is not the best at, here's > > what we suggest instead" ). I'd like to provide a few more > > suggestions, especially for volunteers who are willing to evaluate new > > candidates. > > > > SAS is fine if you're not paying for the license out of your own > > pocket. But maybe one reason you're using R is you don't have > > thousands of spare dollars. > > Using Java for data cleaning is an exercise in sado-masochism, Java > > has a learning curve (almost) as difficult as C++. > > > > There are different types of data transformation, and for some data > > preparation problems an all-purpose programming language is a good > > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has > > excellent regular expression facilities. > > > > However, for some types of complex demanding data preparation > > problems, an all-purpose programming language is a poor choice. For > > example: cleaning up and preparing clinical lab data and adverse event > > data - you could do it in Perl, but it would take way, way too much > > time. A specialized programming language is needed. And since data > > transformation is quite different from data query, SQL is not the > > ideal solution either. > > We deal with exactly those kinds of data solely using R. R is > exceptionally powerful for data manipulation, just a bit hard to learn. > Many examples are at > http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf > > Frank > > > > > There are only three statistical programming languages that are > > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more > > popular than S for data cleaning. > > > > If you're an R user with difficult data preparation problems, frankly > > you are out of luck, because the products I'm about to mention are > > new, unknown, and therefore regarded as immature. And while the > > founders of these products would be very happy if you kicked the > > tires, most people don't like to look at brand new products. Most > > innovators and inventers don't realize this, I've learned it the hard > > way. > > > > But if you are a volunteer who likes to help out by evaluating, > > comparing, and reporting upon new candidates, well you could certainly > > help out R users and the developers of the products by kicking the > > tires of these products. And there is a huge need for such volunteers. > > > > 1. DAP > > This is an open source implementation of SAS. > > The founder: Susan Bassein > > Find it at: directory.fsf.org/math/stats (GNU GPL) > > > > 2. PSPP > > This is an open source implementation of SPSS. > > The relatively early version number might not give a good idea of how > > mature the > > data transformation features are, it reflects the fact that he has > > only started doing the statistical tests. > > The founder: Ben Pfaff, either a grad student or professor at Stanford CS > > dept. > > Also at : directory.fsf.org/math/stats (GNU GPL) > > > > 3. Vilno > > This uses a programming language similar to SPSS and SAS, but quite unlike > > S. > > Essentially, it's a substitute for the SAS datastep, and also > > transposes data and calculates averages and such. (No t-tests or > > regressions in this version). I created this, during the years > > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in > > my opinion. The tarball includes about 100 or so test cases used for > > debugging - for logical calculation errors, but not for extremely high > > volumes of data. > > The maintenance of Vilno has slowed down, because I am currently > > (desparately) looking for employment. But once I've found new > > employment and living quarters and settled in, I will continue to > > enhance Vilno in my spare time. > > The founder: that would be me, Robert Wilkins > > Find it at: code.google.com/p/vilno ( GNU GPL ) > > ( In particular, the tarball at code.google.com/p/vilno/downloads/list > > , since I have yet to figure out how to use Subversion ). > > > > > > 4. Who knows? > > It was not easy to find out about the existence of DAP and PSPP. So > > who knows what else is out there. However, I think you'll find a lot > > more statistics software ( regression , etc ) out there, and not so > > much data transformation software. Not many people work on data > > preparation software. In fact, the category is so obscure that there > > isn't one agreed term: data cleaning , data munging , data crunching , > > or just getting the data ready for analysis. > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > Frank E Harrell Jr Professor and Chair School of Medicine > Department of Biostatistics Vanderbilt University > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.