Robert Wilkins wrote: > As noted on the R-project web site itself ( www.r-project.org -> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare > messy and dirty data for analysis with the R tool itself. I've also > seen at least one S programming book (one of the yellow Springer ones) > that says, more briefly, the same thing. > The R Data Import/Export page recommends examples using SAS, Perl, > Python, and Java. It takes a bit of courage to say that ( when you go > to a corporate software web site, you'll never see a page saying "This > is the type of problem that our product is not the best at, here's > what we suggest instead" ). I'd like to provide a few more > suggestions, especially for volunteers who are willing to evaluate new > candidates. > > SAS is fine if you're not paying for the license out of your own > pocket. But maybe one reason you're using R is you don't have > thousands of spare dollars. > Using Java for data cleaning is an exercise in sado-masochism, Java > has a learning curve (almost) as difficult as C++. > > There are different types of data transformation, and for some data > preparation problems an all-purpose programming language is a good > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has > excellent regular expression facilities. > > However, for some types of complex demanding data preparation > problems, an all-purpose programming language is a poor choice. For > example: cleaning up and preparing clinical lab data and adverse event > data - you could do it in Perl, but it would take way, way too much > time. A specialized programming language is needed. And since data > transformation is quite different from data query, SQL is not the > ideal solution either.
We deal with exactly those kinds of data solely using R. R is exceptionally powerful for data manipulation, just a bit hard to learn. Many examples are at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf Frank > > There are only three statistical programming languages that are > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more > popular than S for data cleaning. > > If you're an R user with difficult data preparation problems, frankly > you are out of luck, because the products I'm about to mention are > new, unknown, and therefore regarded as immature. And while the > founders of these products would be very happy if you kicked the > tires, most people don't like to look at brand new products. Most > innovators and inventers don't realize this, I've learned it the hard > way. > > But if you are a volunteer who likes to help out by evaluating, > comparing, and reporting upon new candidates, well you could certainly > help out R users and the developers of the products by kicking the > tires of these products. And there is a huge need for such volunteers. > > 1. DAP > This is an open source implementation of SAS. > The founder: Susan Bassein > Find it at: directory.fsf.org/math/stats (GNU GPL) > > 2. PSPP > This is an open source implementation of SPSS. > The relatively early version number might not give a good idea of how > mature the > data transformation features are, it reflects the fact that he has > only started doing the statistical tests. > The founder: Ben Pfaff, either a grad student or professor at Stanford CS > dept. > Also at : directory.fsf.org/math/stats (GNU GPL) > > 3. Vilno > This uses a programming language similar to SPSS and SAS, but quite unlike S. > Essentially, it's a substitute for the SAS datastep, and also > transposes data and calculates averages and such. (No t-tests or > regressions in this version). I created this, during the years > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in > my opinion. The tarball includes about 100 or so test cases used for > debugging - for logical calculation errors, but not for extremely high > volumes of data. > The maintenance of Vilno has slowed down, because I am currently > (desparately) looking for employment. But once I've found new > employment and living quarters and settled in, I will continue to > enhance Vilno in my spare time. > The founder: that would be me, Robert Wilkins > Find it at: code.google.com/p/vilno ( GNU GPL ) > ( In particular, the tarball at code.google.com/p/vilno/downloads/list > , since I have yet to figure out how to use Subversion ). > > > 4. Who knows? > It was not easy to find out about the existence of DAP and PSPP. So > who knows what else is out there. However, I think you'll find a lot > more statistics software ( regression , etc ) out there, and not so > much data transformation software. Not many people work on data > preparation software. In fact, the category is so obscure that there > isn't one agreed term: data cleaning , data munging , data crunching , > or just getting the data ready for analysis. > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.