Since R is supposed to be a complete programming language, I wonder why these tools couldn't be implemented in R (unless speed is the issue). Of course, it's a naive desire to have a single language that does everything, but it seems that R currently has most of the functions necessary to do the type of data cleaning described.
For instance, Gabor and Peter showed some snippets of ways to do this elegantly; my [physical science] data is often not as horrendously structured so usually I can get away with a program containing this type of code txtin <- scan(filename,what="",sep="\n") filteredList <- lapply(strsplit(txtin,delimiter),FUN=filterfunction) # fiteringfunction() returns selected (and possibly transformed # elements if present and NULL otherwise # may include calls to grep(), regexpr(), gsub(), substring(),... # nchar(), sscanf(), type.convert(), paste(), etc. mydataframe <- do.call(rbind,filteredList) # then match(), subset(), aggregate(), etc. In the case that the file is large, I open a file connection and scan a single line + apply filterfunction() successively in a FOR-LOOP instead of using lapply(). Of course, the devil is in the details of the filtering function, but I believe most of the required text processing facilities are already provided by R. I often have tasks that involve a combination of shell-scripting and text processing to construct the data frame for analysis; I started out using Python+NumPy to do the front-end work but have been using R progressively more (frankly, all of it) to take over that portion since I generally prefer the data structures and methods in R. --- Peter Dalgaard <[EMAIL PROTECTED]> wrote: > Douglas Bates wrote: > > Frank Harrell indicated that it is possible to do a lot of difficult > > data transformation within R itself if you try hard enough but that > > sometimes means working against the S language and its "whole object" > > view to accomplish what you want and it can require knowledge of > > subtle aspects of the S language. > > > Actually, I think Frank's point was subtly different: It is *because* of > the differences in view that it sometimes seems difficult to find the > way to do something in R that is apparently straightforward in SAS. > I.e. the solutions exist and are often elegant, but may require some > lateral thinking. > > Case in point: Finding the first or the last observation for each > subject when there are multiple records for each subject. The SAS way > would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that > you can compare the subject ID with the one from the previous record, > working with data that are sorted appropriately. > > You can do the same thing in R with a for loop, but there are better > ways e.g. > subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or > maybe > do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or > something involving aggregate(). (The latter approaches generalize > better to other within-subject functionals like cumulative doses, etc.). > > The hardest cases that I know of are the ones where you need to turn one > record into many, such as occurs in survival analysis with > time-dependent, piecewise constant covariates. This may require > "transposing the problem", i.e. for each interval you find out which > subjects contribute and with what, whereas the SAS way would be a > within-subject loop over intervals containing an OUTPUT statement. > > Also, there are some really weird data formats, where e.g. the input > format is different in different records. Back in the 80's where > punched-card input was still common, it was quite popular to have one > card with background information on a patient plus several cards > detailing visits, and you'd get a stack of cards containing both kinds. > In R you would most likely split on the card type using grep() and then > read the two kinds separately and merge() them later. > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ____________________________________________________________________________________ Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.