[R] A really simple data manipulation example
In response to those who asked for a better explanation of what the Vilno software does, here's a simple example that gives some idea of what it does. LABRESULTS is a dataset with multiple rows per patient , with lab sodium measurements. It has columns: PATIENT_ID, VISIT_NUM, and SODIUM. DEMO is a dataset with one row per patient, with demographic data. It has columns: PATIENT_ID, GENDER. Here's a simple example, the following paragraph of code is a data processing function (dpf) : inlist LABRESULTS DEMO ; mergeby PATIENT_ID ; if (SODIUM == -9) SODIUM = NULL ; if (VISIT_NUM != 2) deleterow ; select AVERAGE_SODIUM = avg(SODIUM) by GENDER ; sendoff(RESULTS_DATASET) GENDER AVERAGE_SODIUM ; turnoff; // just means end-of-paragraph , version 1.0 won't need this. Can you guess what it does? The lab result rows are merged with the demographic rows, just to get the gender information merged in. Obviously, they are merged by patient. The code -9 is used to denote "missing", so convert that to NULL. I'm about to take a statistic for visit 2, so rows with visit 0 or 1 must be deleted. I'm assuming, for visit 2, each patient has at most one row. Now, for each sex group, take the average sodium level. After the select statement, I have just two rows, for male and female, with the average sodium level in the AVERAGE_SODIUM column. Now the sendoff statement just stores the current data table into a datafile, called RESULTS_DATASET. So you have a sequence of data tables, each calculation reading in the current table , and leaving a new data table for the next calculation. So you have input datasets, a bunch of intermediate calculations, and one or more output datasets. Pretty simple idea. * Some caveats: LABRESULTS and DEMO are binary datasets. The asciitobinary and binarytoascii statements are used to convert between binary datasets and comma-separated ascii data files. (You can use any delimiter: comma, vertical bar , etc). An asciitobinary statement is typically just two lines of code. The dpf begins with the inlist statement , and , for the moment , needs "turnoff ;" as the last line. Version 1.0 won't require the use of "turnoff;", but version 0.85 does. It only means this paragraph of code ends here ( a program can , of course , contain many paragraphs: data processing functions, print statements, asciitobinary statements, etc.). If you've worked with lab data, you know lab data does not look so simplistic. I need a simple example. Vilno has a lot of functionality, many-to-many joins, adding columns, firstrow() and lastrow() flags, and so forth. A fair amount of complex data manipulations have already been tested with test programs ( in the tarball ). Of course a simple example cannot show you that, it's just a small taste. * If you've never used SPSS or SAS before, you won't care, but this programming language falls in the same family as the SPSS and SAS programming languages. All three programming languages have a fair amount in common, but are quite different from the S programming language. The vilno data processing function can replace the SAS datastep. (It can also replace PROC TRANSPOSE and much of PROC MEANS, except standard deviation calculations still need to be included in the select statement). I hope that helps. http://code.google.com/p/vilno __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Has anyone tryed out my software?
Hello all, Has anyone ( who uses a Linux desktop ) tryed out my stuff I mentioned a few weeks ago? Perhaps installed it and run a couple of example programs? If you have, tell me what you think. Robert ( it's the tarball in the download section at http://code.google.com/p/vilno , discussed briefly in comparison to Awk a couple of weeks ago ) PS Of R users: how many use Windows XP, how many use an Apple, and how many use a Linux desktop? Are there a lot of Linux users out there? Is R more popular in Europe than North America? I'll need to do a statistical analysis of the mailing list. I notice a ton of Europeans. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ] Hmm, I don't think you need a retain statement. if first.patientID ; or if last.patientID ; ought to do it. It's actually better than the Vilno version, I must admit, a bit more concise: if ( not firstrow(patientID) ) deleterow ; Ah well. ** For the folks asking for location of software ( I know posted it, but it didn't connect to the thread, and you get a huge number of posts each day , sorry): Vilno , find at http://code.google.com/p/vilno DAP & PSPP, find at http://directory.fsf.org/math/stats Awk, find at lots of places, http://www.gnu.org/software/gawk/gawk.html Anything else? DAP & PSPP are hard to find, I'm sure there's more out there! What about MDX? Nahh, not really the right problem domain. Nobody uses MDX for this stuff. ** If my examples , using clinical trial data are boring and hard to understand for those who asked for examples ( and presumably don't work in clinical trials) , let me know. Some of these other examples I'm reading about are quite interesting. It doesn't help that clinical trial databases cannot be public. Making a fake database would take a lot of time. The irony is , even with my deep understanding of data preparation in clinical trials, the pharmas still don't want to give me a job ( because I was gone for many years). Let's see if this post works : thanks to the folks who gave me advice on how to properly respond to a post within a thread . ( Although the thread in my gmail account is only a subset of the posts visible in the archives ). Crossing my fingers On 6/10/07, Peter Dalgaard <[EMAIL PROTECTED]> wrote: > Douglas Bates wrote: > > Frank Harrell indicated that it is possible to do a lot of difficult > > data transformation within R itself if you try hard enough but that > > sometimes means working against the S language and its "whole object" > > view to accomplish what you want and it can require knowledge of > > subtle aspects of the S language. > > > Actually, I think Frank's point was subtly different: It is *because* of > the differences in view that it sometimes seems difficult to find the > way to do something in R that is apparently straightforward in SAS. > I.e. the solutions exist and are often elegant, but may require some > lateral thinking. > > Case in point: Finding the first or the last observation for each > subject when there are multiple records for each subject. The SAS way > would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that > you can compare the subject ID with the one from the previous record, > working with data that are sorted appropriately. > > You can do the same thing in R with a for loop, but there are better > ways e.g. > subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or > maybe > do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or > something involving aggregate(). (The latter approaches generalize > better to other within-subject functionals like cumulative doses, etc.). > > The hardest cases that I know of are the ones where you need to turn one > record into many, such as occurs in survival analysis with > time-dependent, piecewise constant covariates. This may require > "transposing the problem", i.e. for each interval you find out which > subjects contribute and with what, whereas the SAS way would be a > within-subject loop over intervals containing an OUTPUT statement. > > Also, there are some really weird data formats, where e.g. the input > format is different in different records. Back in the 80's where > punched-card input was still common, it was quite popular to have one > card with background information on a patient plus several cards > detailing visits, and you'd get a stack of cards containing both kinds. > In R you would most likely split on the card type using grep() and then > read the two kinds separately and merge() them later. > > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Where to Find Data Transformation Software
Hello All, Here is the requested information. Most of it was on the original post for the "Tools For Preparing Data For Analysis" thread from last week, but it got overlooked. They are all given under an open source license. Check 'em out! *** Vilno: data transformation software, that reads in input datasets (rows and columns), crunches through the data, and writes out output datasets. It's an open source application that can replace the SAS datastep ( and also replaces proc transpose and proc means ). Find it at: http://code.google.com/p/vilno ( look in the download section for a tarball, it's a Linux application, can be opened up (and maybe installed) on an Apple as well ). DAP and PSPP: open source implementations for SAS and SPSS. Find it at: http://directory.fsf.org/math/stats * Awk: data transformation/filtering software for semi-structured ASCII files. A predecessor to Perl. Find it at: a lot of places, but try: http://www.gnu.org/software/gawk/gawk.html * Some, but not all , data crunching problems can be handled fairly well by an all-purpose programming language, such as Perl or Python or Ruby. Some, but not all, data crunching problems can be handled reasonably well with the S programming language ( i.e., R ). __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Difficulties With Posting To Ongoing Threads on the R Mailing List
A number of people are having the same problem as me, when you post as a response to an ongoing thread, in place of your message, the following message appears: An embedded & charset-unspecified text was scrubbed ... and a link is given that leads to the desired message. It's better than nothing , but it sure is annoying, and some readers will skip it instead of doing the extra link. It's also annoying to read a thread, when several posters , through no fault of their own, get "scrubbed". I always think of an e-mail as pure ASCII text, unless you add an attachment. Is it possible that some e-mail hosts ( I use gmail ) embed binary code into the e-mail? Maybe the R mailing list software is reacting to that. ** On another note, I tryed posting on gmane, to add to the thread from last week. It just disappeared , or maybe not, I don't know. Maybe it's related to the one-time registration requirement for gmane. * As far as I can tell, the above problem (scrubbing) does not occur when you do a stand-alone post, not as a response to an ongoing thread. Hope it stays that way! ** Have a nice day. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Awk and Vilno
In clinical trial data preparation and many other data situations, the statistical programmer needs to merge and re-merge multiple input files countless times. A syntax for merging files that is clear and concise is very important for the statistical programmer's productivity. Here is how Vilno does it: inlist dataset1 dataset2 dataset3 ; joinby variable1 variable2 where ( var3<=var4 ) ; Each column in a dataset has a variable name ( variable1, variable2, var3, var4 ). You are merging three input datafiles: dataset1, dataset2, and dataset3. The joinby statement asks for a many-to-many join, rather like the SQL SELECT statement. [ The mergeby statement asks for a many-to-one join , more efficient ] [ The readby statement asks for interleaving of rows, the rows don't "match up" , but one row goes under the preceding row (100 rows + 100 rows -> 200 output rows ] The join( or merge ) is done with variable1*variable2 subgroups: A row from dataset1 where variable1=4 and variable2="Sam" can only match to a row from dataset2 where variable1=4 and variable2="Sam". Also, any match-ups where it is not the case that var3<=var4 are also excluded. Here's how the SAS datastep will do it: merge dataset1 dataset2 dataset3 ; by variable1 variable2 ; if ^( var3<=var4 ) then delete ; [Actually, the SAS datastep can only do a many-to-one join, but you can do a PROC SQL paragraph to do an SQL SELECT statement, then export the results to a SAS datastep afterwards.] The point is : there are lots of data preparation scenarios where large numbers of merges need to be done. This is an example where Vilno and SAS are easier to use than the competition. I'm sure an Awk programmer can come up with something, but the result would be awkward. You can also find other data preparation problems where the best tool is Awk. Looking through "Sed & Awk" (O'Reilly) gives a good idea. I'm not expert Awk-er sure, but I think I can see that Awk and Vilno are really like apples and oranges. For scanning inconsistently structured ASCII data files, where different rows have different column specifications, Awk is a better tool. For data problems that lend themselves to UNIX-style regular expressions, Awk, again, is a great tool. If you have a data manipulation problem that is incredibly simple, then converting an ascii data file to binary, and then back, may not seem worth it. Awk, again, wins. But the asciitobinary and binarytoascii statement ( there and back ) only takes 4 lines or so, so Vilno is really not that bad. Certain apsects of Vilno and SAS are a bit more user-friendly: Each column has a variable name, such as "PatientID". Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly. In both Vilno and SAS (and SQL) the possibility of "MISSING" ( or "NULL" ) is built into the data values held in the columns. So you don't have to use separate boolean variables to track MISSING vs NOT-MISSING. Very convenient. Vilno does have a lot of functionality that is a lot harder to implement in most other programming languages. (You can implement that functionality, but it would take a ton of code - the three merge-in options for Vilno are an example). The upshot: Awk is a hammer. Vilno is a screwdriver. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Here are some examples of the type of data crunching you might have to do. In response to the requests by Christophe Pallier and Martin Stevens. Before I started developing Vilno, some six years ago, I had been working in the pharmaceuticals for eight years ( it's not easy to show you actual data though, because it's all confidential of course). Lab data can be especially messy, especially if one clinical trial allows the physicians to use different labs. So let's consider lab data. Merge in normal ranges, into the lab data. This has to be done by lab-site and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where you also need to match by sex and age. The sex column in the normal ranges could be: blank, F, M, or B ( B meaning for Both sexes). The age column in the normal ranges could be: blank, or something like "40 <55". Even worse, you could have an ageunits column in the normal ranges dataset: usually "Y", but if there are children in the clinical trial, you will have "D" or "M", for Days and Months. If the clinical trial is for adults, all rows with "D" or "M" should be tossed out at the start. Clearly the statistical programmer has to spend time looking at the data, before writing the program. Remember, all of these details can change any time you move to a new clinical trial. So for the lab data, you have to merge in the patient's date of birth, calculate age, and somehow relate that to the age-group column in the normal ranges dataset. (By the way, in clinical trial data preparation, the SAS datastep is much more useful and convenient, in my opinion, than the SQL SELECT syntax, at least 97% of the time. But in the middle of this program, when you merge the normal ranges into the lab data, you get a better solution with PROC SQL ( just the SQL SELECT statement implemented inside SAS) This is because of the trickiness of the age match-up, and the SAS datastep does not do well with many-to-many joins.). Merge in various study drug administration dates into the lab data. Now, for each lab record, calculate treatment period ( or cycle number ), depending on the statistician's specifications and the way the clinical trial is structured. Different clinical sites chose to use different lab providers. So, for example, for Monocytes, you have 10 different units ( essentially 6 units, but spelling inconsistencies as well). The statistician has requested that you use standardized units in some of the listings ( % units, and only one type of non-% unit, for example ). At the same time, lab values need to be converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming no matter what software you use, and, in my experience, when the SAS programmer asks for more clinical information or lab guidebooks, the response is incomplete, so he does a lot of guesswork. SAS programmers do not have expertise in lab science, hence the guesswork. Your program has to accomodate numeric values, "1.54" , quasi-numeric values "<1" , and non-numeric values "Trace". Your data listing is tight for space, so print "PROLONGED CELL CONT" as "PRCC". Once normal ranges are merged in, figure out which values are out-of-range and high , which are low, and which are within normal range. In the data listing, you may have "H" or "L" appended to the result value being printed. For each treatment period, you may need a unique lab record selected, in case there are two or three for the same treatment period. The statistician will tell the SAS programmer how. Maybe the averages of the results for that treatment period, maybe that lab record closest to the mid-point of of the treatment period. This isn't for the data listing, but for a summary table. For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC (total white blood cell count) values , to convert values between % units and absolute count units. When printing the values in the data listing, you need "H" or "L" to the right of the value. But you also need the values to be well lined up ( the decimal place ). This can be stupidly time consuming. AND ON AND ON AND ON . I think you see why clinical trials statisticians and SAS programmers enjoy lots of job security. On 6/8/07, Martin Henry H. Stevens <[EMAIL PROTECTED]> wrote: > > Is there an example available of this sort of problematic data that > requires this kind of data screening and filtering? For many of us, > this issue would be nice to learn about, and deal with within R. If a > package could be created, that would be optimal for some of us. I > would like to learn a tad more, if it were not too much effort for > someone else to point me in the right direction? > Cheers, > Hank > On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote: > > &g
[R] How do you do an e-mail post that is within an ongoing thread?
That may sound like a stupid question, but if it confuses me, I'm sure it confuses others as well. I've tried to find that information on the R mail-group info pages, can't seem to find it. Is it something obvious? To begin a brand new discussion, you do your post as an e-mail sent to r-help@stat.math.ethz.ch . As I am doing right now. How do I do an additional post that gets included in the "[R] Tools For Preparing Data For Analysis" thread, a thread which I started myself yesterday ( thanks for all the responses everybody )? There's got to be a real easy answer to that, since everybody else does that. (I'm using gmail, does it make a difference what e-mail host you use?). --- PS If you happen to be reading this, Christophe Pallier & Martin Stevens, I will respond to your request for examples shortly, once I figure this posting how-to out. My examples will come from data preparation problems in clinical trial data ( I worked for 8 years on clinical trial analysis before beginning work on Vilno ). I'll probably use lab data as an example because lab data can be messy and difficult to work with. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Tools For Preparing Data For Analysis
As noted on the R-project web site itself ( www.r-project.org -> Manuals -> R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying "This is the type of problem that our product is not the best at, here's what we suggest instead" ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Why is the R mailing list so hard to figure out?
Why does the R mailing list need such an unusual and customized user interface? Last January, I figured out how to read Usenet mailing lists ( or Usenet groups ) and they all pretty much work the same, learn to use one, you've learned to use them all ( gnu.misc.discuss , comp.lang.lisp , and so on ). What's the best way to view and read discussions in this group for recent days? Can I view the postings for the current day via Google Groups? I hope I'm posting correctly. What does "ethz" and "ch" stand for? Is "ch" for Switzerland? Robert __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.