Re: [R] Tools For Preparing Data For Analysis
I am posting to this thread that has been quiet for some time because I remembered the following question. Christophe Pallier wrote: Hi, Can you provide examples of data formats that are problematic to read and clean with R ? Today I had a data manipulation problem that I don't know how to do in R so I solved it with perl. Since I'm always interested in learning more about complex data manipulation in R I am posting my problem in the hopes of receiving some hints for doing this in R. If anyone has nothing better to do than play with other people's data, I would be happy to send the row files off-list. Background: I have been given data that contains two measurements of left ventricular ejection fraction. One of the methods is echocardiogram which sometimes gives a true quantitative value and other times a semi-quantitative value. The desire is to compare echo with the other method (MUGA). In most cases, patients had either quantitative or semi-quantitative. Same patients had both. The data came to me in excel files with, basically, no patient identifiers to link the both with the semi-quantitative patients (the both patients were in multiple data sets). What I wanted to do was extract from the semi-quantitative data file those patients with only semi-quantitative. All I have to link with are the semi-quantitative echo and the MUGA and these pairs of values are not unique. To make this more concrete, here are some portions of the raw data. Both ID NUM,ECHO,MUGA,Semiquant,Quant B,12,37,10,12 D,13,13,10,13 E,13,26,10,15 F,13,31,10,13 H,15,15,10,15 I,15,21,10,15 J,15,22,10,15 K,17,22,10,17 N,17.5,4,10,17.5 P,18,25,10,18 R,19,25,10,19 Seimi-quantitative echo,muga,quant 10,20,0 -- keep 10,20,0 -- keep 10,21,0 -- remove 10,21,0 -- keep 10,24,0 -- keep 10,25,0 -- remove 10,25,0 -- remove 10,25,0 -- keep Here is the perl program I wrote for this. #!/usr/bin/perl open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv; # Discard first row; $_ = BOTH; while(BOTH) { chomp; ($id, $e, $m, $sq, $qu) = split(/,/); $both{$sq,$m}++; } close(BOTH); open(OUT, qual_echo_only.csv) || die Can't open qual_echo_only.csv; print OUT pid,echo,muga,quant\n; $pid = 2001; open(QUAL, qual_echo.csv) || die Can't open qual_echo.csv; # Discard first row $_ = QUAL; while(QUAL) { chomp; ($echo, $muga, $quant) = split(/,/); if ($both{$echo,$muga} 0) { $both{$echo,$muga}--; } else { print OUT $pid,$echo,$muga,$quant\n; $pid++; } } close(QUAL); close(OUT); open(OUT, both_echo.csv) || die Can't open both_echo.csv; print OUT pid,echo,muga,quant\n; $pid = 3001; open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv; # Discard first row; $_ = BOTH; while(BOTH) { chomp; ($id, $e, $m, $sq, $qu) = split(/,/); print OUT $pid,$sq,$m,0\n; print OUT $pid,$qu,$m,1\n; $pid++; } close(BOTH); close(OUT); -- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Department of Public Health Sciences Faculty of Medicine, University of Toronto email: [EMAIL PROTECTED] Tel: 416.864.5776 Fax: 416.864.6057 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
If I understand correctly (from your Perl script) 1. you count the number of occurences of each (echo, muga) pairs in the first file. 2. you remove from the second file the lines that correspond to these occurences. If this is indeed your aim, here's a solution in R: cumcount - function(x) { y - numeric(length(x)) for (i in 1:length(y)) { y[i] = sum(x[1:i] == x[i]) } y } both - read.csv('both_echo.csv') v - table(paste(both$echo, _, both$muga, sep=)) semi - read.csv('qual_echo.csv') s - paste(semi$echo, _, semi$muga, sep=) cs = cumcount(s) count = v[s] count[is.na(count)]=0 semi2 - data.frame(semi, s, cs, count, keep = cs count) semi2 echo muga quant s cs count keep 1 10 20 0 10_20 1 0 TRUE 2 10 20 0 10_20 2 0 TRUE 3 10 21 0 10_21 1 1 FALSE 4 10 21 0 10_21 2 1 TRUE 5 10 24 0 10_24 1 0 TRUE 6 10 25 0 10_25 1 2 FALSE 7 10 25 0 10_25 2 2 FALSE 8 10 25 0 10_25 3 2 TRUE My code is not very readable... Yet, the 'trick' of using an helper function like 'cumcount' might be instructive. Christophe Pallier On 6/22/07, Kevin E. Thorpe [EMAIL PROTECTED] wrote: I am posting to this thread that has been quiet for some time because I remembered the following question. Christophe Pallier wrote: Hi, Can you provide examples of data formats that are problematic to read and clean with R ? Today I had a data manipulation problem that I don't know how to do in R so I solved it with perl. Since I'm always interested in learning more about complex data manipulation in R I am posting my problem in the hopes of receiving some hints for doing this in R. If anyone has nothing better to do than play with other people's data, I would be happy to send the row files off-list. Background: I have been given data that contains two measurements of left ventricular ejection fraction. One of the methods is echocardiogram which sometimes gives a true quantitative value and other times a semi-quantitative value. The desire is to compare echo with the other method (MUGA). In most cases, patients had either quantitative or semi-quantitative. Same patients had both. The data came to me in excel files with, basically, no patient identifiers to link the both with the semi-quantitative patients (the both patients were in multiple data sets). What I wanted to do was extract from the semi-quantitative data file those patients with only semi-quantitative. All I have to link with are the semi-quantitative echo and the MUGA and these pairs of values are not unique. To make this more concrete, here are some portions of the raw data. Both ID NUM,ECHO,MUGA,Semiquant,Quant B,12,37,10,12 D,13,13,10,13 E,13,26,10,15 F,13,31,10,13 H,15,15,10,15 I,15,21,10,15 J,15,22,10,15 K,17,22,10,17 N,17.5,4,10,17.5 P,18,25,10,18 R,19,25,10,19 Seimi-quantitative echo,muga,quant 10,20,0 -- keep 10,20,0 -- keep 10,21,0 -- remove 10,21,0 -- keep 10,24,0 -- keep 10,25,0 -- remove 10,25,0 -- remove 10,25,0 -- keep Here is the perl program I wrote for this. #!/usr/bin/perl open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv; # Discard first row; $_ = BOTH; while(BOTH) { chomp; ($id, $e, $m, $sq, $qu) = split(/,/); $both{$sq,$m}++; } close(BOTH); open(OUT, qual_echo_only.csv) || die Can't open qual_echo_only.csv; print OUT pid,echo,muga,quant\n; $pid = 2001; open(QUAL, qual_echo.csv) || die Can't open qual_echo.csv; # Discard first row $_ = QUAL; while(QUAL) { chomp; ($echo, $muga, $quant) = split(/,/); if ($both{$echo,$muga} 0) { $both{$echo,$muga}--; } else { print OUT $pid,$echo,$muga,$quant\n; $pid++; } } close(QUAL); close(OUT); open(OUT, both_echo.csv) || die Can't open both_echo.csv; print OUT pid,echo,muga,quant\n; $pid = 3001; open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv; # Discard first row; $_ = BOTH; while(BOTH) { chomp; ($id, $e, $m, $sq, $qu) = split(/,/); print OUT $pid,$sq,$m,0\n; print OUT $pid,$qu,$m,1\n; $pid++; } close(BOTH); close(OUT); -- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Department of Public Health Sciences Faculty of Medicine, University of Toronto email: [EMAIL PROTECTED] Tel: 416.864.5776 Fax: 416.864.6057 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Christophe Pallier (http://www.pallier.org) [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list
Re: [R] Tools For Preparing Data For Analysis
As a tangent to this thread, there is a very relevant article in the latest issue of the RSS magazine Significance, which I have just received: Dr Fisher's Casebook The trouble with data Significance, Vol 4 (2007) Issue 2. Full current contents at http://www.blackwell-synergy.com/toc/sign/4/2 but unfortunately you can only read any of it by paying money to Blackwell (unless you're an RSS member). Best wishes to all, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 14-Jun-07 Time: 12:24:46 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
--- [EMAIL PROTECTED] wrote: As a tangent to this thread, there is a very relevant article in the latest issue of the RSS magazine Significance, which I have just received: Dr Fisher's Casebook The trouble with data Significance, Vol 4 (2007) Issue 2. Full current contents at http://www.blackwell-synergy.com/toc/sign/4/2 but unfortunately you can only read any of it by paying money to Blackwell (unless you're an RSS member). Best wishes to all, Ted. A lovely article. I'm not a member but the local university has a subscription. The examples of men who claimed to have cervical smears (F) and women who were 5' tall weighing 15 stone (T) ring true. I've found people walking at 30 km/hr (F) and an addict using 240 needles a month (T). I've even found a set of 16 variables the study designers never heard of ! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ] Hmm, I don't think you need a retain statement. if first.patientID ; or if last.patientID ; ought to do it. It's actually better than the Vilno version, I must admit, a bit more concise: if ( not firstrow(patientID) ) deleterow ; Ah well. ** For the folks asking for location of software ( I know posted it, but it didn't connect to the thread, and you get a huge number of posts each day , sorry): Vilno , find at http://code.google.com/p/vilno DAP PSPP, find at http://directory.fsf.org/math/stats Awk, find at lots of places, http://www.gnu.org/software/gawk/gawk.html Anything else? DAP PSPP are hard to find, I'm sure there's more out there! What about MDX? Nahh, not really the right problem domain. Nobody uses MDX for this stuff. ** If my examples , using clinical trial data are boring and hard to understand for those who asked for examples ( and presumably don't work in clinical trials) , let me know. Some of these other examples I'm reading about are quite interesting. It doesn't help that clinical trial databases cannot be public. Making a fake database would take a lot of time. The irony is , even with my deep understanding of data preparation in clinical trials, the pharmas still don't want to give me a job ( because I was gone for many years). Let's see if this post works : thanks to the folks who gave me advice on how to properly respond to a post within a thread . ( Although the thread in my gmail account is only a subset of the posts visible in the archives ). Crossing my fingers On 6/10/07, Peter Dalgaard [EMAIL PROTECTED] wrote: Douglas Bates wrote: Frank Harrell indicated that it is possible to do a lot of difficult data transformation within R itself if you try hard enough but that sometimes means working against the S language and its whole object view to accomplish what you want and it can require knowledge of subtle aspects of the S language. Actually, I think Frank's point was subtly different: It is *because* of the differences in view that it sometimes seems difficult to find the way to do something in R that is apparently straightforward in SAS. I.e. the solutions exist and are often elegant, but may require some lateral thinking. Case in point: Finding the first or the last observation for each subject when there are multiple records for each subject. The SAS way would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that you can compare the subject ID with the one from the previous record, working with data that are sorted appropriately. You can do the same thing in R with a for loop, but there are better ways e.g. subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or maybe do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or something involving aggregate(). (The latter approaches generalize better to other within-subject functionals like cumulative doses, etc.). The hardest cases that I know of are the ones where you need to turn one record into many, such as occurs in survival analysis with time-dependent, piecewise constant covariates. This may require transposing the problem, i.e. for each interval you find out which subjects contribute and with what, whereas the SAS way would be a within-subject loop over intervals containing an OUTPUT statement. Also, there are some really weird data formats, where e.g. the input format is different in different records. Back in the 80's where punched-card input was still common, it was quite popular to have one card with background information on a patient plus several cards detailing visits, and you'd get a stack of cards containing both kinds. In R you would most likely split on the card type using grep() and then read the two kinds separately and merge() them later. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
(Ted Harding) sent the following at 10/06/2007 09:28: ... much snipped ... (As is implicit in many comments in Robert's blog, and indeed also from many postings to this list over time and undoubtedly well known to many of us in practice, a lot of the problems with data files arise at the data gathering and entry stages, where people can behave as if stuffing unpaired socks and unattributed underwear randomly into a drawer, and then banging it shut). And they look surprised when pointing a statistician at the chest of drawers doesn't result in a cut price display worthy of Figleaf (or Victoria's Secret I think for those of you in N.America) and get them their degree, doctorate, latest publication ... Ah me, how wonderfully, wonderfully ... sadly, accurate! Thanks Ted, great thread and I'm impressed with EpiData that I've discovered through this. I'd still like something that is even more integrated with R but maybe some day, if EpiData go fully open source as I think they are doing (A full conversion plan to secure this and convert the software to open-source has been made (See complete description of license and principles). at http://www.epidata.dk/ but the link to http://www.epidata.dk/about.htm doesn't exactly clarify this I don't think. But I can hope.) Thanks, yet again, to everyone who creates and contributes to the R system and this list: wonderful! C -- Chris Evans [EMAIL PROTECTED] Skype: chris-psyctc Professor of Psychotherapy, Nottingham University; Consultant Psychiatrist in Psychotherapy, Notts PDD network; Research Programmes Director, Nottinghamshire NHS Trust; *If I am writing from one of those roles, it will be clear. Otherwise* *my views are my own and not representative of those institutions* __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Chris Evans wrote: Thanks Ted, great thread and I'm impressed with EpiData that I've discovered through this. I'd still like something that is even more integrated with R but maybe some day, if EpiData go fully open source as I think they are doing (A full conversion plan to secure this and convert the software to open-source has been made (See complete description of license and principles). at http://www.epidata.dk/ but the link to http://www.epidata.dk/about.htm doesn't exactly clarify this I don't think. But I can hope.) Thanks, yet again, to everyone who creates and contributes to the R system and this list: wonderful! Perhaps what we need is an XML standard for describing record-oriented data and its validation? This could then be used to validate a set of records and possibly also to build input forms with built-in validation for new records. You could then write R code that did 'check this data frame against this XML description and tell me the invalid rows'. Or Python code. This is the kind of thing that is traditionally built using a database front-end, but keeping the description in XML means that alternate interfaces (web forms, standalone programs using Qt or GTK libraries) can be used on the same description set. I had a quick search to see if this kind of thing exists already, but google searches for 'data entry verification' indicate that I should really pay some people in India to do that kind of thing for me... Barry __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote: That can be elegantly handled in R through R's object oriented programming by defining a class for the fancy input. See this post: https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html for a simple example of that style. On 6/9/07, Robert Wilkins [EMAIL PROTECTED] wrote: Here are some examples of the type of data crunching you might have to do. In response to the requests by Christophe Pallier and Martin Stevens. Before I started developing Vilno, some six years ago, I had been working in the pharmaceuticals for eight years ( it's not easy to show you actual data though, because it's all confidential of course). I hadn't heard of Vilno before (except as a variant of Vilnius). And it seems remarkably hard to find info about it from a Google search. The best I've come up with, searching on vilno data is at http://www.xanga.com/datahelper This is a blog site, apparently with postings by Robert Wilkins. At the end of the Sunday, September 17, 2006 posting Tedious coding at the Pharmas is a link: I have created a new data crunching programming language. http://www.my.opera.com/datahelper which appears to be totally empty. In another blog article: go to the www.my.opera.com/datahelper site, go to the August 31 blog article, and there you will find a tarball-file to download, called vilnoAUG2006package.tgz so again inaccessible; and a google on vilnoAUG2006package.tgz gives a single hit which is simply the same aricle. In the Xanga blog there are a few examples of tasks which are no big deal in any programming language (and, relative to their simplicity, appear a bit cumbersome in Vilno). I've not seen in the blog any instance of data transformation which could not be quite easily done in any straigthforward language (even awk). Lab data can be especially messy, especially if one clinical trial allows the physicians to use different labs. So let's consider lab data. [...] That's a fairly daunting description, though indeed not at all extreme for the sort of data that can arise in practice (and not just in pharmaceutical investigations). But the complexity is in the situation, and, whatever language you use, the writing of the program will involve the writer getting to grips with the complexity, and the complexity will be present in the code simply because of the need to accomodate all the special cases, exceptions and faults that have to be anticipated in feral data. Once these have been anticipated and incorporated in the code, the actual transformations are again no big deal. Frankly, I haven't yet seen anything Vilno that couldn't be accomodated in an 'awk' program. Not that I'm advocating awk for universal use (I'm not that monolithic about it). But I'm using it as my favourite example of a flexible, capable, transparent and efficient data filtering language, as far as it goes. SO: where can one find out more about Vilno, to see what it may really be capable of that can not be done so easily in other ways? (As is implicit in many comments in Robert's blog, and indeed also from many postings to this list over time and undoubtedly well known to many of us in practice, a lot of the problems with data files arise at the data gathering and entry stages, where people can behave as if stuffing unpaired socks and unattributed underwear randomly into a drawer, and then banging it shut). Best wishes to all, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 09:28:10 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Douglas Bates wrote: Frank Harrell indicated that it is possible to do a lot of difficult data transformation within R itself if you try hard enough but that sometimes means working against the S language and its whole object view to accomplish what you want and it can require knowledge of subtle aspects of the S language. Actually, I think Frank's point was subtly different: It is *because* of the differences in view that it sometimes seems difficult to find the way to do something in R that is apparently straightforward in SAS. I.e. the solutions exist and are often elegant, but may require some lateral thinking. Case in point: Finding the first or the last observation for each subject when there are multiple records for each subject. The SAS way would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that you can compare the subject ID with the one from the previous record, working with data that are sorted appropriately. You can do the same thing in R with a for loop, but there are better ways e.g. subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or maybe do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or something involving aggregate(). (The latter approaches generalize better to other within-subject functionals like cumulative doses, etc.). The hardest cases that I know of are the ones where you need to turn one record into many, such as occurs in survival analysis with time-dependent, piecewise constant covariates. This may require transposing the problem, i.e. for each interval you find out which subjects contribute and with what, whereas the SAS way would be a within-subject loop over intervals containing an OUTPUT statement. Also, there are some really weird data formats, where e.g. the input format is different in different records. Back in the 80's where punched-card input was still common, it was quite popular to have one card with background information on a patient plus several cards detailing visits, and you'd get a stack of cards containing both kinds. In R you would most likely split on the card type using grep() and then read the two kinds separately and merge() them later. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote: ... a lot of the problems with data files arise at the data gathering and entry stages, where people can behave as if stuffing unpaired socks and unattributed underwear randomly into a drawer, and then banging it shut. Not specifically R-related, but this would make a great fortune. Sarah -- Sarah Goslee http://www.functionaldiversity.org __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Since R is supposed to be a complete programming language, I wonder why these tools couldn't be implemented in R (unless speed is the issue). Of course, it's a naive desire to have a single language that does everything, but it seems that R currently has most of the functions necessary to do the type of data cleaning described. For instance, Gabor and Peter showed some snippets of ways to do this elegantly; my [physical science] data is often not as horrendously structured so usually I can get away with a program containing this type of code txtin - scan(filename,what=,sep=\n) filteredList - lapply(strsplit(txtin,delimiter),FUN=filterfunction) # fiteringfunction() returns selected (and possibly transformed # elements if present and NULL otherwise # may include calls to grep(), regexpr(), gsub(), substring(),... # nchar(), sscanf(), type.convert(), paste(), etc. mydataframe - do.call(rbind,filteredList) # then match(), subset(), aggregate(), etc. In the case that the file is large, I open a file connection and scan a single line + apply filterfunction() successively in a FOR-LOOP instead of using lapply(). Of course, the devil is in the details of the filtering function, but I believe most of the required text processing facilities are already provided by R. I often have tasks that involve a combination of shell-scripting and text processing to construct the data frame for analysis; I started out using Python+NumPy to do the front-end work but have been using R progressively more (frankly, all of it) to take over that portion since I generally prefer the data structures and methods in R. --- Peter Dalgaard [EMAIL PROTECTED] wrote: Douglas Bates wrote: Frank Harrell indicated that it is possible to do a lot of difficult data transformation within R itself if you try hard enough but that sometimes means working against the S language and its whole object view to accomplish what you want and it can require knowledge of subtle aspects of the S language. Actually, I think Frank's point was subtly different: It is *because* of the differences in view that it sometimes seems difficult to find the way to do something in R that is apparently straightforward in SAS. I.e. the solutions exist and are often elegant, but may require some lateral thinking. Case in point: Finding the first or the last observation for each subject when there are multiple records for each subject. The SAS way would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that you can compare the subject ID with the one from the previous record, working with data that are sorted appropriately. You can do the same thing in R with a for loop, but there are better ways e.g. subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or maybe do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or something involving aggregate(). (The latter approaches generalize better to other within-subject functionals like cumulative doses, etc.). The hardest cases that I know of are the ones where you need to turn one record into many, such as occurs in survival analysis with time-dependent, piecewise constant covariates. This may require transposing the problem, i.e. for each interval you find out which subjects contribute and with what, whereas the SAS way would be a within-subject loop over intervals containing an OUTPUT statement. Also, there are some really weird data formats, where e.g. the input format is different in different records. Back in the 80's where punched-card input was still common, it was quite popular to have one card with background information on a patient plus several cards detailing visits, and you'd get a stack of cards containing both kinds. In R you would most likely split on the card type using grep() and then read the two kinds separately and merge() them later. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 10-Jun-07 14:04:44, Sarah Goslee wrote: On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote: ... a lot of the problems with data files arise at the data gathering and entry stages, where people can behave as if stuffing unpaired socks and unattributed underwear randomly into a drawer, and then banging it shut. Not specifically R-related, but this would make a great fortune. Sarah -- Sarah Goslee http://www.functionaldiversity.org I'm not going to object to that! Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 21:18:45 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 10-Jun-07 19:27:50, Stephen Tucker wrote: Since R is supposed to be a complete programming language, I wonder why these tools couldn't be implemented in R (unless speed is the issue). Of course, it's a naive desire to have a single language that does everything, but it seems that R currently has most of the functions necessary to do the type of data cleaning described. In principle that is certainly true. A couple of comments, though. 1. R's rich data structures are likely to be superfluous. Mostly, at the sanitisation stage, one is working with flat files (row column). This straightforward format is often easier to handle using simple programs for the kind of basic filtering needed, rather then getting into the heavier programming constructs of R. 2. As follow-on and contrast at the same time, very often what should be a nice flat file with no rough edges is not. If there are variable numbers of fields per line, R will not handle it straightforwardly (you can force it in, but it's more elaborate). There are related issues as well. a) If someone entering data into an Excel table lets their cursor wander outside the row/col range of the table, this can cause invisible entities to be planted in the extraneous cells. When saved as a CSV, this file then has variable numbers of fields per line, and possibly also extra lines with arbitrary blank fields. cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}' will give you the numbers of fields in each line. If you further pipe it into | sort -nu you will get the distinct field-numbers. If you know (by now) how many fields there should be (e.g. 10), then cat datafile.csv | awk 'BEGIN{FS=,} (NF != 10){print NR , NF}' will tell you which lines have the wrong number of fields, and how many fields they have. You can similarly count how many lines there are (e.g. pipe into wc -l). b) Poeple sometimes randomly use a blank space or a . in a cell to demote a missing value. Consistent use of either is OK: ,, in a CSV will be treated as NA by R. The use of . can be more problematic. If for instance you try to read the following CSV into R as a dataframe: 1,2,.,4 2,.,4,5 3,4,.,6 the . in cols 2 and 3 is treated as the character ., with the result that something complicated happens to the typing of the items. typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but sum(D[1,2]) gives a type-error, even though the entry is in fact 2. And so on , in various combinations. And (as.nmatrix(D)) is of course a matrix of characters. In fact, columns 2 and 3 of D are treated as factors! for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}} [1] 1 [1] 2 Levels: . 2 4 [1] . Levels: . 4 [1] 4 [1] 2 [1] . Levels: . 2 4 [1] 4 Levels: . 4 [1] 5 [1] 3 [1] 4 Levels: . 2 4 [1] . Levels: . 4 [1] 6 This is getting altogether too complicated for the job one wants to do! And it gets worse when people mix ,, and ,.,! On the other hand, a simple brush with awk (or sed in this case) can sort it once and for all, without waking the sleeping dogs in R. I could go on. R undoubtedly has the power, but it can very quickly get over-complicated for simple jobs. Best wishes to all, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 22:14:35 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
An important potential benefit of R solutions shared by awk, sed, ... is that they provide a reproducible way to document exactly how one got from one version of the data to the next. This seems to be the main problem with handicraft methods like editing excel files, it is too easy to introduce new errors that can't be tracked down at later stages of the analysis. url:www.econ.uiuc.edu/~rogerRoger Koenker email [EMAIL PROTECTED] Department of Economics vox:217-333-4558University of Illinois fax:217-244-6678Champaign, IL 61820 On Jun 10, 2007, at 4:14 PM, (Ted Harding) wrote: On 10-Jun-07 19:27:50, Stephen Tucker wrote: Since R is supposed to be a complete programming language, I wonder why these tools couldn't be implemented in R (unless speed is the issue). Of course, it's a naive desire to have a single language that does everything, but it seems that R currently has most of the functions necessary to do the type of data cleaning described. In principle that is certainly true. A couple of comments, though. 1. R's rich data structures are likely to be superfluous. Mostly, at the sanitisation stage, one is working with flat files (row column). This straightforward format is often easier to handle using simple programs for the kind of basic filtering needed, rather then getting into the heavier programming constructs of R. 2. As follow-on and contrast at the same time, very often what should be a nice flat file with no rough edges is not. If there are variable numbers of fields per line, R will not handle it straightforwardly (you can force it in, but it's more elaborate). There are related issues as well. a) If someone entering data into an Excel table lets their cursor wander outside the row/col range of the table, this can cause invisible entities to be planted in the extraneous cells. When saved as a CSV, this file then has variable numbers of fields per line, and possibly also extra lines with arbitrary blank fields. cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}' will give you the numbers of fields in each line. If you further pipe it into | sort -nu you will get the distinct field-numbers. If you know (by now) how many fields there should be (e.g. 10), then cat datafile.csv | awk 'BEGIN{FS=,} (NF != 10){print NR , NF}' will tell you which lines have the wrong number of fields, and how many fields they have. You can similarly count how many lines there are (e.g. pipe into wc -l). b) Poeple sometimes randomly use a blank space or a . in a cell to demote a missing value. Consistent use of either is OK: ,, in a CSV will be treated as NA by R. The use of . can be more problematic. If for instance you try to read the following CSV into R as a dataframe: 1,2,.,4 2,.,4,5 3,4,.,6 the . in cols 2 and 3 is treated as the character ., with the result that something complicated happens to the typing of the items. typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but sum(D[1,2]) gives a type-error, even though the entry is in fact 2. And so on , in various combinations. And (as.nmatrix(D)) is of course a matrix of characters. In fact, columns 2 and 3 of D are treated as factors! for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}} [1] 1 [1] 2 Levels: . 2 4 [1] . Levels: . 4 [1] 4 [1] 2 [1] . Levels: . 2 4 [1] 4 Levels: . 4 [1] 5 [1] 3 [1] 4 Levels: . 2 4 [1] . Levels: . 4 [1] 6 This is getting altogether too complicated for the job one wants to do! And it gets worse when people mix ,, and ,.,! On the other hand, a simple brush with awk (or sed in this case) can sort it once and for all, without waking the sleeping dogs in R. I could go on. R undoubtedly has the power, but it can very quickly get over-complicated for simple jobs. Best wishes to all, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 22:14:35 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Embarrasingly, I don't know awk or sed but R's code seems to be shorter for most tasks than Python, which is my basis for comparison. It's true that R's more powerful data structures usually aren't necessary for the data cleaning, but sometimes in the filtering process I will pick out lines that contain certain data, in which case I have to convert text to numbers and perform operations like which.min(), order(), etc., so in that sense I like to have R's vectorized notation and the objects/functions that support it. As far as some of the tasks you described, I've tried transcribing them to R. I know you provided only the simplest examples, but even in these cases I think R's functions for handling these situations exemplify their usefulness in this step of the analysis. But perhaps you would argue that this code is too long... In any event it will still save the trouble of keeping track of an extra (intermediate) file passed between awk and R. (1) the numbers of fields in each line equivalent to cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}' in awk # R equivalent: nFields - count.fields(datafile.csv,sep=,) # or nFields - sapply(strsplit(readLines(datafile.csv),,),length) (2) which lines have the wrong number of fields, and how many fields they have. You can similarly count how many lines there are (e.g. pipe into wc -l). # number of lines with wrong number of fields nWrongFields - length(nFields[nFields 10]) # select only first ten fields from each line # and return a matrix firstTenFields - do.call(rbind, lapply(strsplit(readLines(datafile.csv),,), function(x) x[1:10])) # select only those lines which contain ten fields # and return a matrix onlyTenFields - do.call(rbind, lapply(strsplit(readLines(datafile.csv),,), function(x) if(length(x) = 10) x else NULL)) (3) If for instance you try to read the following CSV into R as a dataframe: 1,2,.,4 2,.,4,5 3,4,.,6 txtC - textConnection( 1,2,.,4 2,.,4,5 3,4,.,6) # using read.csv() specifying na.string argument: read.csv(txtC,header=FALSE,na.string=.) V1 V2 V3 V4 1 1 2 NA 4 2 2 NA 4 5 3 3 4 NA 6 # Of course, read.csv will work only if data is formatted correctly. # More generally, using readLines(), strsplit(), etc., which are more # flexible : do.call(rbind, + lapply(strsplit(readLines(txtC),,), +type.convert,na.string=.)) [,1] [,2] [,3] [,4] [1,]12 NA4 [2,]2 NA45 [3,]34 NA6 (4) Situations where people mix ,, and ,.,! # type.convert (and read.csv) will still work when missing values are ,, # and ,., (automatically recognizes as NA and through # specification of 'na.string', can recognize . as NA) # If it is desired to convert . to first, this is simple as # well: m - do.call(rbind, lapply(strsplit(readLines(txtC),,), function(x) gsub(^\\.$,,x))) m [,1] [,2] [,3] [,4] [1,] 1 2 4 [2,] 2 4 5 [3,] 3 4 6 # then mode(m) - numeric # or m - apply(m,2,type.convert) # will give m [,1] [,2] [,3] [,4] [1,]12 NA4 [2,]2 NA45 [3,]34 NA6 --- [EMAIL PROTECTED] wrote: On 10-Jun-07 19:27:50, Stephen Tucker wrote: Since R is supposed to be a complete programming language, I wonder why these tools couldn't be implemented in R (unless speed is the issue). Of course, it's a naive desire to have a single language that does everything, but it seems that R currently has most of the functions necessary to do the type of data cleaning described. In principle that is certainly true. A couple of comments, though. 1. R's rich data structures are likely to be superfluous. Mostly, at the sanitisation stage, one is working with flat files (row column). This straightforward format is often easier to handle using simple programs for the kind of basic filtering needed, rather then getting into the heavier programming constructs of R. 2. As follow-on and contrast at the same time, very often what should be a nice flat file with no rough edges is not. If there are variable numbers of fields per line, R will not handle it straightforwardly (you can force it in, but it's more elaborate). There are related issues as well. a) If someone entering data into an Excel table lets their cursor wander outside the row/col range of the table, this can cause invisible entities to be planted in the extraneous cells. When saved as a CSV, this file then has variable numbers of fields per line, and possibly also extra lines with arbitrary blank fields. cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}' will give you the numbers of fields in each line. If you further pipe it into | sort -nu you will get the distinct field-numbers. If you know (by now) how many fields there should be (e.g. 10), then cat
Re: [R] Tools For Preparing Data For Analysis
Here are some examples of the type of data crunching you might have to do. In response to the requests by Christophe Pallier and Martin Stevens. Before I started developing Vilno, some six years ago, I had been working in the pharmaceuticals for eight years ( it's not easy to show you actual data though, because it's all confidential of course). Lab data can be especially messy, especially if one clinical trial allows the physicians to use different labs. So let's consider lab data. Merge in normal ranges, into the lab data. This has to be done by lab-site and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where you also need to match by sex and age. The sex column in the normal ranges could be: blank, F, M, or B ( B meaning for Both sexes). The age column in the normal ranges could be: blank, or something like 40 55. Even worse, you could have an ageunits column in the normal ranges dataset: usually Y, but if there are children in the clinical trial, you will have D or M, for Days and Months. If the clinical trial is for adults, all rows with D or M should be tossed out at the start. Clearly the statistical programmer has to spend time looking at the data, before writing the program. Remember, all of these details can change any time you move to a new clinical trial. So for the lab data, you have to merge in the patient's date of birth, calculate age, and somehow relate that to the age-group column in the normal ranges dataset. (By the way, in clinical trial data preparation, the SAS datastep is much more useful and convenient, in my opinion, than the SQL SELECT syntax, at least 97% of the time. But in the middle of this program, when you merge the normal ranges into the lab data, you get a better solution with PROC SQL ( just the SQL SELECT statement implemented inside SAS) This is because of the trickiness of the age match-up, and the SAS datastep does not do well with many-to-many joins.). Merge in various study drug administration dates into the lab data. Now, for each lab record, calculate treatment period ( or cycle number ), depending on the statistician's specifications and the way the clinical trial is structured. Different clinical sites chose to use different lab providers. So, for example, for Monocytes, you have 10 different units ( essentially 6 units, but spelling inconsistencies as well). The statistician has requested that you use standardized units in some of the listings ( % units, and only one type of non-% unit, for example ). At the same time, lab values need to be converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming no matter what software you use, and, in my experience, when the SAS programmer asks for more clinical information or lab guidebooks, the response is incomplete, so he does a lot of guesswork. SAS programmers do not have expertise in lab science, hence the guesswork. Your program has to accomodate numeric values, 1.54 , quasi-numeric values 1 , and non-numeric values Trace. Your data listing is tight for space, so print PROLONGED CELL CONT as PRCC. Once normal ranges are merged in, figure out which values are out-of-range and high , which are low, and which are within normal range. In the data listing, you may have H or L appended to the result value being printed. For each treatment period, you may need a unique lab record selected, in case there are two or three for the same treatment period. The statistician will tell the SAS programmer how. Maybe the averages of the results for that treatment period, maybe that lab record closest to the mid-point of of the treatment period. This isn't for the data listing, but for a summary table. For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC (total white blood cell count) values , to convert values between % units and absolute count units. When printing the values in the data listing, you need H or L to the right of the value. But you also need the values to be well lined up ( the decimal place ). This can be stupidly time consuming. AND ON AND ON AND ON . I think you see why clinical trials statisticians and SAS programmers enjoy lots of job security. On 6/8/07, Martin Henry H. Stevens [EMAIL PROTECTED] wrote: Is there an example available of this sort of problematic data that requires this kind of data screening and filtering? For many of us, this issue would be nice to learn about, and deal with within R. If a package could be created, that would be optimal for some of us. I would like to learn a tad more, if it were not too much effort for someone else to point me in the right direction? Cheers, Hank On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote: On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S
Re: [R] Tools For Preparing Data For Analysis
That can be elegantly handled in R through R's object oriented programming by defining a class for the fancy input. See this post: https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html for a simple example of that style. On 6/9/07, Robert Wilkins [EMAIL PROTECTED] wrote: Here are some examples of the type of data crunching you might have to do. In response to the requests by Christophe Pallier and Martin Stevens. Before I started developing Vilno, some six years ago, I had been working in the pharmaceuticals for eight years ( it's not easy to show you actual data though, because it's all confidential of course). Lab data can be especially messy, especially if one clinical trial allows the physicians to use different labs. So let's consider lab data. Merge in normal ranges, into the lab data. This has to be done by lab-site and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where you also need to match by sex and age. The sex column in the normal ranges could be: blank, F, M, or B ( B meaning for Both sexes). The age column in the normal ranges could be: blank, or something like 40 55. Even worse, you could have an ageunits column in the normal ranges dataset: usually Y, but if there are children in the clinical trial, you will have D or M, for Days and Months. If the clinical trial is for adults, all rows with D or M should be tossed out at the start. Clearly the statistical programmer has to spend time looking at the data, before writing the program. Remember, all of these details can change any time you move to a new clinical trial. So for the lab data, you have to merge in the patient's date of birth, calculate age, and somehow relate that to the age-group column in the normal ranges dataset. (By the way, in clinical trial data preparation, the SAS datastep is much more useful and convenient, in my opinion, than the SQL SELECT syntax, at least 97% of the time. But in the middle of this program, when you merge the normal ranges into the lab data, you get a better solution with PROC SQL ( just the SQL SELECT statement implemented inside SAS) This is because of the trickiness of the age match-up, and the SAS datastep does not do well with many-to-many joins.). Merge in various study drug administration dates into the lab data. Now, for each lab record, calculate treatment period ( or cycle number ), depending on the statistician's specifications and the way the clinical trial is structured. Different clinical sites chose to use different lab providers. So, for example, for Monocytes, you have 10 different units ( essentially 6 units, but spelling inconsistencies as well). The statistician has requested that you use standardized units in some of the listings ( % units, and only one type of non-% unit, for example ). At the same time, lab values need to be converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming no matter what software you use, and, in my experience, when the SAS programmer asks for more clinical information or lab guidebooks, the response is incomplete, so he does a lot of guesswork. SAS programmers do not have expertise in lab science, hence the guesswork. Your program has to accomodate numeric values, 1.54 , quasi-numeric values 1 , and non-numeric values Trace. Your data listing is tight for space, so print PROLONGED CELL CONT as PRCC. Once normal ranges are merged in, figure out which values are out-of-range and high , which are low, and which are within normal range. In the data listing, you may have H or L appended to the result value being printed. For each treatment period, you may need a unique lab record selected, in case there are two or three for the same treatment period. The statistician will tell the SAS programmer how. Maybe the averages of the results for that treatment period, maybe that lab record closest to the mid-point of of the treatment period. This isn't for the data listing, but for a summary table. For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC (total white blood cell count) values , to convert values between % units and absolute count units. When printing the values in the data listing, you need H or L to the right of the value. But you also need the values to be well lined up ( the decimal place ). This can be stupidly time consuming. AND ON AND ON AND ON . I think you see why clinical trials statisticians and SAS programmers enjoy lots of job security. This could be readily handled in R using object oriented programming. You would specify a class for the strange input, __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Hi, Can you provide examples of data formats that are problematic to read and clean with R ? The only problematic cases I have encountered were cases with multiline and/or varying length records (optional information). Then, it is sometimes a good idea to preprocess the data to present in a tabular format (one record per line). For this purpose, I use awk (e.g. http://www.vectorsite.net/tsawk.html), which is very adept at processing ascii data files (awk is much simpler to learn than perl, spss, sas, ...). I have never encountered a data file in ascii format that I could not reformat with Awk. With binary formats, it is another story... But, again, this is my limited experience; I would like to know if there are situations where using SAS/SPSS is really a better approach. Christophe Pallier On 6/8/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and
Re: [R] Tools For Preparing Data For Analysis
On 08-Jun-07 08:27:21, Christophe Pallier wrote: Hi, Can you provide examples of data formats that are problematic to read and clean with R ? The only problematic cases I have encountered were cases with multiline and/or varying length records (optional information). Then, it is sometimes a good idea to preprocess the data to present in a tabular format (one record per line). For this purpose, I use awk (e.g. http://www.vectorsite.net/tsawk.html), which is very adept at processing ascii data files (awk is much simpler to learn than perl, spss, sas, ...). I want to join in with an enthusiastic Me too!!. For anything which has to do with basic checking for the kind of messes that people can get data into when they put it on the computer, I think awk is ideal. It is very flexible (far more so than many, even long-time, awk users suspect), very transparent in its programming language (as opposed to say perl), fast, and with light impact on system resources (rare delight in these days, when upgrading your software may require upgrading your hardware). Although it may seem on the surface that awk is two-dimensional in its view of data (line by line, and per field in a line), it has some flexible internal data structures and recursive function capability, which allows a lot more to be done with the data that have been read in. For example, I've used awk to trace ancestry through a genealogy, given a data file where each line includes the identifier of an individual and the identifiers of its male and female parents (where known). And that was for pedigree dogs, where what happens in real life makes Oedipus look trivial. I have never encountered a data file in ascii format that I could not reformat with Awk. With binary formats, it is another story... But then it is a good idea to process the binary file using an instance of the creating software, to produce a ASCII file (say in CSV format). But, again, this is my limited experience; I would like to know if there are situations where using SAS/SPSS is really a better approach. The main thing often useful for data cleaning that awk does not have is any associated graphics. It is -- by design -- a line-by-line text-file processor. While, for instance, you could use awk to accumulate numerical histogram counts, you would have to use something else to display the histogram. And for scatter-plots there's probably not much point in bringing awk into the picture at all (unless a preliminary filtration of mess is needed anyway). That being said, though, there can still be a use to extract data fields from a file for submission to other software. Another kind of area where awk would not have much to offer is where, as a part of your preliminary data inspection, you want to inspect the results of some standard statistical analyses. As a final comment, utilities like awk can be used far more fruitfully on operating systems (the unixoid family) which incorporate at ground level the infrastructure for plumbing together streams of data output from different programs. Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 08-Jun-07 Time: 10:43:05 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis. Thanks for bringing up this topic. I think there is definitely a place for such languages, which I would regard as data-filtering languages, but I also think that trying to reproduce the facilities in SAS or SPSS for data analysis is redundant. Other responses in this thread have mentioned 'little language' filters like awk, which is fine for those who were raised in the Bell Labs tradition of programming
Re: [R] Tools For Preparing Data For Analysis
I had mentioned exactly the same thing to others and the feedback I got is - 'when you have a hammer, everything will look like a nail' ^_^. On 6/7/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: Robert Wilkins wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. We deal with exactly those kinds of data solely using R. R is exceptionally powerful for data manipulation, just a bit hard to learn. Many examples are at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf Frank There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data
Re: [R] Tools For Preparing Data For Analysis
Is there an example available of this sort of problematic data that requires this kind of data screening and filtering? For many of us, this issue would be nice to learn about, and deal with within R. If a package could be created, that would be optimal for some of us. I would like to learn a tad more, if it were not too much effort for someone else to point me in the right direction? Cheers, Hank On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote: On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/ list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning ,
Re: [R] Tools For Preparing Data For Analysis
Martin Henry H. Stevens sent the following at 08/06/2007 15:11: Is there an example available of this sort of problematic data that requires this kind of data screening and filtering? For many of us, this issue would be nice to learn about, and deal with within R. If a package could be created, that would be optimal for some of us. I would like to learn a tad more, if it were not too much effort for someone else to point me in the right direction? Cheers, Hank On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote: On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - ... rest snipped ... OK, I can't resist that invitation. I think there are many kinds of problematic data. I handle some nasty textish things in perl (and I loved the purgatory quote) and I'm afraid I do some things in Excel and some cleaning I can handle in R, but I never enter data directly into R. However, one very common scenario I have faceda all my working life is psych data from questionnaires or interviews in low budget work, mostly student research or routine entry of therapists' data. Typically you have an identifier, a date, some demographics and then a lot of item data. There's little money (usual zero) involved for data entry and cleaning but I've produced a lot of good(ish) papers out of this sort of very low budget work over the last 20 years. (Right at the other end of a financial spectrum from the FDA/validated s'ware thread but this is about validation again!) The problem I often face is that people are lousy data entry machines (well, actually, they vary ... enormously) and if they mess up the data entry we all know how horrible this can be. SPSS (boo hiss) used to have an excellent module, actually a standalone PC/Windoze program, that allowed you to define variables so they had allowed values and it would refuse to accept out of range or out of acceptable entries, it also allowed you to create checking rules and rules that would, in the light of earlier entries, set later values and not ask about them. In a rudimentary way you could also lay things out on the screen so that it paginated where the q'aire or paper data record did etc. The final nice touch was that you could define some variables as invariant and then set the thing so an independent data entry person could re-enter the other data (i.e. pick up q'aire, see if ID fits the one showing on screen, if so, enter the rest of the data). It would bleep and not move on if you entered a value other than that entered by the first person and you had to confirm that one of you was right. That saved me wasted weeks I'm sure on analysing data that turned out to be awful and I'd love to see someone build something to replace that. Currently I tend to use (boo hiss) Excel for this as everyone I work with seems to have it (and not all can install open office and anyway I haven't had time to learn that properly yet either ...) and I set up spreadsheets with validation rules set. That doesn't get the branching rules and checks (e.g. if male, skip questions about periods, PMT and pregnancies), or at least, with my poor Excel skills it doesn't. I just skip a column to indicate page breaks in the q'aire, and I get, when I can, two people to enter the data separately and then use R to compare the two spreadsheets having yanked them into data frames. I would really, really love someone to develop (and perhaps replace) the rather buggy edit() and fix() routines (seem to hang on big data frames in Rcmdr which is what I'm trying to get students onto) with something that did some or all of what SPSS/DE used to do for me or I bodge now in Excel. If any generous coding whiz were willing to do this, I'll try to alpha and beta test and write help etc. There _may_ be good open source things out there that do what I need but something that really integrated into R would be another huge step forward in being able to phase out SPSS in my work settings and phase in R. Very best all, Chris -- Chris Evans [EMAIL PROTECTED] Skype: chris-psyctc Professor of Psychotherapy, Nottingham University; Consultant Psychiatrist in Psychotherapy, Notts PDD network; Research Programmes Director, Nottinghamshire NHS Trust; *If I am writing from one of those roles, it will be clear. Otherwise* *my views are my own and not representative of those institutions* __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
For windows users, EpiData Entry http://www.epidata.dk/ is an excellent (free) tool for data entry and documentation.--Dale On 6/8/07, Chris Evans [EMAIL PROTECTED] wrote: Martin Henry H. Stevens sent the following at 08/06/2007 15:11: Is there an example available of this sort of problematic data that requires this kind of data screening and filtering? For many of us, this issue would be nice to learn about, and deal with within R. If a package could be created, that would be optimal for some of us. I would like to learn a tad more, if it were not too much effort for someone else to point me in the right direction? Cheers, Hank On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote: On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - ... rest snipped ... OK, I can't resist that invitation. I think there are many kinds of problematic data. I handle some nasty textish things in perl (and I loved the purgatory quote) and I'm afraid I do some things in Excel and some cleaning I can handle in R, but I never enter data directly into R. However, one very common scenario I have faceda all my working life is psych data from questionnaires or interviews in low budget work, mostly student research or routine entry of therapists' data. Typically you have an identifier, a date, some demographics and then a lot of item data. There's little money (usual zero) involved for data entry and cleaning but I've produced a lot of good(ish) papers out of this sort of very low budget work over the last 20 years. (Right at the other end of a financial spectrum from the FDA/validated s'ware thread but this is about validation again!) The problem I often face is that people are lousy data entry machines (well, actually, they vary ... enormously) and if they mess up the data entry we all know how horrible this can be. SPSS (boo hiss) used to have an excellent module, actually a standalone PC/Windoze program, that allowed you to define variables so they had allowed values and it would refuse to accept out of range or out of acceptable entries, it also allowed you to create checking rules and rules that would, in the light of earlier entries, set later values and not ask about them. In a rudimentary way you could also lay things out on the screen so that it paginated where the q'aire or paper data record did etc. The final nice touch was that you could define some variables as invariant and then set the thing so an independent data entry person could re-enter the other data (i.e. pick up q'aire, see if ID fits the one showing on screen, if so, enter the rest of the data). It would bleep and not move on if you entered a value other than that entered by the first person and you had to confirm that one of you was right. That saved me wasted weeks I'm sure on analysing data that turned out to be awful and I'd love to see someone build something to replace that. Currently I tend to use (boo hiss) Excel for this as everyone I work with seems to have it (and not all can install open office and anyway I haven't had time to learn that properly yet either ...) and I set up spreadsheets with validation rules set. That doesn't get the branching rules and checks (e.g. if male, skip questions about periods, PMT and pregnancies), or at least, with my poor Excel skills it doesn't. I just skip a column to indicate page breaks in the q'aire, and I get, when I can, two people to enter the data separately and then use R to compare the two spreadsheets having yanked them into data frames. I would really, really love someone to develop (and perhaps replace) the rather buggy edit() and fix() routines (seem to hang on big data frames in Rcmdr which is what I'm trying to get students onto) with something that did some or all of what SPSS/DE used to do for me or I bodge now in Excel. If any generous coding whiz were willing to do this, I'll try to alpha and beta test and write help etc. There _may_ be good open source things out there that do what I need but something that really integrated into R would be another huge step forward in being able to phase out SPSS in my work settings and phase in R. Very best all, Chris -- Chris Evans [EMAIL PROTECTED] Skype: chris-psyctc Professor of Psychotherapy, Nottingham University; Consultant Psychiatrist in Psychotherapy, Notts PDD network; Research Programmes Director, Nottinghamshire NHS Trust; *If I am writing from one of those roles, it will be clear. Otherwise* *my views are my own and not representative of those institutions* __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Dale Steele wrote: For windows users, EpiData Entry http://www.epidata.dk/ is an excellent (free) tool for data entry and documentation.--Dale Note that EpiData seems to work well under linux using wine. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 6/8/07, Douglas Bates [EMAIL PROTECTED] wrote: Other responses in this thread have mentioned 'little language' filters like awk, which is fine for those who were raised in the Bell Labs tradition of programming (why type three characters when two character names should suffice for anything one wants to do on a PDP-11) but the typical field scientist finds this a bit too terse to understand and would rather write a filter as a paragraph of code that they have a change of reading and understanding a week later. Hum, Concerning awk, I think that this comment does not apply: because the language is simple and and somewhat limited, awk scripts are typically quite clean and readable (of course, it is possible to write horrible code in any languages). I have introduced awk to dozens of people (mostly scientists in social sciences, and dos/windows users...) over the last 15 years it is sometimes the only programming language they know and they are very happy with what they can do with it. The philosophy of using it as a filter (that is, a converter) is also good because many problems are best solved in 2 or 3 steps (2/3 short scripts run sequentially) rather than in one single step,as people tend to do with languages that encourage to use more complex data structures than associative arrays. It could be argued that awk is the swiss army knife of simple text manipulations. All in all, awk+R is very efficient combination for data manipulation (at least for the cases I have encountered). It would a pity if your remark led people to overlook awk as it would efficiently solve many of the input parsing problems that are posted on this list (I am talking here about extracting information from text files, not data entry). awk, like R, is not exempt of defects, yet both are tools that one gets attached to because they increase your productivity a lot. -- Christophe Pallier (http://www.pallier.org) [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Tools For Preparing Data For Analysis
As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
An additional option for Windows users is Micro Osiris http://www.microsiris.com/ best robert On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Robert Wilkins wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. We deal with exactly those kinds of data solely using R. R is exceptionally powerful for data manipulation, just a bit hard to learn. Many examples are at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf Frank There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide