Re: [R] Tools For Preparing Data For Analysis

2007-06-22 Thread Kevin E. Thorpe
I am posting to this thread that has been quiet for some time because I
remembered the following question.

Christophe Pallier wrote:
 Hi,
 
 Can you provide examples of data formats that are problematic to read and
 clean with R ?

Today I had a data manipulation problem that I don't know how to do in R
so I solved it with perl.  Since I'm always interested in learning more
about complex data manipulation in R I am posting my problem in the
hopes of receiving some hints for doing this in R.

If anyone has nothing better to do than play with other people's data,
I would be happy to send the row files off-list.

Background:

I have been given data that contains two measurements of left
ventricular ejection fraction.  One of the methods is echocardiogram
which sometimes gives a true quantitative value and other times a
semi-quantitative value.  The desire is to compare echo with the
other method (MUGA).  In most cases, patients had either quantitative
or semi-quantitative.  Same patients had both.  The data came
to me in excel files with, basically, no patient identifiers to link
the both with the semi-quantitative patients (the both patients
were in multiple data sets).

What I wanted to do was extract from the semi-quantitative data file
those patients with only semi-quantitative.  All I have to link with
are the semi-quantitative echo and the MUGA and these pairs of values
are not unique.

To make this more concrete, here are some portions of the raw data.

Both

ID NUM,ECHO,MUGA,Semiquant,Quant
B,12,37,10,12
D,13,13,10,13
E,13,26,10,15
F,13,31,10,13
H,15,15,10,15
I,15,21,10,15
J,15,22,10,15
K,17,22,10,17
N,17.5,4,10,17.5
P,18,25,10,18
R,19,25,10,19

Seimi-quantitative

echo,muga,quant
10,20,0  -- keep
10,20,0  -- keep
10,21,0  -- remove
10,21,0  -- keep
10,24,0  -- keep
10,25,0  -- remove
10,25,0  -- remove
10,25,0  -- keep

Here is the perl program I wrote for this.

#!/usr/bin/perl

open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv;
# Discard first row;
$_ = BOTH;
while(BOTH) {
chomp;
($id, $e, $m, $sq, $qu) = split(/,/);
$both{$sq,$m}++;
}
close(BOTH);

open(OUT,  qual_echo_only.csv) || die Can't open qual_echo_only.csv;
print OUT pid,echo,muga,quant\n;
$pid = 2001;

open(QUAL, qual_echo.csv) || die Can't open qual_echo.csv;
# Discard first row
$_ = QUAL;
while(QUAL) {
chomp;
($echo, $muga, $quant) = split(/,/);
if ($both{$echo,$muga}  0) {
$both{$echo,$muga}--;
}
else {
print OUT $pid,$echo,$muga,$quant\n;
$pid++;
}
}
close(QUAL);
close(OUT);

open(OUT,  both_echo.csv) || die Can't open both_echo.csv;
print OUT pid,echo,muga,quant\n;
$pid = 3001;

open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv;
# Discard first row;
$_ = BOTH;
while(BOTH) {
chomp;
($id, $e, $m, $sq, $qu) = split(/,/);
print OUT $pid,$sq,$m,0\n;
print OUT $pid,$qu,$m,1\n;
$pid++;
}
close(BOTH);
close(OUT);


-- 
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Department of Public Health Sciences
Faculty of Medicine, University of Toronto
email: [EMAIL PROTECTED]  Tel: 416.864.5776  Fax: 416.864.6057

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-22 Thread Christophe Pallier
If I understand correctly (from your Perl script)

1. you count the number of occurences of each (echo, muga) pairs in the
first file.

2. you remove from the second file the lines that correspond to these
occurences.

If this is indeed your aim, here's a solution in R:

cumcount - function(x) {
 y - numeric(length(x))
 for (i in 1:length(y)) {
 y[i] = sum(x[1:i] == x[i])
 }
 y
}

both - read.csv('both_echo.csv')
v - table(paste(both$echo, _, both$muga, sep=))

semi - read.csv('qual_echo.csv')
s - paste(semi$echo, _, semi$muga, sep=)
cs = cumcount(s)
count = v[s]
count[is.na(count)]=0

semi2 - data.frame(semi, s, cs, count, keep = cs  count)

 semi2
  echo muga quant s cs count  keep
1   10   20 0 10_20  1 0  TRUE
2   10   20 0 10_20  2 0  TRUE
3   10   21 0 10_21  1 1 FALSE
4   10   21 0 10_21  2 1  TRUE
5   10   24 0 10_24  1 0  TRUE
6   10   25 0 10_25  1 2 FALSE
7   10   25 0 10_25  2 2 FALSE
8   10   25 0 10_25  3 2  TRUE


My code is not very readable...
Yet, the 'trick' of using an helper function like 'cumcount' might be
instructive.

Christophe Pallier


On 6/22/07, Kevin E. Thorpe [EMAIL PROTECTED] wrote:

 I am posting to this thread that has been quiet for some time because I
 remembered the following question.

 Christophe Pallier wrote:
  Hi,
 
  Can you provide examples of data formats that are problematic to read
 and
  clean with R ?

 Today I had a data manipulation problem that I don't know how to do in R
 so I solved it with perl.  Since I'm always interested in learning more
 about complex data manipulation in R I am posting my problem in the
 hopes of receiving some hints for doing this in R.

 If anyone has nothing better to do than play with other people's data,
 I would be happy to send the row files off-list.

 Background:

 I have been given data that contains two measurements of left
 ventricular ejection fraction.  One of the methods is echocardiogram
 which sometimes gives a true quantitative value and other times a
 semi-quantitative value.  The desire is to compare echo with the
 other method (MUGA).  In most cases, patients had either quantitative
 or semi-quantitative.  Same patients had both.  The data came
 to me in excel files with, basically, no patient identifiers to link
 the both with the semi-quantitative patients (the both patients
 were in multiple data sets).

 What I wanted to do was extract from the semi-quantitative data file
 those patients with only semi-quantitative.  All I have to link with
 are the semi-quantitative echo and the MUGA and these pairs of values
 are not unique.

 To make this more concrete, here are some portions of the raw data.

 Both

 ID NUM,ECHO,MUGA,Semiquant,Quant
 B,12,37,10,12
 D,13,13,10,13
 E,13,26,10,15
 F,13,31,10,13
 H,15,15,10,15
 I,15,21,10,15
 J,15,22,10,15
 K,17,22,10,17
 N,17.5,4,10,17.5
 P,18,25,10,18
 R,19,25,10,19

 Seimi-quantitative

 echo,muga,quant
 10,20,0  -- keep
 10,20,0  -- keep
 10,21,0  -- remove
 10,21,0  -- keep
 10,24,0  -- keep
 10,25,0  -- remove
 10,25,0  -- remove
 10,25,0  -- keep

 Here is the perl program I wrote for this.

 #!/usr/bin/perl

 open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv;
 # Discard first row;
 $_ = BOTH;
 while(BOTH) {
 chomp;
 ($id, $e, $m, $sq, $qu) = split(/,/);
 $both{$sq,$m}++;
 }
 close(BOTH);

 open(OUT,  qual_echo_only.csv) || die Can't open qual_echo_only.csv;
 print OUT pid,echo,muga,quant\n;
 $pid = 2001;

 open(QUAL, qual_echo.csv) || die Can't open qual_echo.csv;
 # Discard first row
 $_ = QUAL;
 while(QUAL) {
 chomp;
 ($echo, $muga, $quant) = split(/,/);
 if ($both{$echo,$muga}  0) {
 $both{$echo,$muga}--;
 }
 else {
 print OUT $pid,$echo,$muga,$quant\n;
 $pid++;
 }
 }
 close(QUAL);
 close(OUT);

 open(OUT,  both_echo.csv) || die Can't open both_echo.csv;
 print OUT pid,echo,muga,quant\n;
 $pid = 3001;

 open(BOTH, quant_qual_echo.csv) || die Can't open quant_qual_echo.csv;
 # Discard first row;
 $_ = BOTH;
 while(BOTH) {
 chomp;
 ($id, $e, $m, $sq, $qu) = split(/,/);
 print OUT $pid,$sq,$m,0\n;
 print OUT $pid,$qu,$m,1\n;
 $pid++;
 }
 close(BOTH);
 close(OUT);


 --
 Kevin E. Thorpe
 Biostatistician/Trialist, Knowledge Translation Program
 Assistant Professor, Department of Public Health Sciences
 Faculty of Medicine, University of Toronto
 email: [EMAIL PROTECTED]  Tel: 416.864.5776  Fax: 416.864.6057

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Christophe Pallier (http://www.pallier.org)

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list

Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread Ted Harding
As a tangent to this thread, there is a very relevant
article in the latest issue of the RSS magazine Significance,
which I have just received:

  Dr Fisher's Casebook
  The trouble with data

Significance, Vol 4 (2007) Issue 2.

Full current contents at

http://www.blackwell-synergy.com/toc/sign/4/2

but unfortunately you can only read any of it by paying
money to Blackwell (unless you're an RSS member).

Best wishes to all,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 14-Jun-07   Time: 12:24:46
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread John Kane

--- [EMAIL PROTECTED] wrote:

 As a tangent to this thread, there is a very
 relevant
 article in the latest issue of the RSS magazine
 Significance,
 which I have just received:
 
   Dr Fisher's Casebook
   The trouble with data
 
 Significance, Vol 4 (2007) Issue 2.
 
 Full current contents at
 
 http://www.blackwell-synergy.com/toc/sign/4/2
 
 but unfortunately you can only read any of it by
 paying
 money to Blackwell (unless you're an RSS member).
 
 Best wishes to all,
 Ted.

A lovely article.  I'm not a member but the local
university has a subscription.  

The examples of men who claimed to have cervical 
smears (F) and women who were 5' tall weighing 15
stone (T) ring true.  

I've found people walking at 30 km/hr (F) and an
addict using 240 needles a month (T). I've even found
a set of 16 variables the study designers never heard
of !

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread Robert Wilkins
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]

Hmm,

I don't think you need a retain statement.

if first.patientID ;
or
if last.patientID ;

ought to do it.

It's actually better than the Vilno version, I must admit, a bit more concise:

if ( not firstrow(patientID) ) deleterow ;

Ah well.

**
For the folks asking for location of software ( I know posted it, but
it didn't connect to the thread, and you get a huge number of posts
each day , sorry):

Vilno , find at
http://code.google.com/p/vilno

DAP  PSPP,  find at
http://directory.fsf.org/math/stats

Awk, find at lots of places,
http://www.gnu.org/software/gawk/gawk.html

Anything else? DAP  PSPP are hard to find, I'm sure there's more out there!
What about MDX? Nahh, not really the right problem domain.
Nobody uses MDX for this stuff.

**

If my examples , using clinical trial data are boring and hard to
understand for those who asked for examples
( and presumably don't work in clinical trials) , let me
know. Some of these other examples I'm reading about are quite interesting.
It doesn't help that clinical trial databases cannot be public. Making
a fake database would take a lot of time.
The irony is , even with my deep understanding of data preparation in
clinical trials,
the pharmas still don't want to give me a job ( because I was gone for
many years).


Let's see if this post works : thanks to the folks who gave me advice
on how to properly respond to a post within a  thread . ( Although the
thread in my gmail account is only a subset of the posts visible in
the archives ). Crossing my fingers 

On 6/10/07, Peter Dalgaard [EMAIL PROTECTED] wrote:
 Douglas Bates wrote:
  Frank Harrell indicated that it is possible to do a lot of difficult
  data transformation within R itself if you try hard enough but that
  sometimes means working against the S language and its whole object
  view to accomplish what you want and it can require knowledge of
  subtle aspects of the S language.
 
 Actually, I think Frank's point was subtly different: It is *because* of
 the differences in view that it sometimes seems difficult to find the
 way to do something in R that  is apparently straightforward in SAS.
 I.e. the solutions exist and are often elegant, but may require some
 lateral thinking.

 Case in point: Finding the first or the last observation for each
 subject when there are multiple records for each subject. The SAS way
 would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
 you can compare the subject ID with the one from the previous record,
 working with data that are sorted appropriately.

 You can do the same thing in R with a for loop, but there are better
 ways e.g.
 subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
 maybe
 do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or
 something involving aggregate(). (The latter approaches generalize
 better to other within-subject functionals like cumulative doses, etc.).

 The hardest cases that I know of are the ones where you need to turn one
 record into many, such as occurs in survival analysis with
 time-dependent, piecewise constant covariates. This may require
 transposing the problem, i.e. for each  interval you find out which
 subjects contribute and with what, whereas the SAS way would be a
 within-subject loop over intervals containing an OUTPUT statement.

 Also, there are some really weird data formats, where e.g. the input
 format is different in different records. Back in the 80's where
 punched-card input was still common, it was quite popular to have one
 card with background information on a patient plus several cards
 detailing visits, and you'd get a stack of cards containing both kinds.
 In R you would most likely split on the card type using grep() and then
 read the two kinds separately and merge() them later.



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-11 Thread Chris Evans
(Ted Harding) sent the following  at 10/06/2007 09:28:

... much snipped ...

 (As is implicit in many comments in Robert's blog, and indeed also
 from many postings to this list over time and undoubtedly well
 known to many of us in practice, a lot of the problems with data
 files arise at the data gathering and entry stages, where people
 can behave as if stuffing unpaired socks and unattributed underwear
 randomly into a drawer, and then banging it shut).

And they look surprised when pointing a statistician at the chest of
drawers doesn't result in a cut price display worthy of Figleaf (or
Victoria's Secret I think for those of you in N.America) and get them
their degree, doctorate, latest publication ...

Ah me, how wonderfully, wonderfully ... sadly, accurate!

Thanks Ted, great thread and I'm impressed with EpiData that I've
discovered through this. I'd still like something that is even more
integrated with R but maybe some day, if EpiData go fully open source as
I think they are doing (A full conversion plan to secure this and
convert the software to open-source has been made (See complete
description of license and principles). at http://www.epidata.dk/ but
the link to http://www.epidata.dk/about.htm doesn't exactly clarify this
I don't think.  But I can hope.)

Thanks, yet again, to everyone who creates and contributes to the R
system and this list: wonderful!

C


-- 
Chris Evans [EMAIL PROTECTED] Skype: chris-psyctc
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Notts PDD network;
Research Programmes Director, Nottinghamshire NHS Trust;
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions*

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-11 Thread Barry Rowlingson
Chris Evans wrote:

 Thanks Ted, great thread and I'm impressed with EpiData that I've
 discovered through this. I'd still like something that is even more
 integrated with R but maybe some day, if EpiData go fully open source as
 I think they are doing (A full conversion plan to secure this and
 convert the software to open-source has been made (See complete
 description of license and principles). at http://www.epidata.dk/ but
 the link to http://www.epidata.dk/about.htm doesn't exactly clarify this
 I don't think.  But I can hope.)
 
 Thanks, yet again, to everyone who creates and contributes to the R
 system and this list: wonderful!

  Perhaps what we need is an XML standard for describing record-oriented 
data and its validation? This could then be used to validate a set of 
records and possibly also to build input forms with built-in validation 
for new records.

  You could then write R code that did 'check this data frame against 
this XML description and tell me the invalid rows'. Or Python code.

  This is the kind of thing that is traditionally built using a database 
front-end, but keeping the description in XML means that alternate 
interfaces (web forms, standalone programs using Qt or GTK libraries) 
can be used on the same description set.

  I had a quick search to see if this kind of thing exists already, but 
google searches for 'data entry verification' indicate that I should 
really pay some people in India to do that kind of thing for me...

Barry

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote:
 That can be elegantly handled in R through R's object
 oriented programming by defining a class for the fancy input.
 See this post:
   https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
 for a simple example of that style.
 
 On 6/9/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 Here are some examples of the type of data crunching you might
 have to do.

 In response to the requests by Christophe Pallier and Martin Stevens.

 Before I started developing Vilno, some six years ago, I had
 been working in the pharmaceuticals for eight years ( it's not
 easy to show you actual data though, because it's all confidential
 of course).

I hadn't heard of Vilno before (except as a variant of Vilnius).
And it seems remarkably hard to find info about it from a Google
search. The best I've come up with, searching on

  vilno  data

is at
  http://www.xanga.com/datahelper

This is a blog site, apparently with postings by Robert Wilkins.

At the end of the Sunday, September 17, 2006 posting Tedious
coding at the Pharmas is a link:

  I have created a new data crunching programming language.
   http://www.my.opera.com/datahelper

which appears to be totally empty. In another blog article:

  go to the www.my.opera.com/datahelper site, go to the August 31
   blog article, and there you will find a tarball-file to download,
   called vilnoAUG2006package.tgz

so again inaccessible; and a google on vilnoAUG2006package.tgz
gives a single hit which is simply the same aricle.

In the Xanga blog there are a few examples of tasks which are
no big deal in any programming language (and, relative to their
simplicity, appear a bit cumbersome in Vilno). 

I've not seen in the blog any instance of data transformation
which could not be quite easily done in any straigthforward
language (even awk).

 Lab data can be especially messy, especially if one clinical
 trial allows the physicians to use different labs. So let's
 consider lab data.
 [...]

That's a fairly daunting description, though indeed not at all
extreme for the sort of data that can arise in practice (and
not just in pharmaceutical investigations). But the complexity
is in the situation, and, whatever language you use, the writing
of the program will involve the writer getting to grips with
the complexity, and the complexity will be present in the code
simply because of the need to accomodate all the special cases,
exceptions and faults that have to be anticipated in feral data.

Once these have been anticipated and incorporated in the code,
the actual transformations are again no big deal.

Frankly, I haven't yet seen anything Vilno that couldn't be
accomodated in an 'awk' program. Not that I'm advocating awk for
universal use (I'm not that monolithic about it). But I'm using
it as my favourite example of a flexible, capable, transparent
and efficient data filtering language, as far as it goes.


SO: where can one find out more about Vilno, to see what it may
really be capable of that can not be done so easily in other ways?


(As is implicit in many comments in Robert's blog, and indeed also
from many postings to this list over time and undoubtedly well
known to many of us in practice, a lot of the problems with data
files arise at the data gathering and entry stages, where people
can behave as if stuffing unpaired socks and unattributed underwear
randomly into a drawer, and then banging it shut).

Best wishes to all,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 09:28:10
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Peter Dalgaard
Douglas Bates wrote:
 Frank Harrell indicated that it is possible to do a lot of difficult
 data transformation within R itself if you try hard enough but that
 sometimes means working against the S language and its whole object
 view to accomplish what you want and it can require knowledge of
 subtle aspects of the S language.
   
Actually, I think Frank's point was subtly different: It is *because* of 
the differences in view that it sometimes seems difficult to find the 
way to do something in R that  is apparently straightforward in SAS. 
I.e. the solutions exist and are often elegant, but may require some 
lateral thinking.

Case in point: Finding the first or the last observation for each 
subject when there are multiple records for each subject. The SAS way 
would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
you can compare the subject ID with the one from the previous record, 
working with data that are sorted appropriately.

You can do the same thing in R with a for loop, but there are better 
ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
maybe
do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or 
something involving aggregate(). (The latter approaches generalize 
better to other within-subject functionals like cumulative doses, etc.).

The hardest cases that I know of are the ones where you need to turn one 
record into many, such as occurs in survival analysis with 
time-dependent, piecewise constant covariates. This may require 
transposing the problem, i.e. for each  interval you find out which 
subjects contribute and with what, whereas the SAS way would be a 
within-subject loop over intervals containing an OUTPUT statement.

Also, there are some really weird data formats, where e.g. the input 
format is different in different records. Back in the 80's where 
punched-card input was still common, it was quite popular to have one 
card with background information on a patient plus several cards 
detailing visits, and you'd get a stack of cards containing both kinds. 
In R you would most likely split on the card type using grep() and then 
read the two kinds separately and merge() them later.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Sarah Goslee
On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote:

 ... a lot of the problems with data
 files arise at the data gathering and entry stages, where people
 can behave as if stuffing unpaired socks and unattributed underwear
 randomly into a drawer, and then banging it shut.

Not specifically R-related, but this would make a great fortune.

Sarah
-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Stephen Tucker

Since R is supposed to be a complete programming language, I wonder
why these tools couldn't be implemented in R (unless speed is the
issue). Of course, it's a naive desire to have a single language that
does everything, but it seems that R currently has most of the
functions necessary to do the type of data cleaning described.

For instance, Gabor and Peter showed some snippets of ways to do this
elegantly; my [physical science] data is often not as horrendously
structured so usually I can get away with a program containing this
type of code

txtin - scan(filename,what=,sep=\n)
filteredList - lapply(strsplit(txtin,delimiter),FUN=filterfunction)
   # fiteringfunction() returns selected (and possibly transformed
   # elements if present and NULL otherwise
   # may include calls to grep(), regexpr(), gsub(), substring(),...
   # nchar(), sscanf(), type.convert(), paste(), etc.
mydataframe - do.call(rbind,filteredList)
   # then match(), subset(), aggregate(), etc.

In the case that the file is large, I open a file connection and scan
a single line + apply filterfunction() successively in a FOR-LOOP
instead of using lapply(). Of course, the devil is in the details of
the filtering function, but I believe most of the required text
processing facilities are already provided by R.

I often have tasks that involve a combination of shell-scripting and
text processing to construct the data frame for analysis; I started
out using Python+NumPy to do the front-end work but have been using R
progressively more (frankly, all of it) to take over that portion
since I generally prefer the data structures and methods in R.


--- Peter Dalgaard [EMAIL PROTECTED] wrote:

 Douglas Bates wrote:
  Frank Harrell indicated that it is possible to do a lot of difficult
  data transformation within R itself if you try hard enough but that
  sometimes means working against the S language and its whole object
  view to accomplish what you want and it can require knowledge of
  subtle aspects of the S language.

 Actually, I think Frank's point was subtly different: It is *because* of 
 the differences in view that it sometimes seems difficult to find the 
 way to do something in R that  is apparently straightforward in SAS. 
 I.e. the solutions exist and are often elegant, but may require some 
 lateral thinking.
 
 Case in point: Finding the first or the last observation for each 
 subject when there are multiple records for each subject. The SAS way 
 would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
 you can compare the subject ID with the one from the previous record, 
 working with data that are sorted appropriately.
 
 You can do the same thing in R with a for loop, but there are better 
 ways e.g.
 subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
 maybe
 do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or 
 something involving aggregate(). (The latter approaches generalize 
 better to other within-subject functionals like cumulative doses, etc.).
 
 The hardest cases that I know of are the ones where you need to turn one 
 record into many, such as occurs in survival analysis with 
 time-dependent, piecewise constant covariates. This may require 
 transposing the problem, i.e. for each  interval you find out which 
 subjects contribute and with what, whereas the SAS way would be a 
 within-subject loop over intervals containing an OUTPUT statement.
 
 Also, there are some really weird data formats, where e.g. the input 
 format is different in different records. Back in the 80's where 
 punched-card input was still common, it was quite popular to have one 
 card with background information on a patient plus several cards 
 detailing visits, and you'd get a stack of cards containing both kinds. 
 In R you would most likely split on the card type using grep() and then 
 read the two kinds separately and merge() them later.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 



  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 14:04:44, Sarah Goslee wrote:
 On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote:
 
 ... a lot of the problems with data
 files arise at the data gathering and entry stages, where people
 can behave as if stuffing unpaired socks and unattributed underwear
 randomly into a drawer, and then banging it shut.
 
 Not specifically R-related, but this would make a great fortune.
 
 Sarah
 -- 
 Sarah Goslee
 http://www.functionaldiversity.org

I'm not going to object to that!
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 21:18:45
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 19:27:50, Stephen Tucker wrote:
 
 Since R is supposed to be a complete programming language,
 I wonder why these tools couldn't be implemented in R
 (unless speed is the issue). Of course, it's a naive desire
 to have a single language that does everything, but it seems
 that R currently has most of the functions necessary to do
 the type of data cleaning described.

In principle that is certainly true. A couple of comments,
though.

1. R's rich data structures are likely to be superfluous.
   Mostly, at the sanitisation stage, one is working with
   flat files (row  column). This straightforward format
   is often easier to handle using simple programs for the
   kind of basic filtering needed, rather then getting into
   the heavier programming constructs of R.

2. As follow-on and contrast at the same time, very often
   what should be a nice flat file with no rough edges is not.
   If there are variable numbers of fields per line, R will
   not handle it straightforwardly (you can force it in,
   but it's more elaborate). There are related issues as well.

a) If someone entering data into an Excel table lets their
   cursor wander outside the row/col range of the table,
   this can cause invisible entities to be planted in the
   extraneous cells. When saved as a CSV, this file then
   has variable numbers of fields per line, and possibly
   also extra lines with arbitrary blank fields.

   cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'

   will give you the numbers of fields in each line.

   If you further pipe it into | sort -nu you will get
   the distinct field-numbers. If you know (by now) how many
   fields there should be (e.g. 10), then

   cat datafile.csv | awk 'BEGIN{FS=,} (NF != 10){print NR ,  NF}'

   will tell you which lines have the wrong number of fields,
   and how many fields they have. You can similarly count how
   many lines there are (e.g. pipe into wc -l).

b) Poeple sometimes randomly use a blank space or a . in a
   cell to demote a missing value. Consistent use of either
   is OK: ,, in a CSV will be treated as NA by R. The use
   of . can be more problematic. If for instance you try to
   read the following CSV into R as a dataframe:

   1,2,.,4
   2,.,4,5
   3,4,.,6

   the . in cols 2 and 3 is treated as the character .,
   with the result that something complicated happens to
   the typing of the items.

   typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
   sum(D[1,2]) gives a type-error, even though the entry
   is in fact 2. And so on , in various combinations.

   And (as.nmatrix(D)) is of course a matrix of characters.

   In fact, columns 2 and 3 of D are treated as factors!

   for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
   [1] 1
   [1] 2
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 4
   [1] 2
   [1] .
   Levels: . 2 4
   [1] 4
   Levels: . 4
   [1] 5
   [1] 3
   [1] 4
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 6

   This is getting altogether too complicated for the job
   one wants to do!

   And it gets worse when people mix ,, and ,.,!

   On the other hand, a simple brush with awk (or sed in
   this case) can sort it once and for all, without waking
   the sleeping dogs in R.

I could go on. R undoubtedly has the power, but it can very
quickly get over-complicated for simple jobs.

Best wishes to all,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 22:14:35
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread roger koenker
An important potential benefit of R solutions shared by awk, sed, ...
is that they provide a reproducible way to  document  exactly how one  
got
from one version of the data to the next.  This  seems to be the main
problem with handicraft methods like editing excel files, it  is too
easy to introduce  new errors that can't be tracked down at later
stages of the analysis.


url:www.econ.uiuc.edu/~rogerRoger Koenker
email   [EMAIL PROTECTED]   Department of Economics
vox:217-333-4558University of Illinois
fax:217-244-6678Champaign, IL 61820


On Jun 10, 2007, at 4:14 PM, (Ted Harding) wrote:

 On 10-Jun-07 19:27:50, Stephen Tucker wrote:

 Since R is supposed to be a complete programming language,
 I wonder why these tools couldn't be implemented in R
 (unless speed is the issue). Of course, it's a naive desire
 to have a single language that does everything, but it seems
 that R currently has most of the functions necessary to do
 the type of data cleaning described.

 In principle that is certainly true. A couple of comments,
 though.

 1. R's rich data structures are likely to be superfluous.
Mostly, at the sanitisation stage, one is working with
flat files (row  column). This straightforward format
is often easier to handle using simple programs for the
kind of basic filtering needed, rather then getting into
the heavier programming constructs of R.

 2. As follow-on and contrast at the same time, very often
what should be a nice flat file with no rough edges is not.
If there are variable numbers of fields per line, R will
not handle it straightforwardly (you can force it in,
but it's more elaborate). There are related issues as well.

 a) If someone entering data into an Excel table lets their
cursor wander outside the row/col range of the table,
this can cause invisible entities to be planted in the
extraneous cells. When saved as a CSV, this file then
has variable numbers of fields per line, and possibly
also extra lines with arbitrary blank fields.

cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'

will give you the numbers of fields in each line.

If you further pipe it into | sort -nu you will get
the distinct field-numbers. If you know (by now) how many
fields there should be (e.g. 10), then

cat datafile.csv | awk 'BEGIN{FS=,} (NF != 10){print NR ,  NF}'

will tell you which lines have the wrong number of fields,
and how many fields they have. You can similarly count how
many lines there are (e.g. pipe into wc -l).

 b) Poeple sometimes randomly use a blank space or a . in a
cell to demote a missing value. Consistent use of either
is OK: ,, in a CSV will be treated as NA by R. The use
of . can be more problematic. If for instance you try to
read the following CSV into R as a dataframe:

1,2,.,4
2,.,4,5
3,4,.,6

the . in cols 2 and 3 is treated as the character .,
with the result that something complicated happens to
the typing of the items.

typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
sum(D[1,2]) gives a type-error, even though the entry
is in fact 2. And so on , in various combinations.

And (as.nmatrix(D)) is of course a matrix of characters.

In fact, columns 2 and 3 of D are treated as factors!

for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
[1] 1
[1] 2
Levels: . 2 4
[1] .
Levels: . 4
[1] 4
[1] 2
[1] .
Levels: . 2 4
[1] 4
Levels: . 4
[1] 5
[1] 3
[1] 4
Levels: . 2 4
[1] .
Levels: . 4
[1] 6

This is getting altogether too complicated for the job
one wants to do!

And it gets worse when people mix ,, and ,.,!

On the other hand, a simple brush with awk (or sed in
this case) can sort it once and for all, without waking
the sleeping dogs in R.

 I could go on. R undoubtedly has the power, but it can very
 quickly get over-complicated for simple jobs.

 Best wishes to all,
 Ted.

 
 E-Mail: (Ted Harding) [EMAIL PROTECTED]
 Fax-to-email: +44 (0)870 094 0861
 Date: 10-Jun-07   Time: 22:14:35
 -- XFMail --

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting- 
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Stephen Tucker
Embarrasingly, I don't know awk or sed but R's code seems to be
shorter for most tasks than Python, which is my basis for comparison.

It's true that R's more powerful data structures usually aren't
necessary for the data cleaning, but sometimes in the filtering
process I will pick out lines that contain certain data, in which case
I have to convert text to numbers and perform operations like
which.min(), order(), etc., so in that sense I like to have R's
vectorized notation and the objects/functions that support it.

As far as some of the tasks you described, I've tried transcribing
them to R. I know you provided only the simplest examples, but even in
these cases I think R's functions for handling these situations
exemplify their usefulness in this step of the analysis. But perhaps
you would argue that this code is too long... In any event it will
still save the trouble of keeping track of an extra (intermediate)
file passed between awk and R.

(1) the numbers of fields in each line equivalent to 
cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'
in awk

# R equivalent:
nFields - count.fields(datafile.csv,sep=,)
# or 
nFields - sapply(strsplit(readLines(datafile.csv),,),length)

(2) which lines have the wrong number of fields, and how many fields
they have. You can similarly count how many lines there are (e.g. pipe
into wc -l).

# number of lines with wrong number of fields
nWrongFields - length(nFields[nFields  10])

# select only first ten fields from each line
# and return a matrix
firstTenFields - 
  do.call(rbind,
  lapply(strsplit(readLines(datafile.csv),,),
 function(x) x[1:10]))

# select only those lines which contain ten fields
# and return a matrix
onlyTenFields - 
  do.call(rbind,
  lapply(strsplit(readLines(datafile.csv),,),
 function(x) if(length(x) = 10) x else NULL))

(3)
If for instance you try to
read the following CSV into R as a dataframe:
 
1,2,.,4
2,.,4,5
3,4,.,6
 

txtC - textConnection(
1,2,.,4
2,.,4,5
3,4,.,6)
# using read.csv() specifying na.string argument:
 read.csv(txtC,header=FALSE,na.string=.)
  V1 V2 V3 V4
1  1  2 NA  4
2  2 NA  4  5
3  3  4 NA  6

# Of course, read.csv will work only if data is formatted correctly.
# More generally, using readLines(), strsplit(), etc., which are more
# flexible :

 do.call(rbind,
+ lapply(strsplit(readLines(txtC),,),
+type.convert,na.string=.))
 [,1] [,2] [,3] [,4]
[1,]12   NA4
[2,]2   NA45
[3,]34   NA6

(4) Situations where people mix ,, and ,.,!

# type.convert (and read.csv) will still work when missing values are ,,
# and ,., (automatically recognizes  as NA and through
# specification of 'na.string', can recognize . as NA)

# If it is desired to convert . to  first, this is simple as
# well:

m - do.call(rbind,
lapply(strsplit(readLines(txtC),,),
   function(x) gsub(^\\.$,,x)))
 m
 [,1] [,2] [,3] [,4]
[1,] 1  2 4 
[2,] 2 4  5 
[3,] 3  4 6 

# then
mode(m) - numeric
# or
m - apply(m,2,type.convert)
# will give
 m
 [,1] [,2] [,3] [,4]
[1,]12   NA4
[2,]2   NA45
[3,]34   NA6


--- [EMAIL PROTECTED] wrote:

 On 10-Jun-07 19:27:50, Stephen Tucker wrote:
  
  Since R is supposed to be a complete programming language,
  I wonder why these tools couldn't be implemented in R
  (unless speed is the issue). Of course, it's a naive desire
  to have a single language that does everything, but it seems
  that R currently has most of the functions necessary to do
  the type of data cleaning described.
 
 In principle that is certainly true. A couple of comments,
 though.
 
 1. R's rich data structures are likely to be superfluous.
Mostly, at the sanitisation stage, one is working with
flat files (row  column). This straightforward format
is often easier to handle using simple programs for the
kind of basic filtering needed, rather then getting into
the heavier programming constructs of R.
 
 2. As follow-on and contrast at the same time, very often
what should be a nice flat file with no rough edges is not.
If there are variable numbers of fields per line, R will
not handle it straightforwardly (you can force it in,
but it's more elaborate). There are related issues as well.
 
 a) If someone entering data into an Excel table lets their
cursor wander outside the row/col range of the table,
this can cause invisible entities to be planted in the
extraneous cells. When saved as a CSV, this file then
has variable numbers of fields per line, and possibly
also extra lines with arbitrary blank fields.
 
cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'
 
will give you the numbers of fields in each line.
 
If you further pipe it into | sort -nu you will get
the distinct field-numbers. If you know (by now) how many
fields there should be (e.g. 10), then
 
cat 

Re: [R] Tools For Preparing Data For Analysis

2007-06-09 Thread Robert Wilkins
Here are some examples of the type of data crunching you might have to do.

In response to the requests by Christophe Pallier and Martin Stevens.

Before I started developing Vilno, some six years ago, I had been working in
the pharmaceuticals for eight years ( it's not easy to show you actual data
though, because it's all confidential of course).

Lab data can be especially messy, especially if one clinical trial allows
the physicians to use different labs. So let's consider lab data.

Merge in normal ranges, into the lab data. This has to be done by lab-site
and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
you also need to match by sex and age. The sex column in the normal ranges
could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
the normal ranges could be: blank, or something like 40 55. Even worse,
you could have an ageunits column in the normal ranges dataset: usually Y,
but if there are children in the clinical trial, you will have D or M,
for Days and Months. If the clinical trial is for adults, all rows with D
or M should be tossed out at the start. Clearly the statistical programmer
has to spend time looking at the data, before writing the program. Remember,
all of these details can change any time you move to a new clinical trial.

So for the lab data, you have to merge in the patient's date of birth,
calculate age, and somehow relate that to the age-group column in the normal
ranges dataset.

(By the way, in clinical trial data preparation, the SAS datastep is much
more useful and convenient, in my opinion, than the SQL SELECT syntax, at
least 97% of the time. But in the middle of this program, when you merge the
normal ranges into the lab data, you get a better solution with PROC SQL (
just the SQL SELECT statement implemented inside SAS) This is because of the
trickiness of the age match-up, and the SAS datastep does not do well with
many-to-many joins.).

Merge in various study drug administration dates into the lab data. Now, for
each lab record, calculate treatment period ( or cycle number ), depending
on the statistician's specifications and the way the clinical trial is
structured.

Different clinical sites chose to use different lab providers. So, for
example, for Monocytes, you have 10 different units ( essentially 6 units,
but spelling inconsistencies as well). The statistician has requested that
you use standardized units in some of the listings ( % units, and only one
type of non-% unit, for example ). At the same time, lab values need to be
converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
no matter what software you use, and, in my experience, when the SAS
programmer asks for more clinical information or lab guidebooks, the
response is incomplete, so he does a lot of guesswork. SAS programmers do
not have expertise in lab science, hence the guesswork.

Your program has to accomodate numeric values, 1.54 , quasi-numeric values
1 , and non-numeric values Trace.

Your data listing is tight for space, so print PROLONGED CELL CONT as
PRCC.

Once normal ranges are merged in, figure out which values are out-of-range
and high , which are low, and which are within normal range. In the data
listing, you may have H or L appended to the result value being printed.

For each treatment period, you may need a unique lab record selected, in
case there are two or three for the same treatment period. The statistician
will tell the SAS programmer how. Maybe the averages of the results for that
treatment period, maybe that lab record closest to the mid-point of of the
treatment period. This isn't for the data listing, but for a summary table.

For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
(total white blood cell count) values , to convert values between % units
and absolute count units.

When printing the values in the data listing, you need H or L to the
right of the value. But you also need the values to be well lined up ( the
decimal place ). This can be stupidly time consuming.



AND ON AND ON AND ON .

I think you see why clinical trials statisticians and SAS programmers enjoy
lots of job security.



On 6/8/07, Martin Henry H. Stevens [EMAIL PROTECTED] wrote:

 Is there an example available of this sort of problematic data that
 requires this kind of data screening and filtering? For many of us,
 this issue would be nice to learn about, and deal with within R. If a
 package could be created, that would be optimal for some of us. I
 would like to learn a tad more, if it were not too much effort for
 someone else to point me in the right direction?
 Cheers,
 Hank
 On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:

  On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
  As noted on the R-project web site itself ( www.r-project.org -
  Manuals - R Data Import/Export ), it can be cumbersome to prepare
  messy and dirty data for analysis with the R tool itself. I've also
  seen at least one S 

Re: [R] Tools For Preparing Data For Analysis

2007-06-09 Thread Gabor Grothendieck
That can be  elegantly handled in R through R's object oriented programming
by defining a class for the fancy input.  See this post:
  https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
for a simple example of that style.


On 6/9/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 Here are some examples of the type of data crunching you might have to do.

 In response to the requests by Christophe Pallier and Martin Stevens.

 Before I started developing Vilno, some six years ago, I had been working in
 the pharmaceuticals for eight years ( it's not easy to show you actual data
 though, because it's all confidential of course).

 Lab data can be especially messy, especially if one clinical trial allows
 the physicians to use different labs. So let's consider lab data.

 Merge in normal ranges, into the lab data. This has to be done by lab-site
 and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
 you also need to match by sex and age. The sex column in the normal ranges
 could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
 the normal ranges could be: blank, or something like 40 55. Even worse,
 you could have an ageunits column in the normal ranges dataset: usually Y,
 but if there are children in the clinical trial, you will have D or M,
 for Days and Months. If the clinical trial is for adults, all rows with D
 or M should be tossed out at the start. Clearly the statistical programmer
 has to spend time looking at the data, before writing the program. Remember,
 all of these details can change any time you move to a new clinical trial.

 So for the lab data, you have to merge in the patient's date of birth,
 calculate age, and somehow relate that to the age-group column in the normal
 ranges dataset.

 (By the way, in clinical trial data preparation, the SAS datastep is much
 more useful and convenient, in my opinion, than the SQL SELECT syntax, at
 least 97% of the time. But in the middle of this program, when you merge the
 normal ranges into the lab data, you get a better solution with PROC SQL (
 just the SQL SELECT statement implemented inside SAS) This is because of the
 trickiness of the age match-up, and the SAS datastep does not do well with
 many-to-many joins.).

 Merge in various study drug administration dates into the lab data. Now, for
 each lab record, calculate treatment period ( or cycle number ), depending
 on the statistician's specifications and the way the clinical trial is
 structured.

 Different clinical sites chose to use different lab providers. So, for
 example, for Monocytes, you have 10 different units ( essentially 6 units,
 but spelling inconsistencies as well). The statistician has requested that
 you use standardized units in some of the listings ( % units, and only one
 type of non-% unit, for example ). At the same time, lab values need to be
 converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
 no matter what software you use, and, in my experience, when the SAS
 programmer asks for more clinical information or lab guidebooks, the
 response is incomplete, so he does a lot of guesswork. SAS programmers do
 not have expertise in lab science, hence the guesswork.

 Your program has to accomodate numeric values, 1.54 , quasi-numeric values
 1 , and non-numeric values Trace.

 Your data listing is tight for space, so print PROLONGED CELL CONT as
 PRCC.

 Once normal ranges are merged in, figure out which values are out-of-range
 and high , which are low, and which are within normal range. In the data
 listing, you may have H or L appended to the result value being printed.

 For each treatment period, you may need a unique lab record selected, in
 case there are two or three for the same treatment period. The statistician
 will tell the SAS programmer how. Maybe the averages of the results for that
 treatment period, maybe that lab record closest to the mid-point of of the
 treatment period. This isn't for the data listing, but for a summary table.

 For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
 (total white blood cell count) values , to convert values between % units
 and absolute count units.

 When printing the values in the data listing, you need H or L to the
 right of the value. But you also need the values to be well lined up ( the
 decimal place ). This can be stupidly time consuming.



 AND ON AND ON AND ON .

 I think you see why clinical trials statisticians and SAS programmers enjoy
 lots of job security.

This could be readily handled in R using object oriented programming.
You would specify a class for the strange input,

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Christophe Pallier
Hi,

Can you provide examples of data formats that are problematic to read and
clean with R ?

The only problematic cases I have encountered were cases with multiline
and/or  varying length records (optional information). Then, it is sometimes
a good idea to preprocess the data to present in a tabular format (one
record per line).

For this purpose, I use awk (e.g. http://www.vectorsite.net/tsawk.html),
which is very adept at processing ascii data files  (awk is much simpler to
learn than perl, spss, sas, ...).

I have never encountered a data file in ascii format that I could not
reformat with Awk.  With binary formats, it is another story...

But, again, this is my limited experience; I would like to know if there are
situations where using SAS/SPSS is really a better approach.

Christophe Pallier


On 6/8/07, Robert Wilkins [EMAIL PROTECTED] wrote:

 As noted on the R-project web site itself ( www.r-project.org -
 Manuals - R Data Import/Export ), it can be cumbersome to prepare
 messy and dirty data for analysis with the R tool itself. I've also
 seen at least one S programming book (one of the yellow Springer ones)
 that says, more briefly, the same thing.
 The R Data Import/Export page recommends examples using SAS, Perl,
 Python, and Java. It takes a bit of courage to say that ( when you go
 to a corporate software web site, you'll never see a page saying This
 is the type of problem that our product is not the best at, here's
 what we suggest instead ). I'd like to provide a few more
 suggestions, especially for volunteers who are willing to evaluate new
 candidates.

 SAS is fine if you're not paying for the license out of your own
 pocket. But maybe one reason you're using R is you don't have
 thousands of spare dollars.
 Using Java for data cleaning is an exercise in sado-masochism, Java
 has a learning curve (almost) as difficult as C++.

 There are different types of data transformation, and for some data
 preparation problems an all-purpose programming language is a good
 choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
 excellent regular expression facilities.

 However, for some types of complex demanding data preparation
 problems, an all-purpose programming language is a poor choice. For
 example: cleaning up and preparing clinical lab data and adverse event
 data - you could do it in Perl, but it would take way, way too much
 time. A specialized programming language is needed. And since data
 transformation is quite different from data query, SQL is not the
 ideal solution either.

 There are only three statistical programming languages that are
 well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
 popular than S for data cleaning.

 If you're an R user with difficult data preparation problems, frankly
 you are out of luck, because the products I'm about to mention are
 new, unknown, and therefore regarded as immature. And while the
 founders of these products would be very happy if you kicked the
 tires, most people don't like to look at brand new products. Most
 innovators and inventers don't realize this, I've learned it the hard
 way.

 But if you are a volunteer who likes to help out by evaluating,
 comparing, and reporting upon new candidates, well you could certainly
 help out R users and the developers of the products by kicking the
 tires of these products. And there is a huge need for such volunteers.

 1. DAP
 This is an open source implementation of SAS.
 The founder: Susan Bassein
 Find it at: directory.fsf.org/math/stats (GNU GPL)

 2. PSPP
 This is an open source implementation of SPSS.
 The relatively early version number might not give a good idea of how
 mature the
 data transformation features are, it reflects the fact that he has
 only started doing the statistical tests.
 The founder: Ben Pfaff, either a grad student or professor at Stanford CS
 dept.
 Also at : directory.fsf.org/math/stats (GNU GPL)

 3. Vilno
 This uses a programming language similar to SPSS and SAS, but quite unlike
 S.
 Essentially, it's a substitute for the SAS datastep, and also
 transposes data and calculates averages and such. (No t-tests or
 regressions in this version). I created this, during the years
 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
 my opinion. The tarball includes about 100 or so test cases used for
 debugging - for logical calculation errors, but not for extremely high
 volumes of data.
 The maintenance of Vilno has slowed down, because I am currently
 (desparately) looking for employment. But once I've found new
 employment and living quarters and settled in, I will continue to
 enhance Vilno in my spare time.
 The founder: that would be me, Robert Wilkins
 Find it at: code.google.com/p/vilno ( GNU GPL )
 ( In particular, the tarball at code.google.com/p/vilno/downloads/list
 , since I have yet to figure out how to use Subversion ).


 4. Who knows?
 It was not easy to find out about the existence of DAP and 

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Ted Harding
On 08-Jun-07 08:27:21, Christophe Pallier wrote:
 Hi,
 
 Can you provide examples of data formats that are problematic
 to read and clean with R ?
 
 The only problematic cases I have encountered were cases with
 multiline and/or  varying length records (optional information).
 Then, it is sometimes a good idea to preprocess the data to
 present in a tabular format (one record per line).
 
 For this purpose, I use awk (e.g.
 http://www.vectorsite.net/tsawk.html),
 which is very adept at processing ascii data files  (awk is
 much simpler to learn than perl, spss, sas, ...).

I want to join in with an enthusiastic Me too!!. For anything
which has to do with basic checking for the kind of messes that
people can get data into when they put it on the computer,
I think awk is ideal. It is very flexible (far more so than
many, even long-time, awk users suspect), very transparent
in its programming language (as opposed to say perl), fast,
and with light impact on system resources (rare delight in
these days, when upgrading your software may require upgrading
your hardware).

Although it may seem on the surface that awk is two-dimensional
in its view of data (line by line, and per field in a line),
it has some flexible internal data structures and recursive
function capability, which allows a lot more to be done with
the data that have been read in.

For example, I've used awk to trace ancestry through a genealogy,
given a data file where each line includes the identifier of an
individual and the identifiers of its male and female parents
(where known). And that was for pedigree dogs, where what happens
in real life makes Oedipus look trivial.

 I have never encountered a data file in ascii format that I
 could not reformat with Awk.  With binary formats, it is
 another story...

But then it is a good idea to process the binary file using an
instance of the creating software, to produce a ASCII file (say
in CSV format).

 But, again, this is my limited experience; I would like to
 know if there are situations where using SAS/SPSS is really
 a better approach.

The main thing often useful for data cleaning that awk does
not have is any associated graphics. It is -- by design -- a
line-by-line text-file processor. While, for instance, you
could use awk to accumulate numerical histogram counts, you
would have to use something else to display the histogram.
And for scatter-plots there's probably not much point in
bringing awk into the picture at all (unless a preliminary
filtration of mess is needed anyway).

That being said, though, there can still be a use to extract
data fields from a file for submission to other software.

Another kind of area where awk would not have much to offer
is where, as a part of your preliminary data inspection,
you want to inspect the results of some standard statistical
analyses.

As a final comment, utilities like awk can be used far more
fruitfully on operating systems (the unixoid family) which
incorporate at ground level the infrastructure for plumbing
together streams of data output from different programs.

Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 08-Jun-07   Time: 10:43:05
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Douglas Bates
On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 As noted on the R-project web site itself ( www.r-project.org -
 Manuals - R Data Import/Export ), it can be cumbersome to prepare
 messy and dirty data for analysis with the R tool itself. I've also
 seen at least one S programming book (one of the yellow Springer ones)
 that says, more briefly, the same thing.
 The R Data Import/Export page recommends examples using SAS, Perl,
 Python, and Java. It takes a bit of courage to say that ( when you go
 to a corporate software web site, you'll never see a page saying This
 is the type of problem that our product is not the best at, here's
 what we suggest instead ). I'd like to provide a few more
 suggestions, especially for volunteers who are willing to evaluate new
 candidates.

 SAS is fine if you're not paying for the license out of your own
 pocket. But maybe one reason you're using R is you don't have
 thousands of spare dollars.
 Using Java for data cleaning is an exercise in sado-masochism, Java
 has a learning curve (almost) as difficult as C++.

 There are different types of data transformation, and for some data
 preparation problems an all-purpose programming language is a good
 choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
 excellent regular expression facilities.

 However, for some types of complex demanding data preparation
 problems, an all-purpose programming language is a poor choice. For
 example: cleaning up and preparing clinical lab data and adverse event
 data - you could do it in Perl, but it would take way, way too much
 time. A specialized programming language is needed. And since data
 transformation is quite different from data query, SQL is not the
 ideal solution either.

 There are only three statistical programming languages that are
 well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
 popular than S for data cleaning.

 If you're an R user with difficult data preparation problems, frankly
 you are out of luck, because the products I'm about to mention are
 new, unknown, and therefore regarded as immature. And while the
 founders of these products would be very happy if you kicked the
 tires, most people don't like to look at brand new products. Most
 innovators and inventers don't realize this, I've learned it the hard
 way.

 But if you are a volunteer who likes to help out by evaluating,
 comparing, and reporting upon new candidates, well you could certainly
 help out R users and the developers of the products by kicking the
 tires of these products. And there is a huge need for such volunteers.

 1. DAP
 This is an open source implementation of SAS.
 The founder: Susan Bassein
 Find it at: directory.fsf.org/math/stats (GNU GPL)

 2. PSPP
 This is an open source implementation of SPSS.
 The relatively early version number might not give a good idea of how
 mature the
 data transformation features are, it reflects the fact that he has
 only started doing the statistical tests.
 The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
 dept.
 Also at : directory.fsf.org/math/stats (GNU GPL)

 3. Vilno
 This uses a programming language similar to SPSS and SAS, but quite unlike S.
 Essentially, it's a substitute for the SAS datastep, and also
 transposes data and calculates averages and such. (No t-tests or
 regressions in this version). I created this, during the years
 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
 my opinion. The tarball includes about 100 or so test cases used for
 debugging - for logical calculation errors, but not for extremely high
 volumes of data.
 The maintenance of Vilno has slowed down, because I am currently
 (desparately) looking for employment. But once I've found new
 employment and living quarters and settled in, I will continue to
 enhance Vilno in my spare time.
 The founder: that would be me, Robert Wilkins
 Find it at: code.google.com/p/vilno ( GNU GPL )
 ( In particular, the tarball at code.google.com/p/vilno/downloads/list
 , since I have yet to figure out how to use Subversion ).

 4. Who knows?
 It was not easy to find out about the existence of DAP and PSPP. So
 who knows what else is out there. However, I think you'll find a lot
 more statistics software ( regression , etc ) out there, and not so
 much data transformation software. Not many people work on data
 preparation software. In fact, the category is so obscure that there
 isn't one agreed term: data cleaning , data munging , data crunching ,
 or just getting the data ready for analysis.

Thanks for bringing up this topic.  I think there is definitely a
place for such languages, which I would regard as data-filtering
languages, but I also think that trying to reproduce the facilities in
SAS or SPSS for data analysis is redundant.

Other responses in this thread have mentioned 'little language'
filters like awk, which is fine for those who were raised in the Bell
Labs tradition of programming 

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Wensui Liu
I had mentioned exactly the same thing to others and the feedback I got is -
'when you have a hammer, everything will look like a nail'
^_^.

On 6/7/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote:
 Robert Wilkins wrote:
  As noted on the R-project web site itself ( www.r-project.org -
  Manuals - R Data Import/Export ), it can be cumbersome to prepare
  messy and dirty data for analysis with the R tool itself. I've also
  seen at least one S programming book (one of the yellow Springer ones)
  that says, more briefly, the same thing.
  The R Data Import/Export page recommends examples using SAS, Perl,
  Python, and Java. It takes a bit of courage to say that ( when you go
  to a corporate software web site, you'll never see a page saying This
  is the type of problem that our product is not the best at, here's
  what we suggest instead ). I'd like to provide a few more
  suggestions, especially for volunteers who are willing to evaluate new
  candidates.
 
  SAS is fine if you're not paying for the license out of your own
  pocket. But maybe one reason you're using R is you don't have
  thousands of spare dollars.
  Using Java for data cleaning is an exercise in sado-masochism, Java
  has a learning curve (almost) as difficult as C++.
 
  There are different types of data transformation, and for some data
  preparation problems an all-purpose programming language is a good
  choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
  excellent regular expression facilities.
 
  However, for some types of complex demanding data preparation
  problems, an all-purpose programming language is a poor choice. For
  example: cleaning up and preparing clinical lab data and adverse event
  data - you could do it in Perl, but it would take way, way too much
  time. A specialized programming language is needed. And since data
  transformation is quite different from data query, SQL is not the
  ideal solution either.

 We deal with exactly those kinds of data solely using R.  R is
 exceptionally powerful for data manipulation, just a bit hard to learn.
   Many examples are at
 http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf

 Frank

 
  There are only three statistical programming languages that are
  well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
  popular than S for data cleaning.
 
  If you're an R user with difficult data preparation problems, frankly
  you are out of luck, because the products I'm about to mention are
  new, unknown, and therefore regarded as immature. And while the
  founders of these products would be very happy if you kicked the
  tires, most people don't like to look at brand new products. Most
  innovators and inventers don't realize this, I've learned it the hard
  way.
 
  But if you are a volunteer who likes to help out by evaluating,
  comparing, and reporting upon new candidates, well you could certainly
  help out R users and the developers of the products by kicking the
  tires of these products. And there is a huge need for such volunteers.
 
  1. DAP
  This is an open source implementation of SAS.
  The founder: Susan Bassein
  Find it at: directory.fsf.org/math/stats (GNU GPL)
 
  2. PSPP
  This is an open source implementation of SPSS.
  The relatively early version number might not give a good idea of how
  mature the
  data transformation features are, it reflects the fact that he has
  only started doing the statistical tests.
  The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
  dept.
  Also at : directory.fsf.org/math/stats (GNU GPL)
 
  3. Vilno
  This uses a programming language similar to SPSS and SAS, but quite unlike 
  S.
  Essentially, it's a substitute for the SAS datastep, and also
  transposes data and calculates averages and such. (No t-tests or
  regressions in this version). I created this, during the years
  2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
  my opinion. The tarball includes about 100 or so test cases used for
  debugging - for logical calculation errors, but not for extremely high
  volumes of data.
  The maintenance of Vilno has slowed down, because I am currently
  (desparately) looking for employment. But once I've found new
  employment and living quarters and settled in, I will continue to
  enhance Vilno in my spare time.
  The founder: that would be me, Robert Wilkins
  Find it at: code.google.com/p/vilno ( GNU GPL )
  ( In particular, the tarball at code.google.com/p/vilno/downloads/list
  , since I have yet to figure out how to use Subversion ).
 
 
  4. Who knows?
  It was not easy to find out about the existence of DAP and PSPP. So
  who knows what else is out there. However, I think you'll find a lot
  more statistics software ( regression , etc ) out there, and not so
  much data transformation software. Not many people work on data
  preparation software. In fact, the category is so obscure that there
  isn't one agreed term: data 

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Martin Henry H. Stevens
Is there an example available of this sort of problematic data that  
requires this kind of data screening and filtering? For many of us,  
this issue would be nice to learn about, and deal with within R. If a  
package could be created, that would be optimal for some of us. I  
would like to learn a tad more, if it were not too much effort for  
someone else to point me in the right direction?
Cheers,
Hank
On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:

 On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 As noted on the R-project web site itself ( www.r-project.org -
 Manuals - R Data Import/Export ), it can be cumbersome to prepare
 messy and dirty data for analysis with the R tool itself. I've also
 seen at least one S programming book (one of the yellow Springer  
 ones)
 that says, more briefly, the same thing.
 The R Data Import/Export page recommends examples using SAS, Perl,
 Python, and Java. It takes a bit of courage to say that ( when you go
 to a corporate software web site, you'll never see a page saying  
 This
 is the type of problem that our product is not the best at, here's
 what we suggest instead ). I'd like to provide a few more
 suggestions, especially for volunteers who are willing to evaluate  
 new
 candidates.

 SAS is fine if you're not paying for the license out of your own
 pocket. But maybe one reason you're using R is you don't have
 thousands of spare dollars.
 Using Java for data cleaning is an exercise in sado-masochism, Java
 has a learning curve (almost) as difficult as C++.

 There are different types of data transformation, and for some data
 preparation problems an all-purpose programming language is a good
 choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
 excellent regular expression facilities.

 However, for some types of complex demanding data preparation
 problems, an all-purpose programming language is a poor choice. For
 example: cleaning up and preparing clinical lab data and adverse  
 event
 data - you could do it in Perl, but it would take way, way too much
 time. A specialized programming language is needed. And since data
 transformation is quite different from data query, SQL is not the
 ideal solution either.

 There are only three statistical programming languages that are
 well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
 popular than S for data cleaning.

 If you're an R user with difficult data preparation problems, frankly
 you are out of luck, because the products I'm about to mention are
 new, unknown, and therefore regarded as immature. And while the
 founders of these products would be very happy if you kicked the
 tires, most people don't like to look at brand new products. Most
 innovators and inventers don't realize this, I've learned it the hard
 way.

 But if you are a volunteer who likes to help out by evaluating,
 comparing, and reporting upon new candidates, well you could  
 certainly
 help out R users and the developers of the products by kicking the
 tires of these products. And there is a huge need for such  
 volunteers.

 1. DAP
 This is an open source implementation of SAS.
 The founder: Susan Bassein
 Find it at: directory.fsf.org/math/stats (GNU GPL)

 2. PSPP
 This is an open source implementation of SPSS.
 The relatively early version number might not give a good idea of how
 mature the
 data transformation features are, it reflects the fact that he has
 only started doing the statistical tests.
 The founder: Ben Pfaff, either a grad student or professor at  
 Stanford CS dept.
 Also at : directory.fsf.org/math/stats (GNU GPL)

 3. Vilno
 This uses a programming language similar to SPSS and SAS, but  
 quite unlike S.
 Essentially, it's a substitute for the SAS datastep, and also
 transposes data and calculates averages and such. (No t-tests or
 regressions in this version). I created this, during the years
 2001-2006 mainly. It's version 0.85, and has a fairly low bug  
 rate, in
 my opinion. The tarball includes about 100 or so test cases used for
 debugging - for logical calculation errors, but not for extremely  
 high
 volumes of data.
 The maintenance of Vilno has slowed down, because I am currently
 (desparately) looking for employment. But once I've found new
 employment and living quarters and settled in, I will continue to
 enhance Vilno in my spare time.
 The founder: that would be me, Robert Wilkins
 Find it at: code.google.com/p/vilno ( GNU GPL )
 ( In particular, the tarball at code.google.com/p/vilno/downloads/ 
 list
 , since I have yet to figure out how to use Subversion ).

 4. Who knows?
 It was not easy to find out about the existence of DAP and PSPP. So
 who knows what else is out there. However, I think you'll find a lot
 more statistics software ( regression , etc ) out there, and not so
 much data transformation software. Not many people work on data
 preparation software. In fact, the category is so obscure that there
 isn't one agreed term: data cleaning , 

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Chris Evans

Martin Henry H. Stevens sent the following  at 08/06/2007 15:11:
 Is there an example available of this sort of problematic data that  
 requires this kind of data screening and filtering? For many of us,  
 this issue would be nice to learn about, and deal with within R. If a  
 package could be created, that would be optimal for some of us. I  
 would like to learn a tad more, if it were not too much effort for  
 someone else to point me in the right direction?
 Cheers,
 Hank
 On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
 
 On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 As noted on the R-project web site itself ( www.r-project.org -

... rest snipped ...

OK, I can't resist that invitation.  I think there are many kinds of
problematic data.  I handle some nasty textish things in perl (and I
loved the purgatory quote) and I'm afraid I do some things in Excel and
some cleaning I can handle in R, but I never enter data directly into R.

However, one very common scenario I have faceda all my working life is
psych data from questionnaires or interviews in low budget work, mostly
student research or routine entry of therapists' data.  Typically you
have an identifier, a date, some demographics and then a lot of item
data.  There's little money (usual zero) involved for data entry and
cleaning but I've produced a lot of good(ish) papers out of this sort of
very low budget work over the last 20 years.  (Right at the other end of
a financial spectrum from the FDA/validated s'ware thread but this is
about validation again!)

The problem I often face is that people are lousy data entry machines
(well, actually, they vary ... enormously) and if they mess up the data
entry we all know how horrible this can be.

SPSS (boo hiss) used to have an excellent module, actually a
standalone PC/Windoze program, that allowed you to define variables so
they had allowed values and it would refuse to accept out of range or
out of acceptable entries, it also allowed you to create checking rules
and rules that would, in the light of earlier entries, set later values
and not ask about them.  In a rudimentary way you could also lay things
out on the screen so that it paginated where the q'aire or paper data
record did etc.  The final nice touch was that you could define some
variables as invariant and then set the thing so an independent data
entry person could re-enter the other data (i.e. pick up q'aire, see if
ID fits the one showing on screen, if so, enter the rest of the data).
It would bleep and not move on if you entered a value other than that
entered by the first person and you had to confirm that one of you was
right.

That saved me wasted weeks I'm sure on analysing data that turned out to
be awful and I'd love to see someone build something to replace that.

Currently I tend to use (boo hiss) Excel for this as everyone I work
with seems to have it (and not all can install open office and anyway I
haven't had time to learn that properly yet either ...) and I set up
spreadsheets with validation rules set.  That doesn't get the branching
rules and checks (e.g. if male, skip questions about periods, PMT and
pregnancies), or at least, with my poor Excel skills it doesn't.  I just
skip a column to indicate page breaks in the q'aire, and I get, when I
can, two people to enter the data separately and then use R to compare
the two spreadsheets having yanked them into data frames.

I would really, really love someone to develop (and perhaps replace) the
rather buggy edit() and fix() routines (seem to hang on big data frames
in Rcmdr which is what I'm trying to get students onto) with something
that did some or all of what SPSS/DE used to do for me or I bodge now in
Excel.  If any generous coding whiz were willing to do this, I'll try to
alpha and beta test and write help etc.

There _may_ be good open source things out there that do what I need but
something that really integrated into R would be another huge step
forward in being able to phase out SPSS in my work settings and phase in R.

Very best all,

Chris



-- 
Chris Evans [EMAIL PROTECTED] Skype: chris-psyctc
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Notts PDD network;
Research Programmes Director, Nottinghamshire NHS Trust;
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions*

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Dale Steele
For windows users, EpiData Entry http://www.epidata.dk/ is an
excellent (free) tool for data entry and documentation.--Dale


On 6/8/07, Chris Evans [EMAIL PROTECTED] wrote:

 Martin Henry H. Stevens sent the following  at 08/06/2007 15:11:
  Is there an example available of this sort of problematic data that
  requires this kind of data screening and filtering? For many of us,
  this issue would be nice to learn about, and deal with within R. If a
  package could be created, that would be optimal for some of us. I
  would like to learn a tad more, if it were not too much effort for
  someone else to point me in the right direction?
  Cheers,
  Hank
  On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
 
  On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
  As noted on the R-project web site itself ( www.r-project.org -

 ... rest snipped ...

 OK, I can't resist that invitation.  I think there are many kinds of
 problematic data.  I handle some nasty textish things in perl (and I
 loved the purgatory quote) and I'm afraid I do some things in Excel and
 some cleaning I can handle in R, but I never enter data directly into R.

 However, one very common scenario I have faceda all my working life is
 psych data from questionnaires or interviews in low budget work, mostly
 student research or routine entry of therapists' data.  Typically you
 have an identifier, a date, some demographics and then a lot of item
 data.  There's little money (usual zero) involved for data entry and
 cleaning but I've produced a lot of good(ish) papers out of this sort of
 very low budget work over the last 20 years.  (Right at the other end of
 a financial spectrum from the FDA/validated s'ware thread but this is
 about validation again!)

 The problem I often face is that people are lousy data entry machines
 (well, actually, they vary ... enormously) and if they mess up the data
 entry we all know how horrible this can be.

 SPSS (boo hiss) used to have an excellent module, actually a
 standalone PC/Windoze program, that allowed you to define variables so
 they had allowed values and it would refuse to accept out of range or
 out of acceptable entries, it also allowed you to create checking rules
 and rules that would, in the light of earlier entries, set later values
 and not ask about them.  In a rudimentary way you could also lay things
 out on the screen so that it paginated where the q'aire or paper data
 record did etc.  The final nice touch was that you could define some
 variables as invariant and then set the thing so an independent data
 entry person could re-enter the other data (i.e. pick up q'aire, see if
 ID fits the one showing on screen, if so, enter the rest of the data).
 It would bleep and not move on if you entered a value other than that
 entered by the first person and you had to confirm that one of you was
 right.

 That saved me wasted weeks I'm sure on analysing data that turned out to
 be awful and I'd love to see someone build something to replace that.

 Currently I tend to use (boo hiss) Excel for this as everyone I work
 with seems to have it (and not all can install open office and anyway I
 haven't had time to learn that properly yet either ...) and I set up
 spreadsheets with validation rules set.  That doesn't get the branching
 rules and checks (e.g. if male, skip questions about periods, PMT and
 pregnancies), or at least, with my poor Excel skills it doesn't.  I just
 skip a column to indicate page breaks in the q'aire, and I get, when I
 can, two people to enter the data separately and then use R to compare
 the two spreadsheets having yanked them into data frames.

 I would really, really love someone to develop (and perhaps replace) the
 rather buggy edit() and fix() routines (seem to hang on big data frames
 in Rcmdr which is what I'm trying to get students onto) with something
 that did some or all of what SPSS/DE used to do for me or I bodge now in
 Excel.  If any generous coding whiz were willing to do this, I'll try to
 alpha and beta test and write help etc.

 There _may_ be good open source things out there that do what I need but
 something that really integrated into R would be another huge step
 forward in being able to phase out SPSS in my work settings and phase in R.

 Very best all,

 Chris



 --
 Chris Evans [EMAIL PROTECTED] Skype: chris-psyctc
 Professor of Psychotherapy, Nottingham University;
 Consultant Psychiatrist in Psychotherapy, Notts PDD network;
 Research Programmes Director, Nottinghamshire NHS Trust;
 *If I am writing from one of those roles, it will be clear. Otherwise*
 *my views are my own and not representative of those institutions*

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Frank E Harrell Jr
Dale Steele wrote:
 For windows users, EpiData Entry http://www.epidata.dk/ is an
 excellent (free) tool for data entry and documentation.--Dale

Note that EpiData seems to work well under linux using wine.
Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Christophe Pallier
On 6/8/07, Douglas Bates [EMAIL PROTECTED] wrote:


 Other responses in this thread have mentioned 'little language'
 filters like awk, which is fine for those who were raised in the Bell
 Labs tradition of programming (why type three characters when two
 character names should suffice for anything one wants to do on a
 PDP-11) but the typical field scientist finds this a bit too terse to
 understand and would rather write a filter as a paragraph of code that
 they have a change of reading and understanding a week later.


Hum,


Concerning awk, I think that this comment does not apply: because the
language is simple and and somewhat limited, awk scripts are typically quite
clean and readable (of course, it is possible to write horrible code in any
languages).

I have introduced awk to dozens of people (mostly scientists in social
sciences, and dos/windows users...) over the last 15 years  it is sometimes
the only programming language they know and they are very happy with what
they can do with it.

The philosophy of using it as a filter (that is, a converter) is also good
because many problems are best solved in 2 or 3 steps (2/3 short scripts run
sequentially) rather than in one single step,as people tend to do with
languages that encourage to use more complex data structures than
associative arrays.

It could be argued that awk is the swiss army knife of simple text
manipulations. All in all, awk+R is very efficient combination for data
manipulation (at least for the cases I have encountered).

It would a pity if your remark led people to overlook awk as it would
efficiently solve many of the input parsing problems that are posted on this
list (I am talking here about extracting information from text files, not
data entry).

awk, like R, is not exempt of defects, yet both are tools that one gets
attached to because they increase your productivity a lot.


-- 
Christophe Pallier (http://www.pallier.org)

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Tools For Preparing Data For Analysis

2007-06-07 Thread Robert Wilkins
As noted on the R-project web site itself ( www.r-project.org -
Manuals - R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of the yellow Springer ones)
that says, more briefly, the same thing.
The R Data Import/Export page recommends examples using SAS, Perl,
Python, and Java. It takes a bit of courage to say that ( when you go
to a corporate software web site, you'll never see a page saying This
is the type of problem that our product is not the best at, here's
what we suggest instead ). I'd like to provide a few more
suggestions, especially for volunteers who are willing to evaluate new
candidates.

SAS is fine if you're not paying for the license out of your own
pocket. But maybe one reason you're using R is you don't have
thousands of spare dollars.
Using Java for data cleaning is an exercise in sado-masochism, Java
has a learning curve (almost) as difficult as C++.

There are different types of data transformation, and for some data
preparation problems an all-purpose programming language is a good
choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
excellent regular expression facilities.

However, for some types of complex demanding data preparation
problems, an all-purpose programming language is a poor choice. For
example: cleaning up and preparing clinical lab data and adverse event
data - you could do it in Perl, but it would take way, way too much
time. A specialized programming language is needed. And since data
transformation is quite different from data query, SQL is not the
ideal solution either.

There are only three statistical programming languages that are
well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
popular than S for data cleaning.

If you're an R user with difficult data preparation problems, frankly
you are out of luck, because the products I'm about to mention are
new, unknown, and therefore regarded as immature. And while the
founders of these products would be very happy if you kicked the
tires, most people don't like to look at brand new products. Most
innovators and inventers don't realize this, I've learned it the hard
way.

But if you are a volunteer who likes to help out by evaluating,
comparing, and reporting upon new candidates, well you could certainly
help out R users and the developers of the products by kicking the
tires of these products. And there is a huge need for such volunteers.

1. DAP
This is an open source implementation of SAS.
The founder: Susan Bassein
Find it at: directory.fsf.org/math/stats (GNU GPL)

2. PSPP
This is an open source implementation of SPSS.
The relatively early version number might not give a good idea of how
mature the
data transformation features are, it reflects the fact that he has
only started doing the statistical tests.
The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
Also at : directory.fsf.org/math/stats (GNU GPL)

3. Vilno
This uses a programming language similar to SPSS and SAS, but quite unlike S.
Essentially, it's a substitute for the SAS datastep, and also
transposes data and calculates averages and such. (No t-tests or
regressions in this version). I created this, during the years
2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
my opinion. The tarball includes about 100 or so test cases used for
debugging - for logical calculation errors, but not for extremely high
volumes of data.
The maintenance of Vilno has slowed down, because I am currently
(desparately) looking for employment. But once I've found new
employment and living quarters and settled in, I will continue to
enhance Vilno in my spare time.
The founder: that would be me, Robert Wilkins
Find it at: code.google.com/p/vilno ( GNU GPL )
( In particular, the tarball at code.google.com/p/vilno/downloads/list
, since I have yet to figure out how to use Subversion ).


4. Who knows?
It was not easy to find out about the existence of DAP and PSPP. So
who knows what else is out there. However, I think you'll find a lot
more statistics software ( regression , etc ) out there, and not so
much data transformation software. Not many people work on data
preparation software. In fact, the category is so obscure that there
isn't one agreed term: data cleaning , data munging , data crunching ,
or just getting the data ready for analysis.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Robert Duval
An additional option for Windows users is Micro Osiris

http://www.microsiris.com/

best
robert

On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 As noted on the R-project web site itself ( www.r-project.org -
 Manuals - R Data Import/Export ), it can be cumbersome to prepare
 messy and dirty data for analysis with the R tool itself. I've also
 seen at least one S programming book (one of the yellow Springer ones)
 that says, more briefly, the same thing.
 The R Data Import/Export page recommends examples using SAS, Perl,
 Python, and Java. It takes a bit of courage to say that ( when you go
 to a corporate software web site, you'll never see a page saying This
 is the type of problem that our product is not the best at, here's
 what we suggest instead ). I'd like to provide a few more
 suggestions, especially for volunteers who are willing to evaluate new
 candidates.

 SAS is fine if you're not paying for the license out of your own
 pocket. But maybe one reason you're using R is you don't have
 thousands of spare dollars.
 Using Java for data cleaning is an exercise in sado-masochism, Java
 has a learning curve (almost) as difficult as C++.

 There are different types of data transformation, and for some data
 preparation problems an all-purpose programming language is a good
 choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
 excellent regular expression facilities.

 However, for some types of complex demanding data preparation
 problems, an all-purpose programming language is a poor choice. For
 example: cleaning up and preparing clinical lab data and adverse event
 data - you could do it in Perl, but it would take way, way too much
 time. A specialized programming language is needed. And since data
 transformation is quite different from data query, SQL is not the
 ideal solution either.

 There are only three statistical programming languages that are
 well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
 popular than S for data cleaning.

 If you're an R user with difficult data preparation problems, frankly
 you are out of luck, because the products I'm about to mention are
 new, unknown, and therefore regarded as immature. And while the
 founders of these products would be very happy if you kicked the
 tires, most people don't like to look at brand new products. Most
 innovators and inventers don't realize this, I've learned it the hard
 way.

 But if you are a volunteer who likes to help out by evaluating,
 comparing, and reporting upon new candidates, well you could certainly
 help out R users and the developers of the products by kicking the
 tires of these products. And there is a huge need for such volunteers.

 1. DAP
 This is an open source implementation of SAS.
 The founder: Susan Bassein
 Find it at: directory.fsf.org/math/stats (GNU GPL)

 2. PSPP
 This is an open source implementation of SPSS.
 The relatively early version number might not give a good idea of how
 mature the
 data transformation features are, it reflects the fact that he has
 only started doing the statistical tests.
 The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
 dept.
 Also at : directory.fsf.org/math/stats (GNU GPL)

 3. Vilno
 This uses a programming language similar to SPSS and SAS, but quite unlike S.
 Essentially, it's a substitute for the SAS datastep, and also
 transposes data and calculates averages and such. (No t-tests or
 regressions in this version). I created this, during the years
 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
 my opinion. The tarball includes about 100 or so test cases used for
 debugging - for logical calculation errors, but not for extremely high
 volumes of data.
 The maintenance of Vilno has slowed down, because I am currently
 (desparately) looking for employment. But once I've found new
 employment and living quarters and settled in, I will continue to
 enhance Vilno in my spare time.
 The founder: that would be me, Robert Wilkins
 Find it at: code.google.com/p/vilno ( GNU GPL )
 ( In particular, the tarball at code.google.com/p/vilno/downloads/list
 , since I have yet to figure out how to use Subversion ).


 4. Who knows?
 It was not easy to find out about the existence of DAP and PSPP. So
 who knows what else is out there. However, I think you'll find a lot
 more statistics software ( regression , etc ) out there, and not so
 much data transformation software. Not many people work on data
 preparation software. In fact, the category is so obscure that there
 isn't one agreed term: data cleaning , data munging , data crunching ,
 or just getting the data ready for analysis.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Frank E Harrell Jr
Robert Wilkins wrote:
 As noted on the R-project web site itself ( www.r-project.org -
 Manuals - R Data Import/Export ), it can be cumbersome to prepare
 messy and dirty data for analysis with the R tool itself. I've also
 seen at least one S programming book (one of the yellow Springer ones)
 that says, more briefly, the same thing.
 The R Data Import/Export page recommends examples using SAS, Perl,
 Python, and Java. It takes a bit of courage to say that ( when you go
 to a corporate software web site, you'll never see a page saying This
 is the type of problem that our product is not the best at, here's
 what we suggest instead ). I'd like to provide a few more
 suggestions, especially for volunteers who are willing to evaluate new
 candidates.
 
 SAS is fine if you're not paying for the license out of your own
 pocket. But maybe one reason you're using R is you don't have
 thousands of spare dollars.
 Using Java for data cleaning is an exercise in sado-masochism, Java
 has a learning curve (almost) as difficult as C++.
 
 There are different types of data transformation, and for some data
 preparation problems an all-purpose programming language is a good
 choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
 excellent regular expression facilities.
 
 However, for some types of complex demanding data preparation
 problems, an all-purpose programming language is a poor choice. For
 example: cleaning up and preparing clinical lab data and adverse event
 data - you could do it in Perl, but it would take way, way too much
 time. A specialized programming language is needed. And since data
 transformation is quite different from data query, SQL is not the
 ideal solution either.

We deal with exactly those kinds of data solely using R.  R is 
exceptionally powerful for data manipulation, just a bit hard to learn. 
  Many examples are at 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf

Frank

 
 There are only three statistical programming languages that are
 well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
 popular than S for data cleaning.
 
 If you're an R user with difficult data preparation problems, frankly
 you are out of luck, because the products I'm about to mention are
 new, unknown, and therefore regarded as immature. And while the
 founders of these products would be very happy if you kicked the
 tires, most people don't like to look at brand new products. Most
 innovators and inventers don't realize this, I've learned it the hard
 way.
 
 But if you are a volunteer who likes to help out by evaluating,
 comparing, and reporting upon new candidates, well you could certainly
 help out R users and the developers of the products by kicking the
 tires of these products. And there is a huge need for such volunteers.
 
 1. DAP
 This is an open source implementation of SAS.
 The founder: Susan Bassein
 Find it at: directory.fsf.org/math/stats (GNU GPL)
 
 2. PSPP
 This is an open source implementation of SPSS.
 The relatively early version number might not give a good idea of how
 mature the
 data transformation features are, it reflects the fact that he has
 only started doing the statistical tests.
 The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
 dept.
 Also at : directory.fsf.org/math/stats (GNU GPL)
 
 3. Vilno
 This uses a programming language similar to SPSS and SAS, but quite unlike S.
 Essentially, it's a substitute for the SAS datastep, and also
 transposes data and calculates averages and such. (No t-tests or
 regressions in this version). I created this, during the years
 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
 my opinion. The tarball includes about 100 or so test cases used for
 debugging - for logical calculation errors, but not for extremely high
 volumes of data.
 The maintenance of Vilno has slowed down, because I am currently
 (desparately) looking for employment. But once I've found new
 employment and living quarters and settled in, I will continue to
 enhance Vilno in my spare time.
 The founder: that would be me, Robert Wilkins
 Find it at: code.google.com/p/vilno ( GNU GPL )
 ( In particular, the tarball at code.google.com/p/vilno/downloads/list
 , since I have yet to figure out how to use Subversion ).
 
 
 4. Who knows?
 It was not easy to find out about the existence of DAP and PSPP. So
 who knows what else is out there. However, I think you'll find a lot
 more statistics software ( regression , etc ) out there, and not so
 much data transformation software. Not many people work on data
 preparation software. In fact, the category is so obscure that there
 isn't one agreed term: data cleaning , data munging , data crunching ,
 or just getting the data ready for analysis.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide