Re: [R] Tools For Preparing Data For Analysis

2007-06-22 Thread Christophe Pallier
If I understand correctly (from your Perl script)

1. you count the number of occurences of each "(echo, muga)" pairs in the
first file.

2. you remove from the second file the lines that correspond to these
occurences.

If this is indeed your aim, here's a solution in R:

cumcount <- function(x) {
 y <- numeric(length(x))
 for (i in 1:length(y)) {
 y[i] = sum(x[1:i] == x[i])
 }
 y
}

both <- read.csv('both_echo.csv')
v <- table(paste(both$echo, "_", both$muga, sep=""))

semi <- read.csv('qual_echo.csv')
s <- paste(semi$echo, "_", semi$muga, sep="")
cs = cumcount(s)
count = v[s]
count[is.na(count)]=0

semi2 <- data.frame(semi, s, cs, count, keep = cs > count)

> semi2
  echo muga quant s cs count  keep
1   10   20 0 10_20  1 0  TRUE
2   10   20 0 10_20  2 0  TRUE
3   10   21 0 10_21  1 1 FALSE
4   10   21 0 10_21  2 1  TRUE
5   10   24 0 10_24  1 0  TRUE
6   10   25 0 10_25  1 2 FALSE
7   10   25 0 10_25  2 2 FALSE
8   10   25 0 10_25  3 2  TRUE


My code is not very readable...
Yet, the 'trick' of using an helper function like 'cumcount' might be
instructive.

Christophe Pallier


On 6/22/07, Kevin E. Thorpe <[EMAIL PROTECTED]> wrote:
>
> I am posting to this thread that has been quiet for some time because I
> remembered the following question.
>
> Christophe Pallier wrote:
> > Hi,
> >
> > Can you provide examples of data formats that are problematic to read
> and
> > clean with R ?
>
> Today I had a data manipulation problem that I don't know how to do in R
> so I solved it with perl.  Since I'm always interested in learning more
> about complex data manipulation in R I am posting my problem in the
> hopes of receiving some hints for doing this in R.
>
> If anyone has nothing better to do than play with other people's data,
> I would be happy to send the row files off-list.
>
> Background:
>
> I have been given data that contains two measurements of left
> ventricular ejection fraction.  One of the methods is echocardiogram
> which sometimes gives a true quantitative value and other times a
> semi-quantitative value.  The desire is to compare echo with the
> other method (MUGA).  In most cases, patients had either quantitative
> or semi-quantitative.  Same patients had both.  The data came
> to me in excel files with, basically, no patient identifiers to link
> the "both" with the semi-quantitative patients (the "both" patients
> were in multiple data sets).
>
> What I wanted to do was extract from the semi-quantitative data file
> those patients with only semi-quantitative.  All I have to link with
> are the semi-quantitative echo and the MUGA and these pairs of values
> are not unique.
>
> To make this more concrete, here are some portions of the raw data.
>
> "Both"
>
> "ID NUM","ECHO","MUGA","Semiquant","Quant"
> "B",12,37,10,12
> "D",13,13,10,13
> "E",13,26,10,15
> "F",13,31,10,13
> "H",15,15,10,15
> "I",15,21,10,15
> "J",15,22,10,15
> "K",17,22,10,17
> "N",17.5,4,10,17.5
> "P",18,25,10,18
> "R",19,25,10,19
>
> Seimi-quantitative
>
> "echo","muga","quant"
> 10,20,0  <-- keep
> 10,20,0  <-- keep
> 10,21,0  <-- remove
> 10,21,0  <-- keep
> 10,24,0  <-- keep
> 10,25,0  <-- remove
> 10,25,0  <-- remove
> 10,25,0  <-- keep
>
> Here is the perl program I wrote for this.
>
> #!/usr/bin/perl
>
> open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
> # Discard first row;
> $_ = ;
> while() {
> chomp;
> ($id, $e, $m, $sq, $qu) = split(/,/);
> $both{$sq,$m}++;
> }
> close(BOTH);
>
> open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
> print OUT "pid,echo,muga,quant\n";
> $pid = 2001;
>
> open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
> # Discard first row
> $_ = ;
> while() {
> chomp;
> ($echo, $muga, $quant) = split(/,/);
> if ($both{$echo,$muga} > 0) {
> $both{$echo,$muga}--;
> }
> else {
> print OUT "$pid,$echo,$muga,$quant\n";
> $pid++;
> }
> }
> close(QUAL);
> close(OUT);
>
> open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
> print OUT "pid,echo,muga,quant\n";
> $pid = 3001;
>
> open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
> # Discard first row;
> $_ = ;
> while() {
> chomp;
> ($id, $e, $m, $sq, $qu) = split(/,/);
> print OUT "$pid,$sq,$m,0\n";
> print OUT "$pid,$qu,$m,1\n";
> $pid++;
> }
> close(BOTH);
> close(OUT);
>
>
> --
> Kevin E. Thorpe
> Biostatistician/Trialist, Knowledge Translation Program
> Assistant Professor, Department of Public Health Sciences
> Faculty of Medicine, University of Toronto
> email: [EMAIL PROTECTED]  Tel: 416.864.5776  Fax: 416.864.6057
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self

Re: [R] Tools For Preparing Data For Analysis

2007-06-22 Thread Kevin E. Thorpe
I am posting to this thread that has been quiet for some time because I
remembered the following question.

Christophe Pallier wrote:
> Hi,
> 
> Can you provide examples of data formats that are problematic to read and
> clean with R ?

Today I had a data manipulation problem that I don't know how to do in R
so I solved it with perl.  Since I'm always interested in learning more
about complex data manipulation in R I am posting my problem in the
hopes of receiving some hints for doing this in R.

If anyone has nothing better to do than play with other people's data,
I would be happy to send the row files off-list.

Background:

I have been given data that contains two measurements of left
ventricular ejection fraction.  One of the methods is echocardiogram
which sometimes gives a true quantitative value and other times a
semi-quantitative value.  The desire is to compare echo with the
other method (MUGA).  In most cases, patients had either quantitative
or semi-quantitative.  Same patients had both.  The data came
to me in excel files with, basically, no patient identifiers to link
the "both" with the semi-quantitative patients (the "both" patients
were in multiple data sets).

What I wanted to do was extract from the semi-quantitative data file
those patients with only semi-quantitative.  All I have to link with
are the semi-quantitative echo and the MUGA and these pairs of values
are not unique.

To make this more concrete, here are some portions of the raw data.

"Both"

"ID NUM","ECHO","MUGA","Semiquant","Quant"
"B",12,37,10,12
"D",13,13,10,13
"E",13,26,10,15
"F",13,31,10,13
"H",15,15,10,15
"I",15,21,10,15
"J",15,22,10,15
"K",17,22,10,17
"N",17.5,4,10,17.5
"P",18,25,10,18
"R",19,25,10,19

Seimi-quantitative

"echo","muga","quant"
10,20,0  <-- keep
10,20,0  <-- keep
10,21,0  <-- remove
10,21,0  <-- keep
10,24,0  <-- keep
10,25,0  <-- remove
10,25,0  <-- remove
10,25,0  <-- keep

Here is the perl program I wrote for this.

#!/usr/bin/perl

open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = ;
while() {
chomp;
($id, $e, $m, $sq, $qu) = split(/,/);
$both{$sq,$m}++;
}
close(BOTH);

open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 2001;

open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
# Discard first row
$_ = ;
while() {
chomp;
($echo, $muga, $quant) = split(/,/);
if ($both{$echo,$muga} > 0) {
$both{$echo,$muga}--;
}
else {
print OUT "$pid,$echo,$muga,$quant\n";
$pid++;
}
}
close(QUAL);
close(OUT);

open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 3001;

open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = ;
while() {
chomp;
($id, $e, $m, $sq, $qu) = split(/,/);
print OUT "$pid,$sq,$m,0\n";
print OUT "$pid,$qu,$m,1\n";
$pid++;
}
close(BOTH);
close(OUT);


-- 
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Department of Public Health Sciences
Faculty of Medicine, University of Toronto
email: [EMAIL PROTECTED]  Tel: 416.864.5776  Fax: 416.864.6057

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread Robert Wilkins
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]

Hmm,

I don't think you need a retain statement.

if first.patientID ;
or
if last.patientID ;

ought to do it.

It's actually better than the Vilno version, I must admit, a bit more concise:

if ( not firstrow(patientID) ) deleterow ;

Ah well.

**
For the folks asking for location of software ( I know posted it, but
it didn't connect to the thread, and you get a huge number of posts
each day , sorry):

Vilno , find at
http://code.google.com/p/vilno

DAP & PSPP,  find at
http://directory.fsf.org/math/stats

Awk, find at lots of places,
http://www.gnu.org/software/gawk/gawk.html

Anything else? DAP & PSPP are hard to find, I'm sure there's more out there!
What about MDX? Nahh, not really the right problem domain.
Nobody uses MDX for this stuff.

**

If my examples , using clinical trial data are boring and hard to
understand for those who asked for examples
( and presumably don't work in clinical trials) , let me
know. Some of these other examples I'm reading about are quite interesting.
It doesn't help that clinical trial databases cannot be public. Making
a fake database would take a lot of time.
The irony is , even with my deep understanding of data preparation in
clinical trials,
the pharmas still don't want to give me a job ( because I was gone for
many years).


Let's see if this post works : thanks to the folks who gave me advice
on how to properly respond to a post within a  thread . ( Although the
thread in my gmail account is only a subset of the posts visible in
the archives ). Crossing my fingers 

On 6/10/07, Peter Dalgaard <[EMAIL PROTECTED]> wrote:
> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >
> Actually, I think Frank's point was subtly different: It is *because* of
> the differences in view that it sometimes seems difficult to find the
> way to do something in R that  is apparently straightforward in SAS.
> I.e. the solutions exist and are often elegant, but may require some
> lateral thinking.
>
> Case in point: Finding the first or the last observation for each
> subject when there are multiple records for each subject. The SAS way
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
> you can compare the subject ID with the one from the previous record,
> working with data that are sorted appropriately.
>
> You can do the same thing in R with a for loop, but there are better
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
> maybe
> do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or
> something involving aggregate(). (The latter approaches generalize
> better to other within-subject functionals like cumulative doses, etc.).
>
> The hardest cases that I know of are the ones where you need to turn one
> record into many, such as occurs in survival analysis with
> time-dependent, piecewise constant covariates. This may require
> "transposing the problem", i.e. for each  interval you find out which
> subjects contribute and with what, whereas the SAS way would be a
> within-subject loop over intervals containing an OUTPUT statement.
>
> Also, there are some really weird data formats, where e.g. the input
> format is different in different records. Back in the 80's where
> punched-card input was still common, it was quite popular to have one
> card with background information on a patient plus several cards
> detailing visits, and you'd get a stack of cards containing both kinds.
> In R you would most likely split on the card type using grep() and then
> read the two kinds separately and merge() them later.
>
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread John Kane

--- [EMAIL PROTECTED] wrote:

> As a tangent to this thread, there is a very
> relevant
> article in the latest issue of the RSS magazine
> "Significance",
> which I have just received:
> 
>   Dr Fisher's Casebook
>   The trouble with data
> 
> Significance, Vol 4 (2007) Issue 2.
> 
> Full current contents at
> 
> http://www.blackwell-synergy.com/toc/sign/4/2
> 
> but unfortunately you can only read any of it by
> paying
> money to Blackwell (unless you're an RSS member).
> 
> Best wishes to all,
> Ted.

A lovely article.  I'm not a member but the local
university has a subscription.  

The examples of "men who claimed to have cervical 
smears (F) and women who were 5' tall weighing 15
stone (T) ring true.  

I've found people walking at 30 km/hr (F) and an
addict using 240 needles a month (T). I've even found
a set of 16 variables the study designers never heard
of !

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread Ted Harding
As a tangent to this thread, there is a very relevant
article in the latest issue of the RSS magazine "Significance",
which I have just received:

  Dr Fisher's Casebook
  The trouble with data

Significance, Vol 4 (2007) Issue 2.

Full current contents at

http://www.blackwell-synergy.com/toc/sign/4/2

but unfortunately you can only read any of it by paying
money to Blackwell (unless you're an RSS member).

Best wishes to all,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Jun-07   Time: 12:24:46
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-11 Thread Barry Rowlingson
Chris Evans wrote:

> Thanks Ted, great thread and I'm impressed with EpiData that I've
> discovered through this. I'd still like something that is even more
> integrated with R but maybe some day, if EpiData go fully open source as
> I think they are doing ("A full conversion plan to secure this and
> convert the software to open-source has been made (See complete
> description of license and principles)." at http://www.epidata.dk/ but
> the link to http://www.epidata.dk/about.htm doesn't exactly clarify this
> I don't think.  But I can hope.)
> 
> Thanks, yet again, to everyone who creates and contributes to the R
> system and this list: wonderful!

  Perhaps what we need is an XML standard for describing record-oriented 
data and its validation? This could then be used to validate a set of 
records and possibly also to build input forms with built-in validation 
for new records.

  You could then write R code that did 'check this data frame against 
this XML description and tell me the invalid rows'. Or Python code.

  This is the kind of thing that is traditionally built using a database 
front-end, but keeping the description in XML means that alternate 
interfaces (web forms, standalone programs using Qt or GTK libraries) 
can be used on the same description set.

  I had a quick search to see if this kind of thing exists already, but 
google searches for 'data entry verification' indicate that I should 
really pay some people in India to do that kind of thing for me...

Barry

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Chris Evans
(Ted Harding) sent the following  at 10/06/2007 09:28:

... much snipped ...

> (As is implicit in many comments in Robert's blog, and indeed also
> from many postings to this list over time and undoubtedly well
> known to many of us in practice, a lot of the problems with data
> files arise at the data gathering and entry stages, where people
> can behave as if stuffing unpaired socks and unattributed underwear
> randomly into a drawer, and then banging it shut).

And they look surprised when pointing a statistician at the chest of
drawers doesn't result in a cut price display worthy of Figleaf (or
Victoria's Secret I think for those of you in N.America) and get them
their degree, doctorate, latest publication ...

Ah me, how wonderfully, wonderfully ... sadly, accurate!

Thanks Ted, great thread and I'm impressed with EpiData that I've
discovered through this. I'd still like something that is even more
integrated with R but maybe some day, if EpiData go fully open source as
I think they are doing ("A full conversion plan to secure this and
convert the software to open-source has been made (See complete
description of license and principles)." at http://www.epidata.dk/ but
the link to http://www.epidata.dk/about.htm doesn't exactly clarify this
I don't think.  But I can hope.)

Thanks, yet again, to everyone who creates and contributes to the R
system and this list: wonderful!

C


-- 
Chris Evans <[EMAIL PROTECTED]> Skype: chris-psyctc
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Notts PDD network;
Research Programmes Director, Nottinghamshire NHS Trust;
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions*

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Stephen Tucker
Embarrasingly, I don't know awk or sed but R's code seems to be
shorter for most tasks than Python, which is my basis for comparison.

It's true that R's more powerful data structures usually aren't
necessary for the data cleaning, but sometimes in the filtering
process I will pick out lines that contain certain data, in which case
I have to convert text to numbers and perform operations like
which.min(), order(), etc., so in that sense I like to have R's
vectorized notation and the objects/functions that support it.

As far as some of the tasks you described, I've tried transcribing
them to R. I know you provided only the simplest examples, but even in
these cases I think R's functions for handling these situations
exemplify their usefulness in this step of the analysis. But perhaps
you would argue that this code is too long... In any event it will
still save the trouble of keeping track of an extra (intermediate)
file passed between awk and R.

(1) the numbers of fields in each line equivalent to 
>cat datafile.csv | awk 'BEGIN{FS=","}{n=NF;print n}'
in awk

# R equivalent:
nFields <- count.fields("datafile.csv",sep=",")
# or 
nFields <- sapply(strsplit(readLines("datafile.csv"),","),length)

(2) which lines have the wrong number of fields, and how many fields
they have. You can similarly count how many lines there are (e.g. pipe
into wc -l).

# number of lines with wrong number of fields
nWrongFields <- length(nFields[nFields > 10])

# select only first ten fields from each line
# and return a matrix
firstTenFields <- 
  do.call(rbind,
  lapply(strsplit(readLines("datafile.csv"),","),
 function(x) x[1:10]))

# select only those lines which contain ten fields
# and return a matrix
onlyTenFields <- 
  do.call(rbind,
  lapply(strsplit(readLines("datafile.csv"),","),
 function(x) if(length(x) <= 10) x else NULL))

(3)
>If for instance you try to
>read the following CSV into R as a dataframe:
> 
>1,2,.,4
>2,.,4,5
>3,4,.,6
> 

txtC <- textConnection(
"1,2,.,4
2,.,4,5
3,4,.,6")
# using read.csv() specifying na.string argument:
> read.csv(txtC,header=FALSE,na.string=".")
  V1 V2 V3 V4
1  1  2 NA  4
2  2 NA  4  5
3  3  4 NA  6

# Of course, read.csv will work only if data is formatted correctly.
# More generally, using readLines(), strsplit(), etc., which are more
# flexible :

> do.call(rbind,
+ lapply(strsplit(readLines(txtC),","),
+type.convert,na.string="."))
 [,1] [,2] [,3] [,4]
[1,]12   NA4
[2,]2   NA45
[3,]34   NA6

(4) Situations where people mix ",," and ",.,"!

# type.convert (and read.csv) will still work when missing values are ",,"
# and ",.," (automatically recognizes "" as NA and through
# specification of 'na.string', can recognize "." as NA)

# If it is desired to convert "." to "" first, this is simple as
# well:

m <- do.call(rbind,
lapply(strsplit(readLines(txtC),","),
   function(x) gsub("^\\.$","",x)))
> m
 [,1] [,2] [,3] [,4]
[1,] "1"  "2"  ""   "4" 
[2,] "2"  ""   "4"  "5" 
[3,] "3"  "4"  ""   "6" 

# then
mode(m) <- "numeric"
# or
m <- apply(m,2,type.convert)
# will give
> m
 [,1] [,2] [,3] [,4]
[1,]12   NA4
[2,]2   NA45
[3,]34   NA6


--- [EMAIL PROTECTED] wrote:

> On 10-Jun-07 19:27:50, Stephen Tucker wrote:
> > 
> > Since R is supposed to be a complete programming language,
> > I wonder why these tools couldn't be implemented in R
> > (unless speed is the issue). Of course, it's a naive desire
> > to have a single language that does everything, but it seems
> > that R currently has most of the functions necessary to do
> > the type of data cleaning described.
> 
> In principle that is certainly true. A couple of comments,
> though.
> 
> 1. R's rich data structures are likely to be superfluous.
>Mostly, at the sanitisation stage, one is working with
>"flat" files (row & column). This straightforward format
>is often easier to handle using simple programs for the
>kind of basic filtering needed, rather then getting into
>the heavier programming constructs of R.
> 
> 2. As follow-on and contrast at the same time, very often
>what should be a nice flat file with no rough edges is not.
>If there are variable numbers of fields per line, R will
>not handle it straightforwardly (you can force it in,
>but it's more elaborate). There are related issues as well.
> 
> a) If someone entering data into an Excel table lets their
>cursor wander outside the row/col range of the table,
>this can cause invisible entities to be planted in the
>extraneous cells. When saved as a CSV, this file then
>has variable numbers of fields per line, and possibly
>also extra lines with arbitrary blank fields.
> 
>cat datafile.csv | awk 'BEGIN{FS=","}{n=NF;print n}'
> 
>will give you the numbers of fields in each line.
> 
>If you further pipe it into | s

Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread roger koenker
An important potential benefit of R solutions shared by awk, sed, ...
is that they provide a reproducible way to  document  exactly how one  
got
from one version of the data to the next.  This  seems to be the main
problem with handicraft methods like editing excel files, it  is too
easy to introduce  new errors that can't be tracked down at later
stages of the analysis.


url:www.econ.uiuc.edu/~rogerRoger Koenker
email   [EMAIL PROTECTED]   Department of Economics
vox:217-333-4558University of Illinois
fax:217-244-6678Champaign, IL 61820


On Jun 10, 2007, at 4:14 PM, (Ted Harding) wrote:

> On 10-Jun-07 19:27:50, Stephen Tucker wrote:
>>
>> Since R is supposed to be a complete programming language,
>> I wonder why these tools couldn't be implemented in R
>> (unless speed is the issue). Of course, it's a naive desire
>> to have a single language that does everything, but it seems
>> that R currently has most of the functions necessary to do
>> the type of data cleaning described.
>
> In principle that is certainly true. A couple of comments,
> though.
>
> 1. R's rich data structures are likely to be superfluous.
>Mostly, at the sanitisation stage, one is working with
>"flat" files (row & column). This straightforward format
>is often easier to handle using simple programs for the
>kind of basic filtering needed, rather then getting into
>the heavier programming constructs of R.
>
> 2. As follow-on and contrast at the same time, very often
>what should be a nice flat file with no rough edges is not.
>If there are variable numbers of fields per line, R will
>not handle it straightforwardly (you can force it in,
>but it's more elaborate). There are related issues as well.
>
> a) If someone entering data into an Excel table lets their
>cursor wander outside the row/col range of the table,
>this can cause invisible entities to be planted in the
>extraneous cells. When saved as a CSV, this file then
>has variable numbers of fields per line, and possibly
>also extra lines with arbitrary blank fields.
>
>cat datafile.csv | awk 'BEGIN{FS=","}{n=NF;print n}'
>
>will give you the numbers of fields in each line.
>
>If you further pipe it into | sort -nu you will get
>the distinct field-numbers. If you know (by now) how many
>fields there should be (e.g. 10), then
>
>cat datafile.csv | awk 'BEGIN{FS=","} (NF != 10){print NR ", " NF}'
>
>will tell you which lines have the wrong number of fields,
>and how many fields they have. You can similarly count how
>many lines there are (e.g. pipe into wc -l).
>
> b) Poeple sometimes randomly use a blank space or a "." in a
>cell to demote a missing value. Consistent use of either
>is OK: ",," in a CSV will be treated as "NA" by R. The use
>of "." can be more problematic. If for instance you try to
>read the following CSV into R as a dataframe:
>
>1,2,.,4
>2,.,4,5
>3,4,.,6
>
>the "." in cols 2 and 3 is treated as the character ".",
>with the result that something complicated happens to
>the typing of the items.
>
>typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
>sum(D[1,2]) gives a type-error, even though the entry
>is in fact 2. And so on , in various combinations.
>
>And (as.nmatrix(D)) is of course a matrix of characters.
>
>In fact, columns 2 and 3 of D are treated as factors!
>
>for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
>[1] 1
>[1] 2
>Levels: . 2 4
>[1] .
>Levels: . 4
>[1] 4
>[1] 2
>[1] .
>Levels: . 2 4
>[1] 4
>Levels: . 4
>[1] 5
>[1] 3
>[1] 4
>Levels: . 2 4
>[1] .
>Levels: . 4
>[1] 6
>
>This is getting altogether too complicated for the job
>one wants to do!
>
>And it gets worse when people mix ",," and ",.,"!
>
>On the other hand, a simple brush with awk (or sed in
>this case) can sort it once and for all, without waking
>the sleeping dogs in R.
>
> I could go on. R undoubtedly has the power, but it can very
> quickly get over-complicated for simple jobs.
>
> Best wishes to all,
> Ted.
>
> 
> E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
> Fax-to-email: +44 (0)870 094 0861
> Date: 10-Jun-07   Time: 22:14:35
> -- XFMail --
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the 

Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 19:27:50, Stephen Tucker wrote:
> 
> Since R is supposed to be a complete programming language,
> I wonder why these tools couldn't be implemented in R
> (unless speed is the issue). Of course, it's a naive desire
> to have a single language that does everything, but it seems
> that R currently has most of the functions necessary to do
> the type of data cleaning described.

In principle that is certainly true. A couple of comments,
though.

1. R's rich data structures are likely to be superfluous.
   Mostly, at the sanitisation stage, one is working with
   "flat" files (row & column). This straightforward format
   is often easier to handle using simple programs for the
   kind of basic filtering needed, rather then getting into
   the heavier programming constructs of R.

2. As follow-on and contrast at the same time, very often
   what should be a nice flat file with no rough edges is not.
   If there are variable numbers of fields per line, R will
   not handle it straightforwardly (you can force it in,
   but it's more elaborate). There are related issues as well.

a) If someone entering data into an Excel table lets their
   cursor wander outside the row/col range of the table,
   this can cause invisible entities to be planted in the
   extraneous cells. When saved as a CSV, this file then
   has variable numbers of fields per line, and possibly
   also extra lines with arbitrary blank fields.

   cat datafile.csv | awk 'BEGIN{FS=","}{n=NF;print n}'

   will give you the numbers of fields in each line.

   If you further pipe it into | sort -nu you will get
   the distinct field-numbers. If you know (by now) how many
   fields there should be (e.g. 10), then

   cat datafile.csv | awk 'BEGIN{FS=","} (NF != 10){print NR ", " NF}'

   will tell you which lines have the wrong number of fields,
   and how many fields they have. You can similarly count how
   many lines there are (e.g. pipe into wc -l).

b) Poeple sometimes randomly use a blank space or a "." in a
   cell to demote a missing value. Consistent use of either
   is OK: ",," in a CSV will be treated as "NA" by R. The use
   of "." can be more problematic. If for instance you try to
   read the following CSV into R as a dataframe:

   1,2,.,4
   2,.,4,5
   3,4,.,6

   the "." in cols 2 and 3 is treated as the character ".",
   with the result that something complicated happens to
   the typing of the items.

   typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
   sum(D[1,2]) gives a type-error, even though the entry
   is in fact 2. And so on , in various combinations.

   And (as.nmatrix(D)) is of course a matrix of characters.

   In fact, columns 2 and 3 of D are treated as factors!

   for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
   [1] 1
   [1] 2
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 4
   [1] 2
   [1] .
   Levels: . 2 4
   [1] 4
   Levels: . 4
   [1] 5
   [1] 3
   [1] 4
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 6

   This is getting altogether too complicated for the job
   one wants to do!

   And it gets worse when people mix ",," and ",.,"!

   On the other hand, a simple brush with awk (or sed in
   this case) can sort it once and for all, without waking
   the sleeping dogs in R.

I could go on. R undoubtedly has the power, but it can very
quickly get over-complicated for simple jobs.

Best wishes to all,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 22:14:35
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 14:04:44, Sarah Goslee wrote:
> On 6/10/07, Ted Harding <[EMAIL PROTECTED]> wrote:
> 
>> ... a lot of the problems with data
>> files arise at the data gathering and entry stages, where people
>> can behave as if stuffing unpaired socks and unattributed underwear
>> randomly into a drawer, and then banging it shut.
> 
> Not specifically R-related, but this would make a great fortune.
> 
> Sarah
> -- 
> Sarah Goslee
> http://www.functionaldiversity.org

I'm not going to object to that!
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 21:18:45
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Stephen Tucker

Since R is supposed to be a complete programming language, I wonder
why these tools couldn't be implemented in R (unless speed is the
issue). Of course, it's a naive desire to have a single language that
does everything, but it seems that R currently has most of the
functions necessary to do the type of data cleaning described.

For instance, Gabor and Peter showed some snippets of ways to do this
elegantly; my [physical science] data is often not as horrendously
structured so usually I can get away with a program containing this
type of code

txtin <- scan(filename,what="",sep="\n")
filteredList <- lapply(strsplit(txtin,delimiter),FUN=filterfunction)
   # fiteringfunction() returns selected (and possibly transformed
   # elements if present and NULL otherwise
   # may include calls to grep(), regexpr(), gsub(), substring(),...
   # nchar(), sscanf(), type.convert(), paste(), etc.
mydataframe <- do.call(rbind,filteredList)
   # then match(), subset(), aggregate(), etc.

In the case that the file is large, I open a file connection and scan
a single line + apply filterfunction() successively in a FOR-LOOP
instead of using lapply(). Of course, the devil is in the details of
the filtering function, but I believe most of the required text
processing facilities are already provided by R.

I often have tasks that involve a combination of shell-scripting and
text processing to construct the data frame for analysis; I started
out using Python+NumPy to do the front-end work but have been using R
progressively more (frankly, all of it) to take over that portion
since I generally prefer the data structures and methods in R.


--- Peter Dalgaard <[EMAIL PROTECTED]> wrote:

> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >   
> Actually, I think Frank's point was subtly different: It is *because* of 
> the differences in view that it sometimes seems difficult to find the 
> way to do something in R that  is apparently straightforward in SAS. 
> I.e. the solutions exist and are often elegant, but may require some 
> lateral thinking.
> 
> Case in point: Finding the first or the last observation for each 
> subject when there are multiple records for each subject. The SAS way 
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
> you can compare the subject ID with the one from the previous record, 
> working with data that are sorted appropriately.
> 
> You can do the same thing in R with a for loop, but there are better 
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
> maybe
> do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or 
> something involving aggregate(). (The latter approaches generalize 
> better to other within-subject functionals like cumulative doses, etc.).
> 
> The hardest cases that I know of are the ones where you need to turn one 
> record into many, such as occurs in survival analysis with 
> time-dependent, piecewise constant covariates. This may require 
> "transposing the problem", i.e. for each  interval you find out which 
> subjects contribute and with what, whereas the SAS way would be a 
> within-subject loop over intervals containing an OUTPUT statement.
> 
> Also, there are some really weird data formats, where e.g. the input 
> format is different in different records. Back in the 80's where 
> punched-card input was still common, it was quite popular to have one 
> card with background information on a patient plus several cards 
> detailing visits, and you'd get a stack of cards containing both kinds. 
> In R you would most likely split on the card type using grep() and then 
> read the two kinds separately and merge() them later.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Sarah Goslee
On 6/10/07, Ted Harding <[EMAIL PROTECTED]> wrote:

> ... a lot of the problems with data
> files arise at the data gathering and entry stages, where people
> can behave as if stuffing unpaired socks and unattributed underwear
> randomly into a drawer, and then banging it shut.

Not specifically R-related, but this would make a great fortune.

Sarah
-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Peter Dalgaard
Douglas Bates wrote:
> Frank Harrell indicated that it is possible to do a lot of difficult
> data transformation within R itself if you try hard enough but that
> sometimes means working against the S language and its "whole object"
> view to accomplish what you want and it can require knowledge of
> subtle aspects of the S language.
>   
Actually, I think Frank's point was subtly different: It is *because* of 
the differences in view that it sometimes seems difficult to find the 
way to do something in R that  is apparently straightforward in SAS. 
I.e. the solutions exist and are often elegant, but may require some 
lateral thinking.

Case in point: Finding the first or the last observation for each 
subject when there are multiple records for each subject. The SAS way 
would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
you can compare the subject ID with the one from the previous record, 
working with data that are sorted appropriately.

You can do the same thing in R with a for loop, but there are better 
ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
maybe
do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or 
something involving aggregate(). (The latter approaches generalize 
better to other within-subject functionals like cumulative doses, etc.).

The hardest cases that I know of are the ones where you need to turn one 
record into many, such as occurs in survival analysis with 
time-dependent, piecewise constant covariates. This may require 
"transposing the problem", i.e. for each  interval you find out which 
subjects contribute and with what, whereas the SAS way would be a 
within-subject loop over intervals containing an OUTPUT statement.

Also, there are some really weird data formats, where e.g. the input 
format is different in different records. Back in the 80's where 
punched-card input was still common, it was quite popular to have one 
card with background information on a patient plus several cards 
detailing visits, and you'd get a stack of cards containing both kinds. 
In R you would most likely split on the card type using grep() and then 
read the two kinds separately and merge() them later.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote:
> That can be elegantly handled in R through R's object
> oriented programming by defining a class for the fancy input.
> See this post:
>   https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
> for a simple example of that style.
> 
> On 6/9/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
>> Here are some examples of the type of data crunching you might
>> have to do.
>>
>> In response to the requests by Christophe Pallier and Martin Stevens.
>>
>> Before I started developing Vilno, some six years ago, I had
>> been working in the pharmaceuticals for eight years ( it's not
>> easy to show you actual data though, because it's all confidential
>> of course).

I hadn't heard of Vilno before (except as a variant of "Vilnius").
And it seems remarkably hard to find info about it from a Google
search. The best I've come up with, searching on

  vilno  data

is at
  http://www.xanga.com/datahelper

This is a blog site, apparently with postings by Robert Wilkins.

At the end of the Sunday, September 17, 2006 posting "Tedious
coding at the Pharmas" is a link:

  "I have created a new data crunching programming language."
   http://www.my.opera.com/datahelper

which appears to be totally empty. In another blog article:

  "go to the www.my.opera.com/datahelper site, go to the August 31
   blog article, and there you will find a tarball-file to download,
   called vilnoAUG2006package.tgz"

so again inaccessible; and a google on "vilnoAUG2006package.tgz"
gives a single hit which is simply the same aricle.

In the Xanga blog there are a few examples of tasks which are
no big deal in any programming language (and, relative to their
simplicity, appear a bit cumbersome in "Vilno"). 

I've not seen in the blog any instance of data transformation
which could not be quite easily done in any straigthforward
language (even awk).

>> Lab data can be especially messy, especially if one clinical
>> trial allows the physicians to use different labs. So let's
>> consider lab data.
>> [...]

That's a fairly daunting description, though indeed not at all
extreme for the sort of data that can arise in practice (and
not just in pharmaceutical investigations). But the complexity
is in the situation, and, whatever language you use, the writing
of the program will involve the writer getting to grips with
the complexity, and the complexity will be present in the code
simply because of the need to accomodate all the special cases,
exceptions and faults that have to be anticipated in "feral" data.

Once these have been anticipated and incorporated in the code,
the actual transformations are again no big deal.

Frankly, I haven't yet seen anything "Vilno" that couldn't be
accomodated in an 'awk' program. Not that I'm advocating awk for
universal use (I'm not that monolithic about it). But I'm using
it as my favourite example of a flexible, capable, transparent
and efficient data filtering language, as far as it goes.


SO: where can one find out more about Vilno, to see what it may
really be capable of that can not be done so easily in other ways?


(As is implicit in many comments in Robert's blog, and indeed also
from many postings to this list over time and undoubtedly well
known to many of us in practice, a lot of the problems with data
files arise at the data gathering and entry stages, where people
can behave as if stuffing unpaired socks and unattributed underwear
randomly into a drawer, and then banging it shut).

Best wishes to all,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 09:28:10
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-09 Thread Gabor Grothendieck
That can be  elegantly handled in R through R's object oriented programming
by defining a class for the fancy input.  See this post:
  https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
for a simple example of that style.


On 6/9/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
> Here are some examples of the type of data crunching you might have to do.
>
> In response to the requests by Christophe Pallier and Martin Stevens.
>
> Before I started developing Vilno, some six years ago, I had been working in
> the pharmaceuticals for eight years ( it's not easy to show you actual data
> though, because it's all confidential of course).
>
> Lab data can be especially messy, especially if one clinical trial allows
> the physicians to use different labs. So let's consider lab data.
>
> Merge in normal ranges, into the lab data. This has to be done by lab-site
> and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
> you also need to match by sex and age. The sex column in the normal ranges
> could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
> the normal ranges could be: blank, or something like "40 <55". Even worse,
> you could have an ageunits column in the normal ranges dataset: usually "Y",
> but if there are children in the clinical trial, you will have "D" or "M",
> for Days and Months. If the clinical trial is for adults, all rows with "D"
> or "M" should be tossed out at the start. Clearly the statistical programmer
> has to spend time looking at the data, before writing the program. Remember,
> all of these details can change any time you move to a new clinical trial.
>
> So for the lab data, you have to merge in the patient's date of birth,
> calculate age, and somehow relate that to the age-group column in the normal
> ranges dataset.
>
> (By the way, in clinical trial data preparation, the SAS datastep is much
> more useful and convenient, in my opinion, than the SQL SELECT syntax, at
> least 97% of the time. But in the middle of this program, when you merge the
> normal ranges into the lab data, you get a better solution with PROC SQL (
> just the SQL SELECT statement implemented inside SAS) This is because of the
> trickiness of the age match-up, and the SAS datastep does not do well with
> many-to-many joins.).
>
> Merge in various study drug administration dates into the lab data. Now, for
> each lab record, calculate treatment period ( or cycle number ), depending
> on the statistician's specifications and the way the clinical trial is
> structured.
>
> Different clinical sites chose to use different lab providers. So, for
> example, for Monocytes, you have 10 different units ( essentially 6 units,
> but spelling inconsistencies as well). The statistician has requested that
> you use standardized units in some of the listings ( % units, and only one
> type of non-% unit, for example ). At the same time, lab values need to be
> converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
> no matter what software you use, and, in my experience, when the SAS
> programmer asks for more clinical information or lab guidebooks, the
> response is incomplete, so he does a lot of guesswork. SAS programmers do
> not have expertise in lab science, hence the guesswork.
>
> Your program has to accomodate numeric values, "1.54" , quasi-numeric values
> "<1" , and non-numeric values "Trace".
>
> Your data listing is tight for space, so print "PROLONGED CELL CONT" as
> "PRCC".
>
> Once normal ranges are merged in, figure out which values are out-of-range
> and high , which are low, and which are within normal range. In the data
> listing, you may have "H" or "L" appended to the result value being printed.
>
> For each treatment period, you may need a unique lab record selected, in
> case there are two or three for the same treatment period. The statistician
> will tell the SAS programmer how. Maybe the averages of the results for that
> treatment period, maybe that lab record closest to the mid-point of of the
> treatment period. This isn't for the data listing, but for a summary table.
>
> For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
> (total white blood cell count) values , to convert values between % units
> and absolute count units.
>
> When printing the values in the data listing, you need "H" or "L" to the
> right of the value. But you also need the values to be well lined up ( the
> decimal place ). This can be stupidly time consuming.
>
>
>
> AND ON AND ON AND ON .
>
> I think you see why clinical trials statisticians and SAS programmers enjoy
> lots of job security.

This could be readily handled in R using object oriented programming.
You would specify a class for the strange input,

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide comm

Re: [R] Tools For Preparing Data For Analysis

2007-06-09 Thread Robert Wilkins
Here are some examples of the type of data crunching you might have to do.

In response to the requests by Christophe Pallier and Martin Stevens.

Before I started developing Vilno, some six years ago, I had been working in
the pharmaceuticals for eight years ( it's not easy to show you actual data
though, because it's all confidential of course).

Lab data can be especially messy, especially if one clinical trial allows
the physicians to use different labs. So let's consider lab data.

Merge in normal ranges, into the lab data. This has to be done by lab-site
and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
you also need to match by sex and age. The sex column in the normal ranges
could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
the normal ranges could be: blank, or something like "40 <55". Even worse,
you could have an ageunits column in the normal ranges dataset: usually "Y",
but if there are children in the clinical trial, you will have "D" or "M",
for Days and Months. If the clinical trial is for adults, all rows with "D"
or "M" should be tossed out at the start. Clearly the statistical programmer
has to spend time looking at the data, before writing the program. Remember,
all of these details can change any time you move to a new clinical trial.

So for the lab data, you have to merge in the patient's date of birth,
calculate age, and somehow relate that to the age-group column in the normal
ranges dataset.

(By the way, in clinical trial data preparation, the SAS datastep is much
more useful and convenient, in my opinion, than the SQL SELECT syntax, at
least 97% of the time. But in the middle of this program, when you merge the
normal ranges into the lab data, you get a better solution with PROC SQL (
just the SQL SELECT statement implemented inside SAS) This is because of the
trickiness of the age match-up, and the SAS datastep does not do well with
many-to-many joins.).

Merge in various study drug administration dates into the lab data. Now, for
each lab record, calculate treatment period ( or cycle number ), depending
on the statistician's specifications and the way the clinical trial is
structured.

Different clinical sites chose to use different lab providers. So, for
example, for Monocytes, you have 10 different units ( essentially 6 units,
but spelling inconsistencies as well). The statistician has requested that
you use standardized units in some of the listings ( % units, and only one
type of non-% unit, for example ). At the same time, lab values need to be
converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
no matter what software you use, and, in my experience, when the SAS
programmer asks for more clinical information or lab guidebooks, the
response is incomplete, so he does a lot of guesswork. SAS programmers do
not have expertise in lab science, hence the guesswork.

Your program has to accomodate numeric values, "1.54" , quasi-numeric values
"<1" , and non-numeric values "Trace".

Your data listing is tight for space, so print "PROLONGED CELL CONT" as
"PRCC".

Once normal ranges are merged in, figure out which values are out-of-range
and high , which are low, and which are within normal range. In the data
listing, you may have "H" or "L" appended to the result value being printed.

For each treatment period, you may need a unique lab record selected, in
case there are two or three for the same treatment period. The statistician
will tell the SAS programmer how. Maybe the averages of the results for that
treatment period, maybe that lab record closest to the mid-point of of the
treatment period. This isn't for the data listing, but for a summary table.

For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
(total white blood cell count) values , to convert values between % units
and absolute count units.

When printing the values in the data listing, you need "H" or "L" to the
right of the value. But you also need the values to be well lined up ( the
decimal place ). This can be stupidly time consuming.



AND ON AND ON AND ON .

I think you see why clinical trials statisticians and SAS programmers enjoy
lots of job security.



On 6/8/07, Martin Henry H. Stevens <[EMAIL PROTECTED]> wrote:
>
> Is there an example available of this sort of problematic data that
> requires this kind of data screening and filtering? For many of us,
> this issue would be nice to learn about, and deal with within R. If a
> package could be created, that would be optimal for some of us. I
> would like to learn a tad more, if it were not too much effort for
> someone else to point me in the right direction?
> Cheers,
> Hank
> On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
>
> > On 6/7/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
> >> As noted on the R-project web site itself ( www.r-project.org ->
> >> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> >> messy and dirty data for analysis 

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Christophe Pallier
On 6/8/07, Douglas Bates <[EMAIL PROTECTED]> wrote:
>
>
> Other responses in this thread have mentioned 'little language'
> filters like awk, which is fine for those who were raised in the Bell
> Labs tradition of programming ("why type three characters when two
> character names should suffice for anything one wants to do on a
> PDP-11") but the typical field scientist finds this a bit too terse to
> understand and would rather write a filter as a paragraph of code that
> they have a change of reading and understanding a week later.


Hum,


Concerning awk, I think that this comment does not apply: because the
language is simple and and somewhat limited, awk scripts are typically quite
clean and readable (of course, it is possible to write horrible code in any
languages).

I have introduced awk to dozens of people (mostly scientists in social
sciences, and dos/windows users...) over the last 15 years  it is sometimes
the only programming language they know and they are very happy with what
they can do with it.

The philosophy of using it as a filter (that is, a converter) is also good
because many problems are best solved in 2 or 3 steps (2/3 short scripts run
sequentially) rather than in one single step,as people tend to do with
languages that encourage to use more complex data structures than
associative arrays.

It could be argued that awk is the swiss army knife of simple text
manipulations. All in all, awk+R is very efficient combination for data
manipulation (at least for the cases I have encountered).

It would a pity if your remark led people to overlook awk as it would
efficiently solve many of the input parsing problems that are posted on this
list (I am talking here about extracting information from text files, not
data entry).

awk, like R, is not exempt of defects, yet both are tools that one gets
attached to because they increase your productivity a lot.


-- 
Christophe Pallier (http://www.pallier.org)

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Frank E Harrell Jr
Dale Steele wrote:
> For windows users, EpiData Entry  is an
> excellent (free) tool for data entry and documentation.--Dale

Note that EpiData seems to work well under linux using wine.
Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Dale Steele
For windows users, EpiData Entry  is an
excellent (free) tool for data entry and documentation.--Dale


On 6/8/07, Chris Evans <[EMAIL PROTECTED]> wrote:
>
> Martin Henry H. Stevens sent the following  at 08/06/2007 15:11:
> > Is there an example available of this sort of problematic data that
> > requires this kind of data screening and filtering? For many of us,
> > this issue would be nice to learn about, and deal with within R. If a
> > package could be created, that would be optimal for some of us. I
> > would like to learn a tad more, if it were not too much effort for
> > someone else to point me in the right direction?
> > Cheers,
> > Hank
> > On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
> >
> >> On 6/7/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
> >>> As noted on the R-project web site itself ( www.r-project.org ->
>
> ... rest snipped ...
>
> OK, I can't resist that invitation.  I think there are many kinds of
> problematic data.  I handle some nasty textish things in perl (and I
> loved the purgatory quote) and I'm afraid I do some things in Excel and
> some cleaning I can handle in R, but I never enter data directly into R.
>
> However, one very common scenario I have faceda all my working life is
> psych data from questionnaires or interviews in low budget work, mostly
> student research or routine entry of therapists' data.  Typically you
> have an identifier, a date, some demographics and then a lot of item
> data.  There's little money (usual zero) involved for data entry and
> cleaning but I've produced a lot of good(ish) papers out of this sort of
> very low budget work over the last 20 years.  (Right at the other end of
> a financial spectrum from the FDA/validated s'ware thread but this is
> about validation again!)
>
> The problem I often face is that people are lousy data entry machines
> (well, actually, they vary ... enormously) and if they mess up the data
> entry we all know how horrible this can be.
>
> SPSS (boo hiss) used to have an excellent "module", actually a
> standalone PC/Windoze program, that allowed you to define variables so
> they had allowed values and it would refuse to accept out of range or
> out of acceptable entries, it also allowed you to create checking rules
> and rules that would, in the light of earlier entries, set later values
> and not ask about them.  In a rudimentary way you could also lay things
> out on the screen so that it paginated where the q'aire or paper data
> record did etc.  The final nice touch was that you could define some
> variables as invariant and then set the thing so an independent data
> entry person could re-enter the other data (i.e. pick up q'aire, see if
> ID fits the one showing on screen, if so, enter the rest of the data).
> It would bleep and not move on if you entered a value other than that
> entered by the first person and you had to confirm that one of you was
> right.
>
> That saved me wasted weeks I'm sure on analysing data that turned out to
> be awful and I'd love to see someone build something to replace that.
>
> Currently I tend to use (boo hiss) Excel for this as everyone I work
> with seems to have it (and not all can install open office and anyway I
> haven't had time to learn that properly yet either ...) and I set up
> spreadsheets with validation rules set.  That doesn't get the branching
> rules and checks (e.g. if male, skip questions about periods, PMT and
> pregnancies), or at least, with my poor Excel skills it doesn't.  I just
> skip a column to indicate page breaks in the q'aire, and I get, when I
> can, two people to enter the data separately and then use R to compare
> the two spreadsheets having yanked them into data frames.
>
> I would really, really love someone to develop (and perhaps replace) the
> rather buggy edit() and fix() routines (seem to hang on big data frames
> in Rcmdr which is what I'm trying to get students onto) with something
> that did some or all of what SPSS/DE used to do for me or I bodge now in
> Excel.  If any generous coding whiz were willing to do this, I'll try to
> alpha and beta test and write help etc.
>
> There _may_ be good open source things out there that do what I need but
> something that really integrated into R would be another huge step
> forward in being able to phase out SPSS in my work settings and phase in R.
>
> Very best all,
>
> Chris
>
>
>
> --
> Chris Evans <[EMAIL PROTECTED]> Skype: chris-psyctc
> Professor of Psychotherapy, Nottingham University;
> Consultant Psychiatrist in Psychotherapy, Notts PDD network;
> Research Programmes Director, Nottinghamshire NHS Trust;
> *If I am writing from one of those roles, it will be clear. Otherwise*
> *my views are my own and not representative of those institutions*
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-g

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Chris Evans

Martin Henry H. Stevens sent the following  at 08/06/2007 15:11:
> Is there an example available of this sort of problematic data that  
> requires this kind of data screening and filtering? For many of us,  
> this issue would be nice to learn about, and deal with within R. If a  
> package could be created, that would be optimal for some of us. I  
> would like to learn a tad more, if it were not too much effort for  
> someone else to point me in the right direction?
> Cheers,
> Hank
> On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
> 
>> On 6/7/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
>>> As noted on the R-project web site itself ( www.r-project.org ->

... rest snipped ...

OK, I can't resist that invitation.  I think there are many kinds of
problematic data.  I handle some nasty textish things in perl (and I
loved the purgatory quote) and I'm afraid I do some things in Excel and
some cleaning I can handle in R, but I never enter data directly into R.

However, one very common scenario I have faceda all my working life is
psych data from questionnaires or interviews in low budget work, mostly
student research or routine entry of therapists' data.  Typically you
have an identifier, a date, some demographics and then a lot of item
data.  There's little money (usual zero) involved for data entry and
cleaning but I've produced a lot of good(ish) papers out of this sort of
very low budget work over the last 20 years.  (Right at the other end of
a financial spectrum from the FDA/validated s'ware thread but this is
about validation again!)

The problem I often face is that people are lousy data entry machines
(well, actually, they vary ... enormously) and if they mess up the data
entry we all know how horrible this can be.

SPSS (boo hiss) used to have an excellent "module", actually a
standalone PC/Windoze program, that allowed you to define variables so
they had allowed values and it would refuse to accept out of range or
out of acceptable entries, it also allowed you to create checking rules
and rules that would, in the light of earlier entries, set later values
and not ask about them.  In a rudimentary way you could also lay things
out on the screen so that it paginated where the q'aire or paper data
record did etc.  The final nice touch was that you could define some
variables as invariant and then set the thing so an independent data
entry person could re-enter the other data (i.e. pick up q'aire, see if
ID fits the one showing on screen, if so, enter the rest of the data).
It would bleep and not move on if you entered a value other than that
entered by the first person and you had to confirm that one of you was
right.

That saved me wasted weeks I'm sure on analysing data that turned out to
be awful and I'd love to see someone build something to replace that.

Currently I tend to use (boo hiss) Excel for this as everyone I work
with seems to have it (and not all can install open office and anyway I
haven't had time to learn that properly yet either ...) and I set up
spreadsheets with validation rules set.  That doesn't get the branching
rules and checks (e.g. if male, skip questions about periods, PMT and
pregnancies), or at least, with my poor Excel skills it doesn't.  I just
skip a column to indicate page breaks in the q'aire, and I get, when I
can, two people to enter the data separately and then use R to compare
the two spreadsheets having yanked them into data frames.

I would really, really love someone to develop (and perhaps replace) the
rather buggy edit() and fix() routines (seem to hang on big data frames
in Rcmdr which is what I'm trying to get students onto) with something
that did some or all of what SPSS/DE used to do for me or I bodge now in
Excel.  If any generous coding whiz were willing to do this, I'll try to
alpha and beta test and write help etc.

There _may_ be good open source things out there that do what I need but
something that really integrated into R would be another huge step
forward in being able to phase out SPSS in my work settings and phase in R.

Very best all,

Chris



-- 
Chris Evans <[EMAIL PROTECTED]> Skype: chris-psyctc
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Notts PDD network;
Research Programmes Director, Nottinghamshire NHS Trust;
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions*

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Martin Henry H. Stevens
Is there an example available of this sort of problematic data that  
requires this kind of data screening and filtering? For many of us,  
this issue would be nice to learn about, and deal with within R. If a  
package could be created, that would be optimal for some of us. I  
would like to learn a tad more, if it were not too much effort for  
someone else to point me in the right direction?
Cheers,
Hank
On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:

> On 6/7/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
>> As noted on the R-project web site itself ( www.r-project.org ->
>> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
>> messy and dirty data for analysis with the R tool itself. I've also
>> seen at least one S programming book (one of the yellow Springer  
>> ones)
>> that says, more briefly, the same thing.
>> The R Data Import/Export page recommends examples using SAS, Perl,
>> Python, and Java. It takes a bit of courage to say that ( when you go
>> to a corporate software web site, you'll never see a page saying  
>> "This
>> is the type of problem that our product is not the best at, here's
>> what we suggest instead" ). I'd like to provide a few more
>> suggestions, especially for volunteers who are willing to evaluate  
>> new
>> candidates.
>>
>> SAS is fine if you're not paying for the license out of your own
>> pocket. But maybe one reason you're using R is you don't have
>> thousands of spare dollars.
>> Using Java for data cleaning is an exercise in sado-masochism, Java
>> has a learning curve (almost) as difficult as C++.
>>
>> There are different types of data transformation, and for some data
>> preparation problems an all-purpose programming language is a good
>> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
>> excellent regular expression facilities.
>>
>> However, for some types of complex demanding data preparation
>> problems, an all-purpose programming language is a poor choice. For
>> example: cleaning up and preparing clinical lab data and adverse  
>> event
>> data - you could do it in Perl, but it would take way, way too much
>> time. A specialized programming language is needed. And since data
>> transformation is quite different from data query, SQL is not the
>> ideal solution either.
>>
>> There are only three statistical programming languages that are
>> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
>> popular than S for data cleaning.
>>
>> If you're an R user with difficult data preparation problems, frankly
>> you are out of luck, because the products I'm about to mention are
>> new, unknown, and therefore regarded as immature. And while the
>> founders of these products would be very happy if you kicked the
>> tires, most people don't like to look at brand new products. Most
>> innovators and inventers don't realize this, I've learned it the hard
>> way.
>>
>> But if you are a volunteer who likes to help out by evaluating,
>> comparing, and reporting upon new candidates, well you could  
>> certainly
>> help out R users and the developers of the products by kicking the
>> tires of these products. And there is a huge need for such  
>> volunteers.
>>
>> 1. DAP
>> This is an open source implementation of SAS.
>> The founder: Susan Bassein
>> Find it at: directory.fsf.org/math/stats (GNU GPL)
>>
>> 2. PSPP
>> This is an open source implementation of SPSS.
>> The relatively early version number might not give a good idea of how
>> mature the
>> data transformation features are, it reflects the fact that he has
>> only started doing the statistical tests.
>> The founder: Ben Pfaff, either a grad student or professor at  
>> Stanford CS dept.
>> Also at : directory.fsf.org/math/stats (GNU GPL)
>>
>> 3. Vilno
>> This uses a programming language similar to SPSS and SAS, but  
>> quite unlike S.
>> Essentially, it's a substitute for the SAS datastep, and also
>> transposes data and calculates averages and such. (No t-tests or
>> regressions in this version). I created this, during the years
>> 2001-2006 mainly. It's version 0.85, and has a fairly low bug  
>> rate, in
>> my opinion. The tarball includes about 100 or so test cases used for
>> debugging - for logical calculation errors, but not for extremely  
>> high
>> volumes of data.
>> The maintenance of Vilno has slowed down, because I am currently
>> (desparately) looking for employment. But once I've found new
>> employment and living quarters and settled in, I will continue to
>> enhance Vilno in my spare time.
>> The founder: that would be me, Robert Wilkins
>> Find it at: code.google.com/p/vilno ( GNU GPL )
>> ( In particular, the tarball at code.google.com/p/vilno/downloads/ 
>> list
>> , since I have yet to figure out how to use Subversion ).
>>
>> 4. Who knows?
>> It was not easy to find out about the existence of DAP and PSPP. So
>> who knows what else is out there. However, I think you'll find a lot
>> more statistics software ( regression , etc )

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Wensui Liu
I had mentioned exactly the same thing to others and the feedback I got is -
'when you have a hammer, everything will look like a nail'
^_^.

On 6/7/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote:
> Robert Wilkins wrote:
> > As noted on the R-project web site itself ( www.r-project.org ->
> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> > messy and dirty data for analysis with the R tool itself. I've also
> > seen at least one S programming book (one of the yellow Springer ones)
> > that says, more briefly, the same thing.
> > The R Data Import/Export page recommends examples using SAS, Perl,
> > Python, and Java. It takes a bit of courage to say that ( when you go
> > to a corporate software web site, you'll never see a page saying "This
> > is the type of problem that our product is not the best at, here's
> > what we suggest instead" ). I'd like to provide a few more
> > suggestions, especially for volunteers who are willing to evaluate new
> > candidates.
> >
> > SAS is fine if you're not paying for the license out of your own
> > pocket. But maybe one reason you're using R is you don't have
> > thousands of spare dollars.
> > Using Java for data cleaning is an exercise in sado-masochism, Java
> > has a learning curve (almost) as difficult as C++.
> >
> > There are different types of data transformation, and for some data
> > preparation problems an all-purpose programming language is a good
> > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> > excellent regular expression facilities.
> >
> > However, for some types of complex demanding data preparation
> > problems, an all-purpose programming language is a poor choice. For
> > example: cleaning up and preparing clinical lab data and adverse event
> > data - you could do it in Perl, but it would take way, way too much
> > time. A specialized programming language is needed. And since data
> > transformation is quite different from data query, SQL is not the
> > ideal solution either.
>
> We deal with exactly those kinds of data solely using R.  R is
> exceptionally powerful for data manipulation, just a bit hard to learn.
>   Many examples are at
> http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf
>
> Frank
>
> >
> > There are only three statistical programming languages that are
> > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> > popular than S for data cleaning.
> >
> > If you're an R user with difficult data preparation problems, frankly
> > you are out of luck, because the products I'm about to mention are
> > new, unknown, and therefore regarded as immature. And while the
> > founders of these products would be very happy if you kicked the
> > tires, most people don't like to look at brand new products. Most
> > innovators and inventers don't realize this, I've learned it the hard
> > way.
> >
> > But if you are a volunteer who likes to help out by evaluating,
> > comparing, and reporting upon new candidates, well you could certainly
> > help out R users and the developers of the products by kicking the
> > tires of these products. And there is a huge need for such volunteers.
> >
> > 1. DAP
> > This is an open source implementation of SAS.
> > The founder: Susan Bassein
> > Find it at: directory.fsf.org/math/stats (GNU GPL)
> >
> > 2. PSPP
> > This is an open source implementation of SPSS.
> > The relatively early version number might not give a good idea of how
> > mature the
> > data transformation features are, it reflects the fact that he has
> > only started doing the statistical tests.
> > The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
> > dept.
> > Also at : directory.fsf.org/math/stats (GNU GPL)
> >
> > 3. Vilno
> > This uses a programming language similar to SPSS and SAS, but quite unlike 
> > S.
> > Essentially, it's a substitute for the SAS datastep, and also
> > transposes data and calculates averages and such. (No t-tests or
> > regressions in this version). I created this, during the years
> > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> > my opinion. The tarball includes about 100 or so test cases used for
> > debugging - for logical calculation errors, but not for extremely high
> > volumes of data.
> > The maintenance of Vilno has slowed down, because I am currently
> > (desparately) looking for employment. But once I've found new
> > employment and living quarters and settled in, I will continue to
> > enhance Vilno in my spare time.
> > The founder: that would be me, Robert Wilkins
> > Find it at: code.google.com/p/vilno ( GNU GPL )
> > ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> > , since I have yet to figure out how to use Subversion ).
> >
> >
> > 4. Who knows?
> > It was not easy to find out about the existence of DAP and PSPP. So
> > who knows what else is out there. However, I think you'll find a lot
> > more statistics software ( regression , etc ) out th

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Douglas Bates
On 6/7/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
> As noted on the R-project web site itself ( www.r-project.org ->
> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> messy and dirty data for analysis with the R tool itself. I've also
> seen at least one S programming book (one of the yellow Springer ones)
> that says, more briefly, the same thing.
> The R Data Import/Export page recommends examples using SAS, Perl,
> Python, and Java. It takes a bit of courage to say that ( when you go
> to a corporate software web site, you'll never see a page saying "This
> is the type of problem that our product is not the best at, here's
> what we suggest instead" ). I'd like to provide a few more
> suggestions, especially for volunteers who are willing to evaluate new
> candidates.
>
> SAS is fine if you're not paying for the license out of your own
> pocket. But maybe one reason you're using R is you don't have
> thousands of spare dollars.
> Using Java for data cleaning is an exercise in sado-masochism, Java
> has a learning curve (almost) as difficult as C++.
>
> There are different types of data transformation, and for some data
> preparation problems an all-purpose programming language is a good
> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> excellent regular expression facilities.
>
> However, for some types of complex demanding data preparation
> problems, an all-purpose programming language is a poor choice. For
> example: cleaning up and preparing clinical lab data and adverse event
> data - you could do it in Perl, but it would take way, way too much
> time. A specialized programming language is needed. And since data
> transformation is quite different from data query, SQL is not the
> ideal solution either.
>
> There are only three statistical programming languages that are
> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> popular than S for data cleaning.
>
> If you're an R user with difficult data preparation problems, frankly
> you are out of luck, because the products I'm about to mention are
> new, unknown, and therefore regarded as immature. And while the
> founders of these products would be very happy if you kicked the
> tires, most people don't like to look at brand new products. Most
> innovators and inventers don't realize this, I've learned it the hard
> way.
>
> But if you are a volunteer who likes to help out by evaluating,
> comparing, and reporting upon new candidates, well you could certainly
> help out R users and the developers of the products by kicking the
> tires of these products. And there is a huge need for such volunteers.
>
> 1. DAP
> This is an open source implementation of SAS.
> The founder: Susan Bassein
> Find it at: directory.fsf.org/math/stats (GNU GPL)
>
> 2. PSPP
> This is an open source implementation of SPSS.
> The relatively early version number might not give a good idea of how
> mature the
> data transformation features are, it reflects the fact that he has
> only started doing the statistical tests.
> The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
> dept.
> Also at : directory.fsf.org/math/stats (GNU GPL)
>
> 3. Vilno
> This uses a programming language similar to SPSS and SAS, but quite unlike S.
> Essentially, it's a substitute for the SAS datastep, and also
> transposes data and calculates averages and such. (No t-tests or
> regressions in this version). I created this, during the years
> 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> my opinion. The tarball includes about 100 or so test cases used for
> debugging - for logical calculation errors, but not for extremely high
> volumes of data.
> The maintenance of Vilno has slowed down, because I am currently
> (desparately) looking for employment. But once I've found new
> employment and living quarters and settled in, I will continue to
> enhance Vilno in my spare time.
> The founder: that would be me, Robert Wilkins
> Find it at: code.google.com/p/vilno ( GNU GPL )
> ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> , since I have yet to figure out how to use Subversion ).
>
> 4. Who knows?
> It was not easy to find out about the existence of DAP and PSPP. So
> who knows what else is out there. However, I think you'll find a lot
> more statistics software ( regression , etc ) out there, and not so
> much data transformation software. Not many people work on data
> preparation software. In fact, the category is so obscure that there
> isn't one agreed term: data cleaning , data munging , data crunching ,
> or just getting the data ready for analysis.

Thanks for bringing up this topic.  I think there is definitely a
place for such languages, which I would regard as data-filtering
languages, but I also think that trying to reproduce the facilities in
SAS or SPSS for data analysis is redundant.

Other responses in this thread have mentioned 'little language'
filters 

Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Ted Harding
On 08-Jun-07 08:27:21, Christophe Pallier wrote:
> Hi,
> 
> Can you provide examples of data formats that are problematic
> to read and clean with R ?
> 
> The only problematic cases I have encountered were cases with
> multiline and/or  varying length records (optional information).
> Then, it is sometimes a good idea to preprocess the data to
> present in a tabular format (one record per line).
> 
> For this purpose, I use awk (e.g.
> http://www.vectorsite.net/tsawk.html),
> which is very adept at processing ascii data files  (awk is
> much simpler to learn than perl, spss, sas, ...).

I want to join in with an enthusiastic "Me too!!". For anything
which has to do with basic checking for the kind of messes that
people can get data into when they "put it on the computer",
I think awk is ideal. It is very flexible (far more so than
many, even long-time, awk users suspect), very transparent
in its programming language (as opposed to say perl), fast,
and with light impact on system resources (rare delight in
these days, when upgrading your software may require upgrading
your hardware).

Although it may seem on the surface that awk is "two-dimensional"
in its view of data (line by line, and per field in a line),
it has some flexible internal data structures and recursive
function capability, which allows a lot more to be done with
the data that have been read in.

For example, I've used awk to trace ancestry through a genealogy,
given a data file where each line includes the identifier of an
individual and the identifiers of its male and female parents
(where known). And that was for pedigree dogs, where what happens
in real life makes Oedipus look trivial.

> I have never encountered a data file in ascii format that I
> could not reformat with Awk.  With binary formats, it is
> another story...

But then it is a good idea to process the binary file using an
instance of the creating software, to produce a ASCII file (say
in CSV format).

> But, again, this is my limited experience; I would like to
> know if there are situations where using SAS/SPSS is really
> a better approach.

The main thing often useful for data cleaning that awk does
not have is any associated graphics. It is -- by design -- a
line-by-line text-file processor. While, for instance, you
could use awk to accumulate numerical histogram counts, you
would have to use something else to display the histogram.
And for scatter-plots there's probably not much point in
bringing awk into the picture at all (unless a preliminary
filtration of mess is needed anyway).

That being said, though, there can still be a use to extract
data fields from a file for submission to other software.

Another kind of area where awk would not have much to offer
is where, as a part of your preliminary data inspection,
you want to inspect the results of some standard statistical
analyses.

As a final comment, utilities like awk can be used far more
fruitfully on operating systems (the unixoid family) which
incorporate at ground level the infrastructure for "plumbing"
together streams of data output from different programs.

Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Jun-07   Time: 10:43:05
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Christophe Pallier
Hi,

Can you provide examples of data formats that are problematic to read and
clean with R ?

The only problematic cases I have encountered were cases with multiline
and/or  varying length records (optional information). Then, it is sometimes
a good idea to preprocess the data to present in a tabular format (one
record per line).

For this purpose, I use awk (e.g. http://www.vectorsite.net/tsawk.html),
which is very adept at processing ascii data files  (awk is much simpler to
learn than perl, spss, sas, ...).

I have never encountered a data file in ascii format that I could not
reformat with Awk.  With binary formats, it is another story...

But, again, this is my limited experience; I would like to know if there are
situations where using SAS/SPSS is really a better approach.

Christophe Pallier


On 6/8/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
>
> As noted on the R-project web site itself ( www.r-project.org ->
> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> messy and dirty data for analysis with the R tool itself. I've also
> seen at least one S programming book (one of the yellow Springer ones)
> that says, more briefly, the same thing.
> The R Data Import/Export page recommends examples using SAS, Perl,
> Python, and Java. It takes a bit of courage to say that ( when you go
> to a corporate software web site, you'll never see a page saying "This
> is the type of problem that our product is not the best at, here's
> what we suggest instead" ). I'd like to provide a few more
> suggestions, especially for volunteers who are willing to evaluate new
> candidates.
>
> SAS is fine if you're not paying for the license out of your own
> pocket. But maybe one reason you're using R is you don't have
> thousands of spare dollars.
> Using Java for data cleaning is an exercise in sado-masochism, Java
> has a learning curve (almost) as difficult as C++.
>
> There are different types of data transformation, and for some data
> preparation problems an all-purpose programming language is a good
> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> excellent regular expression facilities.
>
> However, for some types of complex demanding data preparation
> problems, an all-purpose programming language is a poor choice. For
> example: cleaning up and preparing clinical lab data and adverse event
> data - you could do it in Perl, but it would take way, way too much
> time. A specialized programming language is needed. And since data
> transformation is quite different from data query, SQL is not the
> ideal solution either.
>
> There are only three statistical programming languages that are
> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> popular than S for data cleaning.
>
> If you're an R user with difficult data preparation problems, frankly
> you are out of luck, because the products I'm about to mention are
> new, unknown, and therefore regarded as immature. And while the
> founders of these products would be very happy if you kicked the
> tires, most people don't like to look at brand new products. Most
> innovators and inventers don't realize this, I've learned it the hard
> way.
>
> But if you are a volunteer who likes to help out by evaluating,
> comparing, and reporting upon new candidates, well you could certainly
> help out R users and the developers of the products by kicking the
> tires of these products. And there is a huge need for such volunteers.
>
> 1. DAP
> This is an open source implementation of SAS.
> The founder: Susan Bassein
> Find it at: directory.fsf.org/math/stats (GNU GPL)
>
> 2. PSPP
> This is an open source implementation of SPSS.
> The relatively early version number might not give a good idea of how
> mature the
> data transformation features are, it reflects the fact that he has
> only started doing the statistical tests.
> The founder: Ben Pfaff, either a grad student or professor at Stanford CS
> dept.
> Also at : directory.fsf.org/math/stats (GNU GPL)
>
> 3. Vilno
> This uses a programming language similar to SPSS and SAS, but quite unlike
> S.
> Essentially, it's a substitute for the SAS datastep, and also
> transposes data and calculates averages and such. (No t-tests or
> regressions in this version). I created this, during the years
> 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> my opinion. The tarball includes about 100 or so test cases used for
> debugging - for logical calculation errors, but not for extremely high
> volumes of data.
> The maintenance of Vilno has slowed down, because I am currently
> (desparately) looking for employment. But once I've found new
> employment and living quarters and settled in, I will continue to
> enhance Vilno in my spare time.
> The founder: that would be me, Robert Wilkins
> Find it at: code.google.com/p/vilno ( GNU GPL )
> ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> , since I have yet to figure out how to use Subv

Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Frank E Harrell Jr
Robert Wilkins wrote:
> As noted on the R-project web site itself ( www.r-project.org ->
> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> messy and dirty data for analysis with the R tool itself. I've also
> seen at least one S programming book (one of the yellow Springer ones)
> that says, more briefly, the same thing.
> The R Data Import/Export page recommends examples using SAS, Perl,
> Python, and Java. It takes a bit of courage to say that ( when you go
> to a corporate software web site, you'll never see a page saying "This
> is the type of problem that our product is not the best at, here's
> what we suggest instead" ). I'd like to provide a few more
> suggestions, especially for volunteers who are willing to evaluate new
> candidates.
> 
> SAS is fine if you're not paying for the license out of your own
> pocket. But maybe one reason you're using R is you don't have
> thousands of spare dollars.
> Using Java for data cleaning is an exercise in sado-masochism, Java
> has a learning curve (almost) as difficult as C++.
> 
> There are different types of data transformation, and for some data
> preparation problems an all-purpose programming language is a good
> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> excellent regular expression facilities.
> 
> However, for some types of complex demanding data preparation
> problems, an all-purpose programming language is a poor choice. For
> example: cleaning up and preparing clinical lab data and adverse event
> data - you could do it in Perl, but it would take way, way too much
> time. A specialized programming language is needed. And since data
> transformation is quite different from data query, SQL is not the
> ideal solution either.

We deal with exactly those kinds of data solely using R.  R is 
exceptionally powerful for data manipulation, just a bit hard to learn. 
  Many examples are at 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf

Frank

> 
> There are only three statistical programming languages that are
> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> popular than S for data cleaning.
> 
> If you're an R user with difficult data preparation problems, frankly
> you are out of luck, because the products I'm about to mention are
> new, unknown, and therefore regarded as immature. And while the
> founders of these products would be very happy if you kicked the
> tires, most people don't like to look at brand new products. Most
> innovators and inventers don't realize this, I've learned it the hard
> way.
> 
> But if you are a volunteer who likes to help out by evaluating,
> comparing, and reporting upon new candidates, well you could certainly
> help out R users and the developers of the products by kicking the
> tires of these products. And there is a huge need for such volunteers.
> 
> 1. DAP
> This is an open source implementation of SAS.
> The founder: Susan Bassein
> Find it at: directory.fsf.org/math/stats (GNU GPL)
> 
> 2. PSPP
> This is an open source implementation of SPSS.
> The relatively early version number might not give a good idea of how
> mature the
> data transformation features are, it reflects the fact that he has
> only started doing the statistical tests.
> The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
> dept.
> Also at : directory.fsf.org/math/stats (GNU GPL)
> 
> 3. Vilno
> This uses a programming language similar to SPSS and SAS, but quite unlike S.
> Essentially, it's a substitute for the SAS datastep, and also
> transposes data and calculates averages and such. (No t-tests or
> regressions in this version). I created this, during the years
> 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> my opinion. The tarball includes about 100 or so test cases used for
> debugging - for logical calculation errors, but not for extremely high
> volumes of data.
> The maintenance of Vilno has slowed down, because I am currently
> (desparately) looking for employment. But once I've found new
> employment and living quarters and settled in, I will continue to
> enhance Vilno in my spare time.
> The founder: that would be me, Robert Wilkins
> Find it at: code.google.com/p/vilno ( GNU GPL )
> ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> , since I have yet to figure out how to use Subversion ).
> 
> 
> 4. Who knows?
> It was not easy to find out about the existence of DAP and PSPP. So
> who knows what else is out there. However, I think you'll find a lot
> more statistics software ( regression , etc ) out there, and not so
> much data transformation software. Not many people work on data
> preparation software. In fact, the category is so obscure that there
> isn't one agreed term: data cleaning , data munging , data crunching ,
> or just getting the data ready for analysis.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat

Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Robert Duval
An additional option for Windows users is Micro Osiris

http://www.microsiris.com/

best
robert

On 6/7/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
> As noted on the R-project web site itself ( www.r-project.org ->
> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> messy and dirty data for analysis with the R tool itself. I've also
> seen at least one S programming book (one of the yellow Springer ones)
> that says, more briefly, the same thing.
> The R Data Import/Export page recommends examples using SAS, Perl,
> Python, and Java. It takes a bit of courage to say that ( when you go
> to a corporate software web site, you'll never see a page saying "This
> is the type of problem that our product is not the best at, here's
> what we suggest instead" ). I'd like to provide a few more
> suggestions, especially for volunteers who are willing to evaluate new
> candidates.
>
> SAS is fine if you're not paying for the license out of your own
> pocket. But maybe one reason you're using R is you don't have
> thousands of spare dollars.
> Using Java for data cleaning is an exercise in sado-masochism, Java
> has a learning curve (almost) as difficult as C++.
>
> There are different types of data transformation, and for some data
> preparation problems an all-purpose programming language is a good
> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> excellent regular expression facilities.
>
> However, for some types of complex demanding data preparation
> problems, an all-purpose programming language is a poor choice. For
> example: cleaning up and preparing clinical lab data and adverse event
> data - you could do it in Perl, but it would take way, way too much
> time. A specialized programming language is needed. And since data
> transformation is quite different from data query, SQL is not the
> ideal solution either.
>
> There are only three statistical programming languages that are
> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> popular than S for data cleaning.
>
> If you're an R user with difficult data preparation problems, frankly
> you are out of luck, because the products I'm about to mention are
> new, unknown, and therefore regarded as immature. And while the
> founders of these products would be very happy if you kicked the
> tires, most people don't like to look at brand new products. Most
> innovators and inventers don't realize this, I've learned it the hard
> way.
>
> But if you are a volunteer who likes to help out by evaluating,
> comparing, and reporting upon new candidates, well you could certainly
> help out R users and the developers of the products by kicking the
> tires of these products. And there is a huge need for such volunteers.
>
> 1. DAP
> This is an open source implementation of SAS.
> The founder: Susan Bassein
> Find it at: directory.fsf.org/math/stats (GNU GPL)
>
> 2. PSPP
> This is an open source implementation of SPSS.
> The relatively early version number might not give a good idea of how
> mature the
> data transformation features are, it reflects the fact that he has
> only started doing the statistical tests.
> The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
> dept.
> Also at : directory.fsf.org/math/stats (GNU GPL)
>
> 3. Vilno
> This uses a programming language similar to SPSS and SAS, but quite unlike S.
> Essentially, it's a substitute for the SAS datastep, and also
> transposes data and calculates averages and such. (No t-tests or
> regressions in this version). I created this, during the years
> 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> my opinion. The tarball includes about 100 or so test cases used for
> debugging - for logical calculation errors, but not for extremely high
> volumes of data.
> The maintenance of Vilno has slowed down, because I am currently
> (desparately) looking for employment. But once I've found new
> employment and living quarters and settled in, I will continue to
> enhance Vilno in my spare time.
> The founder: that would be me, Robert Wilkins
> Find it at: code.google.com/p/vilno ( GNU GPL )
> ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> , since I have yet to figure out how to use Subversion ).
>
>
> 4. Who knows?
> It was not easy to find out about the existence of DAP and PSPP. So
> who knows what else is out there. However, I think you'll find a lot
> more statistics software ( regression , etc ) out there, and not so
> much data transformation software. Not many people work on data
> preparation software. In fact, the category is so obscure that there
> isn't one agreed term: data cleaning , data munging , data crunching ,
> or just getting the data ready for analysis.
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> an

[R] Tools For Preparing Data For Analysis

2007-06-07 Thread Robert Wilkins
As noted on the R-project web site itself ( www.r-project.org ->
Manuals -> R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of the yellow Springer ones)
that says, more briefly, the same thing.
The R Data Import/Export page recommends examples using SAS, Perl,
Python, and Java. It takes a bit of courage to say that ( when you go
to a corporate software web site, you'll never see a page saying "This
is the type of problem that our product is not the best at, here's
what we suggest instead" ). I'd like to provide a few more
suggestions, especially for volunteers who are willing to evaluate new
candidates.

SAS is fine if you're not paying for the license out of your own
pocket. But maybe one reason you're using R is you don't have
thousands of spare dollars.
Using Java for data cleaning is an exercise in sado-masochism, Java
has a learning curve (almost) as difficult as C++.

There are different types of data transformation, and for some data
preparation problems an all-purpose programming language is a good
choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
excellent regular expression facilities.

However, for some types of complex demanding data preparation
problems, an all-purpose programming language is a poor choice. For
example: cleaning up and preparing clinical lab data and adverse event
data - you could do it in Perl, but it would take way, way too much
time. A specialized programming language is needed. And since data
transformation is quite different from data query, SQL is not the
ideal solution either.

There are only three statistical programming languages that are
well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
popular than S for data cleaning.

If you're an R user with difficult data preparation problems, frankly
you are out of luck, because the products I'm about to mention are
new, unknown, and therefore regarded as immature. And while the
founders of these products would be very happy if you kicked the
tires, most people don't like to look at brand new products. Most
innovators and inventers don't realize this, I've learned it the hard
way.

But if you are a volunteer who likes to help out by evaluating,
comparing, and reporting upon new candidates, well you could certainly
help out R users and the developers of the products by kicking the
tires of these products. And there is a huge need for such volunteers.

1. DAP
This is an open source implementation of SAS.
The founder: Susan Bassein
Find it at: directory.fsf.org/math/stats (GNU GPL)

2. PSPP
This is an open source implementation of SPSS.
The relatively early version number might not give a good idea of how
mature the
data transformation features are, it reflects the fact that he has
only started doing the statistical tests.
The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
Also at : directory.fsf.org/math/stats (GNU GPL)

3. Vilno
This uses a programming language similar to SPSS and SAS, but quite unlike S.
Essentially, it's a substitute for the SAS datastep, and also
transposes data and calculates averages and such. (No t-tests or
regressions in this version). I created this, during the years
2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
my opinion. The tarball includes about 100 or so test cases used for
debugging - for logical calculation errors, but not for extremely high
volumes of data.
The maintenance of Vilno has slowed down, because I am currently
(desparately) looking for employment. But once I've found new
employment and living quarters and settled in, I will continue to
enhance Vilno in my spare time.
The founder: that would be me, Robert Wilkins
Find it at: code.google.com/p/vilno ( GNU GPL )
( In particular, the tarball at code.google.com/p/vilno/downloads/list
, since I have yet to figure out how to use Subversion ).


4. Who knows?
It was not easy to find out about the existence of DAP and PSPP. So
who knows what else is out there. However, I think you'll find a lot
more statistics software ( regression , etc ) out there, and not so
much data transformation software. Not many people work on data
preparation software. In fact, the category is so obscure that there
isn't one agreed term: data cleaning , data munging , data crunching ,
or just getting the data ready for analysis.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.