[R] A really simple data manipulation example

2007-06-26 Thread Robert Wilkins
In response to those who asked for a better explanation of what the
Vilno software does, here's a simple example that gives some idea of
what it does.

LABRESULTS is a dataset with multiple rows per patient , with lab
sodium measurements. It has columns: PATIENT_ID, VISIT_NUM, and
SODIUM.

DEMO is a dataset with one row per patient, with demographic data.
It has columns: PATIENT_ID, GENDER.

Here's a simple example, the following paragraph of code is a
data processing function (dpf) :


inlist LABRESULTS DEMO ;
mergeby PATIENT_ID ;
if (SODIUM == -9) SODIUM = NULL ;
if (VISIT_NUM != 2) deleterow ;
select AVERAGE_SODIUM = avg(SODIUM) by GENDER ;
sendoff(RESULTS_DATASET)  GENDER AVERAGE_SODIUM ;
turnoff; // just means end-of-paragraph , version 1.0 won't need this.

Can you guess what it does? The lab result rows are merged with the
demographic rows, just to get the gender information merged in.
Obviously, they are merged by patient. The code -9 is used to denote
"missing", so convert that to NULL. I'm about to take a statistic for
visit 2, so rows with visit 0 or 1 must be deleted. I'm assuming, for
visit 2, each patient has at most one row. Now, for each sex group,
take the average sodium level. After the select statement, I have just
two rows, for male and female, with the average sodium level in the
AVERAGE_SODIUM column. Now the sendoff statement just stores the
current data table into a datafile, called RESULTS_DATASET.

So you have a sequence of data tables, each calculation reading in the
current table , and leaving a new data table for the next calculation.

So you have input datasets, a bunch of intermediate calculations, and
one or more output datasets. Pretty simple idea.

*

Some caveats:

LABRESULTS and DEMO are binary datasets. The asciitobinary and
binarytoascii statements are used to convert between binary datasets
and comma-separated ascii data files. (You can use any delimiter:
comma, vertical bar , etc). An asciitobinary statement is typically
just two lines of code.

The dpf begins with the inlist statement , and , for the moment ,
needs "turnoff ;" as the last line. Version 1.0 won't require the use
of "turnoff;", but version 0.85 does. It only means this paragraph of
code ends here ( a program can , of course , contain many paragraphs:
data processing functions, print statements, asciitobinary statements,
etc.).

If you've worked with lab data, you know lab data does not look so
simplistic. I need a simple example.

Vilno has a lot of functionality, many-to-many joins, adding columns,
firstrow() and lastrow() flags, and so forth. A fair amount of complex
data manipulations have already been tested with test programs ( in
the tarball ). Of course a simple example cannot show you that, it's
just a small taste.

*

If you've never used SPSS or SAS before, you won't care, but this
programming language falls in the same family as the SPSS and SAS
programming languages. All three programming languages have a fair
amount in common, but are quite different from the S programming
language. The vilno data processing function can replace the SAS
datastep. (It can also replace PROC TRANSPOSE and much of PROC MEANS,
except standard deviation calculations still need to be included in
the select statement).



I hope that helps.

http://code.google.com/p/vilno

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Has anyone tryed out my software?

2007-06-25 Thread Robert Wilkins
Hello all,

Has anyone ( who uses a Linux desktop ) tryed out my stuff I mentioned
a few weeks ago?
Perhaps installed it and run a couple of example programs?

If you have, tell me what you think.


Robert


( it's the tarball in the download section at
http://code.google.com/p/vilno , discussed briefly in comparison to
Awk a couple of weeks ago )



PS

Of R users: how many use Windows XP, how many use an Apple, and how
many use a Linux desktop? Are there a lot of Linux users out there?

Is R more popular in Europe than North America? I'll need to do a
statistical analysis of the mailing list. I notice a ton of Europeans.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-14 Thread Robert Wilkins
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]

Hmm,

I don't think you need a retain statement.

if first.patientID ;
or
if last.patientID ;

ought to do it.

It's actually better than the Vilno version, I must admit, a bit more concise:

if ( not firstrow(patientID) ) deleterow ;

Ah well.

**
For the folks asking for location of software ( I know posted it, but
it didn't connect to the thread, and you get a huge number of posts
each day , sorry):

Vilno , find at
http://code.google.com/p/vilno

DAP & PSPP,  find at
http://directory.fsf.org/math/stats

Awk, find at lots of places,
http://www.gnu.org/software/gawk/gawk.html

Anything else? DAP & PSPP are hard to find, I'm sure there's more out there!
What about MDX? Nahh, not really the right problem domain.
Nobody uses MDX for this stuff.

**

If my examples , using clinical trial data are boring and hard to
understand for those who asked for examples
( and presumably don't work in clinical trials) , let me
know. Some of these other examples I'm reading about are quite interesting.
It doesn't help that clinical trial databases cannot be public. Making
a fake database would take a lot of time.
The irony is , even with my deep understanding of data preparation in
clinical trials,
the pharmas still don't want to give me a job ( because I was gone for
many years).


Let's see if this post works : thanks to the folks who gave me advice
on how to properly respond to a post within a  thread . ( Although the
thread in my gmail account is only a subset of the posts visible in
the archives ). Crossing my fingers 

On 6/10/07, Peter Dalgaard <[EMAIL PROTECTED]> wrote:
> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >
> Actually, I think Frank's point was subtly different: It is *because* of
> the differences in view that it sometimes seems difficult to find the
> way to do something in R that  is apparently straightforward in SAS.
> I.e. the solutions exist and are often elegant, but may require some
> lateral thinking.
>
> Case in point: Finding the first or the last observation for each
> subject when there are multiple records for each subject. The SAS way
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
> you can compare the subject ID with the one from the previous record,
> working with data that are sorted appropriately.
>
> You can do the same thing in R with a for loop, but there are better
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
> maybe
> do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or
> something involving aggregate(). (The latter approaches generalize
> better to other within-subject functionals like cumulative doses, etc.).
>
> The hardest cases that I know of are the ones where you need to turn one
> record into many, such as occurs in survival analysis with
> time-dependent, piecewise constant covariates. This may require
> "transposing the problem", i.e. for each  interval you find out which
> subjects contribute and with what, whereas the SAS way would be a
> within-subject loop over intervals containing an OUTPUT statement.
>
> Also, there are some really weird data formats, where e.g. the input
> format is different in different records. Back in the 80's where
> punched-card input was still common, it was quite popular to have one
> card with background information on a patient plus several cards
> detailing visits, and you'd get a stack of cards containing both kinds.
> In R you would most likely split on the card type using grep() and then
> read the two kinds separately and merge() them later.
>
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Where to Find Data Transformation Software

2007-06-13 Thread Robert Wilkins
Hello All,

Here is the requested information. Most of it was on the original post for the
"Tools For Preparing Data For Analysis" thread from last week, but it
got overlooked.
They are all given under an open source license.
Check 'em out!

***

Vilno: data transformation software, that reads in input datasets
(rows and columns), crunches through the data, and writes out output
datasets. It's an open source application that can replace the SAS
datastep ( and also replaces proc transpose and proc means ).

Find it at: http://code.google.com/p/vilno
( look in the download section for a tarball, it's a Linux
application, can be opened up (and maybe installed) on an Apple as
well ).



DAP and PSPP: open source implementations for SAS and SPSS.

Find it at: http://directory.fsf.org/math/stats

*

Awk: data transformation/filtering software for semi-structured ASCII files.
A predecessor to Perl.

Find it at: a lot of places, but try:
http://www.gnu.org/software/gawk/gawk.html

*

Some, but not all , data crunching problems can be handled fairly well by an
all-purpose programming language, such as Perl or Python or Ruby.
Some, but not all, data crunching problems can be handled reasonably
well with the
S programming language ( i.e., R ).

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Difficulties With Posting To Ongoing Threads on the R Mailing List

2007-06-13 Thread Robert Wilkins
A number of people are having the same problem as me, when you post as
a response to an ongoing thread, in place of your message, the
following message appears:

An embedded & charset-unspecified text was scrubbed ...

and a link is given that leads to the desired message.

It's better than nothing , but it sure is annoying, and some readers
will skip it instead of doing the extra link. It's also annoying to
read a thread, when several posters , through no fault of their own,
get "scrubbed".

I always think of an e-mail as pure ASCII text, unless you add an attachment.
Is it possible that some e-mail hosts ( I use gmail ) embed binary
code into the e-mail?
Maybe the R mailing list software is reacting to that.

**

On another note, I tryed posting on gmane, to add to the thread from
last week. It just disappeared , or maybe not, I don't know. Maybe
it's related to the one-time registration requirement for gmane.

*

As far as I can tell, the above problem (scrubbing) does not occur
when you do a stand-alone post, not as a response to an ongoing
thread. Hope it stays that way!


**
Have a nice day.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Awk and Vilno

2007-06-12 Thread Robert Wilkins
In clinical trial data preparation and many other data situations, the
statistical programmer needs to merge and re-merge multiple input
files countless times. A syntax for merging files that is clear and
concise is very important for the statistical programmer's
productivity.

Here is how Vilno does it:

inlist dataset1 dataset2 dataset3 ;
joinby variable1 variable2  where ( var3<=var4 ) ;

Each column in a dataset has a variable name ( variable1, variable2,
var3, var4 ).
You are merging three input datafiles: dataset1, dataset2, and dataset3.
The joinby statement asks for a many-to-many join, rather like the SQL
SELECT statement.
[ The mergeby statement asks for a many-to-one join , more efficient ]
[ The readby statement asks for interleaving of rows, the rows don't
"match up" ,
  but one row goes under the preceding row (100 rows + 100 rows -> 200
output rows ]
The join( or merge ) is done with variable1*variable2 subgroups: A row
from dataset1 where variable1=4 and variable2="Sam" can only match to
a row from dataset2 where variable1=4 and variable2="Sam". Also, any
match-ups where it is not the case that var3<=var4 are also excluded.

Here's how the SAS datastep will do it:

merge dataset1 dataset2 dataset3 ;
by variable1 variable2 ;
if ^( var3<=var4 ) then delete ;

[Actually, the SAS datastep can only do a many-to-one join, but you
can do a PROC SQL paragraph to do an SQL SELECT statement, then export
the results to a SAS datastep afterwards.]

The point is : there are lots of data preparation scenarios where
large numbers of merges need to be done. This is an example where
Vilno and SAS are easier to use than the competition. I'm sure an Awk
programmer can come up with something, but the result would be
awkward.

You can also find other data preparation problems where the best tool
is Awk. Looking through "Sed & Awk" (O'Reilly) gives a good idea. I'm
not expert Awk-er sure, but I think I can see that Awk and Vilno are
really like apples and oranges.

For scanning inconsistently structured ASCII data files, where
different rows have different column specifications, Awk is a better
tool.

For data problems that lend themselves to UNIX-style regular
expressions, Awk, again, is a great tool.

If you have a data manipulation problem that is incredibly simple,
then converting an ascii data file to binary, and then back, may not
seem worth it. Awk, again, wins. But the asciitobinary and
binarytoascii statement ( there and back ) only takes 4 lines or so,
so Vilno is really not that bad.

Certain apsects of Vilno and SAS are a bit more user-friendly:
Each column has a variable name, such as "PatientID".
Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly.
In both Vilno and SAS (and SQL) the possibility of "MISSING" ( or
"NULL" ) is built into the data values held in the columns. So you
don't have to use separate boolean variables to track MISSING vs
NOT-MISSING. Very convenient.

Vilno does have a lot of functionality that is a lot harder to
implement in most other programming languages. (You can implement that
functionality, but it would take a ton of code - the three merge-in
options for Vilno are an example).

The upshot:

Awk is a hammer.
Vilno is a screwdriver.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-09 Thread Robert Wilkins
Here are some examples of the type of data crunching you might have to do.

In response to the requests by Christophe Pallier and Martin Stevens.

Before I started developing Vilno, some six years ago, I had been working in
the pharmaceuticals for eight years ( it's not easy to show you actual data
though, because it's all confidential of course).

Lab data can be especially messy, especially if one clinical trial allows
the physicians to use different labs. So let's consider lab data.

Merge in normal ranges, into the lab data. This has to be done by lab-site
and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
you also need to match by sex and age. The sex column in the normal ranges
could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
the normal ranges could be: blank, or something like "40 <55". Even worse,
you could have an ageunits column in the normal ranges dataset: usually "Y",
but if there are children in the clinical trial, you will have "D" or "M",
for Days and Months. If the clinical trial is for adults, all rows with "D"
or "M" should be tossed out at the start. Clearly the statistical programmer
has to spend time looking at the data, before writing the program. Remember,
all of these details can change any time you move to a new clinical trial.

So for the lab data, you have to merge in the patient's date of birth,
calculate age, and somehow relate that to the age-group column in the normal
ranges dataset.

(By the way, in clinical trial data preparation, the SAS datastep is much
more useful and convenient, in my opinion, than the SQL SELECT syntax, at
least 97% of the time. But in the middle of this program, when you merge the
normal ranges into the lab data, you get a better solution with PROC SQL (
just the SQL SELECT statement implemented inside SAS) This is because of the
trickiness of the age match-up, and the SAS datastep does not do well with
many-to-many joins.).

Merge in various study drug administration dates into the lab data. Now, for
each lab record, calculate treatment period ( or cycle number ), depending
on the statistician's specifications and the way the clinical trial is
structured.

Different clinical sites chose to use different lab providers. So, for
example, for Monocytes, you have 10 different units ( essentially 6 units,
but spelling inconsistencies as well). The statistician has requested that
you use standardized units in some of the listings ( % units, and only one
type of non-% unit, for example ). At the same time, lab values need to be
converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
no matter what software you use, and, in my experience, when the SAS
programmer asks for more clinical information or lab guidebooks, the
response is incomplete, so he does a lot of guesswork. SAS programmers do
not have expertise in lab science, hence the guesswork.

Your program has to accomodate numeric values, "1.54" , quasi-numeric values
"<1" , and non-numeric values "Trace".

Your data listing is tight for space, so print "PROLONGED CELL CONT" as
"PRCC".

Once normal ranges are merged in, figure out which values are out-of-range
and high , which are low, and which are within normal range. In the data
listing, you may have "H" or "L" appended to the result value being printed.

For each treatment period, you may need a unique lab record selected, in
case there are two or three for the same treatment period. The statistician
will tell the SAS programmer how. Maybe the averages of the results for that
treatment period, maybe that lab record closest to the mid-point of of the
treatment period. This isn't for the data listing, but for a summary table.

For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
(total white blood cell count) values , to convert values between % units
and absolute count units.

When printing the values in the data listing, you need "H" or "L" to the
right of the value. But you also need the values to be well lined up ( the
decimal place ). This can be stupidly time consuming.



AND ON AND ON AND ON .

I think you see why clinical trials statisticians and SAS programmers enjoy
lots of job security.



On 6/8/07, Martin Henry H. Stevens <[EMAIL PROTECTED]> wrote:
>
> Is there an example available of this sort of problematic data that
> requires this kind of data screening and filtering? For many of us,
> this issue would be nice to learn about, and deal with within R. If a
> package could be created, that would be optimal for some of us. I
> would like to learn a tad more, if it were not too much effort for
> someone else to point me in the right direction?
> Cheers,
> Hank
> On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
>
> &g

[R] How do you do an e-mail post that is within an ongoing thread?

2007-06-08 Thread Robert Wilkins
That may sound like a stupid question, but if it confuses me, I'm sure
it confuses others as well. I've tried to find that information on the
R mail-group info pages, can't seem to find it. Is it something
obvious?

To begin a brand new discussion, you do your post as an e-mail sent to
 r-help@stat.math.ethz.ch .
As I am doing right now.

How do I do an additional post that gets included in the
"[R] Tools For Preparing Data For Analysis" thread, a thread which I
started myself yesterday ( thanks for all the responses everybody )?

There's got to be a real easy answer to that, since everybody else does that.
(I'm using gmail, does it make a difference what e-mail host you use?).

---


PS
If you happen to be reading this, Christophe Pallier & Martin Stevens,
I will respond to your request for examples shortly, once I figure
this posting how-to out. My examples will come from data preparation
problems in clinical trial data ( I worked for 8 years on clinical
trial analysis before beginning work on Vilno ). I'll probably use lab
data as an example because  lab data can be messy and difficult to
work with.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Tools For Preparing Data For Analysis

2007-06-07 Thread Robert Wilkins
As noted on the R-project web site itself ( www.r-project.org ->
Manuals -> R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of the yellow Springer ones)
that says, more briefly, the same thing.
The R Data Import/Export page recommends examples using SAS, Perl,
Python, and Java. It takes a bit of courage to say that ( when you go
to a corporate software web site, you'll never see a page saying "This
is the type of problem that our product is not the best at, here's
what we suggest instead" ). I'd like to provide a few more
suggestions, especially for volunteers who are willing to evaluate new
candidates.

SAS is fine if you're not paying for the license out of your own
pocket. But maybe one reason you're using R is you don't have
thousands of spare dollars.
Using Java for data cleaning is an exercise in sado-masochism, Java
has a learning curve (almost) as difficult as C++.

There are different types of data transformation, and for some data
preparation problems an all-purpose programming language is a good
choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
excellent regular expression facilities.

However, for some types of complex demanding data preparation
problems, an all-purpose programming language is a poor choice. For
example: cleaning up and preparing clinical lab data and adverse event
data - you could do it in Perl, but it would take way, way too much
time. A specialized programming language is needed. And since data
transformation is quite different from data query, SQL is not the
ideal solution either.

There are only three statistical programming languages that are
well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
popular than S for data cleaning.

If you're an R user with difficult data preparation problems, frankly
you are out of luck, because the products I'm about to mention are
new, unknown, and therefore regarded as immature. And while the
founders of these products would be very happy if you kicked the
tires, most people don't like to look at brand new products. Most
innovators and inventers don't realize this, I've learned it the hard
way.

But if you are a volunteer who likes to help out by evaluating,
comparing, and reporting upon new candidates, well you could certainly
help out R users and the developers of the products by kicking the
tires of these products. And there is a huge need for such volunteers.

1. DAP
This is an open source implementation of SAS.
The founder: Susan Bassein
Find it at: directory.fsf.org/math/stats (GNU GPL)

2. PSPP
This is an open source implementation of SPSS.
The relatively early version number might not give a good idea of how
mature the
data transformation features are, it reflects the fact that he has
only started doing the statistical tests.
The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
Also at : directory.fsf.org/math/stats (GNU GPL)

3. Vilno
This uses a programming language similar to SPSS and SAS, but quite unlike S.
Essentially, it's a substitute for the SAS datastep, and also
transposes data and calculates averages and such. (No t-tests or
regressions in this version). I created this, during the years
2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
my opinion. The tarball includes about 100 or so test cases used for
debugging - for logical calculation errors, but not for extremely high
volumes of data.
The maintenance of Vilno has slowed down, because I am currently
(desparately) looking for employment. But once I've found new
employment and living quarters and settled in, I will continue to
enhance Vilno in my spare time.
The founder: that would be me, Robert Wilkins
Find it at: code.google.com/p/vilno ( GNU GPL )
( In particular, the tarball at code.google.com/p/vilno/downloads/list
, since I have yet to figure out how to use Subversion ).


4. Who knows?
It was not easy to find out about the existence of DAP and PSPP. So
who knows what else is out there. However, I think you'll find a lot
more statistics software ( regression , etc ) out there, and not so
much data transformation software. Not many people work on data
preparation software. In fact, the category is so obscure that there
isn't one agreed term: data cleaning , data munging , data crunching ,
or just getting the data ready for analysis.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Why is the R mailing list so hard to figure out?

2007-06-04 Thread Robert Wilkins
Why does the R mailing list need such an unusual and customized user interface?

Last January, I figured out how to read Usenet mailing lists ( or
Usenet groups ) and they all pretty much work the same, learn to use
one, you've learned to use them all ( gnu.misc.discuss ,
comp.lang.lisp , and so on ).

What's the best way to view and read discussions in this group for
recent days? Can I view the postings for the current day via Google
Groups?

I hope I'm posting correctly.

What does "ethz" and "ch" stand for? Is "ch" for Switzerland?


Robert

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.