Re: [R] Awk and Vilno

2007-06-13 Thread Tim Churches
Rogerio Porto wrote:
> Hey,
> 
>> What we should really compare is the four situations:
>> R alone
>> R + awk
>> R + vilno
>> R + awk + vilno
>> and maybe "R + SAS Data step"
>> and see what scripts are more  elegant (read 'short and understandable')

I don't think that short and understandable necessarily go hand-in-hand.
Sometimes longer scripts which are more explicit and use less tricky
syntax shortcuts are much easier to understand a year or two later. Ease
and speed of script writing (taking into account learning curve and time
taken to consult scripting language documentation) are important, as is
the ability to re-visit scripts or examine someone else's script and be
able to work out what it does and how it works is vital, and speed of
execution also counts with large datasets. Also ubiquity of the tool,
whether it is freely available on many platforms, either pre-installed
or in an easy-to-install form are also considerations.

> what do you guys think of creating a R-wiki page for syntax
> comparisons among the various options to enhance R use?
> 
> I already have two sugestions:
> 
> 1) syntax examples for using R and other tools to manipulate
> and analyze large datasets (with a concise description of the
> datasets);
> 
> 2) syntax examples for using R and other tools (or R alone) to clean
> and prepare datasets (simple and very small datasets, for didatic
> purposes).

The ability of the tools to scale to large or very large datasets is
also a consideration, as is their speed when dealing with such large data.

> I think this could be interesting for R users and to promote other
> software tools, since it seems there is a lot of R users that use
> other tools also.
> 
> Besides that, questions on those two above subjects are prevalent
> at this list. Thus a wiki page seems to be the right place to discuss
> and teach this to other users.
> 
> What do you think?

Yes, happy to contribute R + Python examples to such wiki pages. Please
post the URL.

Tim C

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Awk and Vilno

2007-06-13 Thread Rogerio Porto
Hey,

> What we should really compare is the four situations:
> R alone
> R + awk
> R + vilno
> R + awk + vilno
> and maybe "R + SAS Data step"
> and see what scripts are more  elegant (read 'short and understandable')

what do you guys think of creating a R-wiki page for syntax
comparisons among the various options to enhance R use?

I already have two sugestions:

1) syntax examples for using R and other tools to manipulate
and analyze large datasets (with a concise description of the
datasets);

2) syntax examples for using R and other tools (or R alone) to clean
and prepare datasets (simple and very small datasets, for didatic
purposes).

I think this could be interesting for R users and to promote other
software tools, since it seems there is a lot of R users that use
other tools also.

Besides that, questions on those two above subjects are prevalent
at this list. Thus a wiki page seems to be the right place to discuss
and teach this to other users.

What do you think?

Rogerio

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Awk and Vilno

2007-06-13 Thread Ted Harding
On 13-Jun-07 01:24:41, Robert Wilkins wrote:
> In clinical trial data preparation and many other data situations,
> the statistical programmer needs to merge and re-merge multiple
> input files countless times. A syntax for merging files that is
> clear and concise is very important for the statistical programmer's
> productivity.
> 
> Here is how Vilno does it:
> 
> inlist dataset1 dataset2 dataset3 ;
> joinby variable1 variable2  where ( var3<=var4 ) ;
> [...]

Thanks to Robert for this more explicit illustration of what Vilno
does. Its potential usefulness is clear.

I broadly agree with the comments that have been made about the
various approaches (with/without awk/sed/R etc).

Is there any URL that leads to a fuller explicit exposition of Vilno?

As I said previously, a web-search on "vilno" leads to very little
that is relevant. What I did find didn't amount to much.

Best wishes,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-07   Time: 10:00:12
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Awk and Vilno

2007-06-12 Thread Christophe Pallier
On 6/13/07, Robert Wilkins <[EMAIL PROTECTED]> wrote:
>
> The point is : there are lots of data preparation scenarios where
> large numbers of merges need to be done. This is an example where
> Vilno and SAS are easier to use than the competition. I'm sure an Awk
> programmer can come up with something, but the result would be
> awkward.


Agreed.
In the awk+R scenario, it is clear that the merges are often better done
with R.
My strategy is to use awk only to clean/reformat data into a tabular format
and
do most of the "consolidation" (computations/filtering/merges) in R.  I
suggested to use awk only to perform manipulations that would be more
complex to do within R (especially mutliline records or recors with
optionnal fields). I try to keep the scripts as simple as possible on both
sides



> Certain apsects of Vilno and SAS are a bit more user-friendly:
> > Each column has a variable name, such as "PatientID".
> > Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly.
>
>


In the first lines of awk scripts, I usually assign column numbers to
variables (e.g. "Code=1, time=3") and then access the fields with "$Code",
"$Time"...
Yet, it is true that it is cumbersome, in awk, to use the labels on the
first line of a file as a variable names (my major complain about awk).

I looked at a few examples of  SAS Data step scripts on the Net, and found
that the awk scripts would be very similar (except for merges), but there
may  manipulations which I missed.


> For scanning inconsistently structured ASCII data files, where
> different rows have different column specifications, Awk is a better
> tool.
>
> For data problems that lend themselves to UNIX-style regular
> expressions, Awk, again, is a great tool.



The examples of messy data formats that were described ealier on the list
are good examples where regular expressions will help a lot. In the very
first stage of data inspection, to detect coding "mistakes", awk (sometimes
with the help ot other gnutools such as 'uniq' and 'sort') can be very
efficient.

> The upshot:

> Awk is a hammer.
> Vilno is a screwdriver.

Nice analogy. Using the right tool for the right task is very important.
So awk and vilno seem complementary.
Yet, when R enters into the equation, do you still "need" the three tools?

What we should really compare is the four situations:

R alone
R + awk
R + vilno
R + awk + vilno

and maybe "R + SAS Data step"

and see what scripts are more  elegant (read 'short and understandable')


Best,

Christophe



-- 
Christophe Pallier (http://www.pallier.org)

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.