Re: [R] Awk and Vilno
Rogerio Porto wrote: > Hey, > >> What we should really compare is the four situations: >> R alone >> R + awk >> R + vilno >> R + awk + vilno >> and maybe "R + SAS Data step" >> and see what scripts are more elegant (read 'short and understandable') I don't think that short and understandable necessarily go hand-in-hand. Sometimes longer scripts which are more explicit and use less tricky syntax shortcuts are much easier to understand a year or two later. Ease and speed of script writing (taking into account learning curve and time taken to consult scripting language documentation) are important, as is the ability to re-visit scripts or examine someone else's script and be able to work out what it does and how it works is vital, and speed of execution also counts with large datasets. Also ubiquity of the tool, whether it is freely available on many platforms, either pre-installed or in an easy-to-install form are also considerations. > what do you guys think of creating a R-wiki page for syntax > comparisons among the various options to enhance R use? > > I already have two sugestions: > > 1) syntax examples for using R and other tools to manipulate > and analyze large datasets (with a concise description of the > datasets); > > 2) syntax examples for using R and other tools (or R alone) to clean > and prepare datasets (simple and very small datasets, for didatic > purposes). The ability of the tools to scale to large or very large datasets is also a consideration, as is their speed when dealing with such large data. > I think this could be interesting for R users and to promote other > software tools, since it seems there is a lot of R users that use > other tools also. > > Besides that, questions on those two above subjects are prevalent > at this list. Thus a wiki page seems to be the right place to discuss > and teach this to other users. > > What do you think? Yes, happy to contribute R + Python examples to such wiki pages. Please post the URL. Tim C __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Awk and Vilno
Hey, > What we should really compare is the four situations: > R alone > R + awk > R + vilno > R + awk + vilno > and maybe "R + SAS Data step" > and see what scripts are more elegant (read 'short and understandable') what do you guys think of creating a R-wiki page for syntax comparisons among the various options to enhance R use? I already have two sugestions: 1) syntax examples for using R and other tools to manipulate and analyze large datasets (with a concise description of the datasets); 2) syntax examples for using R and other tools (or R alone) to clean and prepare datasets (simple and very small datasets, for didatic purposes). I think this could be interesting for R users and to promote other software tools, since it seems there is a lot of R users that use other tools also. Besides that, questions on those two above subjects are prevalent at this list. Thus a wiki page seems to be the right place to discuss and teach this to other users. What do you think? Rogerio __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Awk and Vilno
On 13-Jun-07 01:24:41, Robert Wilkins wrote: > In clinical trial data preparation and many other data situations, > the statistical programmer needs to merge and re-merge multiple > input files countless times. A syntax for merging files that is > clear and concise is very important for the statistical programmer's > productivity. > > Here is how Vilno does it: > > inlist dataset1 dataset2 dataset3 ; > joinby variable1 variable2 where ( var3<=var4 ) ; > [...] Thanks to Robert for this more explicit illustration of what Vilno does. Its potential usefulness is clear. I broadly agree with the comments that have been made about the various approaches (with/without awk/sed/R etc). Is there any URL that leads to a fuller explicit exposition of Vilno? As I said previously, a web-search on "vilno" leads to very little that is relevant. What I did find didn't amount to much. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-07 Time: 10:00:12 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Awk and Vilno
On 6/13/07, Robert Wilkins <[EMAIL PROTECTED]> wrote: > > The point is : there are lots of data preparation scenarios where > large numbers of merges need to be done. This is an example where > Vilno and SAS are easier to use than the competition. I'm sure an Awk > programmer can come up with something, but the result would be > awkward. Agreed. In the awk+R scenario, it is clear that the merges are often better done with R. My strategy is to use awk only to clean/reformat data into a tabular format and do most of the "consolidation" (computations/filtering/merges) in R. I suggested to use awk only to perform manipulations that would be more complex to do within R (especially mutliline records or recors with optionnal fields). I try to keep the scripts as simple as possible on both sides > Certain apsects of Vilno and SAS are a bit more user-friendly: > > Each column has a variable name, such as "PatientID". > > Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly. > > In the first lines of awk scripts, I usually assign column numbers to variables (e.g. "Code=1, time=3") and then access the fields with "$Code", "$Time"... Yet, it is true that it is cumbersome, in awk, to use the labels on the first line of a file as a variable names (my major complain about awk). I looked at a few examples of SAS Data step scripts on the Net, and found that the awk scripts would be very similar (except for merges), but there may manipulations which I missed. > For scanning inconsistently structured ASCII data files, where > different rows have different column specifications, Awk is a better > tool. > > For data problems that lend themselves to UNIX-style regular > expressions, Awk, again, is a great tool. The examples of messy data formats that were described ealier on the list are good examples where regular expressions will help a lot. In the very first stage of data inspection, to detect coding "mistakes", awk (sometimes with the help ot other gnutools such as 'uniq' and 'sort') can be very efficient. > The upshot: > Awk is a hammer. > Vilno is a screwdriver. Nice analogy. Using the right tool for the right task is very important. So awk and vilno seem complementary. Yet, when R enters into the equation, do you still "need" the three tools? What we should really compare is the four situations: R alone R + awk R + vilno R + awk + vilno and maybe "R + SAS Data step" and see what scripts are more elegant (read 'short and understandable') Best, Christophe -- Christophe Pallier (http://www.pallier.org) [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.