..............................................<°}))><........
 ) ) ) ) )
( ( ( ( (    Prof. Philippe Grosjean
 ) ) ) ) )
( ( ( ( (    Numerical Ecology of Aquatic Systems
 ) ) ) ) )   Mons-Hainaut University, Belgium
( ( ( ( (
..............................................................

Romain Francois wrote:
strapply in package gsubfn brings elegance here:

Don't! If you write functions to be used in a package to be included somehow in the base or recommended packages, then, your package should only depends on... base (preferably), or recommended packages itself!

So, forget about gsubfn, unless it is itself incorporated in base or utils.
Best,

Philippe

 > txt <- '<foo>bar</foo>'
 > rx <- "<(.*?)>(.*?)</(.*?)>"
 > strapply( txt, rx, c , perl = T )
[[1]]
[1] "foo" "bar" "foo"

Too bad you have to pay this on performance:

 > txt <- rep( '<foo>bar</foo>', 1000 )
 > rx <- "<(.*?)>(.*?)</(.*?)>"
 > system.time( out <- strapply( txt, rx, c , perl = T ) )
  user  system elapsed
 2.923   0.005   3.063
 > system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
+ gsub(rx, x, txt, perl=TRUE)
+ } ) )
  user  system elapsed
 0.011   0.000   0.011

Not sure what the right play is


Wacek Kusnierczyk wrote:
Romain Francois wrote:
   txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
TRUE )
     rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
   out <- data.frame(
       url = gsub( rx, "\\1", txt ),
       group = gsub( rx, "\\2", txt ),
       description = gsub( rx, "\\3", txt ),

looking at this bit of your code, i wonder why gsub is not vectorized
for the pattern and replacement arguments, although it is for the x
argument.  the three lines above could be collapsed to just one with a
vectorized gsub:

    gsubm = function(pattern, replacement, x, ...)
       mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
           gsub, pattern=pattern, replacement=replacement, x=x, ...)

for example, given the sample data

    txt = '<foo>foo</foo><bar>bar</bar>'
    rx = '<(.*?)>(.*?)</(.*?)>'

the sequence

    open = gsub(rx, '\\1', txt, perl=TRUE)
    content = gsub(rx, '\\2', txt, perl=TRUE)
    close = gsub(rx, '\\3', txt, perl=TRUE)

    print(list(open, content, close))
   could be replaced with

    data = structure(names=c('open', 'content', 'close'),
        gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))

    print(data)

surely, a call to mapply does not improve performance, but a
source-level fix should not be too difficult;  unfortunately, i can't
find myself willing to struggle with r sources right now.


note also that .*? does not work as a non-greedy .* with the default
regex engine, e.g.,

    txt = "foo='FOO' bar='BAR'"
    gsub("(.*?)='(.*?)'", '\\1', txt)
    # "foo='FOO' bar"
    gsub("(.*?)='(.*?)'", '\\2', txt)
    # "BAR"

because the first .*? matches everyithng up to and exclusive of the
second, *not* the first, '='.  for a non-greedy match, you'd need pcre
(and using pcre generally improves performance anyway):

    txt = "foo='FOO' bar='BAR'"
    gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
    # "foo bar"
    gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
    # "FOO BAR"

vQ





______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to