strapply in package gsubfn brings elegance here:

> txt <- '<foo>bar</foo>'
> rx <- "<(.*?)>(.*?)</(.*?)>"
> strapply( txt, rx, c , perl = T )
[[1]]
[1] "foo" "bar" "foo"

Too bad you have to pay this on performance:

> txt <- rep( '<foo>bar</foo>', 1000 )
> rx <- "<(.*?)>(.*?)</(.*?)>"
> system.time( out <- strapply( txt, rx, c , perl = T ) )
  user  system elapsed
 2.923   0.005   3.063
> system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
+ gsub(rx, x, txt, perl=TRUE)
+ } ) )
  user  system elapsed
 0.011   0.000   0.011

Not sure what the right play is


Wacek Kusnierczyk wrote:
Romain Francois wrote:
   txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
TRUE )
     rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
   out <- data.frame(
       url = gsub( rx, "\\1", txt ),
       group = gsub( rx, "\\2", txt ),
       description = gsub( rx, "\\3", txt ),

looking at this bit of your code, i wonder why gsub is not vectorized
for the pattern and replacement arguments, although it is for the x
argument.  the three lines above could be collapsed to just one with a
vectorized gsub:

    gsubm = function(pattern, replacement, x, ...)
       mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
           gsub, pattern=pattern, replacement=replacement, x=x, ...)

for example, given the sample data

    txt = '<foo>foo</foo><bar>bar</bar>'
    rx = '<(.*?)>(.*?)</(.*?)>'

the sequence

    open = gsub(rx, '\\1', txt, perl=TRUE)
    content = gsub(rx, '\\2', txt, perl=TRUE)
    close = gsub(rx, '\\3', txt, perl=TRUE)

    print(list(open, content, close))
could be replaced with

    data = structure(names=c('open', 'content', 'close'),
        gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))

    print(data)

surely, a call to mapply does not improve performance, but a
source-level fix should not be too difficult;  unfortunately, i can't
find myself willing to struggle with r sources right now.


note also that .*? does not work as a non-greedy .* with the default
regex engine, e.g.,

    txt = "foo='FOO' bar='BAR'"
    gsub("(.*?)='(.*?)'", '\\1', txt)
    # "foo='FOO' bar"
    gsub("(.*?)='(.*?)'", '\\2', txt)
    # "BAR"

because the first .*? matches everyithng up to and exclusive of the
second, *not* the first, '='.  for a non-greedy match, you'd need pcre
(and using pcre generally improves performance anyway):

    txt = "foo='FOO' bar='BAR'"
    gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
    # "foo bar"
    gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
    # "FOO BAR"

vQ




--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to