Romain Francois wrote: > > txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value = > TRUE ) > rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$' > out <- data.frame( > url = gsub( rx, "\\1", txt ), > group = gsub( rx, "\\2", txt ), > description = gsub( rx, "\\3", txt ),
looking at this bit of your code, i wonder why gsub is not vectorized for the pattern and replacement arguments, although it is for the x argument. the three lines above could be collapsed to just one with a vectorized gsub: gsubm = function(pattern, replacement, x, ...) mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE, gsub, pattern=pattern, replacement=replacement, x=x, ...) for example, given the sample data txt = '<foo>foo</foo><bar>bar</bar>' rx = '<(.*?)>(.*?)</(.*?)>' the sequence open = gsub(rx, '\\1', txt, perl=TRUE) content = gsub(rx, '\\2', txt, perl=TRUE) close = gsub(rx, '\\3', txt, perl=TRUE) print(list(open, content, close)) could be replaced with data = structure(names=c('open', 'content', 'close'), gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE)) print(data) surely, a call to mapply does not improve performance, but a source-level fix should not be too difficult; unfortunately, i can't find myself willing to struggle with r sources right now. note also that .*? does not work as a non-greedy .* with the default regex engine, e.g., txt = "foo='FOO' bar='BAR'" gsub("(.*?)='(.*?)'", '\\1', txt) # "foo='FOO' bar" gsub("(.*?)='(.*?)'", '\\2', txt) # "BAR" because the first .*? matches everyithng up to and exclusive of the second, *not* the first, '='. for a non-greedy match, you'd need pcre (and using pcre generally improves performance anyway): txt = "foo='FOO' bar='BAR'" gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE) # "foo bar" gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE) # "FOO BAR" vQ ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel