Bill, Michael,

good to see I'm not the only one who sees potential for improvements
in the regexpr domain.  Adding a subpattern argument is certainly a
step in the right direction and would make my life much easier.
However, in my application I need to know not only the position of one
group but also the position of the overall match in the original
string.  The ideal solution would provide positions and match lengths
for the whole pattern and for all groups if desired.  Only this would
solve all related issues.  One possibility is to have a subpattern
argument that accepts a vector of numbers (0 refers to the whole
pattern):

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
 [[1]]:
 [[1]][[1]]:
 [1] 1 5
 attr(, "match.length"):
 [1] 2 4
 [[1]][[2]]:
 [1] 2 7
 attr(, "match.length"):
 [1] 1 2

A weakness of this solution is that the structure of the return values
changes if length(subpattern)>1.  An alternative is to have a separate
function, say ggregepxr for group gregexpr, that returns a list of
lists as in the above example.  This function would always return
positions and match lengths of the whole pattern (group 0) and all
groups.  The original gregexpr could still have the subpattern
argument but it would only accept single numbers.  This way the return
format of gregexpr remains the same.

Best,

  Titus


On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
<michael.bedw...@gmail.com> wrote:
> Ah, that's interesting - thanks Bill. That's certainly on the right
> track for me (Titus, you too ?) especially if the subpattern argument
> accepted a vector of multiple group indices.
>
> As you say, this is straightforward in C. I'd be happy to (try to)
> make a patch for the R sources if there was some consensus on the best
> way to implement it, ie. as a new R function or by extending existing
> function(s).

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to