Re: [R] Regular expressions: offsets of groups

2010-09-30 Thread Titus von der Malsburg
Ok, we decided to have a shot at modifying gregexpr.  Let's see how it
works out.  If anybody is interested in discussing this please contact
me.  R-help doesn't seem like the right place for further discussion.
Is there a default place for discussing things like that?

Thanks everybody for your responses!

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Titus von der Malsburg
On Wed, Sep 29, 2010 at 1:58 PM, Michael Bedward
 wrote:
> How is your C coding ? Bill ? Anyone else ?  I could have a got at
> writing some prototype code to test in the next few days, though if
> someone else with decent C skills is itching to do it please speak up.

We have a skilled C- and R-programmer who could work on it. I'll talk to him.

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Michael Bedward
I'd definitely be a customer for it Titus. And it does seem like an
obvious hole in regex processing in R that cries out to be filled.

Um, ggregexpr isn't the sexiest of function names :)  Perhaps we can
think of something a little easier ?

How is your C coding ? Bill ? Anyone else ?  I could have a got at
writing some prototype code to test in the next few days, though if
someone else with decent C skills is itching to do it please speak up.

Michael

On 29 September 2010 20:08, Titus von der Malsburg  wrote:
> Bill, Michael,
>
> good to see I'm not the only one who sees potential for improvements
> in the regexpr domain.  Adding a subpattern argument is certainly a
> step in the right direction and would make my life much easier.
> However, in my application I need to know not only the position of one
> group but also the position of the overall match in the original
> string.  The ideal solution would provide positions and match lengths
> for the whole pattern and for all groups if desired.  Only this would
> solve all related issues.  One possibility is to have a subpattern
> argument that accepts a vector of numbers (0 refers to the whole
> pattern):
>
>  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
>  [[1]]:
>  [[1]][[1]]:
>  [1] 1 5
>  attr(, "match.length"):
>  [1] 2 4
>  [[1]][[2]]:
>  [1] 2 7
>  attr(, "match.length"):
>  [1] 1 2
>
> A weakness of this solution is that the structure of the return values
> changes if length(subpattern)>1.  An alternative is to have a separate
> function, say ggregepxr for group gregexpr, that returns a list of
> lists as in the above example.  This function would always return
> positions and match lengths of the whole pattern (group 0) and all
> groups.  The original gregexpr could still have the subpattern
> argument but it would only accept single numbers.  This way the return
> format of gregexpr remains the same.
>
> Best,
>
>  Titus
>
>
> On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
>  wrote:
>> Ah, that's interesting - thanks Bill. That's certainly on the right
>> track for me (Titus, you too ?) especially if the subpattern argument
>> accepted a vector of multiple group indices.
>>
>> As you say, this is straightforward in C. I'd be happy to (try to)
>> make a patch for the R sources if there was some consensus on the best
>> way to implement it, ie. as a new R function or by extending existing
>> function(s).
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Titus von der Malsburg
Bill, Michael,

good to see I'm not the only one who sees potential for improvements
in the regexpr domain.  Adding a subpattern argument is certainly a
step in the right direction and would make my life much easier.
However, in my application I need to know not only the position of one
group but also the position of the overall match in the original
string.  The ideal solution would provide positions and match lengths
for the whole pattern and for all groups if desired.  Only this would
solve all related issues.  One possibility is to have a subpattern
argument that accepts a vector of numbers (0 refers to the whole
pattern):

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=c(0,1))
 [[1]]:
 [[1]][[1]]:
 [1] 1 5
 attr(, "match.length"):
 [1] 2 4
 [[1]][[2]]:
 [1] 2 7
 attr(, "match.length"):
 [1] 1 2

A weakness of this solution is that the structure of the return values
changes if length(subpattern)>1.  An alternative is to have a separate
function, say ggregepxr for group gregexpr, that returns a list of
lists as in the above example.  This function would always return
positions and match lengths of the whole pattern (group 0) and all
groups.  The original gregexpr could still have the subpattern
argument but it would only accept single numbers.  This way the return
format of gregexpr remains the same.

Best,

  Titus


On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
 wrote:
> Ah, that's interesting - thanks Bill. That's certainly on the right
> track for me (Titus, you too ?) especially if the subpattern argument
> accepted a vector of multiple group indices.
>
> As you say, this is straightforward in C. I'd be happy to (try to)
> make a patch for the R sources if there was some consensus on the best
> way to implement it, ie. as a new R function or by extending existing
> function(s).

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Michael Bedward
Ah, that's interesting - thanks Bill. That's certainly on the right
track for me (Titus, you too ?) especially if the subpattern argument
accepted a vector of multiple group indices.

As you say, this is straightforward in C. I'd be happy to (try to)
make a patch for the R sources if there was some consensus on the best
way to implement it, ie. as a new R function or by extending existing
function(s).

Michael

On 29 September 2010 01:46, William Dunlap wrote:
>
> S+ has a subpattern=number argument to regexpr and
> related functions.  It means that the text matched
> by the subpattern'th parenthesized expression in the
> pattern will be considered the matched text.  E.g.,
> to find runs of b's that come immediately after a's:
>
>  > gregexpr("a+(b+)", "abcdaabbc", subpattern=1)
>  [[1]]:
>  [1] 2 7
>  attr(, "match.length"):
>  [1] 1 2
>
> or to find bc's that come after 2 or more ab's
>  > gregexpr("(ab){2,}bc", "abbcabababbcabcababbc", subpattern=1)
>
> regexpr() and strsplit() have this argument in S+ 8.1 but
> gregexpr() is not yet in a released version of S+.
>
> subpattern=0, the default, means to use the entire
> pattern.  regexpr allows subpattern=-1, which means
> to return a list with one element for each subpattern.
> I don't know if the extra complexity is worth it.
> (gregexpr does not allow subpattern=-1.)
>
> The usual C regexec() returns this information.
> Perhaps it would be handy to have it in R.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread William Dunlap

> -Original Message-
> From: r-help-boun...@r-project.org 
> [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Bedward
> Sent: Tuesday, September 28, 2010 12:46 AM
> To: Titus von der Malsburg
> Cc: r-help@r-project.org
> Subject: Re: [R] Regular expressions: offsets of groups
> 
> What Titus wants to do is akin to retrieving capturing groups from a
> Matcher object in Java. I also thought there must be an existing,
> elegant solution to this some time ago and searched for it, including
> looking at the sources (albeit with not much expertise) but came up
> blank.
> 
> I also looked at the stringr package (which is nice) but it doesn't
> quite do it either.

S+ has a subpattern=number argument to regexpr and
related functions.  It means that the text matched
by the subpattern'th parenthesized expression in the
pattern will be considered the matched text.  E.g.,
to find runs of b's that come immediately after a's:

  > gregexpr("a+(b+)", "abcdaabbc", subpattern=1)
  [[1]]:
  [1] 2 7
  attr(, "match.length"):
  [1] 1 2

or to find bc's that come after 2 or more ab's
  > gregexpr("(ab){2,}bc", "abbcabababbcabcababbc", subpattern=1)

regexpr() and strsplit() have this argument in S+ 8.1 but
gregexpr() is not yet in a released version of S+.

subpattern=0, the default, means to use the entire
pattern.  regexpr allows subpattern=-1, which means
to return a list with one element for each subpattern.
I don't know if the extra complexity is worth it.
(gregexpr does not allow subpattern=-1.)

The usual C regexec() returns this information.
Perhaps it would be handy to have it in R.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> Michael
> 
> On 28 September 2010 01:48, Titus von der Malsburg 
>  wrote:
> > Dear list!
> >
> >> gregexpr("a+(b+)", "abcdaabbc")
> > [[1]]
> > [1] 1 5
> > attr(,"match.length")
> > [1] 2 4
> >
> > What I want is the offsets of the matches for the group (b+), i.e. 2
> > and 7, not the offsets of the complete matches.  Is there a way in R
> > to get that?
> >
> > I know about gsubgn and strapply, but they only give me the strings
> > matched by groups not their offsets.
> >
> > I could write something myself that first takes the above matches
> > ("ab" and "aabb") and then searches again using only the group (b+).
> > For this to work, I'd have to parse the regular expression 
> and search
> > several times (> 2, for nested groups) instead of just 
> once.  But I'm
> > sure there is a better way to do this.
> >
> > Thanks for any suggestion!
> >
> >   Titus
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Gabor Grothendieck
On Tue, Sep 28, 2010 at 6:52 AM, Titus von der Malsburg
 wrote:
> On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
>  wrote:
>> What Titus wants to do is akin to retrieving capturing groups from a
>> Matcher object in Java.
>
> Precisely.  Here's the description:
>
>  http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)
>
> Gabor's lookbehind trick solves some special cases but it's not the

The only limitation is that in the regular expressions supported by R
you cannot have repitition in the (<=...) portion but none of your
examples -- neither the one you gave nor the one below require that
since if the prior expression ends in X+ you can just use X.Are
you sure it does not cover all your actual situations?

If you truly do have situations where that require repetition a
gregexpr plus gsubfn will do it in one line.   Parenthesize the
portion of the regular expression you want to capture and replace
every character in it with X (or some other character that does not
otherwise occur).  Then find the positions and lengths of strings of
X.

> gregexpr("X+", gsubfn("a(b+)", ~ gsub(".", "X", x), "abcdaabbcbbb"))
[[1]]
[1] 1 5
attr(,"match.length")
[1] 1 2

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Titus von der Malsburg
On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
 wrote:
> What Titus wants to do is akin to retrieving capturing groups from a
> Matcher object in Java.

Precisely.  Here's the description:

  
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)

Gabor's lookbehind trick solves some special cases but it's not the
kind of general solution I'm looking for.  Let me explain what I'm
trying to achieve here.  I'm working on a package that provides tools
for processing and analyzing eye movements (we're doing reading
research).  In most situations, eye movements consist of fixations
where the eyes are relatively stationary and saccades, quick movements
between fixations.  A common way to represent eye movements is as
strings of symbols, where each symbol corresponds to a fixation on a
particular region.  AABC means two fixations followed by a fixation on
B and then C.  When people analyze eye movements it's often necessary
to find specific events in the eye movement record like: fixations on
the word C preceded by fixations on words D-F and followed by
fixations on words A-C.  This event can be specified using this
regexpr: "[D-F]+(C)[A-C]+"  The group (in parenthesis) indicates the
substring for which I'd like to know the position in the overall
string.  Another application is the extraction of subsequences from a
sequence of fixations.  Note that in some situations people might have
to use more groups in their regexprs and that groups can be nested.
In this case the user would have to indicate for which group he/she
wants to know the offset.  I'm not an expert for regexpr engines but
I'm pretty sure the necessary information is available in the engine.

Gabor, I see you're the author of gsubfn (fantastic package!).  Do you
see a relatively simple way to expose information about group offsets
and their corresponding match lengths?  I think this could be useful
for other applications as well.  At least it seems Michael could use
it, too.  We can cook up something for ourselves but a general
solution would benefit the larger community.

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Michael Bedward
What Titus wants to do is akin to retrieving capturing groups from a
Matcher object in Java. I also thought there must be an existing,
elegant solution to this some time ago and searched for it, including
looking at the sources (albeit with not much expertise) but came up
blank.

I also looked at the stringr package (which is nice) but it doesn't
quite do it either.

Michael

On 28 September 2010 01:48, Titus von der Malsburg  wrote:
> Dear list!
>
>> gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>
> Thanks for any suggestion!
>
>   Titus
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Gabor Grothendieck
On Mon, Sep 27, 2010 at 1:34 PM, Titus von der Malsburg
 wrote:
> On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck
>  wrote:
>> Try this zero width negative look behind expression:
>>
>>> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)
>> [[1]]
>> [1] 2 7
>> attr(,"match.length")
>> [1] 1 2
>
> Thanks Gabor, but this gives me the same result as
>
>  gregexpr("b+", "abcdaabbc", perl = TRUE)
>
> which is wrong if the string is "abcdaabbcbbb".
>

Sorry, try this:

>  gregexpr("(?<=a)b+", "abcdaabbcbbb", perl = TRUE)
[[1]]
[1] 2 7
attr(,"match.length")
[1] 1 2

Note that it does not give the same answer as:

>  gregexpr("b+", "abcdaabbcbbb", perl = TRUE)
[[1]]
[1]  2  7 10
attr(,"match.length")
[1] 1 2 3


 gregexpr("(?<=a)b+", "abcdaabbcbbb", perl = TRUE)




-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Henrique Dallazuanna
You've tried:

gregexpr("b+", "abcdaabbc")


On Mon, Sep 27, 2010 at 12:48 PM, Titus von der Malsburg  wrote:

> Dear list!
>
> > gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>
> Thanks for any suggestion!
>
>   Titus
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Henrique Dallazuanna
You could do this:

gregexpr("ab+", "abcdaabbcbb")[[1]] + 1

On Mon, Sep 27, 2010 at 2:25 PM, Titus von der Malsburg
wrote:

> On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna 
> wrote:
> > You've tried:
> >
> > gregexpr("b+", "abcdaabbc")
>
> But this would match the third occurrence of b+ in "abcdaabbcbb".  But
> in this example I'm only interested in b+ if it's preceded by a+.
>
>  Titus
>



-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck
 wrote:
> Try this zero width negative look behind expression:
>
>> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)
> [[1]]
> [1] 2 7
> attr(,"match.length")
> [1] 1 2

Thanks Gabor, but this gives me the same result as

  gregexpr("b+", "abcdaabbc", perl = TRUE)

which is wrong if the string is "abcdaabbcbbb".

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Gabor Grothendieck
On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
 wrote:
> Dear list!
>
>> gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>

Try this zero width negative look behind expression:

> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)
[[1]]
[1] 2 7
attr(,"match.length")
[1] 1 2

See ?regexp for more info.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna  wrote:
> You've tried:
>
> gregexpr("b+", "abcdaabbc")

But this would match the third occurrence of b+ in "abcdaabbcbb".  But
in this example I'm only interested in b+ if it's preceded by a+.

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
Thank you Jim, but just as the solution that I discussed, your
proposal involves deconstructing the pattern and searching several
times.  I'm looking for a general and efficient solution.  Internally,
the regexpr engine has all necessary information after one pass
through the string.  What I need is an interface that exposes this
information.

  Titus

On Mon, Sep 27, 2010 at 6:43 PM, jim holtman  wrote:
> try this:
>
>> x <-  gregexpr("a+(b+)", "abcdaabbcaaacaaab")
>> justA <-  gregexpr("a+", "abcdaabbcaaacaaab")
>> # find matches in 'x' for 'justA'
>> indx <- which(justA[[1]] %in% x[[1]])
>> # now determine where 'b' starts
>> justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx]
> [1]  2  7 17
>>
>
>
> On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
>  wrote:
>> Dear list!
>>
>>> gregexpr("a+(b+)", "abcdaabbc")
>> [[1]]
>> [1] 1 5
>> attr(,"match.length")
>> [1] 2 4
>>
>> What I want is the offsets of the matches for the group (b+), i.e. 2
>> and 7, not the offsets of the complete matches.  Is there a way in R
>> to get that?
>>
>> I know about gsubgn and strapply, but they only give me the strings
>> matched by groups not their offsets.
>>
>> I could write something myself that first takes the above matches
>> ("ab" and "aabb") and then searches again using only the group (b+).
>> For this to work, I'd have to parse the regular expression and search
>> several times (> 2, for nested groups) instead of just once.  But I'm
>> sure there is a better way to do this.
>>
>> Thanks for any suggestion!
>>
>>   Titus
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread jim holtman
try this:

> x <-  gregexpr("a+(b+)", "abcdaabbcaaacaaab")
> justA <-  gregexpr("a+", "abcdaabbcaaacaaab")
> # find matches in 'x' for 'justA'
> indx <- which(justA[[1]] %in% x[[1]])
> # now determine where 'b' starts
> justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx]
[1]  2  7 17
>


On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
 wrote:
> Dear list!
>
>> gregexpr("a+(b+)", "abcdaabbc")
> [[1]]
> [1] 1 5
> attr(,"match.length")
> [1] 2 4
>
> What I want is the offsets of the matches for the group (b+), i.e. 2
> and 7, not the offsets of the complete matches.  Is there a way in R
> to get that?
>
> I know about gsubgn and strapply, but they only give me the strings
> matched by groups not their offsets.
>
> I could write something myself that first takes the above matches
> ("ab" and "aabb") and then searches again using only the group (b+).
> For this to work, I'd have to parse the regular expression and search
> several times (> 2, for nested groups) instead of just once.  But I'm
> sure there is a better way to do this.
>
> Thanks for any suggestion!
>
>   Titus
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
Dear list!

> gregexpr("a+(b+)", "abcdaabbc")
[[1]]
[1] 1 5
attr(,"match.length")
[1] 2 4

What I want is the offsets of the matches for the group (b+), i.e. 2
and 7, not the offsets of the complete matches.  Is there a way in R
to get that?

I know about gsubgn and strapply, but they only give me the strings
matched by groups not their offsets.

I could write something myself that first takes the above matches
("ab" and "aabb") and then searches again using only the group (b+).
For this to work, I'd have to parse the regular expression and search
several times (> 2, for nested groups) instead of just once.  But I'm
sure there is a better way to do this.

Thanks for any suggestion!

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.