Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Say I have some string that begins with an arbitrary sequence of characters 
and then alternates repeating the letters 'a' and 'b' any number of times, 
e.g.

xyz123aaabbaaabaaaabb

I'm looking for a regular expression that matches the first, and only the 
first, sequence of the letter 'a', and only if the length of the sequence is 
exactly 3.

Does such a regular expression exist?  If so, any ideas as to what it could 
be?

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Christoph Conrad [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Hello Roger,

 I'm looking for a regular expression that matches the first, and only
 the first, sequence of the letter 'a', and only if the length of the
 sequence is exactly 3.

 import sys, re, os

 if __name__=='__main__':

m = re.search('a{3}', 'xyz123aaabbaaaabaaabb')
print m.group(0)
print Preceded by: \ + m.string[0:m.start(0)] + \

The correct pattern should reject the string:

'xyz123aabbaaab'

since the length of the first sequence of the letter 'a' is 2.  Yours 
accepts it, right?

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Alex Martelli [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Tim Chase [EMAIL PROTECTED] wrote:
   ...
 I'm not quite sure what your intent here is, as the
 resulting find would obviously be aaa, of length 3.

 But that would also match ''; I think he wants negative loobehind
 and lookahead assertions around the 'aaa' part.  But then there's the
 spec about matching only if the sequence is the first occurrence of
 'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe
 parentheses around the 'aaa' to somehow 'match' is specially?).

 It's definitely not very clear what exactly the intent is, no...

Sorry for the confusion.  The correct pattern should reject all strings 
except those in which the first sequence of the letter 'a' that is followed 
by the letter 'b' has a length of exactly three.

Hope that's clearer . . . .

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Sybren Stuvel [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Roger L. Cauvin enlightened us with:
 I'm looking for a regular expression that matches the first, and
 only the first, sequence of the letter 'a', and only if the length
 of the sequence is exactly 3.

 Your request is ambiguous:

 1) You're looking for the first, and only the first, sequence of the
   letter 'a'. If the length of this first, and only the first,
   sequence of the letter 'a' is not 3, no match is made at all.

 2) You're looking for the first, and only the first, sequence of
   length 3 of the letter 'a'.

 What is it?

The first option describes what I want, with the additional restriction that 
the first sequence of the letter 'a' is defined as 1 or more consecutive 
occurrences of the letter 'a', followed directly by the letter 'b'.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Tim Chase [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Sorry for the confusion.  The correct pattern should reject
 all strings except those in which the first sequence of the
 letter 'a' that is followed by the letter 'b' has a length of
 exactly three.

 Ah...a little more clear.

 r = re.compile([^a]*a{3}b+(a+b*)*)
 matches = [s for s in listOfStringsToTest if r.match(s)]

Wow, I like it, but it allows some strings it shouldn't.  For example:

xyz123aabbaaab

(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Christos Georgiou [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that Roger L. Cauvin
 [EMAIL PROTECTED] might have written:

Say I have some string that begins with an arbitrary sequence of 
characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.

xyz123aaabbaaabaaaabb

I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence 
is
exactly 3.

Does such a regular expression exist?  If so, any ideas as to what it 
could
be?

 Is this what you mean?

 ^[^a]*(a{3})(?:[^a].*)?$

Close, but the pattern should allow arbitrary sequence of characters that 
precede the alternating a's and b's to contain the letter 'a'.  In other 
words, the pattern should accept:

xayz123aaabbab

since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

Your proposed pattern  rejects this string.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Tim Chase [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
r = re.compile([^a]*a{3}b+(a+b*)*)
matches = [s for s in listOfStringsToTest if r.match(s)]

 Wow, I like it, but it allows some strings it shouldn't.  For example:

 xyz123aabbaaab

 (It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

 Anchoring it to the beginning/end might solve that:

 r = re.compile(^[^a]*a{3}b+(a+b*)*$)

 this ensures that no as come before the first 3xa and nothing but b 
 and a follows it.

Anchoring may be the key here, but this pattern rejects

xayz123aaabab

which it should accept, since the 'a' between the 'x' and the 'y' is not 
directly followed by the letter 'b'.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Peter Hansen [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Roger L. Cauvin wrote:
 Sorry for the confusion.  The correct pattern should reject all strings 
 except those in which the first sequence of the letter 'a' that is 
 followed by the letter 'b' has a length of exactly three.

 Hope that's clearer . . . .

 Examples are a *really* good way to clarify ambiguous or complex 
 requirements.  In fact, when made executable they're called test cases 
 :-), and supplying a few of those (showing input values and expected 
 output values) would help, not only to clarify your goals for the humans, 
 but also to let the proposed solutions easily be tested.

Good suggestion.  Here are some test cases:

xyz123aaabbab accept
xyz123aabbaab reject
xayz123aaabab accept
xaaayz123abab reject
xaaayz123aaabab accept

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Alex Martelli [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Tim Chase [EMAIL PROTECTED] wrote:

  Sorry for the confusion.  The correct pattern should reject
  all strings except those in which the first sequence of the
  letter 'a' that is followed by the letter 'b' has a length of
  exactly three.
...
...
 If a little more than just REs and matching was allowed, it would be
 reasonably easy, but I don't know how to fashion a RE r such that
 r.match(s) will succeed if and only if s meets those very precise and
 complicated specs.  That doesn't mean it just can't be done, just that I
 can't do it so far.  Perhaps the OP can tell us what constrains him to
 use r.match ONLY, rather than a little bit of logic around it, so we can
 see if we're trying to work in an artificially overconstrained domain?

Alex, you seem to grasp exactly what the requirements are in this case.  I 
of course don't *have* to use regular expressions only, but I'm working with 
an infrastructure that uses regexps in configuration files so that the code 
doesn't have to change to add or change patterns.  Before throwing up my 
hands and re-architecting, I wanted to see if regexps would handle the job 
(they have in every case but one).

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Fredrik Lundh [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Roger L. Cauvin wrote:

 Good suggestion.  Here are some test cases:

 xyz123aaabbab accept
 xyz123aabbaab reject
 xayz123aaabab accept
 xaaayz123abab reject
 xaaayz123aaabab accept

 $ more test.py

 import re

 print gotexpected
 print -- 

 testsuite = (
(xyz123aaabbab, accept),
(xyz123aabbaab, reject),
(xayz123aaabab, accept),
(xaaayz123abab, reject),
(xaaayz123aaabab, accept),
)

 for string, result in testsuite:
m = re.search(aaab, string)
if m:
print accept,
else:
print reject,
print result


 $ python test.py
 gotexpected
 ---
 accept accept
 reject reject
 accept accept
 reject reject
 accept accept

Thanks, but the second test case I listed contained a typo.  It should have 
contained a sequence of three of the letter 'a'.  The test cases should be:

xyz123aaabbab accept
xyz123aabbaaab reject
xayz123aaabab accept
xaaayz123abab reject
xaaayz123aaabab accept

Your pattern fails the second test.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Christos Georgiou [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 On Thu, 26 Jan 2006 16:41:08 GMT, rumours say that Roger L. Cauvin
 [EMAIL PROTECTED] might have written:

Good suggestion.  Here are some test cases:

xyz123aaabbab accept
xyz123aabbaab reject
xayz123aaabab accept
xaaayz123abab reject
xaaayz123aaabab accept

 Applying my last regex to your test cases:

 r.match(xyz123aaabbab)
 _sre.SRE_Match object at 0x00B47F60
 r.match(xyz123aabbaab)
 r.match(xayz123aaabab)
 _sre.SRE_Match object at 0x00B50020
 r.match(xaaayz123abab)
 r.match(xaaayz123aaabab)
 _sre.SRE_Match object at 0x00B47F60
 print r.pattern
 ^(?:.*?[^a])?(a{3})(?:b[ab]*)?$

 You should also remember to check the (match_object).start(1) to verify 
 that
 it matches the aaa you want.

Thanks, but the second test case I listed contained a typo.  It should have 
contained a sequence of three of the letter 'a'.  The test cases should be:

xyz123aaabbab accept
xyz123aabbaaab reject
xayz123aaabab accept
xaaayz123abab reject
xaaayz123aaabab accept

Your pattern fails the second test.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Christos Georgiou [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 On Thu, 26 Jan 2006 16:26:57 GMT, rumours say that Roger L. Cauvin
 [EMAIL PROTECTED] might have written:

Christos Georgiou [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]

 On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that Roger L. Cauvin
 [EMAIL PROTECTED] might have written:

Say I have some string that begins with an arbitrary sequence of
characters
and then alternates repeating the letters 'a' and 'b' any number of 
times,
e.g.

xyz123aaabbaaabaaaabb

I'm looking for a regular expression that matches the first, and only 
the
first, sequence of the letter 'a', and only if the length of the 
sequence
is
exactly 3.

Does such a regular expression exist?  If so, any ideas as to what it
could
be?

 Is this what you mean?

 ^[^a]*(a{3})(?:[^a].*)?$

Close, but the pattern should allow arbitrary sequence of characters 
that
precede the alternating a's and b's to contain the letter 'a'.  In other
words, the pattern should accept:

xayz123aaabbab

since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

Your proposed pattern  rejects this string.

 1.

 (a{3})(?:b[ab]*)?$

 This finds the first (leftmost) aaa either at the end of the string or
 followed by 'b' and then arbitrary sequences of 'a' and 'b'.

 This will also match  (from second position on).

 2.

 If you insist in only three 'a's and you can add the constraint that:

 * let s be the arbitrary sequence of characters at the start of your
 searched text
 * len(s) = 1 and not s.endswith('a')

 then you'll have this reg.ex.

 (?=[^a])(a{3})(?:b[ab]*)?$

 3.

 If you want to allow for a possible empty arbitrary sequence of 
 characters
 at the start and you don't mind search speed

 ^(?:.?*[^a])?(a{3})(?:b[ab]*)?$

 This should cover you:

 s=xayzbaaa123aaabbab
 r=re.compile(r^(?:.*?[^a])?(a{3})(?:b[ab]*)?$)
 m= r.match(s)
 m.group(1)
 'aaa'
 m.start(1)
 11
 s[11:]
 'aaabbab'

Thanks for continuing to follow up, Christos.  Please see my reply to your 
other post (in which you applied the test cases).

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Christos Georgiou [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 On Thu, 26 Jan 2006 18:01:07 +0100, rumours say that Fredrik Lundh
 [EMAIL PROTECTED] might have written:

Roger L. Cauvin wrote:

 Good suggestion.  Here are some test cases:

 xyz123aaabbab accept
 xyz123aabbaab reject
 xayz123aaabab accept
 xaaayz123abab reject
 xaaayz123aaabab accept

$ more test.py

 [snip of code]
m = re.search(aaab, string)
 [snip of more code]

$ python test.py
gotexpected
---
accept accept
reject reject
accept accept
reject reject
accept accept

 You're right, Fredrik, but we (graciously as a group :) take also notice 
 of
 the other requirements that the OP has provided elsewhere and that are not
 covered by the simple test that he specified.

My fault, guys.  The second test case should be

xyz123aabbaaab reject

instead of

xyz123aabbaab reject

Fredrik's pattern fails this test case.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Christos Georgiou [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 On Thu, 26 Jan 2006 17:09:18 GMT, rumours say that Roger L. Cauvin
 [EMAIL PROTECTED] might have written:

Thanks, but the second test case I listed contained a typo.  It should 
have
contained a sequence of three of the letter 'a'.  The test cases should 
be:

xyz123aaabbab accept
xyz123aabbaaab reject

 Here I object to either you or your need for a regular expression.  You 
 see,
 before the aaa in your second test case, you have an arbitrary sequence
 of characters, so your requirements are met.

Well, thank you for your efforts so far, Christos.

My purpose is to determine whether it's possible to do this using regular 
expressions, since my application is already architected around 
configuration files that use regular expressions.  It may not be the best 
architecture, but I still don't know the answer to my question.  Is it 
*possible* to fulfill my requirements with regular expressions, even if it's 
not the best way to do it?

The requirements are not met by your regular expression, since by definition 
the arbitrary sequence of characters stops once the sequences of a's and 
b's starts.

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Fredrik Lundh [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Roger L. Cauvin wrote:

  $ python test.py
  gotexpected
  ---
  accept accept
  reject reject
  accept accept
  reject reject
  accept accept

 Thanks, but the second test case I listed contained a typo.  It should 
 have
 contained a sequence of three of the letter 'a'.  The test cases should 
 be:

 xyz123aaabbab accept
 xyz123aabbaaab reject
 xayz123aaabab accept
 xaaayz123abab reject
 xaaayz123aaabab accept

 Your pattern fails the second test.

 $ more test.py

 import re

 print gotexpected
 print -- 

 testsuite = (
(xyz123aaabbab, accept),
(xyz123aabbaaab, reject),
(xayz123aaabab, accept),
(xaaayz123abab, reject),
(xaaayz123aaabab, accept),
)

 for string, result in testsuite:
m = re.search(a+b, string)
if m and len(m.group()) == 4:
print accept,
else:
print reject,
print result

 $ python test.py

 gotexpected
 -- 
 accept accept
 reject reject
 accept accept
 reject reject
 accept accept

Thanks, but I'm looking for a solution in terms of a regular expression 
only.  In other words, accept means the regular expression matched, and 
reject means the regular expression did not match.  I want to see if I can 
fulfill the requirements without additional code (such as checking 
len(m.group())).

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Michael Spencer [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Roger L. Cauvin wrote:

 xyz123aaabbab accept
 xyz123aabbaaab reject
 xayz123aaabab accept
 xaaayz123abab reject
 xaaayz123aaabab accept

 This passes your tests.  I haven't closely followed the thread for other 
 requirements:

   pattern = .*?(?![a+b])aaab #look for aaab not preceded by any a+b

Very interesting.  I think you may have solved the problem.  The key seems 
to be the not preceded by part.  I'm unfamiliar with some of the notation. 
Can you explain what [a+b] and the (?! do?

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Tim Chase [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 The below seems to pass all the tests you threw at it (taking the modified 
 2nd test into consideration)

 One other test that occurs to me would be

 xyz123aaabbaaabab

 where you have aaab in there twice.

Good suggestion.

 ^([^b]|((?!a)b))*aaab+[ab]*$

Looks good, although I've been unable to find a good explanation of the 
negative lookbehind construct (?.  How does it work?

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match First Sequence in Regular Expression?

2006-01-26 Thread Roger L. Cauvin
Peter Hansen [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Roger L. Cauvin wrote:
 Michael Spencer [EMAIL PROTECTED] wrote in message 
 news:[EMAIL PROTECTED]

Roger L. Cauvin wrote:
xyz123aaabbab accept
xyz123aabbaaab reject
xayz123aaabab accept
xaaayz123abab reject
xaaayz123aaabab accept


This passes your tests.  I haven't closely followed the thread for other 
requirements:

  pattern = .*?(?![a+b])aaab #look for aaab not preceded by any a+b

 Very interesting.  I think you may have solved the problem.  The key 
 seems to be the not preceded by part.  I'm unfamiliar with some of the 
 notation. Can you explain what [a+b] and the (?! do?

 I think you might need to add a test case involving a pattern of b 
 prior to another aaab.  From what I gather (not reading too closely), you 
 would want this to be rejected.  Is that true?

 xyz123babaaabab

Adding that test would be a good idea.  You're right; I would want that 
string to be rejected, since in that string the first sequence of 'a' 
directly preceding a 'b' is of length 4 instead of 3.

Thanks for the solution!

-- 
Roger L. Cauvin
[EMAIL PROTECTED] (omit the nospam_ part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com


-- 
http://mail.python.org/mailman/listinfo/python-list