Match First Sequence in Regular Expression?
Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. xyz123aaabbaaabaaaabb I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3. Does such a regular expression exist? If so, any ideas as to what it could be? -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Christoph Conrad [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Hello Roger, I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3. import sys, re, os if __name__=='__main__': m = re.search('a{3}', 'xyz123aaabbaaaabaaabb') print m.group(0) print Preceded by: \ + m.string[0:m.start(0)] + \ The correct pattern should reject the string: 'xyz123aabbaaab' since the length of the first sequence of the letter 'a' is 2. Yours accepts it, right? -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Alex Martelli [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Tim Chase [EMAIL PROTECTED] wrote: ... I'm not quite sure what your intent here is, as the resulting find would obviously be aaa, of length 3. But that would also match ''; I think he wants negative loobehind and lookahead assertions around the 'aaa' part. But then there's the spec about matching only if the sequence is the first occurrence of 'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe parentheses around the 'aaa' to somehow 'match' is specially?). It's definitely not very clear what exactly the intent is, no... Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three. Hope that's clearer . . . . -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Sybren Stuvel [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin enlightened us with: I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3. Your request is ambiguous: 1) You're looking for the first, and only the first, sequence of the letter 'a'. If the length of this first, and only the first, sequence of the letter 'a' is not 3, no match is made at all. 2) You're looking for the first, and only the first, sequence of length 3 of the letter 'a'. What is it? The first option describes what I want, with the additional restriction that the first sequence of the letter 'a' is defined as 1 or more consecutive occurrences of the letter 'a', followed directly by the letter 'b'. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Tim Chase [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three. Ah...a little more clear. r = re.compile([^a]*a{3}b+(a+b*)*) matches = [s for s in listOfStringsToTest if r.match(s)] Wow, I like it, but it allows some strings it shouldn't. For example: xyz123aabbaaab (It skips over the two-letter sequence of 'a' and matches 'bbaaab'.) -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Christos Georgiou [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that Roger L. Cauvin [EMAIL PROTECTED] might have written: Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. xyz123aaabbaaabaaaabb I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3. Does such a regular expression exist? If so, any ideas as to what it could be? Is this what you mean? ^[^a]*(a{3})(?:[^a].*)?$ Close, but the pattern should allow arbitrary sequence of characters that precede the alternating a's and b's to contain the letter 'a'. In other words, the pattern should accept: xayz123aaabbab since the 'a' between the 'x' and 'y' is not directly followed by a 'b'. Your proposed pattern rejects this string. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Tim Chase [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] r = re.compile([^a]*a{3}b+(a+b*)*) matches = [s for s in listOfStringsToTest if r.match(s)] Wow, I like it, but it allows some strings it shouldn't. For example: xyz123aabbaaab (It skips over the two-letter sequence of 'a' and matches 'bbaaab'.) Anchoring it to the beginning/end might solve that: r = re.compile(^[^a]*a{3}b+(a+b*)*$) this ensures that no as come before the first 3xa and nothing but b and a follows it. Anchoring may be the key here, but this pattern rejects xayz123aaabab which it should accept, since the 'a' between the 'x' and the 'y' is not directly followed by the letter 'b'. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Peter Hansen [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin wrote: Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three. Hope that's clearer . . . . Examples are a *really* good way to clarify ambiguous or complex requirements. In fact, when made executable they're called test cases :-), and supplying a few of those (showing input values and expected output values) would help, not only to clarify your goals for the humans, but also to let the proposed solutions easily be tested. Good suggestion. Here are some test cases: xyz123aaabbab accept xyz123aabbaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Alex Martelli [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Tim Chase [EMAIL PROTECTED] wrote: Sorry for the confusion. The correct pattern should reject all strings except those in which the first sequence of the letter 'a' that is followed by the letter 'b' has a length of exactly three. ... ... If a little more than just REs and matching was allowed, it would be reasonably easy, but I don't know how to fashion a RE r such that r.match(s) will succeed if and only if s meets those very precise and complicated specs. That doesn't mean it just can't be done, just that I can't do it so far. Perhaps the OP can tell us what constrains him to use r.match ONLY, rather than a little bit of logic around it, so we can see if we're trying to work in an artificially overconstrained domain? Alex, you seem to grasp exactly what the requirements are in this case. I of course don't *have* to use regular expressions only, but I'm working with an infrastructure that uses regexps in configuration files so that the code doesn't have to change to add or change patterns. Before throwing up my hands and re-architecting, I wanted to see if regexps would handle the job (they have in every case but one). -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Fredrik Lundh [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin wrote: Good suggestion. Here are some test cases: xyz123aaabbab accept xyz123aabbaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept $ more test.py import re print gotexpected print -- testsuite = ( (xyz123aaabbab, accept), (xyz123aabbaab, reject), (xayz123aaabab, accept), (xaaayz123abab, reject), (xaaayz123aaabab, accept), ) for string, result in testsuite: m = re.search(aaab, string) if m: print accept, else: print reject, print result $ python test.py gotexpected --- accept accept reject reject accept accept reject reject accept accept Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be: xyz123aaabbab accept xyz123aabbaaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept Your pattern fails the second test. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Christos Georgiou [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On Thu, 26 Jan 2006 16:41:08 GMT, rumours say that Roger L. Cauvin [EMAIL PROTECTED] might have written: Good suggestion. Here are some test cases: xyz123aaabbab accept xyz123aabbaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept Applying my last regex to your test cases: r.match(xyz123aaabbab) _sre.SRE_Match object at 0x00B47F60 r.match(xyz123aabbaab) r.match(xayz123aaabab) _sre.SRE_Match object at 0x00B50020 r.match(xaaayz123abab) r.match(xaaayz123aaabab) _sre.SRE_Match object at 0x00B47F60 print r.pattern ^(?:.*?[^a])?(a{3})(?:b[ab]*)?$ You should also remember to check the (match_object).start(1) to verify that it matches the aaa you want. Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be: xyz123aaabbab accept xyz123aabbaaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept Your pattern fails the second test. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Christos Georgiou [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On Thu, 26 Jan 2006 16:26:57 GMT, rumours say that Roger L. Cauvin [EMAIL PROTECTED] might have written: Christos Georgiou [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that Roger L. Cauvin [EMAIL PROTECTED] might have written: Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. xyz123aaabbaaabaaaabb I'm looking for a regular expression that matches the first, and only the first, sequence of the letter 'a', and only if the length of the sequence is exactly 3. Does such a regular expression exist? If so, any ideas as to what it could be? Is this what you mean? ^[^a]*(a{3})(?:[^a].*)?$ Close, but the pattern should allow arbitrary sequence of characters that precede the alternating a's and b's to contain the letter 'a'. In other words, the pattern should accept: xayz123aaabbab since the 'a' between the 'x' and 'y' is not directly followed by a 'b'. Your proposed pattern rejects this string. 1. (a{3})(?:b[ab]*)?$ This finds the first (leftmost) aaa either at the end of the string or followed by 'b' and then arbitrary sequences of 'a' and 'b'. This will also match (from second position on). 2. If you insist in only three 'a's and you can add the constraint that: * let s be the arbitrary sequence of characters at the start of your searched text * len(s) = 1 and not s.endswith('a') then you'll have this reg.ex. (?=[^a])(a{3})(?:b[ab]*)?$ 3. If you want to allow for a possible empty arbitrary sequence of characters at the start and you don't mind search speed ^(?:.?*[^a])?(a{3})(?:b[ab]*)?$ This should cover you: s=xayzbaaa123aaabbab r=re.compile(r^(?:.*?[^a])?(a{3})(?:b[ab]*)?$) m= r.match(s) m.group(1) 'aaa' m.start(1) 11 s[11:] 'aaabbab' Thanks for continuing to follow up, Christos. Please see my reply to your other post (in which you applied the test cases). -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Christos Georgiou [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On Thu, 26 Jan 2006 18:01:07 +0100, rumours say that Fredrik Lundh [EMAIL PROTECTED] might have written: Roger L. Cauvin wrote: Good suggestion. Here are some test cases: xyz123aaabbab accept xyz123aabbaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept $ more test.py [snip of code] m = re.search(aaab, string) [snip of more code] $ python test.py gotexpected --- accept accept reject reject accept accept reject reject accept accept You're right, Fredrik, but we (graciously as a group :) take also notice of the other requirements that the OP has provided elsewhere and that are not covered by the simple test that he specified. My fault, guys. The second test case should be xyz123aabbaaab reject instead of xyz123aabbaab reject Fredrik's pattern fails this test case. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Christos Georgiou [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On Thu, 26 Jan 2006 17:09:18 GMT, rumours say that Roger L. Cauvin [EMAIL PROTECTED] might have written: Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be: xyz123aaabbab accept xyz123aabbaaab reject Here I object to either you or your need for a regular expression. You see, before the aaa in your second test case, you have an arbitrary sequence of characters, so your requirements are met. Well, thank you for your efforts so far, Christos. My purpose is to determine whether it's possible to do this using regular expressions, since my application is already architected around configuration files that use regular expressions. It may not be the best architecture, but I still don't know the answer to my question. Is it *possible* to fulfill my requirements with regular expressions, even if it's not the best way to do it? The requirements are not met by your regular expression, since by definition the arbitrary sequence of characters stops once the sequences of a's and b's starts. -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Fredrik Lundh [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin wrote: $ python test.py gotexpected --- accept accept reject reject accept accept reject reject accept accept Thanks, but the second test case I listed contained a typo. It should have contained a sequence of three of the letter 'a'. The test cases should be: xyz123aaabbab accept xyz123aabbaaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept Your pattern fails the second test. $ more test.py import re print gotexpected print -- testsuite = ( (xyz123aaabbab, accept), (xyz123aabbaaab, reject), (xayz123aaabab, accept), (xaaayz123abab, reject), (xaaayz123aaabab, accept), ) for string, result in testsuite: m = re.search(a+b, string) if m and len(m.group()) == 4: print accept, else: print reject, print result $ python test.py gotexpected -- accept accept reject reject accept accept reject reject accept accept Thanks, but I'm looking for a solution in terms of a regular expression only. In other words, accept means the regular expression matched, and reject means the regular expression did not match. I want to see if I can fulfill the requirements without additional code (such as checking len(m.group())). -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Michael Spencer [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin wrote: xyz123aaabbab accept xyz123aabbaaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept This passes your tests. I haven't closely followed the thread for other requirements: pattern = .*?(?![a+b])aaab #look for aaab not preceded by any a+b Very interesting. I think you may have solved the problem. The key seems to be the not preceded by part. I'm unfamiliar with some of the notation. Can you explain what [a+b] and the (?! do? -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Tim Chase [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] The below seems to pass all the tests you threw at it (taking the modified 2nd test into consideration) One other test that occurs to me would be xyz123aaabbaaabab where you have aaab in there twice. Good suggestion. ^([^b]|((?!a)b))*aaab+[ab]*$ Looks good, although I've been unable to find a good explanation of the negative lookbehind construct (?. How does it work? -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Match First Sequence in Regular Expression?
Peter Hansen [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin wrote: Michael Spencer [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Roger L. Cauvin wrote: xyz123aaabbab accept xyz123aabbaaab reject xayz123aaabab accept xaaayz123abab reject xaaayz123aaabab accept This passes your tests. I haven't closely followed the thread for other requirements: pattern = .*?(?![a+b])aaab #look for aaab not preceded by any a+b Very interesting. I think you may have solved the problem. The key seems to be the not preceded by part. I'm unfamiliar with some of the notation. Can you explain what [a+b] and the (?! do? I think you might need to add a test case involving a pattern of b prior to another aaab. From what I gather (not reading too closely), you would want this to be rejected. Is that true? xyz123babaaabab Adding that test would be a good idea. You're right; I would want that string to be rejected, since in that string the first sequence of 'a' directly preceding a 'b' is of length 4 instead of 3. Thanks for the solution! -- Roger L. Cauvin [EMAIL PROTECTED] (omit the nospam_ part) Cauvin, Inc. Product Management / Market Research http://www.cauvin-inc.com -- http://mail.python.org/mailman/listinfo/python-list