Re: [Tutor] re.findall(), but with overlaps?
On Sat, 3 Sep 2005, Kent Johnson wrote: AFAIK that is the way to do it. I may put in an enhancement request to change the name of re.findall to re.findsome. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] re.findall(), but with overlaps?
On Sat, 3 Sep 2005, Kent Johnson wrote: But I would say your chances of getting the name changed are slim to none, the Python developers are extremely reluctant to make changes that break existing code. Yeah, I know. I was mostly joking. Mostly. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] re.findall(), but with overlaps?
I may put in an enhancement request to change the name of re.findall to re.findsome. Hi Terry, A typical use of regular expressions is to break text into a sequence of non-overlapping tokens. There's nothing that technically stops us from applying the theory of regular expressions to get overlapping matches, but that use case is rare enough that it probably won't get into the Standard Library anytime soon. A third-party approach, to write customized code that allow overlaps, will probably work better. You may want to ask on comp.lang.python and see if someone else has had the need for overlapping matches --- there might be other people who've run into that problem too. I've helped to adapt a specialized pattern matcher for Python; not sure if this might interest you, but: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ and the Aho-Corasick search automaton that I've adapted does do overlapping matches of keywords: ## import ahocorasick tree = ahocorasick.KeywordTree() for i in range(ord('A'), ord('Z') + 1): ... tree.add('B' + chr(i) + 'B') ... tree.make() tree.findall('BABBEBIB', allow_overlaps = True) generator object at 0x403a9fec list(tree.findall('BABBEBIB', allow_overlaps = True)) [(0, 3), (3, 6), (5, 8)] ## The ahocorasick module doesn't provide full regexp power (and the example shows that I have to simulate wildcards... *grin*), but it might still be useful, depending on what you're really trying to do. The link above also refers to Nicolas Nehuen's 'pytst' module, which might also be useful for you. Best of wishes to you! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] re.findall(), but with overlaps?
A friend of mine got bitten by an expectations bug. he was using re.findall to look for all occurances of strings matching a pattern, and a substring he *knew* was in there did not pop out. the bug was that it overlapped another matching substring, and findall only returns non-overlapping strings. This is documented; he just missed it. But he asked me, is there a standard method to get even overlapped strings? Cut to its basics, here's an artificial example: import re rexp=re.compile(B.B) sequence=BABBEBIB rexp.findall(sequence) ['BAB', 'BEB'] What he would have wanted was the list ['BAB', 'BEB', 'BIB']; but since the last 'B' in BEB is also the firt 'B' in BIB, BIB is not picked up. After looking through the docs, I couldn't find a way to do this in standard methods, so I gave him a quick RYO solution: def myfindall(regex, seq): ...resultlist=[] ...pos=0 ... ...while True: ... result = regex.search(seq, pos) ... if result is None: ... break ... resultlist.append(seq[result.start():result.end()]) ... pos = result.start()+1 ...return resultlist ... myfindall(rexp,sequence) ['BAB', 'BEB', 'BIB'] But just curious; are we reinventing the wheel here? Is there already a way to match even overlapping substrings? I'm surprised I can't find one. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] re.findall(), but with overlaps?
Terry Carroll wrote: But he asked me, is there a standard method to get even overlapped strings? After looking through the docs, I couldn't find a way to do this in standard methods, so I gave him a quick RYO solution: def myfindall(regex, seq): ...resultlist=[] ...pos=0 ... ...while True: ... result = regex.search(seq, pos) ... if result is None: ... break ... resultlist.append(seq[result.start():result.end()]) ... pos = result.start()+1 ...return resultlist ... myfindall(rexp,sequence) ['BAB', 'BEB', 'BIB'] But just curious; are we reinventing the wheel here? Is there already a way to match even overlapping substrings? I'm surprised I can't find one. AFAIK that is the way to do it. You can shorten it a little by using result.group() instead of seq[result.start():result.end()]. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor