Re: [Tutor] re.findall(), but with overlaps?

2005-09-03 Thread Terry Carroll
On Sat, 3 Sep 2005, Kent Johnson wrote:

 AFAIK that is the way to do it.

I may put in an enhancement request to change the name of re.findall to 
re.findsome.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] re.findall(), but with overlaps?

2005-09-03 Thread Terry Carroll
On Sat, 3 Sep 2005, Kent Johnson wrote:

 But I would say your chances of getting the name changed are slim to
 none, the Python developers are extremely reluctant to make changes that
 break existing code.

Yeah, I know.  I was mostly joking.

Mostly.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] re.findall(), but with overlaps?

2005-09-03 Thread Danny Yoo


 I may put in an enhancement request to change the name of re.findall to
 re.findsome.

Hi Terry,

A typical use of regular expressions is to break text into a sequence of
non-overlapping tokens.  There's nothing that technically stops us from
applying the theory of regular expressions to get overlapping matches, but
that use case is rare enough that it probably won't get into the Standard
Library anytime soon.  A third-party approach, to write customized code
that allow overlaps, will probably work better.

You may want to ask on comp.lang.python and see if someone else has had
the need for overlapping matches --- there might be other people who've
run into that problem too.

I've helped to adapt a specialized pattern matcher for Python; not sure if
this might interest you, but:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

and the Aho-Corasick search automaton that I've adapted does do
overlapping matches of keywords:

##
 import ahocorasick
 tree = ahocorasick.KeywordTree()
 for i in range(ord('A'), ord('Z') + 1):
... tree.add('B' + chr(i) + 'B')
...
 tree.make()
 tree.findall('BABBEBIB', allow_overlaps = True)
generator object at 0x403a9fec
 list(tree.findall('BABBEBIB', allow_overlaps = True))
[(0, 3), (3, 6), (5, 8)]
##

The ahocorasick module doesn't provide full regexp power (and the example
shows that I have to simulate wildcards... *grin*), but it might still be
useful, depending on what you're really trying to do.  The link above also
refers to Nicolas Nehuen's 'pytst' module, which might also be useful for
you.

Best of wishes to you!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] re.findall(), but with overlaps?

2005-09-02 Thread Terry Carroll

A friend of mine got bitten by an expectations bug.  he was using 
re.findall to look for all occurances of strings matching a pattern, and a 
substring he *knew* was in there did not pop out.

the bug was that it overlapped another matching substring, and findall 
only returns non-overlapping strings.  This is documented; he just missed 
it.

But he asked me, is there a standard method to get even overlapped
strings?

Cut to its basics, here's an artificial example:

 import re
 rexp=re.compile(B.B)
 sequence=BABBEBIB
 rexp.findall(sequence)
['BAB', 'BEB']

What he would have wanted was  the list ['BAB', 'BEB', 'BIB']; but since 
the last 'B' in BEB is also the firt 'B' in BIB, BIB is not picked 
up.

After looking through the docs, I couldn't find a way to do this in 
standard methods, so I gave him a quick RYO solution:

 def myfindall(regex, seq):
...resultlist=[]
...pos=0
...
...while True:
...   result = regex.search(seq, pos)
...   if result is None:
...  break
...   resultlist.append(seq[result.start():result.end()])
...   pos = result.start()+1
...return resultlist
...
 myfindall(rexp,sequence)
['BAB', 'BEB', 'BIB']

But just curious; are we reinventing the wheel here?  Is there already a 
way to match even overlapping substrings?  I'm surprised I can't find one.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] re.findall(), but with overlaps?

2005-09-02 Thread Kent Johnson
Terry Carroll wrote:
 But he asked me, is there a standard method to get even overlapped
 strings?
 
 After looking through the docs, I couldn't find a way to do this in 
 standard methods, so I gave him a quick RYO solution:
 
 
def myfindall(regex, seq):
 
 ...resultlist=[]
 ...pos=0
 ...
 ...while True:
 ...   result = regex.search(seq, pos)
 ...   if result is None:
 ...  break
 ...   resultlist.append(seq[result.start():result.end()])
 ...   pos = result.start()+1
 ...return resultlist
 ...
 
myfindall(rexp,sequence)
 
 ['BAB', 'BEB', 'BIB']
 
 But just curious; are we reinventing the wheel here?  Is there already a 
 way to match even overlapping substrings?  I'm surprised I can't find one.

AFAIK that is the way to do it.  You can shorten it a little by using 
result.group() instead of seq[result.start():result.end()].

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor