Re: Overlapping Regular Expression Matches With findall()

Bengt Richter Thu, 15 Dec 2005 13:30:50 -0800

On Thu, 15 Dec 2005 20:33:42 +0000, Simon Brunning <[EMAIL PROTECTED]> wrote:


>On 15 Dec 2005 12:26:07 -0800, Mystilleef <[EMAIL PROTECTED]> wrote:
>> I want a pattern that scans the entire string but avoids
>> returning duplicate matches. For example "cat", "cate",
>> "cater" may all well be valid matches, but I don't want
>> duplicate matches of any of them. I know I can filter the
>> list containing found matches myself, but that is somewhat
>> expensive for a list containing thousands of matches.
>
>Probably the cheapest way of de-duping the list would be to dump it
>straight into a set, provided that you aren't concerned about the
>order.
>
Or if concerned, maybe try a combination like:

 >>> s = """\
 ... I want a pattern that scans the entire string but avoids
 ... returning duplicate matches. For example "cat", "cate",
 ... "cater" may all well be valid matches, but I don't want
 ... duplicate matches of any of them. I know I can filter the
 ... list containing found matches myself, but that is somewhat
 ... expensive for a list containing thousands of matches.
 ... """
 >>> import re
 >>> rxo = re.compile(r'cat(?:er|e)?')
 >>> rxo.findall(s)
 ['cate', 'cat', 'cate', 'cater', 'cate']
 >>> seen = set()
 >>> [w for w in (m.group(0) for m in rxo.finditer(s)) if w not in seen and not 
 >>> seen.add(w)]
 ['cate', 'cat', 'cater']

BTW, note to put longer ambiguous match first in re, e.g., not r'cat(?:e|er)?') 
for above.

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Overlapping Regular Expression Matches With findall()

Reply via email to