Re: fast regex

MRAB Sat, 08 May 2010 07:30:35 -0700

Bryan wrote:

Tim Chase wrote:

James wrote:

[Tim had written:]

If the keys in your word_list are more than just words, then the
regexp may not find them all, and thus not replace them all.  In
that case you may have to resort to my 2nd regexp which builds
the 5k branch regexp from your actual dictionary keys:

 r = re.compile(r'\b(%s)\b' % (
   '|'.join(re.escape(s) for s in words_list.keys())
   ),
   re.IGNORECASE)

This method on the above dictionary (modified)

   d = {
     'hello': 'goodbye',
     'world': 'python',
     'stuff with spaces?': 'tadah!',
     }

would create a regexp of

   \b(about|world|stuff\ with\ spaces\?)\b



There's a subtle issue of the order of the keywords. Pythons
(ir)regular expressions do not strictly guarantee to find the longest
match. For example,

re.findall(r'\b(about|about\-faced)\b',
        'does about-faced match?')

returns ['about'], which is the first complete match, not the longest.
Sorting the keywords into longest-first order avoids the problem.

This has considerable performance implications as len(word_list)
grows, unless you can figure a way to determine that some
replacements are more probable than others and push them to the
front of this regexp, but that's more complex and requires
knowledge of your word-list.


There is another method which is to factor out common prefixes, so the
re module's search-and-backtrack engine never has a choice of multiple
paths. Instead of:

    \b(abolished|abolition)\b

we'd use:

    \b(aboli(shed|tion)\b

The RE-generator is of course more complex than Tim's one-liner, but I
happen to have code, which I'll include below. It's probably academic,
as I'd agree with Tim that his first solution is the better candidate
at this point.

With my giant prefix-combined RE's, I can re.compile one with 4000
words randomly chosen from a Unix words file, but 5000 results in
"regular expression code size limit exceeded". Tim's version which
doesn't combine prefixes tops out a little lower. This is on 32-bit
Windows, standard distribution. One could, of course, build multiple
smaller RE's, but that starts to suck.

There's an alternative regex module at:

    http://bugs.python.org/issue2636

and also at:

    http://pypi.python.org/pypi/regex/0.1.2010414

It looks for common prefixes and suffixes internally and can handle much
larger regexes.
--
http://mail.python.org/mailman/listinfo/python-list

Re: fast regex

Reply via email to