New module: Regexp::Trie

David Landgren Wed, 14 Jul 2004 14:29:34 -0700

Hello,

I gave a talk at the French Perl Workshop in June about some work I was doing to produce really large (i.e. length($re) > 50000) regular expressions for Postfix access maps. (Postfix can be compiled with the PCRE library). A number of people expressed interest in the approach and wondered if and when it would be available as a module on CPAN.

The idea is that sometimes you have a large set of regular expressions (e.g. 2000), and you want to test whether a string matches any of them. You don't particularly care *which* expression matches, the fact that one matches is sufficient. Brute forcing it with a loop is not very efficient. Concatenating them with | is not efficient either, if a large subset of the expressions share a common prefix.

I know about Jarkko's Regex::PreSuf and another module whose name escapes me this instant, but they both suffer from the limitation of not being metacharacter-aware. For instance:

use Regex::PreSuf;
print presuf(qw(a\d+foo a\D+123));

produces:

a\\(?:D\+123|d\+foo)

The module I'm developing works with variable length tokens, and thus deals with the above correctly:

use Regexp::Trie;

my $rt = Regexp::Trie->new;
$rt->add( qw/a \\d+ f o o/ );
$rt->add( qw/a \\D+ 1 2 3/ );
print $rt->re;

produces:

a(?:\D+123|\d+foo)

(modulo me getting the backslashes escaped correctly here -- the algorithm does the right thing).

The above example contains (IMO) too much make-work code, so I'm planning on distributing a number of helper packages as well, which will take care of the general cases. I'm thinking of something like

my $rt = Regexp::Trie->new(
        Regexp::Trie::Lex::simple(qw( a\\d+foo a \\D+123 ))
);
print $rt->re;

I.e., the Regexp::Trie::Lex::* namespace, whose packages simply have to return an object that contains an 'add' method. The Regexp::Trie::new constructor will simply take the returned object and call its 'add' method until exhaustion, and fill up its own internal structure.

As another example, I have a set of regexps that should never contain a bare . (dot) metachar. To do so is an error. Writing a seperate lexer package for this allows such error-checking to take place.

I spoke with Nicholas Clark about the item in the latest perltodo, specifically:

<quote>
=head2 common suffices/prefices in regexps (trie optimization)

Currently, the user has to optimize C<foo|far> and C<foo|goo> into
C<f(?:oo|ar)> and C<[fg]oo> by hand; this could be done automatically.
</quote>

This is apparently a non-trivial undertaking to do in core and he suggested I pursue the release of this module regardless (I'm targetting 5.005_03 as a baseline anyway).

If I haven't put you to sleep by now, I have the following questions:

1. Has this been done before (i.e. shoot me now and put me out of my misery).

2. Is Regexp::Trie a good name? (I fall into the "regexp is spelt with a p" camp, but if Regex is preferred that's fine by me. I can never remember which, if either, is deprecated).

3. Is the lexer namespace a good idea? Or is there a better way do to this? I'm open to any design suggestions on this issue since nothing is written yet.

Thanks for reading this far,
David Landgren

New module: Regexp::Trie

Reply via email to