Hello,
I gave a talk at the French Perl Workshop in June about some work I was doing to produce really large (i.e. length($re) > 50000) regular expressions for Postfix access maps. (Postfix can be compiled with the PCRE library). A number of people expressed interest in the approach and wondered if and when it would be available as a module on CPAN.
The idea is that sometimes you have a large set of regular expressions (e.g. 2000), and you want to test whether a string matches any of them. You don't particularly care *which* expression matches, the fact that one matches is sufficient. Brute forcing it with a loop is not very efficient. Concatenating them with | is not efficient either, if a large subset of the expressions share a common prefix.
I know about Jarkko's Regex::PreSuf and another module whose name escapes me this instant, but they both suffer from the limitation of not being metacharacter-aware. For instance:
use Regex::PreSuf; print presuf(qw(a\d+foo a\D+123));
produces:
a\\(?:D\+123|d\+foo)
The module I'm developing works with variable length tokens, and thus deals with the above correctly:
use Regexp::Trie;
my $rt = Regexp::Trie->new; $rt->add( qw/a \\d+ f o o/ ); $rt->add( qw/a \\D+ 1 2 3/ ); print $rt->re;
produces:
a(?:\D+123|\d+foo)
(modulo me getting the backslashes escaped correctly here -- the algorithm does the right thing).
The above example contains (IMO) too much make-work code, so I'm planning on distributing a number of helper packages as well, which will take care of the general cases. I'm thinking of something like
my $rt = Regexp::Trie->new( Regexp::Trie::Lex::simple(qw( a\\d+foo a \\D+123 )) ); print $rt->re;
I.e., the Regexp::Trie::Lex::* namespace, whose packages simply have to return an object that contains an 'add' method. The Regexp::Trie::new constructor will simply take the returned object and call its 'add' method until exhaustion, and fill up its own internal structure.
As another example, I have a set of regexps that should never contain a bare . (dot) metachar. To do so is an error. Writing a seperate lexer package for this allows such error-checking to take place.
I spoke with Nicholas Clark about the item in the latest perltodo, specifically:
<quote> =head2 common suffices/prefices in regexps (trie optimization)
Currently, the user has to optimize C<foo|far> and C<foo|goo> into C<f(?:oo|ar)> and C<[fg]oo> by hand; this could be done automatically. </quote>
This is apparently a non-trivial undertaking to do in core and he suggested I pursue the release of this module regardless (I'm targetting 5.005_03 as a baseline anyway).
If I haven't put you to sleep by now, I have the following questions:
1. Has this been done before (i.e. shoot me now and put me out of my misery).
2. Is Regexp::Trie a good name? (I fall into the "regexp is spelt with a p" camp, but if Regex is preferred that's fine by me. I can never remember which, if either, is deprecated).
3. Is the lexer namespace a good idea? Or is there a better way do to this? I'm open to any design suggestions on this issue since nothing is written yet.
Thanks for reading this far, David Landgren