Re: [Boston.pm] Q: giant-but-simple regex efficiency

Ted Zlatanov Sat, 05 Feb 2011 20:03:09 -0800

On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <creit...@rcn.com> wrote:


C> The sample program below runs in 00:09:04 on 1.15GB  (1024 copies of
C> Moby Dick).   Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars)  runs
C> in 00:09:17.   I.e. the map lookup is trivial and has little impact on
C> the overall performance.   A trie might do just a bit better than a
C> hash table, but you are optimizing the wrong part of the program.

C> my %keywords = ( Foo => 1, Bar => 2 );

C> while ( <> )
C> {
C>    my $line = '';
C>    my @tokens = split /\s/;
C>    for my $tkn ( @tokens )
C>    {
C>        if ( '' ne $line ) {
C>          $line .= ' ';
C>        }
C>        if ( exists $keywords{$tkn} ) {
C>          $line .= 'prefix_';
C>        }
C>        $line .= $tkn;
C>    }
C>    print $line, "\n";
C> }

You assumed that \s will delimit the tokens.  That's not the case (see
the original message, the interesting data can occur anywhere).  So you
can't tokenize and do a simple hash lookup.  If you benchmark 6000
index() calls vs. 2 index() calls you'll see why a trie makes sense.
You may benefit if you tokenize, which I mentioned in my message:

TZ> If you know a boundary (e.g. newline or space or \b or \W) which
TZ> always demarcates your strings, you can split your input on that and
TZ> feed the pieces to Tree::Trie.

There is often a significant benefit to optimizing this kind of work in
C, as Text::Match::FastAlternatives does, because you may be able to
keep the code in the tight inner loop from busting the instruction
cache.  The performance benefits are huge.

I attach a benchmark of trie vs. fastmatch.  I commented out the index()
benchmark because it was ridiculously slow, on the order of 23 per
second on a fairly recent CPU.  Text::Match::FastAlternatives seems the
best choice, as advertised.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;
use Benchmark qw/cmpthese/;
use Text::Match::FastAlternatives;
use Regexp::Trie;

my $text = 'abcd ' x 100;

my @words = 'a' .. 'hzz';

my $fast = Text::Match::FastAlternatives->new(@words);

my $rt = Regexp::Trie->new;
$rt->add($_) foreach @words;
my $rtr = $rt->regexp();
my $word_count = scalar @words;

cmpthese(5000000,
         {
          # "$word_count keywords, index()" => sub
          # {
          #  my $found;
          #  foreach (@words)
          #  {
          #   $found += index $text, $_, 0;
          #  }
          # },

          "$word_count keywords, fastmatch" => sub
          {
           return $fast->match($text)
          },

          "$word_count keywords, trie" => sub
          {
           return ($text =~ m/$rtr/);
          },
         });
__DATA__
                              Rate  6110 keywords, trie 6110 keywords, fastmatch
6110 keywords, trie       638570/s                   --                     -73%
6110 keywords, fastmatch 2325581/s                 264%                       --

_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to