On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <creit...@rcn.com> wrote:
C> The sample program below runs in 00:09:04 on 1.15GB (1024 copies of C> Moby Dick). Replacing the hard-coded map with 2 entries with 6000 C> words taken from the text (randomly selected, unique, >5 chars) runs C> in 00:09:17. I.e. the map lookup is trivial and has little impact on C> the overall performance. A trie might do just a bit better than a C> hash table, but you are optimizing the wrong part of the program. C> my %keywords = ( Foo => 1, Bar => 2 ); C> while ( <> ) C> { C> my $line = ''; C> my @tokens = split /\s/; C> for my $tkn ( @tokens ) C> { C> if ( '' ne $line ) { C> $line .= ' '; C> } C> if ( exists $keywords{$tkn} ) { C> $line .= 'prefix_'; C> } C> $line .= $tkn; C> } C> print $line, "\n"; C> } You assumed that \s will delimit the tokens. That's not the case (see the original message, the interesting data can occur anywhere). So you can't tokenize and do a simple hash lookup. If you benchmark 6000 index() calls vs. 2 index() calls you'll see why a trie makes sense. You may benefit if you tokenize, which I mentioned in my message: TZ> If you know a boundary (e.g. newline or space or \b or \W) which TZ> always demarcates your strings, you can split your input on that and TZ> feed the pieces to Tree::Trie. There is often a significant benefit to optimizing this kind of work in C, as Text::Match::FastAlternatives does, because you may be able to keep the code in the tight inner loop from busting the instruction cache. The performance benefits are huge. I attach a benchmark of trie vs. fastmatch. I commented out the index() benchmark because it was ridiculously slow, on the order of 23 per second on a fairly recent CPU. Text::Match::FastAlternatives seems the best choice, as advertised. Ted #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use Benchmark qw/cmpthese/; use Text::Match::FastAlternatives; use Regexp::Trie; my $text = 'abcd ' x 100; my @words = 'a' .. 'hzz'; my $fast = Text::Match::FastAlternatives->new(@words); my $rt = Regexp::Trie->new; $rt->add($_) foreach @words; my $rtr = $rt->regexp(); my $word_count = scalar @words; cmpthese(5000000, { # "$word_count keywords, index()" => sub # { # my $found; # foreach (@words) # { # $found += index $text, $_, 0; # } # }, "$word_count keywords, fastmatch" => sub { return $fast->match($text) }, "$word_count keywords, trie" => sub { return ($text =~ m/$rtr/); }, }); __DATA__ Rate 6110 keywords, trie 6110 keywords, fastmatch 6110 keywords, trie 638570/s -- -73% 6110 keywords, fastmatch 2325581/s 264% -- _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm