On Fri, 4 Feb 2011 18:43:57 -0500 Kripa Sundar <kripa.sun...@synopsys.com> wrote:
KS> I have a 900 Meg text file, containing random text. I also have a list KS> of 6000 names (alphanumeric strings) that occur in the random text. KS> I need to tag a prefix on to each occurrence of each of these 6000 KS> names. KS> My premise: KS> I believe a regex would give the simplest and most efficient algorithm. KS> If I am mistaken, I would be happy to learn. KS> Solution attempt: KS> I built a large-but-simple regex, consisting of all the names in KS> alternation. I applied this regex to each input line. KS> My code: KS> 1: my @names = [...]; # my 6000 names. KS> 2: my $regex = join "|", @names; KS> 3: $regex = qr/\b($regex)\b/; KS> 4: KS> 5: # Read the input, and write out to all the copies simultaneously. KS> 6: while (<>) { KS> 7: s/$regex/prefix_$1/g; KS> 8: } On CPAN you can find Regex::Trie which builds this regex in a more optimized way (http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm). This is the most general solution. There's also Tree::Trie (http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm) which is for general text matching. If you know a boundary (e.g. newline or space or \b or \W) which always demarcates your strings, you can split your input on that and feed the pieces to Tree::Trie. You have to benchmark to find out if that's faster that Regex::Trie (which claims to be built for speed). This is a more specific solution than Regex::Trie. I also found http://aaroncrane.co.uk/2008/05/text_match_fastalternatives/ which benchmarks a C implementation called Text::Match::FastAlternatives to be very fast. Incidentally it claims that Perl 5.10 has specific optimizations for your use case and that Text::Match::FastAlternatives beats all the alternatives in speed. Text::Match::FastAlternatives only handles ASCII characters and is the most single-purpose solution of the three. So try it if it fits, it's also on CPAN :) If you decide on one in particular, please let us know what you found. Ted _______________________________________________ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm