Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Greg London
is faster. but but not tokenizing gives your grammers more flexibility I think. -Original message- From: Charlie To: Ted Zlatanov Cc: boston...@pm.org Sent: Sun, Feb 6, 2011 16:49:56 GMT+00:00 Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency Given how you frame the problem

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread belg4mit
Too bad Text::Match::FastAlternatives's return values aren't more useful i.e; the matched position. This with a /g equivalent and Bob's your uncle. ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Ted Zlatanov
On Sun, 06 Feb 2011 11:49:56 -0500 Charlie wrote: C> Given how you frame the problem, then the hash lookup isn't even an C> option! No question, 6000+ string searches will be slow vs. a trie. C> Given the varying requirements we all encounter, day-to-day, I think C> this is an interesting

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Charlie
Given how you frame the problem, then the hash lookup isn't even an option! No question, 6000+ string searches will be slow vs. a trie. Given the varying requirements we all encounter, day-to-day, I think this is an interesting exercise. Thanks for sharing these modules, Ted. The OP

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread belg4mit
>You assumed that \s will delimit the tokens. That's not the case (see >the original message, the interesting data can occur anywhere). So you >can't tokenize and do a simple hash lookup. If you benchmark 6000 Acutally I believe the OP said that there were still delimters required, they just

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Greg London
it to a tree search, return the coderef. Is there a name for this kind of behavior? wondering what i would call the subroutine. Greg -Original message- From: Charlie To: boston...@pm.org Sent: Sat, Feb 5, 2011 23:27:13 GMT+00:00 Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Ted Zlatanov
On Sat, 05 Feb 2011 18:27:13 -0500 Charlie wrote: C> The sample program below runs in 00:09:04 on 1.15GB (1024 copies of C> Moby Dick). Replacing the hard-coded map with 2 entries with 6000 C> words taken from the text (randomly selected, unique, >5 chars) runs C> in 00:09:17. I.e. the

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Charlie
Short answer, no, Perl regex will not build an optimal lookup of a token into your set of 6000 names. In general, if speed is the issue, do not use regex. It does not scale. Also, be clear on the 2 problems at hand: 1) tokenizing 1GB of input text and 2) adding a prefix to identified

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Bill Ricker
On Sat, Feb 5, 2011 at 3:43 PM, Alex Vandiver wrote: > Since we're talking about literals, this hasn't been true since 2007, > with the release of perl 5.10.  Perl now uses a Aho-Corasick trie > algorithm internally for literal alternations, which allows for > matching without backtracking: Aha,

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Alex Vandiver
At Fri Feb 04 18:53:09 -0500 2011, Uri Guttman wrote: > that will kill your cpu. alternations are very slow since they have to > go back and try from the beginning of the list each time. Since we're talking about literals, this hasn't been true since 2007, with the release of perl 5.10. Perl now

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Uri Guttman
> "MP" == Martyn Peck writes: MP> What's wrong with something like this: MP> while($line=<>){ MP> foreach my $name (@names){ MP> $line ~= s/$name/prefix_$1/g; MP> } MP> } it is O( N^2 ) which is very slow for large data sets. MP> I know it seems

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Martyn Peck
hi Ok, I've been reading over the responses you've been getting and I just have to ask everyone. What's wrong with something like this: while($line=<>){ foreach my $name (@names){ $line ~= s/$name/prefix_$1/g; } } I know it seems kind of

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Greg London
m.org" Sent: Sat, Feb 5, 2011 00:53:35 GMT+00:00 Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency Thanks for the prompt replies, folks! Unfortunately, my names can be embedded in larger "words" of the input text, as long as they are delimited by certain punctuation. If

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Kripa Sundar
Thanks for the prompt replies, folks! Unfortunately, my names can be embedded in larger "words" of the input text, as long as they are delimited by certain punctuation. If I can figure out all of the permitted punctuation, I will try out a split() and a hash lookup. But Regex::Trie seems more

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Ted Zlatanov
On Fri, 4 Feb 2011 18:10:08 -0600 Ted Zlatanov wrote: TZ> On CPAN you can find Regex::Trie which builds this regex in a more TZ> optimized way TZ> (http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm). TZ> This is the most general solution. Oh, and for the Emacs fans, this is

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Ted Zlatanov
On Fri, 4 Feb 2011 18:43:57 -0500 Kripa Sundar wrote: KS> I have a 900 Meg text file, containing random text. I also have a list KS> of 6000 names (alphanumeric strings) that occur in the random text. KS> I need to tag a prefix on to each occurrence of each of these 6000 KS> names. KS> My

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Uri Guttman
> "KS" == Kripa Sundar writes: KS> I have a 900 Meg text file, containing random text. I also have a list KS> of 6000 names (alphanumeric strings) that occur in the random text. KS> I need to tag a prefix on to each occurrence of each of these 6000 KS> names. KS> My premise:

[Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Kripa Sundar
Hi folks, Problem: I have a 900 Meg text file, containing random text. I also have a list of 6000 names (alphanumeric strings) that occur in the random text. I need to tag a prefix on to each occurrence of each of these 6000 names. My premise: I believe a regex would give the simplest and most