is faster. but but not tokenizing gives
your grammers more flexibility I think.
-Original message-
From: Charlie
To: Ted Zlatanov
Cc: boston...@pm.org
Sent: Sun, Feb 6, 2011 16:49:56 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency
Given how you frame the problem
Too bad Text::Match::FastAlternatives's return values aren't more useful
i.e; the matched position. This with a /g equivalent and Bob's your uncle.
___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm
On Sun, 06 Feb 2011 11:49:56 -0500 Charlie wrote:
C> Given how you frame the problem, then the hash lookup isn't even an
C> option! No question, 6000+ string searches will be slow vs. a trie.
C> Given the varying requirements we all encounter, day-to-day, I think
C> this is an interesting
Given how you frame the problem, then the hash lookup isn't even an
option! No question, 6000+ string searches will be slow vs. a
trie. Given the varying requirements we all encounter, day-to-day, I
think this is an interesting exercise. Thanks for sharing these modules, Ted.
The OP
>You assumed that \s will delimit the tokens. That's not the case (see
>the original message, the interesting data can occur anywhere). So you
>can't tokenize and do a simple hash lookup. If you benchmark 6000
Acutally I believe the OP said that there were still delimters required,
they just
it to a tree
search, return the coderef.
Is there a name for this kind of behavior? wondering what i would call the
subroutine.
Greg
-Original message-
From: Charlie
To: boston...@pm.org
Sent: Sat, Feb 5, 2011 23:27:13 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency
On Sat, 05 Feb 2011 18:27:13 -0500 Charlie wrote:
C> The sample program below runs in 00:09:04 on 1.15GB (1024 copies of
C> Moby Dick). Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars) runs
C> in 00:09:17. I.e. the
Short answer, no, Perl regex will not build an optimal lookup of a token
into your set of 6000 names. In general, if speed is the issue, do not use
regex. It does not scale.
Also, be clear on the 2 problems at hand: 1) tokenizing 1GB of input text
and 2) adding a prefix to identified
On Sat, Feb 5, 2011 at 3:43 PM, Alex Vandiver wrote:
> Since we're talking about literals, this hasn't been true since 2007,
> with the release of perl 5.10. Perl now uses a Aho-Corasick trie
> algorithm internally for literal alternations, which allows for
> matching without backtracking:
Aha,
At Fri Feb 04 18:53:09 -0500 2011, Uri Guttman wrote:
> that will kill your cpu. alternations are very slow since they have to
> go back and try from the beginning of the list each time.
Since we're talking about literals, this hasn't been true since 2007,
with the release of perl 5.10. Perl now
> "MP" == Martyn Peck writes:
MP> What's wrong with something like this:
MP> while($line=<>){
MP> foreach my $name (@names){
MP> $line ~= s/$name/prefix_$1/g;
MP> }
MP> }
it is O( N^2 ) which is very slow for large data sets.
MP> I know it seems
hi
Ok, I've been reading over the responses you've been getting and I just
have to ask everyone.
What's wrong with something like this:
while($line=<>){
foreach my $name (@names){
$line ~= s/$name/prefix_$1/g;
}
}
I know it seems kind of
m.org"
Sent: Sat, Feb 5, 2011 00:53:35 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency
Thanks for the prompt replies, folks!
Unfortunately, my names can be embedded in larger "words" of the input
text, as long as they are delimited by certain punctuation.
If
Thanks for the prompt replies, folks!
Unfortunately, my names can be embedded in larger "words" of the input
text, as long as they are delimited by certain punctuation.
If I can figure out all of the permitted punctuation, I will try out
a split() and a hash lookup.
But Regex::Trie seems more
On Fri, 4 Feb 2011 18:10:08 -0600 Ted Zlatanov wrote:
TZ> On CPAN you can find Regex::Trie which builds this regex in a more
TZ> optimized way
TZ> (http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm).
TZ> This is the most general solution.
Oh, and for the Emacs fans, this is
On Fri, 4 Feb 2011 18:43:57 -0500 Kripa Sundar
wrote:
KS> I have a 900 Meg text file, containing random text. I also have a list
KS> of 6000 names (alphanumeric strings) that occur in the random text.
KS> I need to tag a prefix on to each occurrence of each of these 6000
KS> names.
KS> My
> "KS" == Kripa Sundar writes:
KS> I have a 900 Meg text file, containing random text. I also have a list
KS> of 6000 names (alphanumeric strings) that occur in the random text.
KS> I need to tag a prefix on to each occurrence of each of these 6000
KS> names.
KS> My premise:
Hi folks,
Problem:
I have a 900 Meg text file, containing random text. I also have a list
of 6000 names (alphanumeric strings) that occur in the random text.
I need to tag a prefix on to each occurrence of each of these 6000
names.
My premise:
I believe a regex would give the simplest and most
18 matches
Mail list logo