Bowie Bailey wrote:
[EMAIL PROTECTED] wrote:
While I doubt it'd have quite the performance gains that A-C can
offer, Regexp::Assemble certainly sounds like something worth
trying...
the coderef trick, in particular, is very nifty.
Forgot to mention in the other thread I just replied to, if you've
downloaded the package, look at eg/ircwatcher for a slightly mindless
demo of the tracked mode. If you have a copy of O'Reilly's _Perl Hacks_,
a much more fleshed-out demo appears in there.
It can work well. After reading about it here, I tried it on one of
my programs that compares about 1600 words and phrases against a
document. My scan time dropped by 30%. This doesn't count the time
taken to assemble the regex (about .27 seconds), but since this
program runs as a daemon and only has to do the assembly once, it
wasn't relevant to me.
Here's some background that people may find interesting.
I have a Postfix access map that is an assembly of currently 4145
patterns, that correspond to residential broadband DNS names.
Patterns like
\d+-\d+-\d+-\d+\.netabc\.com\.br
a\d+[abc]\d+\.neo\.lrun\.com
\d+\.\d+\.\d+\.\d+\.adsl\.abc\.tiscali\.dk
to match DNS names like
host217-34-41-132.in-addr.btopenworld.com
dsl-200-67-157-162.prodigy.net.mx
host80-39.pool212171.interbusiness.it
bgp01069788bgs.vnburn01.mi.comcast.net
cpe-68-112-253-235.ma.charter.com
adsl-68-73-64-222.dsl.klmzmi.ameritech.net
At first this was to catch spam, now I'm happy that an unexpected
side-effect is that it discards connections during virus storms. I never
even accept the DATA, much less overload my AV scanner.
Anyway, when I started out, I noticed the performance of the Postfix
server dropping through the floor. So I wound up writing
Regexp::Assemble. Now, instead of going through a list of patterns, it
goes through one. (I had to recompile pcre and up the LINK_TYPE #define
so that pcre could compile the pattern).
Running a test on a small (1000) sample of host names speaks eloquently:
% perl5.9.4 racmp host.1k
assembled 4145 patterns in 3.83324813842773 seconds
R::A: good = 971, bad = 29 in 0.0148990154266357 seconds
list: good = 971, bad = 29 in 5.72843599319458 seconds
A_C: good = 971, bad = 29 in 8.56000709533691 seconds
RA len: 87491
A_C len: 174644
That is, the assembled approach runs in under 1/500th of the time of
looping through the list of REs. A_C is even worse, but given that the
pattern is over twice as long, and chock full of metacharacters I
suppose this shouldn't come as a surprise but it does seem odd. I'll
check back with Yves and see if my methodology is sane there.
David
--
Much of the propaganda that passes for news in our own society is given
to immobilising and pacifying people and diverting them from the idea
that they can confront power. -- John Pilger