Re: body speedups using new features in perl 5.9.x

2006-07-12 Thread David Landgren

David Landgren wrote:

[...]


% perl5.9.4 racmp host.1k
assembled 4145 patterns in 3.83324813842773 seconds
R::A: good = 971, bad = 29 in 0.0148990154266357 seconds
list: good = 971, bad = 29 in 5.72843599319458 seconds
 A_C: good = 971, bad = 29 in 8.56000709533691 seconds
RA  len: 87491
A_C len: 174644

That is, the assembled approach runs in under 1/500th of the time of 
looping through the list of REs. A_C is even worse, but given that the 
pattern is over twice as long, and chock full of metacharacters I 
suppose this shouldn't come as a surprise but it does seem odd. I'll 
check back with Yves and see if my methodology is sane there.


I was doing something silly with the A-C test that killed the optimiser. 
Fixing that up made the blead time descend from 8.56 seconds down to 
0.43. In a nutshell, I was doing /^ab|^cd|^ef/ instead of /^(?:ab|cd|ef)/.


David
--
hope still, a little resistance always maybe stubborn tiny lights vs. 
clustering darkness forever ok?


Re: body speedups using new features in perl 5.9.x

2006-07-12 Thread David Landgren

Justin Mason did write:

David Landgren writes:

[...]

That is, the assembled approach runs in under 1/500th of the time of 
looping through the list of REs. A_C is even worse, but given that the 
pattern is over twice as long, and chock full of metacharacters I 
suppose this shouldn't come as a surprise but it does seem odd. I'll 
check back with Yves and see if my methodology is sane there.


"perl5.9.4" -- that's using bleadperl, right?


Yes. I wanted to check out the Aho-Corasick matching. I haven't been 
following super closely, but I'm pretty sure Yves said that it doesn't 
deal well with meta-characters. These results tend to bear that out.



Can you post the output of "use re qw(debug); /regexp/"?  I'd
like to see what the structure looks like from the R::A regexp -- 500
times is a stunning speedup.


The uncompressed output is 1.7Mb, so I've bzipped it. You can pick it up 
here:


  http://www.landgren.net/ re.debug.txt.bz2

(remove the space in the middle: I've done it that way to stop robots 
from stumbling over it via a list archive and wasting bandwidth).


What may help you follow what's going on more easily is looking at

  http://www.landgren.net/ re.indented.txt.bz2

(again, lose the space) which is the result of the assembly, indented. 
You can put it directly into a m/.../x and it will work; it's legal Perl.


The speedup comes from the fact that once you hit the engine, for all 
the legit hostnames like 'mail.example.com', only the first few 
characters will be examined and then you hit a fail state. There is no 
backtracking left on the stack, so the entire match fails at that point, 
and 99% of the regop tree is never consulted. Whereas the 
loop-over-a-list approach will try the pattern, and the next, and the 
next...


To put it another way, it's not so much that it matches quickly, but 
rather it fails quickly. Of course, patterns that contain .*? sequences 
will require backtracking, but that's just the nature of the beast.


David
--
"It's overkill of course, but you can never have too much overkill."


Re: body speedups using new features in perl 5.9.x

2006-07-12 Thread David Landgren

Bowie Bailey wrote:

[EMAIL PROTECTED] wrote:

While I doubt it'd have quite the performance gains that A-C can
offer, Regexp::Assemble certainly sounds like something worth
trying... 
the coderef trick, in particular, is very nifty.


Forgot to mention in the other thread I just replied to, if you've 
downloaded the package, look at eg/ircwatcher for a slightly mindless 
demo of the tracked mode. If you have a copy of O'Reilly's _Perl Hacks_, 
a much more fleshed-out demo appears in there.



It can work well.  After reading about it here, I tried it on one of
my programs that compares about 1600 words and phrases against a
document.  My scan time dropped by 30%.  This doesn't count the time
taken to assemble the regex (about .27 seconds), but since this
program runs as a daemon and only has to do the assembly once, it
wasn't relevant to me.


Here's some background that people may find interesting.

I have a Postfix access map that is an assembly of currently 4145 
patterns, that correspond to residential broadband DNS names.


Patterns like

\d+-\d+-\d+-\d+\.netabc\.com\.br
a\d+[abc]\d+\.neo\.lrun\.com
\d+\.\d+\.\d+\.\d+\.adsl\.abc\.tiscali\.dk

to match DNS names like

host217-34-41-132.in-addr.btopenworld.com
dsl-200-67-157-162.prodigy.net.mx
host80-39.pool212171.interbusiness.it
bgp01069788bgs.vnburn01.mi.comcast.net
cpe-68-112-253-235.ma.charter.com
adsl-68-73-64-222.dsl.klmzmi.ameritech.net

At first this was to catch spam, now I'm happy that an unexpected 
side-effect is that it discards connections during virus storms. I never 
even accept the DATA, much less overload my AV scanner.


Anyway, when I started out, I noticed the performance of the Postfix 
server dropping through the floor. So I wound up writing 
Regexp::Assemble. Now, instead of going through a list of patterns, it 
goes through one. (I had to recompile pcre and up the LINK_TYPE #define 
so that pcre could compile the pattern).


Running a test on a small (1000) sample of host names speaks eloquently:

% perl5.9.4 racmp host.1k
assembled 4145 patterns in 3.83324813842773 seconds
R::A: good = 971, bad = 29 in 0.0148990154266357 seconds
list: good = 971, bad = 29 in 5.72843599319458 seconds
 A_C: good = 971, bad = 29 in 8.56000709533691 seconds
RA  len: 87491
A_C len: 174644

That is, the assembled approach runs in under 1/500th of the time of 
looping through the list of REs. A_C is even worse, but given that the 
pattern is over twice as long, and chock full of metacharacters I 
suppose this shouldn't come as a surprise but it does seem odd. I'll 
check back with Yves and see if my methodology is sane there.


David
--
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger




Re: body speedups using new features in perl 5.9.x

2006-07-12 Thread David Landgren

Stuart Johnston wrote:

Bowie Bailey wrote:

[EMAIL PROTECTED] wrote:

While I doubt it'd have quite the performance gains that A-C can
offer, Regexp::Assemble certainly sounds like something worth
trying... the coderef trick, in particular, is very nifty.


It can work well.  After reading about it here, I tried it on one of
my programs that compares about 1600 words and phrases against a
document.  My scan time dropped by 30%.  This doesn't count the time
taken to assemble the regex (about .27 seconds), but since this
program runs as a daemon and only has to do the assembly once, it
wasn't relevant to me.



Wouldn't assembling all rules into one regex make it impossible to have 
per rule scores?


No. Using tracked mode, you recover the original pattern, use that as a 
key into a hash that gives you the name of the rule, and then update a 
scoring hash keyed on the rule name.


After the body scan, you look at what keys have been set in your scoring 
hash and you have your base set of rules that have been triggered. You 
would then coalesce them as usual according to the various meta rules.


David.
--
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger




Re: body speedups using new features in perl 5.9.x

2006-07-11 Thread David Landgren

Justin Mason wrote:

There's an interesting discussion on my weblog at
http://taint.org/2006/07/07/184022a.html , regarding the
recently-contributed "trie" and Aho-Corasick regexp optimizations, which
have been added to perl 5.9.x.

These could provide a way to get *serious* speedups in SpamAssassin.


Note that these work really well on fixed text. Once you start adding 
STAR, PLUS and CURLY the gain diminishes.



If anyone is interested in working on speedups, this would be a really
great place to do it; body rules are by far the biggest CPU time sink in
our code.


One source of slowness is that a lot of SA pattern writers use 
(?=[abcd]) zwla's in the belief, I think, to supply more information to 
the engine, but even in older perls this was always a loss. You have to 
have a something like 95-99% of strings not matching /[abcd]/ for the 
benefit to kick in, at least that has been my experience.


I think the biggest problem is the fact that the scanning algorithm 
consists of applying one pattern after the other to see what it triggers.


I wrote Regexp::Assemble to allow one to take a large slab of patterns 
and concoct a single expression (also structured as a trie, 
incidentally) as a result. The benefit is that all the common paths are 
shared, so all the expensive curlies and stars are visited only one.


For instance, with /a+b+c+/ and /a+c+e+c+/ the result is 
/a+(c+e+|b+)c+/, so if you are give fc, you don't have to try the 
first pattern, fail and then the second pattern and fail. As soon as the 
alternation in the middle of the resulting pattern above fails, there's 
no backtracking and so you stop. When you have three thousand patterns 
instead of two, this starts to become very interesting.


Regexp::Assemble also has a mode whereby you can build up a hash of 
pattern => coderef pairs, assemble all the patterns into a single 
pattern, and then feed that the body text.


This then means that your entire scan of the body just becomes a single 
//g pattern match, and each time something matches, you can go back to 
the dispatch table and call out the action according to what just 
matched. The more patterns you have, the more efficient it becomes.


I'd be happy to explain this in more detail if this sounds like 
something you want to explore.


David
--
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger




Re: Fw: spamc error using perl

2006-06-29 Thread David Landgren

Alberto Iovino wrote:

Hi


[...]


the process start correctly but if I do the same with spamc
 
perl -T /usr/local/bin/spamc --syslog-socket=inet -d
 
I get the following error
 
*Unrecognized character \x7F at /usr/local/bin/spamc line 1*


Hello,

as I already explained in the perl bug report you filed, this is because 
spamc is not a Perl application, it is a binary which is directly 
executed by the kernel. At the bare minimum, you need to change the 
above line to


  /usr/local/bin/spamc --syslog-socket=inet -d

(that is, drop the 'perl -T' from the command).


Can you explain me why?


You are asking perl to execute a datafile it is unable to parse. People 
sometimes disparage Perl as being akin to line noise, but this is 
literally what you are trying to get perl to execute in this instance.


Try running 'perl /bin/ls'. You will see that it achieves about the same 
level of success.



Is there another valid option? Should I modify the Makefile.PL?
How can I do to make run this process? 


Just run it directly.

Regards,
David



Regards
 


Alberto Iovino
Sistemi Informatici
Chelab srl
Via Fratta 25
31023 Resana (TV)
mail [EMAIL PROTECTED] 
Tel 0423/717980
Fax 0423/715058




--
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger




Re: The Future of Email is SQL

2006-06-13 Thread David Landgren

Jim C. Nasby wrote:

On Sat, Jun 10, 2006 at 01:23:35PM -0600,  wrote:
I would defer to the smart people to figure out the details. However I do 
wonder if the actual body content of the message would be best stored in a 
file and the SQL used to store anything and everything you would want to 
index. That would keep the SQL file size down if that's an issue. However, 
SQL databases might have to be changed to accomodate the needs to store 
email.
I think this is what I was getting at early in the thread.  I would think 
that a 5 MB body would do better on file but I don't know enough in regards 
to DBs to even make a call.


A good rule of thumb about storing something in the database is: are you
going to search that data? If you're going to search the text of an
email body, that makes it a more likely candidate for storing it in the
database (though there are ways to do this searching while storing the
file externally).


SQL databases suck dead dogs through a straw on full text searches. The 
language specification isn't designed for it. Database vendors offer 
support for it in various mutually-incompatible ways.


It's easy to precalc search indices for maildir and produce fast search 
results; mairix is one such tool that I know of, there are no doubt many 
other solutions available.


David

--
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger




Re: Random Header Construct

2006-05-31 Thread David Landgren

Dan wrote:
Thought you might enjoy this unintended peak at the construction of 
randomized header variables:





  (SMTPD32-8.15) id ACE84FB700A2; Wed, 31 May 2006 06:16:08 -0400
Received: from [221.7.214.83]
by
id 1FlNkA-00029J-ST
for [EMAIL PROTECTED]; Wed, 31 May 2006 03:16:05 -0700
X-Message-Info: 
%RNDDIGIT13%RNDLCCHAR13%RNDUCCHAR12%RNDLCCHAR13%RNDUCCHAR13%RNDDIGIT13%RNDUCCHAR13%RNDLCCHAR15%RNDDIGIT13%RNDLCCHAR13%RNDUCCHAR13%RNDLCCHAR13%RNDUCCHAR13%RNDDIGIT13%RNDUCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDUCCHAR13%RNDLCCHAR13%RNDUCCHAR13%RNDDIGIT13 

Received: from dns%RNDDIGIT13yahoo.com (216.80.250.105) by 
%RNDLCCHAR13%RNDDIGIT13-%RNDLCCHAR13%RNDDIGIT12.yahoo.com with Microsoft 


I actually have postfix header checks that hunt down and kill this sort 
of stuff. It's commoner than you think. Well no, quite infrequent, the 
test is cheap...


David


--
Much of the propaganda that passes for news in our own society is given 
to immobilising and pacifying people and diverting them from the idea 
that they can confront power. -- John Pilger




Re: Advanced regex question - backtracking vs. negative lookahead s

2006-04-21 Thread David Landgren

Bowie Bailey wrote:

[...]


An alternative solution would be this:

/style="[^>]+color:blue/


This looks better.  It is probably less resource-intensive than your
previous attempt and is definitely easier to read.  But why are you
looking for > when you anchor the beginning with a quote?

How about this:

/style="[^"]+?color:blue/

This is also non-greedy, so it will start looking for the "color:blue"
match at the beginning of the string instead of having the + slurp up
everything up to the quote and then backtracking to find the match.


The regexp engine doesn't slurp. It just scans from left to right, 
noting "I might have to come back here" along the way.



For SA purposes, you may want to limit the search as well.

/style="[^"]{1,20}?color:blue/

This way, it will stop looking after 20 characters.  This prevents it
from using lots of memory if the quotes aren't closed.


Good point.


But this will certainly involve some backtracking, especially if
there is even more text after the "color:blue" but before the
closing > character, for example the "font-size:small" text.


No it won't. It will scan once and quit. It never encountered any other 
alternatives that would require backtracking.


David
--
"It's overkill of course, but you can never have too much overkill."



Re: Advanced regex question - backtracking vs. negative lookaheads

2006-04-21 Thread David Landgren

Jeremy Fairbrass wrote:

[...]


So one possible solution would be the following:

/style="(.(?!color))+.color:blue/


Eeep!

In other words, after the first " (quote mark) it looks for any character 
NOT followed by the word "color", and repeats that with the + character, 
until it gets to the actual word "color". I believe this results in no (or 
almost no?) backtracking. But I'm not sure if it's resource-intensive 
anyway, because of the negative lookahead - are negative lookaheads 
particularly resource intensive, when compared to backtracking? Is one 
preferable over the other?


An alternative solution would be this:

/style="[^>]+color:blue/

But this will certainly involve some backtracking,


False.

especially if there is 
even more text after the "color:blue" but before the closing > character, 
for example the "font-size:small" text.


Irrelevant. If you get a '>', the you didn't find what you were looking 
for. If you get a "color:blue", you did. In either case, the regexp 
quits right there.


You can use perl from the command line to get a good idea of the 
differences:


# all on one line

perl -Mre=debug -e 'q{} =~ /style="(.(?!color))+.color:blue/'


The above produces scads of output: you can see it inching down the 
string char by char. On the other hand,


perl -Mre=debug -e 'q{} =~ /style="[^>]+color:blue/'


is *fast*. Slightly less good is

/style=".*?color:blue/

David
--
"It's overkill of course, but you can never have too much overkill."



Re: RFC: ruleset to tag French-language hoax virus warnings

2006-04-05 Thread David Landgren

Rick Macdougall wrote:

David Landgren wrote:

List,

if you have a significant number of French-language users, I'd love to 
have some feedback on a ruleset to catch the endless virus hoax 
messages. Things like "the worst ever according to CNN", "only 
discovered yesterday, no rememdy, according to McAfee", "burns the 
sector zero of your hard disk" and that sort of stuff, except in French.


  http://www.landgren.net/sa/

It works okay for me, although I should point out that I draw the line 
at 8.0, so you might want reduce the scores a bit if you cut off at 5.0.


David



Strange, I'm the Network admin for an ISP in Montreal, Quebec and I've 
never ever seen anything like that (about 30k email accounts, 90% French).


Maybe it's geographically limited to France, I dunno. We have about 1000 
accounts, and the help desk regularly gets hit with panicky consultants 
wondering whether they should forward it to the entire company and/or 
their address book.


There's also the jdbgmr teddy bear virus hoax, but it hasn't come up on 
my radar for quite some time now, so I lack a suitable corpus.


David



--
"It's overkill of course, but you can never have too much overkill."




RFC: ruleset to tag French-language hoax virus warnings

2006-04-05 Thread David Landgren

List,

if you have a significant number of French-language users, I'd love to 
have some feedback on a ruleset to catch the endless virus hoax 
messages. Things like "the worst ever according to CNN", "only 
discovered yesterday, no rememdy, according to McAfee", "burns the 
sector zero of your hard disk" and that sort of stuff, except in French.


  http://www.landgren.net/sa/

It works okay for me, although I should point out that I draw the line 
at 8.0, so you might want reduce the scores a bit if you cut off at 5.0.


David



Re: Rule help ... if one rule matched, ignore another

2006-04-02 Thread David Landgren

mouss wrote:

David Gibbs wrote:

Folks:


[...]


My particular example ...

I want to create a rule that will assign a specific score if the subject
contains the word 'euromillion', but have a lower score if the subject
contains 'million'.

Obviously if I put two separate rules with the 'euromillion' rule just
adding to the score set by 'million' it would work... but I would rather
have only one rule listed in the spam header.



use a META rule.



Or a zero-width negative look-behind assertion

  # million not preceded by euro
  (?hope still, a little resistance always maybe stubborn tiny lights vs. 
clustering darkness forever ok?


Re: CheapTickets newsletter triggering SARE_BAYES plus others

2006-03-14 Thread David Landgren

Chris Purves wrote:

Loren Wilton wrote:


The other rule is looking for a really standard spammer trick:
.


Interesting.  How is this helpful to spammers?


Indeed. This used to crop up regularly in MS-Frontpage circa 1998 when 
people added and then removed markup. Dunno if that is still the case. I 
suspect many HTML editing tools will leave cruft like this lying around.


So some legitimate HTML e-mail (I know, contradiction in terms) is 
likely to suffer.


David
--
"It's overkill of course, but you can never have too much overkill."



Contextual rules in SA

2006-03-02 Thread David Landgren

Hello list,

I've seen a sharp increase in spam that is really easy to identify, but 
I'm struggling to get SA to do so. One example is


The To: header matches:

  /^([EMAIL PROTECTED])@(?:regexp-of-my-local-domains)$/

The Subject matches:

  /^Fw: Discount for (\S+)/

and the $1 capture in both these patterns are the same, in an C 
sense. Finally, the message body is a multipart message, purported to be 
a forwarded message, and in the body of the message there is


  /^Subject: (\S+)$/

and once again the captured pattern is the same as the previous two. If 
those three conditions match up, that's worth 3 to 5 straight away. And 
there's other garbage to improve the score.


If I could write a contextual rule that remembers those three items and 
then makes a decision, I'd be a happy man. I think I need to write a 
plugin, since you can't capture stuff with basic rules and then 
manipulate the captures (at least, not in any rulesets I've seen). Plus, 
since the captured pattern is the LHS of a local email address, so I 
could refine things even further if necessary by doing a recipient lookup.


I'm sure people have already done this sort of thing, so I tried 
searching for ideas to steal but came up short. So... does anyone have 
some ideas (or code) they can point out to me?


I know how to do this in a policy server for Postfix, but I'd rather do 
it in SA if possible.


Thanks,
David
--
"It's overkill of course, but you can never have too much overkill."