Re: more efficent big scoring

2008-01-23 Thread Justin Mason

To clarify -- here's how the current code orders rule evaluation:

- message metadata is extracted.

- header DNSBL tests are started.

- the decoded forms of the body text are extracted and cached.

- the URIs in the message body are extracted and cached.

- Iterates through each known priority level, defined in the active
  ruleset, from lowest to highest, and:

  - checks to see if it's shortcircuited; if it has, it breaks the loop

  - calls the 'check_rules_at_priority' plugin hook

  - runs all head and head-eval rules defined for that priority level
  - checks to see if there are network rules to harvest
  - runs all body and body-eval rules defined for that priority level
  - checks to see if there are network rules to harvest
  - runs all uri rules defined for that priority level
  - checks to see if there are network rules to harvest
  - runs all rawbody and rawbody-eval rules defined for that priority
level
  - checks to see if there are network rules to harvest
  - runs all full and full-eval rules defined for that priority level
  - checks to see if there are network rules to harvest
  - runs all meta rules defined for that priority level (note: if the
meta rules depend on a network rule, this may block until that rule
completes)
  - checks to see if there are network rules to harvest

  - calls the check_tick plugin hook

- finally, it waits for any remaining unharvested network rules (if it
  hasn't shortcircuited)

- calls the check_post_dnsbl plugin hook

- auto-learns from the message, if applicable

- calls the check_post_learn plugin hook

- and returns


In 3.2.x and 3.3.0 this is all in the Check plugin, in the check_main()
method, so can be redefined or overridden with alternative orderings
quite easily.

--j.

Loren Wilton writes:
   maybe if there was some way to establish a hierachy at startup
   which groups rule processing into nodes. some nodes finish
   quickly, some have dependencies, some are negative, etc.
 
  Just wanted to point out, this topic came out when site dns
  cache service started to fail due to excessive dnsbl queries. My
  slowdown was due to multiple timeouts and/or delay, probably
  related to answering joe-job rbldns backscatter -- that's the
  reason I was looking for early exit on scans in process.
 
 There is a little of splitting rules into processing speed groups done. 
 Specifically, the net-based tests, being dependent on external events for 
 completion, are split out from the other tests and are processed in two 
 phases.  The first phase issues the request for information over the net, 
 and the second phase then waits for an answer.  There is a background 
 routine that is harvesting incoming net results while other rules are 
 processed, so when a net result is required it may already be present and no 
 delay will be incurred.
 
 This is not an area I understand at all fully, but reading moderately recent 
 comments on Bugzilla leads me to believe that this is an area where some 
 improvement is still possible; there are some net tests that (I think) end 
 up waiting immediately for an answer rather than doing the two-phase 
 processing.  How much that slows down the result for the overall email 
 probably depends on many factors.
 
 Also note that even issuing the requests and then waiting for the result 
 only when it is needed doesn't guarantee that the mail will not have to wait 
 for results.  It could be that one of the very first rules processed (due to 
 priority ort meta dependency, for instance) will need a net result, and so 
 the entire rule process will be forced to wait on it.
 
 As far as splitting non-net rules up based on speed, that isn't very 
 practical.  Regex rules should in general be quite fast, and all of them are 
 going to require the use of the processor full-time anyway.  The speed of 
 the rule will depend on how it is written and the exact content of the email 
 it is processing.  So a rule that is dog slow on one email may be blindingly 
 fast on most other emails.  I don't know that there is any good way to 
 estimate the speed of a regex simply by looking at it.
 
  Loren


RE: more efficent big scoring

2008-01-23 Thread Robert - elists

 
 Just wanted to point out, this topic came out when site dns
 cache service started to fail due to excessive dnsbl queries. My
 slowdown was due to multiple timeouts and/or delay, probably
 related to answering joe-job rbldns backscatter -- that's the
 reason I was looking for early exit on scans in process.
 
 // George
 George Georgalis, information system scientist IXOYE

George, That is correct!

I still maintain that the SA Team is more than bright and talented enough,
that over time, they will come up with new algorithms to allow this behavior
w/o a substantial SA processing speed decrease.

I just cannot imagine that the theoretical and real world limits of these
functions have been met yet.

And bottom line is, even if a function initially slows down the processing,
if the theory behind it is the truth, shouldn't it be pursued until it can
be implemented properly?

... or is that just wishful thinking?

jdow has suggested some reading to enlighten me that I will be getting to in
short order.

 - rh



Re: more efficent big scoring

2008-01-22 Thread George Georgalis
On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
On Sat, 19 Jan 2008, Loren Wilton wrote:

 I would not be terribly surprised to find out that on average
 there was no appreciable difference in running all rules of all
 types in priority order, over the current method;

Neither am I. Another thing to consider is the fraction of defined
rules that actually hit and affect the score is rather small. The
greatest optimization would be to not test REs you know will fail;  
but how do you do *that*?

thanks for all the followups on my inquiry. I'm glad the topic is/was
considered and it looks like there is some room for development, but
I now realize it is not as simple as I thought it might have been.
In answer to above question, maybe the tests need their own scoring?
eg fast tests and with big spam scores get a higher test score than
slow tests with low spam scores.

maybe if there was some way to establish a hierachy at startup
which groups rule processing into nodes. some nodes finish
quickly, some have dependencies, some are negative, etc.

Then utilize some sort of looping test (eg check every .5 second)
which can kill remaining tests and short circut. eg anytime in the
hierachy the score is above what the negative test can fix, etc.

Appreciate the discussion thus far, unfortunately discussion is
all I'm able to contribute at this time.

Thanks,
// George



-- 
George Georgalis, information system scientist IXOYE


Re: more efficent big scoring

2008-01-22 Thread John D. Hardin
On Tue, 22 Jan 2008, George Georgalis wrote:

 On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:

 Neither am I. Another thing to consider is the fraction of defined
 rules that actually hit and affect the score is rather small. The
 greatest optimization would be to not test REs you know will fail;  
 but how do you do *that*?
 
 thanks for all the followups on my inquiry. I'm glad the topic is/was
 considered and it looks like there is some room for development, but
 I now realize it is not as simple as I thought it might have been.
 In answer to above question, maybe the tests need their own scoring?
 eg fast tests and with big spam scores get a higher test score than
 slow tests with low spam scores.
 
 maybe if there was some way to establish a hierachy at startup
 which groups rule processing into nodes. some nodes finish
 quickly, some have dependencies, some are negative, etc.

Loren mentioned to me in a private email: common subexpressions.

It would be theoretically possible to analyze all the rules in a given
set (e.g. body rules) to extract common subexpressions and develop a
processing/pruning tree based on that. You'd probably gain some
performance scanning messages, but at the cost of how much
startup/compiling time?

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 To prevent conflict and violence from undermining development,
 effective disarmament programmes are vital...
  -- the UN, who doesn't want to confiscate guns
---
 5 days until the 41st anniversary of the loss of Apollo 1



Re: more efficent big scoring

2008-01-22 Thread Justin Mason

John D. Hardin writes:
 On Tue, 22 Jan 2008, George Georgalis wrote:
 
  On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
 
  Neither am I. Another thing to consider is the fraction of defined
  rules that actually hit and affect the score is rather small. The
  greatest optimization would be to not test REs you know will fail;  
  but how do you do *that*?
  
  thanks for all the followups on my inquiry. I'm glad the topic is/was
  considered and it looks like there is some room for development, but
  I now realize it is not as simple as I thought it might have been.
  In answer to above question, maybe the tests need their own scoring?
  eg fast tests and with big spam scores get a higher test score than
  slow tests with low spam scores.
  
  maybe if there was some way to establish a hierachy at startup
  which groups rule processing into nodes. some nodes finish
  quickly, some have dependencies, some are negative, etc.
 
 Loren mentioned to me in a private email: common subexpressions.
 
 It would be theoretically possible to analyze all the rules in a given
 set (e.g. body rules) to extract common subexpressions and develop a
 processing/pruning tree based on that. You'd probably gain some
 performance scanning messages, but at the cost of how much
 startup/compiling time?

I experimented with this concept in my sa-compile work, but I could
achieve any speedup on real-world mixed spam/ham datasets.

Feel free to give it a try though ;)

--j.


Re: more efficent big scoring

2008-01-22 Thread Jim Maul

Justin Mason wrote:

John D. Hardin writes:

On Tue, 22 Jan 2008, George Georgalis wrote:


On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:


Neither am I. Another thing to consider is the fraction of defined
rules that actually hit and affect the score is rather small. The
greatest optimization would be to not test REs you know will fail;  
but how do you do *that*?

thanks for all the followups on my inquiry. I'm glad the topic is/was
considered and it looks like there is some room for development, but
I now realize it is not as simple as I thought it might have been.
In answer to above question, maybe the tests need their own scoring?
eg fast tests and with big spam scores get a higher test score than
slow tests with low spam scores.

maybe if there was some way to establish a hierachy at startup
which groups rule processing into nodes. some nodes finish
quickly, some have dependencies, some are negative, etc.

Loren mentioned to me in a private email: common subexpressions.

It would be theoretically possible to analyze all the rules in a given
set (e.g. body rules) to extract common subexpressions and develop a
processing/pruning tree based on that. You'd probably gain some
performance scanning messages, but at the cost of how much
startup/compiling time?


I experimented with this concept in my sa-compile work, but I could
achieve any speedup on real-world mixed spam/ham datasets.

Feel free to give it a try though ;)

--j.




You do mean *couldn't* achieve any speedup, correct?

-Jim



Re: more efficent big scoring

2008-01-22 Thread Justin Mason

Jim Maul writes:
 Justin Mason wrote:
  John D. Hardin writes:
  On Tue, 22 Jan 2008, George Georgalis wrote:
 
  On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
 
  Neither am I. Another thing to consider is the fraction of defined
  rules that actually hit and affect the score is rather small. The
  greatest optimization would be to not test REs you know will fail;  
  but how do you do *that*?
  thanks for all the followups on my inquiry. I'm glad the topic is/was
  considered and it looks like there is some room for development, but
  I now realize it is not as simple as I thought it might have been.
  In answer to above question, maybe the tests need their own scoring?
  eg fast tests and with big spam scores get a higher test score than
  slow tests with low spam scores.
 
  maybe if there was some way to establish a hierachy at startup
  which groups rule processing into nodes. some nodes finish
  quickly, some have dependencies, some are negative, etc.
  Loren mentioned to me in a private email: common subexpressions.
 
  It would be theoretically possible to analyze all the rules in a given
  set (e.g. body rules) to extract common subexpressions and develop a
  processing/pruning tree based on that. You'd probably gain some
  performance scanning messages, but at the cost of how much
  startup/compiling time?
  
  I experimented with this concept in my sa-compile work, but I could
  achieve any speedup on real-world mixed spam/ham datasets.
  
  Feel free to give it a try though ;)
  
  --j.
  
  
 
 You do mean *couldn't* achieve any speedup, correct?

yep


Re: more efficent big scoring

2008-01-22 Thread John D. Hardin
John D. Hardin writes:
 
 Loren mentioned to me in a private email: common subexpressions.

Whoops! Matt Kettler mentioned it to me, not Loren. Sorry!

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The difference is that Unix has had thirty years of technical
  types demanding basic functionality of it. And the Macintosh has
  had fifteen years of interface fascist users shaping its progress.
  Windows has the hairpin turns of the Microsoft marketing machine
  and that's all.-- Red Drag Diva
---
 5 days until Wolfgang Amadeus Mozart's 252nd Birthday



Re: more efficent big scoring

2008-01-22 Thread George Georgalis
On Tue, Jan 22, 2008 at 05:24:00PM +, Justin Mason wrote:

Jim Maul writes:
 Justin Mason wrote:
  John D. Hardin writes:
  On Tue, 22 Jan 2008, George Georgalis wrote:
 
  On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
 
  Neither am I. Another thing to consider is the fraction of defined
  rules that actually hit and affect the score is rather small. The
  greatest optimization would be to not test REs you know will fail;  
  but how do you do *that*?
  thanks for all the followups on my inquiry. I'm glad the topic is/was
  considered and it looks like there is some room for development, but
  I now realize it is not as simple as I thought it might have been.
  In answer to above question, maybe the tests need their own scoring?
  eg fast tests and with big spam scores get a higher test score than
  slow tests with low spam scores.
 
  maybe if there was some way to establish a hierachy at startup
  which groups rule processing into nodes. some nodes finish
  quickly, some have dependencies, some are negative, etc.
  Loren mentioned to me in a private email: common subexpressions.
 
  It would be theoretically possible to analyze all the rules in a given
  set (e.g. body rules) to extract common subexpressions and develop a
  processing/pruning tree based on that. You'd probably gain some
  performance scanning messages, but at the cost of how much
  startup/compiling time?
  
  I experimented with this concept in my sa-compile work, but I could
  achieve any speedup on real-world mixed spam/ham datasets.
  
  Feel free to give it a try though ;)
  
  --j.
  
  
 
 You do mean *couldn't* achieve any speedup, correct?

yep


Just wanted to point out, this topic came out when site dns
cache service started to fail due to excessive dnsbl queries. My
slowdown was due to multiple timeouts and/or delay, probably
related to answering joe-job rbldns backscatter -- that's the
reason I was looking for early exit on scans in process.

// George



-- 
George Georgalis, information system scientist IXOYE


Re: more efficent big scoring

2008-01-22 Thread Loren Wilton

John D. Hardin writes:


Loren mentioned to me in a private email: common subexpressions.


Whoops! Matt Kettler mentioned it to me, not Loren. Sorry!


I was going to mention that I didn't think that had been me.
Unless I was asleep when I wrote the reply.  Which could have been the case. 
:-)


   Loren




Re: more efficent big scoring

2008-01-22 Thread Loren Wilton

 maybe if there was some way to establish a hierachy at startup
 which groups rule processing into nodes. some nodes finish
 quickly, some have dependencies, some are negative, etc.


Just wanted to point out, this topic came out when site dns
cache service started to fail due to excessive dnsbl queries. My
slowdown was due to multiple timeouts and/or delay, probably
related to answering joe-job rbldns backscatter -- that's the
reason I was looking for early exit on scans in process.


There is a little of splitting rules into processing speed groups done. 
Specifically, the net-based tests, being dependent on external events for 
completion, are split out from the other tests and are processed in two 
phases.  The first phase issues the request for information over the net, 
and the second phase then waits for an answer.  There is a background 
routine that is harvesting incoming net results while other rules are 
processed, so when a net result is required it may already be present and no 
delay will be incurred.


This is not an area I understand at all fully, but reading moderately recent 
comments on Bugzilla leads me to believe that this is an area where some 
improvement is still possible; there are some net tests that (I think) end 
up waiting immediately for an answer rather than doing the two-phase 
processing.  How much that slows down the result for the overall email 
probably depends on many factors.


Also note that even issuing the requests and then waiting for the result 
only when it is needed doesn't guarantee that the mail will not have to wait 
for results.  It could be that one of the very first rules processed (due to 
priority ort meta dependency, for instance) will need a net result, and so 
the entire rule process will be forced to wait on it.


As far as splitting non-net rules up based on speed, that isn't very 
practical.  Regex rules should in general be quite fast, and all of them are 
going to require the use of the processor full-time anyway.  The speed of 
the rule will depend on how it is written and the exact content of the email 
it is processing.  So a rule that is dog slow on one email may be blindingly 
fast on most other emails.  I don't know that there is any good way to 
estimate the speed of a regex simply by looking at it.


   Loren




Re: more efficent big scoring

2008-01-20 Thread Matt Kettler

Loren Wilton wrote:
Well, it looks like I need to spend some time reading the code to 
study exactly how SA runs rules, and see if it's doing something that 
pollutes the memory cache, which would cause the over-sorting to not 
matter..


As best I recall, it runs rules by type, and sorted by priority within 
type. There is also code to resolve meta-chaining ordering; I don't 
recall that I've seen that code since Justin wrote that.

I read the code, it runs by priority, then type within priority.

The first loop can be found in Check.pm, sub check_main.

foreach my $priority (sort { $a = $b } keys 
%{$pms-{conf}-{priorities}}) {


and within that loop, after some code for DNSBL handling, you've got:

   $self-do_head_tests($pms, $priority);
   $self-do_head_eval_tests($pms, $priority);

   $self-do_body_tests($pms, $priority, $decoded);
   $self-do_uri_tests($pms, $priority, @uris);
.
 






Since it runs rules by type, I don't think its guaranteed that a -1000 
rule will run before a -900 rule if they aren't the same rule type.  
(Maybe it is; I'd have to look at the code again.  For what I remember 
that wouldn't have been guaranteed.)

It's gaurnteed.


There is (at least in theory) a cache advantage to doing things like 
running all the head tests and then all the body tests, rather than 
some of each. OTOH, both head and body are probably in memory, and the 
headers are generally not huge.  The body of course may be even 
smaller on many spams. So I'm not *convinced* that the cache locality 
argument will hold up under actual testing, albeit the theory sounds 
good. 
Well, I was thinking about the performance in large messages, which has 
to do with how it handles lines in the body. (even though linewraps 
are removed for body rules, SA does break it up into largeish chunks 
the code calls lines).


This part of the code runs the entire body on one rule at a time, not 
all the rules on each line at a time..




What is useful is starting the net tests as early as possible, and 
harvesting them as late as possible.  

It already does that. Or at least it harvests them late.
However, net tests can be started early regardless of priority or 
short-circuiting, with (probably) minimal performance loss.  If you 
decide the case before all the net results arrive, you just ignore the 
stragglers.


I would not be terribly surprised to find out that on average there 
was no appreciable difference in running all rules of all types in 
priority order, over the current method; at least if this didn't push 
a lot of net rule mandatory result checking too early.
Of course it resulted in no difference. The code as it stands makes zero 
effort to take advantage of cache locality at all. Well, I guess you 
could say it's maximizing locality of the rule code, while minimizing 
locality of the message data.
  And even if that happened, it would slow throughput per item, but it 
wouldn't necessarily increase processor overhead.  indeed, it might in 
some cases reduce processor overhead.


Doing something like you did of assigning a priority to every rule 
that doesn't already have one, with the rule based on the score pretty 
much in the order the OP suggested, then sorting the rules by priority 
regardless of rule type, and running all of them that way I *suspect* 
will be about the same performance as the current algorithm.

Well that's roughly what my test did. The code doesn't group by type.
A reduction in performance would I suspect most likely occur from the 
code having to switch on rule type for each rule it runs.  There are 
probably clever tricks (like the current eval-ed compiled procedures) 
that would eliminate this switch overhead.


This still doesn't necessarily check for bailing on score.  But note 
that short-circuiting is already present.  I think it is based on a 
'short circult rule' hitting rather than a score comparison.  But it 
is still potentially a per-rule bailout test.  An extra numeric 
comparison after each rule that evaluates true would likely be trivial 
compared to the other tracking that is done for rules that hit.
Agreed.. or you could do it for each priority... This would give you 
flexibility to control how often the check is actually made.










Re: more efficent big scoring

2008-01-20 Thread John D. Hardin
On Sat, 19 Jan 2008, Loren Wilton wrote:

 I would not be terribly surprised to find out that on average
 there was no appreciable difference in running all rules of all
 types in priority order, over the current method;

Neither am I. Another thing to consider is the fraction of defined
rules that actually hit and affect the score is rather small. The
greatest optimization would be to not test REs you know will fail;  
but how do you do *that*?

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  How can you reason with someone who thinks we're on a glidepath to
  a police state and yet their solution is to grant the government a
  monopoly on force? They are insane.
---
 7 days until Wolfgang Amadeus Mozart's 252nd Birthday



Re: more efficent big scoring

2008-01-19 Thread Justin Mason

Theo Van Dinter writes:
 Yes and no.  There aren't many negative scored rules, which could easily be
 put into a low priority to run first.
 
 The issue, which is where Matt was going I believe, is that the reason score
 based short circuiting was removed is that it's horribly slow to keep checking
 the score after each rule runs.  You can do it at the end of a priority's run,
 but then you have to split the rules across multiple priorities, which does
 impact performance.
 
 I made some comments about this kind of thing in
 http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3109 and envisioned SA
 auto-prioritizing rules for short circuiting for things like what I mentioned
 in c7, but there was some strong disagreement about things like SC based on
 score and so it didn't get implemented in the current code.

Well, there's room to implement it given current plugin hooks, but
nobody's done it yet. 

--j.


Re: more efficent big scoring

2008-01-19 Thread Matt Kettler

Robert - elists wrote:

You can't run the rules in score-order without driving SA's performance
into the ground.

The key here is SA doesn't run tests sequentially, it runs them in
parallel as it works its way through the body. this allows for good,
efficient use of memory cache.

By running rules in score-order, you break this, forcing SA to run
through the body multiple times, degrading performance.




Mr K

SA is an awesome, incredible product and tool.

Wonderful Job!

I am not an expert on the programming theory, design, and implementation
behind SA.

So... are you saying SA takes a single email and breaks it apart into
several pieces and scans those pieces via multiple processing threads and
comes back with an additive single end result for that single emails
multiple scan processing threads?
  
No, I'm saying it breaks the emails into pieces, then for the first 
piece, it runs all the rules. Then it runs all the rules on the second 
piece, and the third, and the fourth, etc.


Forcing score order causes it to run the whole message on one rule, then 
then whole message on the next rule, etc.


Looping through the entire message body multiple times is slow because 
you defeat the benefits of processor memory caches. It's much better to 
loop through pieces that fit in the cache.


Think of it like an assembly line. Even with one worker, assembly line 
methods are overall considerably faster. Think of a worker building 100 
things. That worker can get the tool he needs for the first task, do 
it 100 times, then get the next tool, do the next task 100 times, etc. 
Overall he'll finish much faster than making one thing, switching tools 
many times, then the next one, again switching tools many times.. etc.



I do admit that I am respectfully optimistic about your teams ability to
design code that would run just as fast if not faster with a score order
end result.

Maybe you could let us make that decision with local.cf knob?
  
Well, you can't do the score-checks with a local.cf knob. believe me, 
it's been tested, *YEARS* ago.. it *killed* SA's performance.


However, if you wanted to see the effects, you could use the priority 
setting on each rule in local.cf. You can cause them all to run in score 
order and see what happens...

I mean, most processors are so fast nowadays..
  

Wait, if they're so fast, why are you trying to optimize?

It doesn't help you to try and make things faster by hoping to bail out 
halfway through processing, when the cost is making SA run at less than 
half its normal speed.

I am thinking we would brute force it under some circumstances 'till you
folks come forth with even more brilliant design and implementation
breakthroughs.

What think?
  

This is a VERY, Very, Very old idea.

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=304

If it didn't work in 2002, it's not going to work any better now. And 
yes, there was the run in score order idea advocated there too. See 
comment #10.. Well, when we finally got arbitrary order ability, doing a 
strict score order and check for threshold on every rule, the 
performance sucked.



The better way is the current way, priority and short-circuiting. If you 
configure this, you can at least control the number of passes SA makes 
at the message, and bail out on certain trusted rules, rather than total 
scores.


See the docs for shortcircuit:

http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html


Is there somewhere you recommend that we can view discussions on making
processing faster?
  

Archives of sa-dev, and the bugzilla..

:-)

 - rh


  




Re: more efficent big scoring

2008-01-19 Thread Matt Kettler

Matt Kettler wrote:
No, I'm saying it breaks the emails into pieces, then for the first 
piece, it runs all the rules. Then it runs all the rules on the second 
piece, and the third, and the fourth, etc.


Forcing score order causes it to run the whole message on one rule, 
then then whole message on the next rule, etc.


Looping through the entire message body multiple times is slow because 
you defeat the benefits of processor memory caches. It's much better 
to loop through pieces that fit in the cache.


Think of it like an assembly line. Even with one worker, assembly line 
methods are overall considerably faster. Think of a worker building 
100 things. That worker can get the tool he needs for the first 
task, do it 100 times, then 
Ok, I've just proven myself wrong.. However, this does mean I'm 
concerned that there's other problems with how SA processes messages 
that's causing it to not matter.


What I did:

as a quick, crude demonstration of how using priorities affects SA, I 
went and hacked 50_scores.cf of a vanilla sa 3.2.3 into a priority.cf.


grep -P ^score [A-Z] 50_scores.cf | sed s/score/priority/ |cut -d ' ' 
-f -3 | sed s/\.// priority.cf


Note, 6 rules end up with no priority value, because these rules have 
lots of spaces before their scores.. ie:

score RDNS_NONE 0.1

You can lint the file and fix it..

This creates a score-based priority for every rule. It's not strictly 
score order, as rules scored 1.0 run at priority 10, and rules scored 
1.000 run at 1000, but it's close. It's certainly close enough to create 
lots of different priorities for the rules to run at.


In my testing, I grabbed a corpus file:

http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2


And ran:

time /home/mkettler/spamassassin-3.2/masses/mass-check --file easy_ham/* 
test.out


Due to disk caching, I ran it 3 times on a vanilla 3.2.3 install. The 
first run was noticeably slower than the other 2, which gained from disk 
cache. I've discarded the first run.


The times I came up with were:
real2m25.432s  user2m12.880s sys 0m11.521s
real2m25.571s user2m12.818s   sys 0m11.672s

I then installed my priority.cf into /etc/mail/spamassassin, and re-ran..

real2m25.212s  user2m12.507s sys 0m11.694s
real2m25.435s  user2m12.852s sys 0m11.596s

No significant difference..

Well, it looks like I need to spend some time reading the code to study 
exactly how SA runs rules, and see if it's doing something that pollutes 
the memory cache, which would cause the over-sorting to not matter..





Note: by default, mass-check runs without network tests




Re: more efficent big scoring

2008-01-19 Thread Loren Wilton
Well, it looks like I need to spend some time reading the code to study 
exactly how SA runs rules, and see if it's doing something that pollutes 
the memory cache, which would cause the over-sorting to not matter..


As best I recall, it runs rules by type, and sorted by priority within type. 
There is also code to resolve meta-chaining ordering; I don't recall that 
I've seen that code since Justin wrote that.


Since it runs rules by type, I don't think its guaranteed that a -1000 rule 
will run before a -900 rule if they aren't the same rule type.  (Maybe it 
is; I'd have to look at the code again.  For what I remember that wouldn't 
have been guaranteed.)


There is (at least in theory) a cache advantage to doing things like running 
all the head tests and then all the body tests, rather than some of each. 
OTOH, both head and body are probably in memory, and the headers are 
generally not huge.  The body of course may be even smaller on many spams. 
So I'm not *convinced* that the cache locality argument will hold up under 
actual testing, albeit the theory sounds good.


What is useful is starting the net tests as early as possible, and 
harvesting them as late as possible.  However, net tests can be started 
early regardless of priority or short-circuiting, with (probably) minimal 
performance loss.  If you decide the case before all the net results arrive, 
you just ignore the stragglers.


I would not be terribly surprised to find out that on average there was no 
appreciable difference in running all rules of all types in priority order, 
over the current method; at least if this didn't push a lot of net rule 
mandatory result checking too early.  And even if that happened, it would 
slow throughput per item, but it wouldn't necessarily increase processor 
overhead.  indeed, it might in some cases reduce processor overhead.


Doing something like you did of assigning a priority to every rule that 
doesn't already have one, with the rule based on the score pretty much in 
the order the OP suggested, then sorting the rules by priority regardless of 
rule type, and running all of them that way I *suspect* will be about the 
same performance as the current algorithm.  A reduction in performance would 
I suspect most likely occur from the code having to switch on rule type for 
each rule it runs.  There are probably clever tricks (like the current 
eval-ed compiled procedures) that would eliminate this switch overhead.


This still doesn't necessarily check for bailing on score.  But note that 
short-circuiting is already present.  I think it is based on a 'short 
circult rule' hitting rather than a score comparison.  But it is still 
potentially a per-rule bailout test.  An extra numeric comparison after each 
rule that evaluates true would likely be trivial compared to the other 
tracking that is done for rules that hit.


   Loren




Re: more efficent big scoring

2008-01-18 Thread Matt Kettler
You can't run the rules in score-order without driving SA's performance 
into the ground.


The key here is SA doesn't run tests sequentially, it runs them in 
parallel as it works its way through the body. this allows for good, 
efficient use of memory cache.


By running rules in score-order, you break this, forcing SA to run 
through the body multiple times, degrading performance.



George Georgalis wrote:

Noticed today (again) how long some messages take to test.  The
first thing that comes to mind is some dns is getting overloaded
answering joe-job rbldns backskatter, causing timeouts or slow
responce times.

Then I was thinking about how some tests are excluded because they
generate too much regex load, which can be problematic even if
it's a good test.

Some time back I recall a thread, amounting to why not quit
remaining tests if spam threshold is reached, the answer was some
tests have negative scores and could change the result.

So, here are two ideas, on startup, after all the conf files are
parsed create a hash that has tests sorted by score, with the
largest positive tests starting after zero, ordered like this

-5
-5
-2
-1
0
6
5
4
2
2
1

then test in that order, whenever a test brings the message
to a spam score level, exit with result. (and add a switch to
optionally run all tests)

Another approach might be simpler to integrate than above, simply
do all the negative score tests first and pull out if the score
gets to spam level.

// George


  




Re: more efficent big scoring

2008-01-18 Thread Theo Van Dinter
Yes and no.  There aren't many negative scored rules, which could easily be
put into a low priority to run first.

The issue, which is where Matt was going I believe, is that the reason score
based short circuiting was removed is that it's horribly slow to keep checking
the score after each rule runs.  You can do it at the end of a priority's run,
but then you have to split the rules across multiple priorities, which does
impact performance.

I made some comments about this kind of thing in
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3109 and envisioned SA
auto-prioritizing rules for short circuiting for things like what I mentioned
in c7, but there was some strong disagreement about things like SC based on
score and so it didn't get implemented in the current code.


On Fri, Jan 18, 2008 at 11:22:55PM -0500, Matt Kettler wrote:
 You can't run the rules in score-order without driving SA's performance 
 into the ground.
 
 The key here is SA doesn't run tests sequentially, it runs them in 
 parallel as it works its way through the body. this allows for good, 
 efficient use of memory cache.
 
 By running rules in score-order, you break this, forcing SA to run 
 through the body multiple times, degrading performance.
 
 
 George Georgalis wrote:
 Noticed today (again) how long some messages take to test.  The
 first thing that comes to mind is some dns is getting overloaded
 answering joe-job rbldns backskatter, causing timeouts or slow
 responce times.
 
 Then I was thinking about how some tests are excluded because they
 generate too much regex load, which can be problematic even if
 it's a good test.
 
 Some time back I recall a thread, amounting to why not quit
 remaining tests if spam threshold is reached, the answer was some
 tests have negative scores and could change the result.
 
 So, here are two ideas, on startup, after all the conf files are
 parsed create a hash that has tests sorted by score, with the
 largest positive tests starting after zero, ordered like this
 
 -5
 -5
 -2
 -1
 0
 6
 5
 4
 2
 2
 1
 
 then test in that order, whenever a test brings the message
 to a spam score level, exit with result. (and add a switch to
 optionally run all tests)
 
 Another approach might be simpler to integrate than above, simply
 do all the negative score tests first and pull out if the score
 gets to spam level.
 
 // George
 
 
   

-- 
Randomly Selected Tagline:
No one can feel as helpless as the owner of a sick goldfish.


pgpFz7e9zaSsp.pgp
Description: PGP signature


RE: more efficent big scoring

2008-01-18 Thread Robert - elists
 
 You can't run the rules in score-order without driving SA's performance
 into the ground.
 
 The key here is SA doesn't run tests sequentially, it runs them in
 parallel as it works its way through the body. this allows for good,
 efficient use of memory cache.
 
 By running rules in score-order, you break this, forcing SA to run
 through the body multiple times, degrading performance.
 

Mr K

SA is an awesome, incredible product and tool.

Wonderful Job!

I am not an expert on the programming theory, design, and implementation
behind SA.

So... are you saying SA takes a single email and breaks it apart into
several pieces and scans those pieces via multiple processing threads and
comes back with an additive single end result for that single emails
multiple scan processing threads?

I do admit that I am respectfully optimistic about your teams ability to
design code that would run just as fast if not faster with a score order
end result.

Maybe you could let us make that decision with local.cf knob?

I mean, most processors are so fast nowadays..

I am thinking we would brute force it under some circumstances 'till you
folks come forth with even more brilliant design and implementation
breakthroughs.

What think?

Is there somewhere you recommend that we can view discussions on making
processing faster?

:-)

 - rh



Re: more efficent big scoring

2008-01-18 Thread jdow

From: Robert - elists [EMAIL PROTECTED]
Sent: Friday, 2008, January 18 21:14




You can't run the rules in score-order without driving SA's performance
into the ground.

The key here is SA doesn't run tests sequentially, it runs them in
parallel as it works its way through the body. this allows for good,
efficient use of memory cache.

By running rules in score-order, you break this, forcing SA to run
through the body multiple times, degrading performance.



Mr K

SA is an awesome, incredible product and tool.

Wonderful Job!

I am not an expert on the programming theory, design, and implementation
behind SA.

So... are you saying SA takes a single email and breaks it apart into
several pieces and scans those pieces via multiple processing threads and
comes back with an additive single end result for that single emails
multiple scan processing threads?


Before going further you should try to find a really good discussion of
how perl parses regular expressions. Oversimplifications can lead to
massive pessimization of the code in the name of optimization.

{^_^}