Re: more efficent big scoring
To clarify -- here's how the current code orders rule evaluation: - message metadata is extracted. - header DNSBL tests are started. - the decoded forms of the body text are extracted and cached. - the URIs in the message body are extracted and cached. - Iterates through each known priority level, defined in the active ruleset, from lowest to highest, and: - checks to see if it's shortcircuited; if it has, it breaks the loop - calls the 'check_rules_at_priority' plugin hook - runs all head and head-eval rules defined for that priority level - checks to see if there are network rules to harvest - runs all body and body-eval rules defined for that priority level - checks to see if there are network rules to harvest - runs all uri rules defined for that priority level - checks to see if there are network rules to harvest - runs all rawbody and rawbody-eval rules defined for that priority level - checks to see if there are network rules to harvest - runs all full and full-eval rules defined for that priority level - checks to see if there are network rules to harvest - runs all meta rules defined for that priority level (note: if the meta rules depend on a network rule, this may block until that rule completes) - checks to see if there are network rules to harvest - calls the check_tick plugin hook - finally, it waits for any remaining unharvested network rules (if it hasn't shortcircuited) - calls the check_post_dnsbl plugin hook - auto-learns from the message, if applicable - calls the check_post_learn plugin hook - and returns In 3.2.x and 3.3.0 this is all in the Check plugin, in the check_main() method, so can be redefined or overridden with alternative orderings quite easily. --j. Loren Wilton writes: maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Just wanted to point out, this topic came out when site dns cache service started to fail due to excessive dnsbl queries. My slowdown was due to multiple timeouts and/or delay, probably related to answering joe-job rbldns backscatter -- that's the reason I was looking for early exit on scans in process. There is a little of splitting rules into processing speed groups done. Specifically, the net-based tests, being dependent on external events for completion, are split out from the other tests and are processed in two phases. The first phase issues the request for information over the net, and the second phase then waits for an answer. There is a background routine that is harvesting incoming net results while other rules are processed, so when a net result is required it may already be present and no delay will be incurred. This is not an area I understand at all fully, but reading moderately recent comments on Bugzilla leads me to believe that this is an area where some improvement is still possible; there are some net tests that (I think) end up waiting immediately for an answer rather than doing the two-phase processing. How much that slows down the result for the overall email probably depends on many factors. Also note that even issuing the requests and then waiting for the result only when it is needed doesn't guarantee that the mail will not have to wait for results. It could be that one of the very first rules processed (due to priority ort meta dependency, for instance) will need a net result, and so the entire rule process will be forced to wait on it. As far as splitting non-net rules up based on speed, that isn't very practical. Regex rules should in general be quite fast, and all of them are going to require the use of the processor full-time anyway. The speed of the rule will depend on how it is written and the exact content of the email it is processing. So a rule that is dog slow on one email may be blindingly fast on most other emails. I don't know that there is any good way to estimate the speed of a regex simply by looking at it. Loren
RE: more efficent big scoring
Just wanted to point out, this topic came out when site dns cache service started to fail due to excessive dnsbl queries. My slowdown was due to multiple timeouts and/or delay, probably related to answering joe-job rbldns backscatter -- that's the reason I was looking for early exit on scans in process. // George George Georgalis, information system scientist IXOYE George, That is correct! I still maintain that the SA Team is more than bright and talented enough, that over time, they will come up with new algorithms to allow this behavior w/o a substantial SA processing speed decrease. I just cannot imagine that the theoretical and real world limits of these functions have been met yet. And bottom line is, even if a function initially slows down the processing, if the theory behind it is the truth, shouldn't it be pursued until it can be implemented properly? ... or is that just wishful thinking? jdow has suggested some reading to enlighten me that I will be getting to in short order. - rh
Re: more efficent big scoring
On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: On Sat, 19 Jan 2008, Loren Wilton wrote: I would not be terribly surprised to find out that on average there was no appreciable difference in running all rules of all types in priority order, over the current method; Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? thanks for all the followups on my inquiry. I'm glad the topic is/was considered and it looks like there is some room for development, but I now realize it is not as simple as I thought it might have been. In answer to above question, maybe the tests need their own scoring? eg fast tests and with big spam scores get a higher test score than slow tests with low spam scores. maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Then utilize some sort of looping test (eg check every .5 second) which can kill remaining tests and short circut. eg anytime in the hierachy the score is above what the negative test can fix, etc. Appreciate the discussion thus far, unfortunately discussion is all I'm able to contribute at this time. Thanks, // George -- George Georgalis, information system scientist IXOYE
Re: more efficent big scoring
On Tue, 22 Jan 2008, George Georgalis wrote: On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? thanks for all the followups on my inquiry. I'm glad the topic is/was considered and it looks like there is some room for development, but I now realize it is not as simple as I thought it might have been. In answer to above question, maybe the tests need their own scoring? eg fast tests and with big spam scores get a higher test score than slow tests with low spam scores. maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Loren mentioned to me in a private email: common subexpressions. It would be theoretically possible to analyze all the rules in a given set (e.g. body rules) to extract common subexpressions and develop a processing/pruning tree based on that. You'd probably gain some performance scanning messages, but at the cost of how much startup/compiling time? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- To prevent conflict and violence from undermining development, effective disarmament programmes are vital... -- the UN, who doesn't want to confiscate guns --- 5 days until the 41st anniversary of the loss of Apollo 1
Re: more efficent big scoring
John D. Hardin writes: On Tue, 22 Jan 2008, George Georgalis wrote: On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? thanks for all the followups on my inquiry. I'm glad the topic is/was considered and it looks like there is some room for development, but I now realize it is not as simple as I thought it might have been. In answer to above question, maybe the tests need their own scoring? eg fast tests and with big spam scores get a higher test score than slow tests with low spam scores. maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Loren mentioned to me in a private email: common subexpressions. It would be theoretically possible to analyze all the rules in a given set (e.g. body rules) to extract common subexpressions and develop a processing/pruning tree based on that. You'd probably gain some performance scanning messages, but at the cost of how much startup/compiling time? I experimented with this concept in my sa-compile work, but I could achieve any speedup on real-world mixed spam/ham datasets. Feel free to give it a try though ;) --j.
Re: more efficent big scoring
Justin Mason wrote: John D. Hardin writes: On Tue, 22 Jan 2008, George Georgalis wrote: On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? thanks for all the followups on my inquiry. I'm glad the topic is/was considered and it looks like there is some room for development, but I now realize it is not as simple as I thought it might have been. In answer to above question, maybe the tests need their own scoring? eg fast tests and with big spam scores get a higher test score than slow tests with low spam scores. maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Loren mentioned to me in a private email: common subexpressions. It would be theoretically possible to analyze all the rules in a given set (e.g. body rules) to extract common subexpressions and develop a processing/pruning tree based on that. You'd probably gain some performance scanning messages, but at the cost of how much startup/compiling time? I experimented with this concept in my sa-compile work, but I could achieve any speedup on real-world mixed spam/ham datasets. Feel free to give it a try though ;) --j. You do mean *couldn't* achieve any speedup, correct? -Jim
Re: more efficent big scoring
Jim Maul writes: Justin Mason wrote: John D. Hardin writes: On Tue, 22 Jan 2008, George Georgalis wrote: On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? thanks for all the followups on my inquiry. I'm glad the topic is/was considered and it looks like there is some room for development, but I now realize it is not as simple as I thought it might have been. In answer to above question, maybe the tests need their own scoring? eg fast tests and with big spam scores get a higher test score than slow tests with low spam scores. maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Loren mentioned to me in a private email: common subexpressions. It would be theoretically possible to analyze all the rules in a given set (e.g. body rules) to extract common subexpressions and develop a processing/pruning tree based on that. You'd probably gain some performance scanning messages, but at the cost of how much startup/compiling time? I experimented with this concept in my sa-compile work, but I could achieve any speedup on real-world mixed spam/ham datasets. Feel free to give it a try though ;) --j. You do mean *couldn't* achieve any speedup, correct? yep
Re: more efficent big scoring
John D. Hardin writes: Loren mentioned to me in a private email: common subexpressions. Whoops! Matt Kettler mentioned it to me, not Loren. Sorry! -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The difference is that Unix has had thirty years of technical types demanding basic functionality of it. And the Macintosh has had fifteen years of interface fascist users shaping its progress. Windows has the hairpin turns of the Microsoft marketing machine and that's all.-- Red Drag Diva --- 5 days until Wolfgang Amadeus Mozart's 252nd Birthday
Re: more efficent big scoring
On Tue, Jan 22, 2008 at 05:24:00PM +, Justin Mason wrote: Jim Maul writes: Justin Mason wrote: John D. Hardin writes: On Tue, 22 Jan 2008, George Georgalis wrote: On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote: Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? thanks for all the followups on my inquiry. I'm glad the topic is/was considered and it looks like there is some room for development, but I now realize it is not as simple as I thought it might have been. In answer to above question, maybe the tests need their own scoring? eg fast tests and with big spam scores get a higher test score than slow tests with low spam scores. maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Loren mentioned to me in a private email: common subexpressions. It would be theoretically possible to analyze all the rules in a given set (e.g. body rules) to extract common subexpressions and develop a processing/pruning tree based on that. You'd probably gain some performance scanning messages, but at the cost of how much startup/compiling time? I experimented with this concept in my sa-compile work, but I could achieve any speedup on real-world mixed spam/ham datasets. Feel free to give it a try though ;) --j. You do mean *couldn't* achieve any speedup, correct? yep Just wanted to point out, this topic came out when site dns cache service started to fail due to excessive dnsbl queries. My slowdown was due to multiple timeouts and/or delay, probably related to answering joe-job rbldns backscatter -- that's the reason I was looking for early exit on scans in process. // George -- George Georgalis, information system scientist IXOYE
Re: more efficent big scoring
John D. Hardin writes: Loren mentioned to me in a private email: common subexpressions. Whoops! Matt Kettler mentioned it to me, not Loren. Sorry! I was going to mention that I didn't think that had been me. Unless I was asleep when I wrote the reply. Which could have been the case. :-) Loren
Re: more efficent big scoring
maybe if there was some way to establish a hierachy at startup which groups rule processing into nodes. some nodes finish quickly, some have dependencies, some are negative, etc. Just wanted to point out, this topic came out when site dns cache service started to fail due to excessive dnsbl queries. My slowdown was due to multiple timeouts and/or delay, probably related to answering joe-job rbldns backscatter -- that's the reason I was looking for early exit on scans in process. There is a little of splitting rules into processing speed groups done. Specifically, the net-based tests, being dependent on external events for completion, are split out from the other tests and are processed in two phases. The first phase issues the request for information over the net, and the second phase then waits for an answer. There is a background routine that is harvesting incoming net results while other rules are processed, so when a net result is required it may already be present and no delay will be incurred. This is not an area I understand at all fully, but reading moderately recent comments on Bugzilla leads me to believe that this is an area where some improvement is still possible; there are some net tests that (I think) end up waiting immediately for an answer rather than doing the two-phase processing. How much that slows down the result for the overall email probably depends on many factors. Also note that even issuing the requests and then waiting for the result only when it is needed doesn't guarantee that the mail will not have to wait for results. It could be that one of the very first rules processed (due to priority ort meta dependency, for instance) will need a net result, and so the entire rule process will be forced to wait on it. As far as splitting non-net rules up based on speed, that isn't very practical. Regex rules should in general be quite fast, and all of them are going to require the use of the processor full-time anyway. The speed of the rule will depend on how it is written and the exact content of the email it is processing. So a rule that is dog slow on one email may be blindingly fast on most other emails. I don't know that there is any good way to estimate the speed of a regex simply by looking at it. Loren
Re: more efficent big scoring
Loren Wilton wrote: Well, it looks like I need to spend some time reading the code to study exactly how SA runs rules, and see if it's doing something that pollutes the memory cache, which would cause the over-sorting to not matter.. As best I recall, it runs rules by type, and sorted by priority within type. There is also code to resolve meta-chaining ordering; I don't recall that I've seen that code since Justin wrote that. I read the code, it runs by priority, then type within priority. The first loop can be found in Check.pm, sub check_main. foreach my $priority (sort { $a = $b } keys %{$pms-{conf}-{priorities}}) { and within that loop, after some code for DNSBL handling, you've got: $self-do_head_tests($pms, $priority); $self-do_head_eval_tests($pms, $priority); $self-do_body_tests($pms, $priority, $decoded); $self-do_uri_tests($pms, $priority, @uris); . Since it runs rules by type, I don't think its guaranteed that a -1000 rule will run before a -900 rule if they aren't the same rule type. (Maybe it is; I'd have to look at the code again. For what I remember that wouldn't have been guaranteed.) It's gaurnteed. There is (at least in theory) a cache advantage to doing things like running all the head tests and then all the body tests, rather than some of each. OTOH, both head and body are probably in memory, and the headers are generally not huge. The body of course may be even smaller on many spams. So I'm not *convinced* that the cache locality argument will hold up under actual testing, albeit the theory sounds good. Well, I was thinking about the performance in large messages, which has to do with how it handles lines in the body. (even though linewraps are removed for body rules, SA does break it up into largeish chunks the code calls lines). This part of the code runs the entire body on one rule at a time, not all the rules on each line at a time.. What is useful is starting the net tests as early as possible, and harvesting them as late as possible. It already does that. Or at least it harvests them late. However, net tests can be started early regardless of priority or short-circuiting, with (probably) minimal performance loss. If you decide the case before all the net results arrive, you just ignore the stragglers. I would not be terribly surprised to find out that on average there was no appreciable difference in running all rules of all types in priority order, over the current method; at least if this didn't push a lot of net rule mandatory result checking too early. Of course it resulted in no difference. The code as it stands makes zero effort to take advantage of cache locality at all. Well, I guess you could say it's maximizing locality of the rule code, while minimizing locality of the message data. And even if that happened, it would slow throughput per item, but it wouldn't necessarily increase processor overhead. indeed, it might in some cases reduce processor overhead. Doing something like you did of assigning a priority to every rule that doesn't already have one, with the rule based on the score pretty much in the order the OP suggested, then sorting the rules by priority regardless of rule type, and running all of them that way I *suspect* will be about the same performance as the current algorithm. Well that's roughly what my test did. The code doesn't group by type. A reduction in performance would I suspect most likely occur from the code having to switch on rule type for each rule it runs. There are probably clever tricks (like the current eval-ed compiled procedures) that would eliminate this switch overhead. This still doesn't necessarily check for bailing on score. But note that short-circuiting is already present. I think it is based on a 'short circult rule' hitting rather than a score comparison. But it is still potentially a per-rule bailout test. An extra numeric comparison after each rule that evaluates true would likely be trivial compared to the other tracking that is done for rules that hit. Agreed.. or you could do it for each priority... This would give you flexibility to control how often the check is actually made.
Re: more efficent big scoring
On Sat, 19 Jan 2008, Loren Wilton wrote: I would not be terribly surprised to find out that on average there was no appreciable difference in running all rules of all types in priority order, over the current method; Neither am I. Another thing to consider is the fraction of defined rules that actually hit and affect the score is rather small. The greatest optimization would be to not test REs you know will fail; but how do you do *that*? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- How can you reason with someone who thinks we're on a glidepath to a police state and yet their solution is to grant the government a monopoly on force? They are insane. --- 7 days until Wolfgang Amadeus Mozart's 252nd Birthday
Re: more efficent big scoring
Theo Van Dinter writes: Yes and no. There aren't many negative scored rules, which could easily be put into a low priority to run first. The issue, which is where Matt was going I believe, is that the reason score based short circuiting was removed is that it's horribly slow to keep checking the score after each rule runs. You can do it at the end of a priority's run, but then you have to split the rules across multiple priorities, which does impact performance. I made some comments about this kind of thing in http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3109 and envisioned SA auto-prioritizing rules for short circuiting for things like what I mentioned in c7, but there was some strong disagreement about things like SC based on score and so it didn't get implemented in the current code. Well, there's room to implement it given current plugin hooks, but nobody's done it yet. --j.
Re: more efficent big scoring
Robert - elists wrote: You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. Mr K SA is an awesome, incredible product and tool. Wonderful Job! I am not an expert on the programming theory, design, and implementation behind SA. So... are you saying SA takes a single email and breaks it apart into several pieces and scans those pieces via multiple processing threads and comes back with an additive single end result for that single emails multiple scan processing threads? No, I'm saying it breaks the emails into pieces, then for the first piece, it runs all the rules. Then it runs all the rules on the second piece, and the third, and the fourth, etc. Forcing score order causes it to run the whole message on one rule, then then whole message on the next rule, etc. Looping through the entire message body multiple times is slow because you defeat the benefits of processor memory caches. It's much better to loop through pieces that fit in the cache. Think of it like an assembly line. Even with one worker, assembly line methods are overall considerably faster. Think of a worker building 100 things. That worker can get the tool he needs for the first task, do it 100 times, then get the next tool, do the next task 100 times, etc. Overall he'll finish much faster than making one thing, switching tools many times, then the next one, again switching tools many times.. etc. I do admit that I am respectfully optimistic about your teams ability to design code that would run just as fast if not faster with a score order end result. Maybe you could let us make that decision with local.cf knob? Well, you can't do the score-checks with a local.cf knob. believe me, it's been tested, *YEARS* ago.. it *killed* SA's performance. However, if you wanted to see the effects, you could use the priority setting on each rule in local.cf. You can cause them all to run in score order and see what happens... I mean, most processors are so fast nowadays.. Wait, if they're so fast, why are you trying to optimize? It doesn't help you to try and make things faster by hoping to bail out halfway through processing, when the cost is making SA run at less than half its normal speed. I am thinking we would brute force it under some circumstances 'till you folks come forth with even more brilliant design and implementation breakthroughs. What think? This is a VERY, Very, Very old idea. http://issues.apache.org/SpamAssassin/show_bug.cgi?id=304 If it didn't work in 2002, it's not going to work any better now. And yes, there was the run in score order idea advocated there too. See comment #10.. Well, when we finally got arbitrary order ability, doing a strict score order and check for threshold on every rule, the performance sucked. The better way is the current way, priority and short-circuiting. If you configure this, you can at least control the number of passes SA makes at the message, and bail out on certain trusted rules, rather than total scores. See the docs for shortcircuit: http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html Is there somewhere you recommend that we can view discussions on making processing faster? Archives of sa-dev, and the bugzilla.. :-) - rh
Re: more efficent big scoring
Matt Kettler wrote: No, I'm saying it breaks the emails into pieces, then for the first piece, it runs all the rules. Then it runs all the rules on the second piece, and the third, and the fourth, etc. Forcing score order causes it to run the whole message on one rule, then then whole message on the next rule, etc. Looping through the entire message body multiple times is slow because you defeat the benefits of processor memory caches. It's much better to loop through pieces that fit in the cache. Think of it like an assembly line. Even with one worker, assembly line methods are overall considerably faster. Think of a worker building 100 things. That worker can get the tool he needs for the first task, do it 100 times, then Ok, I've just proven myself wrong.. However, this does mean I'm concerned that there's other problems with how SA processes messages that's causing it to not matter. What I did: as a quick, crude demonstration of how using priorities affects SA, I went and hacked 50_scores.cf of a vanilla sa 3.2.3 into a priority.cf. grep -P ^score [A-Z] 50_scores.cf | sed s/score/priority/ |cut -d ' ' -f -3 | sed s/\.// priority.cf Note, 6 rules end up with no priority value, because these rules have lots of spaces before their scores.. ie: score RDNS_NONE 0.1 You can lint the file and fix it.. This creates a score-based priority for every rule. It's not strictly score order, as rules scored 1.0 run at priority 10, and rules scored 1.000 run at 1000, but it's close. It's certainly close enough to create lots of different priorities for the rules to run at. In my testing, I grabbed a corpus file: http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 And ran: time /home/mkettler/spamassassin-3.2/masses/mass-check --file easy_ham/* test.out Due to disk caching, I ran it 3 times on a vanilla 3.2.3 install. The first run was noticeably slower than the other 2, which gained from disk cache. I've discarded the first run. The times I came up with were: real2m25.432s user2m12.880s sys 0m11.521s real2m25.571s user2m12.818s sys 0m11.672s I then installed my priority.cf into /etc/mail/spamassassin, and re-ran.. real2m25.212s user2m12.507s sys 0m11.694s real2m25.435s user2m12.852s sys 0m11.596s No significant difference.. Well, it looks like I need to spend some time reading the code to study exactly how SA runs rules, and see if it's doing something that pollutes the memory cache, which would cause the over-sorting to not matter.. Note: by default, mass-check runs without network tests
Re: more efficent big scoring
Well, it looks like I need to spend some time reading the code to study exactly how SA runs rules, and see if it's doing something that pollutes the memory cache, which would cause the over-sorting to not matter.. As best I recall, it runs rules by type, and sorted by priority within type. There is also code to resolve meta-chaining ordering; I don't recall that I've seen that code since Justin wrote that. Since it runs rules by type, I don't think its guaranteed that a -1000 rule will run before a -900 rule if they aren't the same rule type. (Maybe it is; I'd have to look at the code again. For what I remember that wouldn't have been guaranteed.) There is (at least in theory) a cache advantage to doing things like running all the head tests and then all the body tests, rather than some of each. OTOH, both head and body are probably in memory, and the headers are generally not huge. The body of course may be even smaller on many spams. So I'm not *convinced* that the cache locality argument will hold up under actual testing, albeit the theory sounds good. What is useful is starting the net tests as early as possible, and harvesting them as late as possible. However, net tests can be started early regardless of priority or short-circuiting, with (probably) minimal performance loss. If you decide the case before all the net results arrive, you just ignore the stragglers. I would not be terribly surprised to find out that on average there was no appreciable difference in running all rules of all types in priority order, over the current method; at least if this didn't push a lot of net rule mandatory result checking too early. And even if that happened, it would slow throughput per item, but it wouldn't necessarily increase processor overhead. indeed, it might in some cases reduce processor overhead. Doing something like you did of assigning a priority to every rule that doesn't already have one, with the rule based on the score pretty much in the order the OP suggested, then sorting the rules by priority regardless of rule type, and running all of them that way I *suspect* will be about the same performance as the current algorithm. A reduction in performance would I suspect most likely occur from the code having to switch on rule type for each rule it runs. There are probably clever tricks (like the current eval-ed compiled procedures) that would eliminate this switch overhead. This still doesn't necessarily check for bailing on score. But note that short-circuiting is already present. I think it is based on a 'short circult rule' hitting rather than a score comparison. But it is still potentially a per-rule bailout test. An extra numeric comparison after each rule that evaluates true would likely be trivial compared to the other tracking that is done for rules that hit. Loren
Re: more efficent big scoring
You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. George Georgalis wrote: Noticed today (again) how long some messages take to test. The first thing that comes to mind is some dns is getting overloaded answering joe-job rbldns backskatter, causing timeouts or slow responce times. Then I was thinking about how some tests are excluded because they generate too much regex load, which can be problematic even if it's a good test. Some time back I recall a thread, amounting to why not quit remaining tests if spam threshold is reached, the answer was some tests have negative scores and could change the result. So, here are two ideas, on startup, after all the conf files are parsed create a hash that has tests sorted by score, with the largest positive tests starting after zero, ordered like this -5 -5 -2 -1 0 6 5 4 2 2 1 then test in that order, whenever a test brings the message to a spam score level, exit with result. (and add a switch to optionally run all tests) Another approach might be simpler to integrate than above, simply do all the negative score tests first and pull out if the score gets to spam level. // George
Re: more efficent big scoring
Yes and no. There aren't many negative scored rules, which could easily be put into a low priority to run first. The issue, which is where Matt was going I believe, is that the reason score based short circuiting was removed is that it's horribly slow to keep checking the score after each rule runs. You can do it at the end of a priority's run, but then you have to split the rules across multiple priorities, which does impact performance. I made some comments about this kind of thing in http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3109 and envisioned SA auto-prioritizing rules for short circuiting for things like what I mentioned in c7, but there was some strong disagreement about things like SC based on score and so it didn't get implemented in the current code. On Fri, Jan 18, 2008 at 11:22:55PM -0500, Matt Kettler wrote: You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. George Georgalis wrote: Noticed today (again) how long some messages take to test. The first thing that comes to mind is some dns is getting overloaded answering joe-job rbldns backskatter, causing timeouts or slow responce times. Then I was thinking about how some tests are excluded because they generate too much regex load, which can be problematic even if it's a good test. Some time back I recall a thread, amounting to why not quit remaining tests if spam threshold is reached, the answer was some tests have negative scores and could change the result. So, here are two ideas, on startup, after all the conf files are parsed create a hash that has tests sorted by score, with the largest positive tests starting after zero, ordered like this -5 -5 -2 -1 0 6 5 4 2 2 1 then test in that order, whenever a test brings the message to a spam score level, exit with result. (and add a switch to optionally run all tests) Another approach might be simpler to integrate than above, simply do all the negative score tests first and pull out if the score gets to spam level. // George -- Randomly Selected Tagline: No one can feel as helpless as the owner of a sick goldfish. pgpFz7e9zaSsp.pgp Description: PGP signature
RE: more efficent big scoring
You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. Mr K SA is an awesome, incredible product and tool. Wonderful Job! I am not an expert on the programming theory, design, and implementation behind SA. So... are you saying SA takes a single email and breaks it apart into several pieces and scans those pieces via multiple processing threads and comes back with an additive single end result for that single emails multiple scan processing threads? I do admit that I am respectfully optimistic about your teams ability to design code that would run just as fast if not faster with a score order end result. Maybe you could let us make that decision with local.cf knob? I mean, most processors are so fast nowadays.. I am thinking we would brute force it under some circumstances 'till you folks come forth with even more brilliant design and implementation breakthroughs. What think? Is there somewhere you recommend that we can view discussions on making processing faster? :-) - rh
Re: more efficent big scoring
From: Robert - elists [EMAIL PROTECTED] Sent: Friday, 2008, January 18 21:14 You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. Mr K SA is an awesome, incredible product and tool. Wonderful Job! I am not an expert on the programming theory, design, and implementation behind SA. So... are you saying SA takes a single email and breaks it apart into several pieces and scans those pieces via multiple processing threads and comes back with an additive single end result for that single emails multiple scan processing threads? Before going further you should try to find a really good discussion of how perl parses regular expressions. Oversimplifications can lead to massive pessimization of the code in the name of optimization. {^_^}