Re: more efficent big scoring
From: "Robert - elists" <[EMAIL PROTECTED]> Sent: Friday, 2008, January 18 21:14 You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. Mr K SA is an awesome, incredible product and tool. Wonderful Job! I am not an expert on the programming theory, design, and implementation behind SA. So... are you saying SA takes a single email and breaks it apart into several pieces and scans those pieces via multiple processing threads and comes back with an additive single end result for that single emails multiple scan processing threads? Before going further you should try to find a really good discussion of how perl parses regular expressions. Oversimplifications can lead to massive pessimization of the code in the name of optimization. {^_^}
RE: more efficent big scoring
> > You can't run the rules in score-order without driving SA's performance > into the ground. > > The key here is SA doesn't run tests sequentially, it runs them in > parallel as it works its way through the body. this allows for good, > efficient use of memory cache. > > By running rules in score-order, you break this, forcing SA to run > through the body multiple times, degrading performance. > Mr K SA is an awesome, incredible product and tool. Wonderful Job! I am not an expert on the programming theory, design, and implementation behind SA. So... are you saying SA takes a single email and breaks it apart into several pieces and scans those pieces via multiple processing threads and comes back with an additive single end result for that single emails multiple scan processing threads? I do admit that I am respectfully optimistic about your teams ability to design code that would run just as fast if not faster with a "score order" end result. Maybe you could let us make that decision with local.cf knob? I mean, most processors are so fast nowadays.. I am thinking we would brute force it under some circumstances 'till you folks come forth with even more brilliant design and implementation breakthroughs. What think? Is there somewhere you recommend that we can view discussions on making processing faster? :-) - rh
Re: more efficent big scoring
Yes and no. There aren't many negative scored rules, which could easily be put into a low priority to run first. The issue, which is where Matt was going I believe, is that the reason score based short circuiting was removed is that it's horribly slow to keep checking the score after each rule runs. You can do it at the end of a priority's run, but then you have to split the rules across multiple priorities, which does impact performance. I made some comments about this kind of thing in http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3109 and envisioned SA auto-prioritizing rules for short circuiting for things like what I mentioned in c7, but there was some strong disagreement about things like SC based on score and so it didn't get implemented in the current code. On Fri, Jan 18, 2008 at 11:22:55PM -0500, Matt Kettler wrote: > You can't run the rules in score-order without driving SA's performance > into the ground. > > The key here is SA doesn't run tests sequentially, it runs them in > parallel as it works its way through the body. this allows for good, > efficient use of memory cache. > > By running rules in score-order, you break this, forcing SA to run > through the body multiple times, degrading performance. > > > George Georgalis wrote: > >Noticed today (again) how long some messages take to test. The > >first thing that comes to mind is some dns is getting overloaded > >answering joe-job rbldns backskatter, causing timeouts or slow > >responce times. > > > >Then I was thinking about how some tests are excluded because they > >generate too much regex load, which can be problematic even if > >it's a good test. > > > >Some time back I recall a thread, amounting to why not quit > >remaining tests if spam threshold is reached, the answer was some > >tests have negative scores and could change the result. > > > >So, here are two ideas, on startup, after all the conf files are > >parsed create a hash that has tests sorted by score, with the > >largest positive tests starting after zero, ordered like this > > > >-5 > >-5 > >-2 > >-1 > >0 > >6 > >5 > >4 > >2 > >2 > >1 > > > >then test in that order, whenever a test brings the message > >to a spam score level, exit with result. (and add a switch to > >optionally run all tests) > > > >Another approach might be simpler to integrate than above, simply > >do all the negative score tests first and pull out if the score > >gets to spam level. > > > >// George > > > > > > -- Randomly Selected Tagline: No one can feel as helpless as the owner of a sick goldfish. pgpFz7e9zaSsp.pgp Description: PGP signature
Re: more efficent big scoring
You can't run the rules in score-order without driving SA's performance into the ground. The key here is SA doesn't run tests sequentially, it runs them in parallel as it works its way through the body. this allows for good, efficient use of memory cache. By running rules in score-order, you break this, forcing SA to run through the body multiple times, degrading performance. George Georgalis wrote: Noticed today (again) how long some messages take to test. The first thing that comes to mind is some dns is getting overloaded answering joe-job rbldns backskatter, causing timeouts or slow responce times. Then I was thinking about how some tests are excluded because they generate too much regex load, which can be problematic even if it's a good test. Some time back I recall a thread, amounting to why not quit remaining tests if spam threshold is reached, the answer was some tests have negative scores and could change the result. So, here are two ideas, on startup, after all the conf files are parsed create a hash that has tests sorted by score, with the largest positive tests starting after zero, ordered like this -5 -5 -2 -1 0 6 5 4 2 2 1 then test in that order, whenever a test brings the message to a spam score level, exit with result. (and add a switch to optionally run all tests) Another approach might be simpler to integrate than above, simply do all the negative score tests first and pull out if the score gets to spam level. // George
Re: sa-learn error message
> Hello Craig, > > I recently ran into this problem myself. The solution, > after being a dolt and not running a backup first, was > the following sequence followed by line definitions: > >/etc/init.d/mailserver stop >sa-learn --backup > /etc/mail/spamassassin/database.bak >sa-learn --dump magic >sa-learn --no-sync --ham --progress --mbox >/export/home/brian/Ham sa-learn --sync >sa-learn --no-sync --spam --progress --mbox >/export/home/brian/Spam sa-learn --sync >sa-learn --dump magic >spamassassin -D --lint >/etc/init.d/mailserver start > > 1) Shutdown Sendmail/ClamAV/MIMEDefang/Spamassassin. > 2) Backup the database. > 3) View current statistics which will also display the > current bayes database version. > 4) Do a ham learn. > 5) This one was key! Even after everything was parsed > and the command line came back, the database was still > not in a happy place. Doing the --sync brings it to that > happy place. 6) Do a spam learn. > 7) See #5. > 8) View current statistics and note nham and nspam > increases. 9) Run through the rules to make sure > everything is still cool and no errors occur. > 10) Start Sendmail/ClamAV/MIMEDefang/Spamassassin. > > Notes: > > - Doing a --sync on the sa-learn learning process didn't > work. I'm not sure why the system doesn't learn the file > and then just resync the database when it's done. Maybe > Theo has an idea. - Shutting down the MTA isn't ideal but > it prevents lock file conflicts which don't seem to work > too well under Solaris 8. Mail queues in the ether for > about 30 minutes while all of this is going on. I've > even thought about automating the process which would > help keep the Ham and Spam files at a reasonable size and > shorten that to about 5 minutes. > > -BE > Why do you put that --no-sync argument after each learning command in the first place? I have used it when learning several messages one at a time, and then later --sync But in your script, I see no reason for 1st learning with --no-sync and then --sync after it.
Re: sa-learn error message
Hello Craig, I recently ran into this problem myself. The solution, after being a dolt and not running a backup first, was the following sequence followed by line definitions: /etc/init.d/mailserver stop sa-learn --backup > /etc/mail/spamassassin/database.bak sa-learn --dump magic sa-learn --no-sync --ham --progress --mbox /export/home/brian/Ham sa-learn --sync sa-learn --no-sync --spam --progress --mbox /export/home/brian/Spam sa-learn --sync sa-learn --dump magic spamassassin -D --lint /etc/init.d/mailserver start 1) Shutdown Sendmail/ClamAV/MIMEDefang/Spamassassin. 2) Backup the database. 3) View current statistics which will also display the current bayes database version. 4) Do a ham learn. 5) This one was key! Even after everything was parsed and the command line came back, the database was still not in a happy place. Doing the --sync brings it to that happy place. 6) Do a spam learn. 7) See #5. 8) View current statistics and note nham and nspam increases. 9) Run through the rules to make sure everything is still cool and no errors occur. 10) Start Sendmail/ClamAV/MIMEDefang/Spamassassin. Notes: - Doing a --sync on the sa-learn learning process didn't work. I'm not sure why the system doesn't learn the file and then just resync the database when it's done. Maybe Theo has an idea. - Shutting down the MTA isn't ideal but it prevents lock file conflicts which don't seem to work too well under Solaris 8. Mail queues in the ether for about 30 minutes while all of this is going on. I've even thought about automating the process which would help keep the Ham and Spam files at a reasonable size and shorten that to about 5 minutes. -BE Hi again SA experts, Note the error message in the 2nd-last line of the following transcript: animalhead:~/sj $ sa-learn --no-rebuild --spam --mbox savejunk The --no-rebuild option has been deprecated. Please use --no-sync instead. Learned tokens from 3025 message(s) (3047 message(s) examined) animalhead:~/sj $ sa-learn --no-sync --spam thruJunk bayes: bayes db version 0 is not able to be used, aborting! at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/BayesStore/DBM.pm line 196. Learned tokens from 170 message(s) (170 message(s) examined) There are 171 messages in directory thruJunk. The largest is 495K, the next largest is 137K. $ sa-learn -Vyields "spamassassin v 3.2.1" What should I do about this? I still have another directory with ham to go. It includes lots of large files. Should I delete those over a certain size? Thanks, Craig MacKenna
Disabling eval rules (was: Re: Testing Botnet)
On Sat, 2008-01-12 at 12:23 -0800, Robert - elists wrote: > > Sounds like you've been hit by bug 5519 [1] before the upgrade in Oct. > > Setting rules scores to 0 did *not* prevent these tests from being > > evaluated for SA 3.2.x before 3.2.3. > > > > Fixed since 3.2.3. Plugin eval rules with 0 scores are meant no not be > > evaluated, and of course to not show up in the report. > Interesting, does this mean that we should be changing scores we care about > and want to see eval'd in the reports to .01 or something similar? > > Any other implications in the bug and current or future fix methods? AFAIK, nope. That should be all. However... I noticed that even my SA 3.2.4 still evaluates my URICountry plugin rules, which are set to a score of 0.0 [1]. Which actually should *not* happen since 3.2.3. Anyone got a guess why? Devs? guenther [1] originally set up for exactly this testing purpose, btw -- char *t="[EMAIL PROTECTED]"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
more efficent big scoring
Noticed today (again) how long some messages take to test. The first thing that comes to mind is some dns is getting overloaded answering joe-job rbldns backskatter, causing timeouts or slow responce times. Then I was thinking about how some tests are excluded because they generate too much regex load, which can be problematic even if it's a good test. Some time back I recall a thread, amounting to why not quit remaining tests if spam threshold is reached, the answer was some tests have negative scores and could change the result. So, here are two ideas, on startup, after all the conf files are parsed create a hash that has tests sorted by score, with the largest positive tests starting after zero, ordered like this -5 -5 -2 -1 0 6 5 4 2 2 1 then test in that order, whenever a test brings the message to a spam score level, exit with result. (and add a switch to optionally run all tests) Another approach might be simpler to integrate than above, simply do all the negative score tests first and pull out if the score gets to spam level. // George -- George Georgalis, information system scientist <
Re: disable all network test except ...
On Friday 18 January 2008 13:46, you wrote: > Stefan Jakobs wrote: > > Hello list, > > > > I'm using amavisd-new with spamassassin and for some tests I have to > > disable all network tests in spamassassin except for sorbs, njabl, uribl > > and maybe some other blackhole lists. > > I guess I can comment out the corresponding header lines in the files > > 20_dnsbl_tests.cf and 25_uribl.cf. > > Don't do that. Your changes will get clobbered whenever you run > sa-update or upgrade SA versions.. I know. I will run spamassassin with disabled DNS querries only for performance tests. And in this time I will not change the system. > > And also deactivate the plugins for razor, > > pyzor and so on. But it this enough, or is there a easier way to disable > > most of the network tests? > > Set their score to 0 in your local.cf. Note for RBL's you'll need to set > a 0 score for the normally un-scored "root" rule for that RBL, which is > the one using check_rbl, not check_rbl_sub. > > For example, to disable all the spamhaus tests: > > score__RCVD_IN_ZEN 0 > score RCVD_IN_SBL 0 > score RCVD_IN_XBL 0 > score RCVD_IN_PBL 0 > > The only one you *really* need is the first one, as that one disables > the DNS querry. However, disabling the sub-tests will save you a little > CPU and prevent SA from constantly checking an empty result to see if > different IPs match it. OK, that's good to know. Are there some other network test which are not mentioned in the following files? 20_dnsbl_tests.cf 25_uribl.cf 50_scores.cf Thanks guys. Stefan pgpPnoBzbnysH.pgp Description: PGP signature
Re: disable all network test except ...
Stefan Jakobs wrote: Hello list, I'm using amavisd-new with spamassassin and for some tests I have to disable all network tests in spamassassin except for sorbs, njabl, uribl and maybe some other blackhole lists. I guess I can comment out the corresponding header lines in the files 20_dnsbl_tests.cf and 25_uribl.cf. And also deactivate the plugins for razor, pyzor and so on. But it this enough, or is there a easier way to disable most of the network tests? create a scores.cf file in the directory where you have local.cf, and set the scores to zero for any rule you want to disable. Look at 50_scores.cf in the spamassassin "core" rules directoy for the names of the rules.
Re: The googolbees are getting craftier
Quoting Justin Mason <[EMAIL PROTECTED]>: the redirect detection should have no problem finding that... And the redirected-to domain is on two SURBL blacklists, so it should be hitting. Jeff C. Loren Wilton writes: I guess btnl is no longer working. Now they are doing a redirect: http://google.co.uk///pagead/iclk?sa=l&ai=livermore&num=970&adurl=http://-low-rate.tw?beast Loren
Re: disable all network test except ...
Stefan Jakobs wrote: Hello list, I'm using amavisd-new with spamassassin and for some tests I have to disable all network tests in spamassassin except for sorbs, njabl, uribl and maybe some other blackhole lists. I guess I can comment out the corresponding header lines in the files 20_dnsbl_tests.cf and 25_uribl.cf. Don't do that. Your changes will get clobbered whenever you run sa-update or upgrade SA versions.. And also deactivate the plugins for razor, pyzor and so on. But it this enough, or is there a easier way to disable most of the network tests? Set their score to 0 in your local.cf. Note for RBL's you'll need to set a 0 score for the normally un-scored "root" rule for that RBL, which is the one using check_rbl, not check_rbl_sub. For example, to disable all the spamhaus tests: score__RCVD_IN_ZEN 0 score RCVD_IN_SBL 0 score RCVD_IN_XBL 0 score RCVD_IN_PBL 0 The only one you *really* need is the first one, as that one disables the DNS querry. However, disabling the sub-tests will save you a little CPU and prevent SA from constantly checking an empty result to see if different IPs match it.
Re: The googolbees are getting craftier
the redirect detection should have no problem finding that... Loren Wilton writes: > I guess btnl is no longer working. Now they are doing a redirect: > > http://google.co.uk///pagead/iclk?sa=l&ai=livermore&num=970&adurl=http://-low-rate.tw?beast > > > Loren
disable all network test except ...
Hello list, I'm using amavisd-new with spamassassin and for some tests I have to disable all network tests in spamassassin except for sorbs, njabl, uribl and maybe some other blackhole lists. I guess I can comment out the corresponding header lines in the files 20_dnsbl_tests.cf and 25_uribl.cf. And also deactivate the plugins for razor, pyzor and so on. But it this enough, or is there a easier way to disable most of the network tests? Thanks for your help. Stefan pgpjp3YDUrg12.pgp Description: PGP signature
The googolbees are getting craftier
I guess btnl is no longer working. Now they are doing a redirect: http://google.co.uk///pagead/iclk?sa=l&ai=livermore&num=970&adurl=http://christmas-low-rate.tw?beast Loren