sa-learn and modern spam sizes
Hi all I found out today that the reason a spammer was giving me pain when trying to learn his spam mails as spam for bayes, was that they are too big? There are a couple of hits on it on google with various people having the same problem, I didn't find much answers but it appears that the hardcoded limit is somewhere between 256-512kilobyte? These spam mails are all between 1-2MB, so is there no way to learn this as spam with bayes? On top of that he's sending via Gmail making it hard to use rbl's. Looking forward to suggestions and clarification of the sa-learn limit (if its hardcoded I would strongly suggest it becomes configurable) Med venlig hilsen / Best regards Jonas Akrouh Larsen TechBiz ApS Laplandsgade 4, 2. sal 2300 København S Office: 7020 0979 Direct: 3336 9974 Mobile: 5120 1096 Fax:7020 0978 Web: www.techbiz.dk http://www.techbiz.dk
Re: sa-learn and modern spam sizes
Am 16.12.2011 13:30, schrieb Jonas: Hi all I found out today that the reason a spammer was giving me pain when trying to learn his spam mails as spam for bayes, was that they are too big? There are a couple of hits on it on google with various people having the same problem, I didn’t find much answers but it appears that the hardcoded limit is somewhere between 256-512kilobyte? man spamc -s max_size, --max-size=max_size Set the maximum message size which will be sent to spamd -- any bigger than this threshold and the message will be returned unprocessed (default: 500 KB). If spamc gets handed a message bigger than this, it won't be passed to spamd. The maximum message size is 256 MB. These spam mails are all between 1-2MB, so is there no way to learn this as spam with bayes? i guess this isnt a big problem, it may consum machine power and time but gurus should know more On top of that he’s sending via Gmail making it hard to use rbl’s. Looking forward to suggestions and clarification of the sa-learn limit (if its hardcoded I would strongly suggest it becomes configurable) Med venlig hilsen / Best regards Jonas Akrouh Larsen TechBiz ApS Laplandsgade 4, 2. sal 2300 København S Office: 7020 0979 Direct: 3336 9974 Mobile: 5120 1096 Fax:7020 0978 Web: www.techbiz.dk http://www.techbiz.dk -- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria
Re: Apply Bayes learning to all users?
On Fri, 16 Dec 2011 08:54:36 +0100 Benny Pedersen wrote: On Fri, 16 Dec 2011 06:30:31 +, Martin Hepworth wrote: Created a shared iMap or similar email account with a spam and ham folder for users to drag email into (not forward as that breaks headers in thing like outlook) yes, here i found that dovecot-antispam helpfull in the way I think you've both misread the question. The OP wants to use spamtrap mail to train the individual user Bayes accounts. The best way to do this would be to use the global database to adjust the probabilities for low count tokens in the user database. Nothing like that is supported. Doing it via sa-learn sounds like more trouble than it's worth. It's probably a good thing for high volume accounts, but swamping low volume accounts may make things worse.
RE: sa-learn and modern spam sizes
There are a couple of hits on it on google with various people having the same problem, I didn't find much answers but it appears that the hardcoded limit is somewhere between 256-512kilobyte? man spamc -s max_size, --max-size=max_size Set the maximum message size which will be sent to spamd -- any bigger than this threshold and the message will be returned unprocessed (default: 500 KB). If spamc gets handed a message bigger than this, it won't be passed to spamd. The maximum message size is 256 MB. I do not use spamd/spamc. But the perl module of spamassassin. So is there no way to get around this? Med venlig hilsen / Best regards Jonas Akrouh Larsen TechBiz ApS Laplandsgade 4, 2. sal 2300 København S Office: 7020 0979 Direct: 3336 9974 Mobile: 5120 1096 Fax: 7020 0978 Web: www.techbiz.dk
SA Sorbs Usage/Rules
I know some of the discussions in the past about usage of Sorbs RBLs in Spamassassin. The scores today are as follows: score RCVD_IN_SORBS_BLOCK 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_DUL 0 0.001 0 0.001 # n=0 n=2 score RCVD_IN_SORBS_HTTP 0 2.499 0 0.001 # n=0 n=2 score RCVD_IN_SORBS_MISC 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_SMTP 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_SOCKS 0 2.443 0 1.927 # n=0 n=2 score RCVD_IN_SORBS_WEB 0 0.614 0 0.770 # n=0 n=2 score RCVD_IN_SORBS_ZOMBIE 0 # n=0 n=1 n=2 n=3 The 0-Scores for DUL was done because lot of people thought there were too much false positives within that (I dont see so, but ok). Another Argument for 0-Scoring or not using sorbs was that the rbl contains a lot of old (meaning not actual) entries in the spam section (in mind of the dislist policy). Ok. But today I take a deeper look at the sorbs rbls and found, that there is a very simple misconfigration in the SA rules. The rbl check is done against the big 'dnsbl.sorbs.net' zone: eval:check_rbl('sorbs', 'dnsbl.sorbs.net.') And _that_ in my opinion is wrong. The rbl lookup should be done against the rbl 'safe.dnsbl.sorbs.net' instead. This rbl is a compilation of most of the sorbs partial lists as dnsbl.sorbs.net but with a simple difference: In opposite to dnsl.sorbs.net it does not contain the 'recent.spam' and the 'old.spam' partial lists, which are contained in 'dnsbl.sorbs.net'. The only spam listed in this 'safe.dnsbl.sorbs.net' contains spam of the last 24 hours, so the arguments against using sorbs especially because of its spam delisting policy do not exist. One could simply change the rbl lookup to the right zone and so also score spams within that rbl (low). Description of the different sorbs partial-zones as of the aggregate zones here: https://www.sorbs.net/using.shtml
Re: SA Sorbs Usage/Rules
Interesting. Will cross-post to dev and see if anyone has some input. On 12/16/2011 12:22 PM, Lutz Petersen wrote: I know some of the discussions in the past about usage of Sorbs RBLs in Spamassassin. The scores today are as follows: score RCVD_IN_SORBS_BLOCK 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_DUL 0 0.001 0 0.001 # n=0 n=2 score RCVD_IN_SORBS_HTTP 0 2.499 0 0.001 # n=0 n=2 score RCVD_IN_SORBS_MISC 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_SMTP 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_SOCKS 0 2.443 0 1.927 # n=0 n=2 score RCVD_IN_SORBS_WEB 0 0.614 0 0.770 # n=0 n=2 score RCVD_IN_SORBS_ZOMBIE 0 # n=0 n=1 n=2 n=3 The 0-Scores for DUL was done because lot of people thought there were too much false positives within that (I dont see so, but ok). Another Argument for 0-Scoring or not using sorbs was that the rbl contains a lot of old (meaning not actual) entries in the spam section (in mind of the dislist policy). Ok. But today I take a deeper look at the sorbs rbls and found, that there is a very simple misconfigration in the SA rules. The rbl check is done against the big 'dnsbl.sorbs.net' zone: eval:check_rbl('sorbs', 'dnsbl.sorbs.net.') And _that_ in my opinion is wrong. The rbl lookup should be done against the rbl 'safe.dnsbl.sorbs.net' instead. This rbl is a compilation of most of the sorbs partial lists as dnsbl.sorbs.net but with a simple difference: In opposite to dnsl.sorbs.net it does not contain the 'recent.spam' and the 'old.spam' partial lists, which are contained in 'dnsbl.sorbs.net'. The only spam listed in this 'safe.dnsbl.sorbs.net' contains spam of the last 24 hours, so the arguments against using sorbs especially because of its spam delisting policy do not exist. One could simply change the rbl lookup to the right zone and so also score spams within that rbl (low). Description of the different sorbs partial-zones as of the aggregate zones here: https://www.sorbs.net/using.shtml -- Kevin A. McGrail President Peregrine Computer Consultants Corporation 3927 Old Lee Highway, Suite 102-C Fairfax, VA 22030-2422 http://www.pccc.com/ 703-359-9700 x50 / 800-823-8402 (Toll-Free) 703-359-8451 (fax) kmcgr...@pccc.com
Re: sa-learn and modern spam sizes
On Fri, 16 Dec 2011 12:06:15 -0500 Kevin A. McGrail wrote: There are a couple of hits on it on google with various people having the same problem, I didn't find much answers but it appears that the hardcoded limit is somewhere between 256-512kilobyte? man spamc -s max_size, --max-size=max_size Set the maximum message size which will be sent to spamd -- any bigger than this threshold and the message will be returned unprocessed (default: 500 KB). If spamc gets handed a message bigger than this, it won't be passed to spamd. The maximum message size is 256 MB. I do not use spamd/spamc. But the perl module of spamassassin. So is there no way to get around this? Hmm. I didn't think SA had a limit internally. Normally, you utilize a limit on spamc (-s/--max-size) or in procmail such as * 524288. But if you call SA directly as an API, I don't think there is a limit. You might want to post this on the dev list. It's an optional limit in ArchiveIterator.pm. It's turned-on in sa-learn.
Re: SA Sorbs Usage/Rules
On 12/16, Lutz Petersen wrote: I know some of the discussions in the past about usage of Sorbs RBLs in Spamassassin. The scores today are as follows: score RCVD_IN_SORBS_BLOCK 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_DUL 0 0.001 0 0.001 # n=0 n=2 score RCVD_IN_SORBS_HTTP 0 2.499 0 0.001 # n=0 n=2 score RCVD_IN_SORBS_MISC 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_SMTP 0 # n=0 n=1 n=2 n=3 score RCVD_IN_SORBS_SOCKS 0 2.443 0 1.927 # n=0 n=2 score RCVD_IN_SORBS_WEB 0 0.614 0 0.770 # n=0 n=2 score RCVD_IN_SORBS_ZOMBIE 0 # n=0 n=1 n=2 n=3 The 0-Scores for DUL was done because lot of people thought there were too much false positives within that (I dont see so, but ok). Another Argument for 0-Scoring or not using sorbs was that the rbl contains a lot of old (meaning not actual) entries in the spam section (in mind of the dislist policy). Ok. But today I take a deeper look at the sorbs rbls and found, that there is a very simple misconfigration in the SA rules. The rbl check is done against the big 'dnsbl.sorbs.net' zone: eval:check_rbl('sorbs', 'dnsbl.sorbs.net.') And _that_ in my opinion is wrong. The rbl lookup should be done against the rbl 'safe.dnsbl.sorbs.net' instead. This rbl is a compilation of most of the sorbs partial lists as dnsbl.sorbs.net but with a simple difference: In opposite to dnsl.sorbs.net it does not contain the 'recent.spam' and the 'old.spam' partial lists, which are contained in 'dnsbl.sorbs.net'. The only spam listed in this 'safe.dnsbl.sorbs.net' contains spam of the last 24 hours, so the arguments against using sorbs especially because of its spam delisting policy do not exist. One could simply change the rbl lookup to the right zone and so also score spams within that rbl (low). Description of the different sorbs partial-zones as of the aggregate zones here: https://www.sorbs.net/using.shtml After digging into this a bit, I believe your entire objection is to the default rule set not handling the 127.0.0.6 return code, used by the following lists? new.spam.dnsbl.sorbs.net127.0.0.6 recent.spam.dnsbl.sorbs.net127.0.0.6 old.spam.dnsbl.sorbs.net127.0.0.6 spam.dnsbl.sorbs.net127.0.0.6 escalations.dnsbl.sorbs.net127.0.0.6 The rule for that return code is commented out in the default rule set with this comment: # delist: $50 fee for RCVD_IN_SORBS_SPAM, others have free retest on request Which seems likely to have resulted from this bug: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=2221 Lists returning the 127.0.0.6 code in the safe.dnsbl.sorbs.net agregate zone are: new.spam.dnsbl.sorbs.net recent.spam.dnsbl.sorbs.net escalations.dnsbl.sorbs.net new.spam is only hosts from the last 48 hours. recent.spam is hosts from the last 28 days. escalations doesn't seem to have a time limit. So it seems your statement that The only spam listed in this 'safe.dnsbl.sorbs.net' contains spam of the last 24 hours is incorrect. Basically, without evidence money is not charged to be delisted from any of those three lists, they're going to stay out of the default rule set. With the currently enabled default rules, there would be *no* difference if you changed from dnsbl.sorbs.net to safe.dnsbl.sorbs.net because we're not using the lists as an aggregate (we don't only have a RCVD_IN_SORBS rule), but have separate rules for each of the return codes. And there is no difference in what lists are providing which return codes between those two aggregate lists other than the 127.0.0.6 (spam) value (which is disabled). Also, I wouldn't say the 0 scores were done because lot of people thought there were too much false positives. The scores are flagged as mutable, meaning optimal scores are generated daily using masscheck data. Related statistics can be seen here: http://ruleqa.spamassassin.org/?daterev=20111210rule=%2Fsorbs RCVD_IN_SORBS_DUL seems to have a decent hit rate for both spam and ham, so somehow the score generator just decided the most spams would be caught without exceeding 1 false positive in 2500 hams with that score. It's not always clear what exactly it's thinking. It could be, for example, that almost all of the spam hits from RCVD_IN_SORBS_DUL overlapped with another blacklist, and the SORBS_DUL list caused more false positives than that other blacklist, so that other blacklist got a decent score, and SORBS_DUL didn't. But these scores do not come from the whims of humans. -- Anarchy is based on the observation that since few are fit to rule themselves, even fewer are fit to rule others. -Edward Abbey http://www.ChaosReigns.com
Re: sa-learn and modern spam sizes
The maximum message size is 256 MB. I've never seen spam larger than 3 MB. Joseph Brennan Lead Email Systems Engineer Columbia University Information Technology
Re: SA Sorbs Usage/Rules
On Fri, 2011-12-16 at 13:57 -0500, dar...@chaosreigns.com wrote: Basically, without evidence money is not charged to be delisted from any of those three lists, they're going to stay out of the default rule set. Plenty of people can attest to the fact there is no payment taking place, its just a scare tactic to coerce admins to act rather then ignore and hope it sorts itself out. Don't use DNSBL's in SA myself, I use them in MTA (frankly, where they belong). At least under the control of its original owner there wasn't anyway, and yes, we, like most large ISP's, had a couple of times the odd different outbound smtp server listed with them, typically we were alerted of the listing quickly (by use of mon) , a login to the SORBS site for info, and the culprit was identified and we were unlisted in hours, only one time did it take about 24 hours, and, IIRC, that was a holiday season, happy to say not had any my servers listed anywhere that I know of since 2005. Lastly, I would have thought SA dev team would have liked to see hard evidence that someone was _forced_ to pay the 50 donation to be delisted, because all I here is the web site says it which frankly doesn't cut it with me, we were nobody special to SORBS, so I can't see why they'd remove us for free but forcibly demand payments from others, the only common ground we had with Matt back then was we were both located in the same city, along with 2 million others. signature.asc Description: This is a digitally signed message part
Re: sa-learn and modern spam sizes
On Fri, 2011-12-16 at 18:17 -0500, Joseph Brennan wrote: The maximum message size is 256 MB. I've never seen spam larger than 3 MB. About 3 years ago(?), remember all the pdf spam? SA caught some that were about 5mb, but yes, on the whole it is rather rare to be more than a few KB. signature.asc Description: This is a digitally signed message part
Re: Apply Bayes learning to all users?
On 12/16/11 05:53, RW wrote: On Fri, 16 Dec 2011 08:54:36 +0100 Benny Pedersen wrote: On Fri, 16 Dec 2011 06:30:31 +, Martin Hepworth wrote: Created a shared iMap or similar email account with a spam and ham folder for users to drag email into (not forward as that breaks headers in thing like outlook) yes, here i found that dovecot-antispam helpfull in the way I think you've both misread the question. The OP wants to use spamtrap mail to train the individual user Bayes accounts. The best way to do this would be to use the global database to adjust the probabilities for low count tokens in the user database. Nothing like that is supported. Doing it via sa-learn sounds like more trouble than it's worth. It's probably a good thing for high volume accounts, but swamping low volume accounts may make things worse. Thanks RW, you understood the question correctly. I'll take a look at those suggestions. Stev3e