Re: Spamassassin Bayes... "why give that spam that score???"
On Thu, 25 Feb 2016, RW wrote: On Thu, 25 Feb 2016 13:58:03 -0800 (PST) John Hardin wrote: On Thu, 25 Feb 2016, Steve wrote: b) Configure spamc -C report (run as any user) to initiate training of the amavis bayes database (in ~amavis/.spamassassin) ? That would probably be a code change, unless you want to write a wrapped script that calls the real spamc and then sa-learn... Probably not a good idea. I don't see why it would require a code change if ~amavis is a real unix home directory. It does require an instance of spamd that does nothing else since AFAIK it's not needed by amavisd. Sorry, I was thinking in terms of "learning" at all rather than "learning to a specific database". {refreshes memory of spamc command line} spamc -L iham|spam is for learning. I expect if you configured the correct database (as Reindl suggested) then -L would do what you want. Having -C report do that as well would be a code change, I'm not sure that's a good idea. Apologies for not mentioning -L initially, I had forgotten about it. You can either run spamd -u amavis , or leave it as root and run spamc -u amavis. Either way spamd will drop to the user amavis and look for its files in ~amavis/.spamassassin I think you do need to use both the -C and -L options to spamc though. The alternative for both training and reporting/revoking would be to use the spamassassin script, but that's inefficient from the Dovecot plugin. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The Constitution is a written instrument. As such its meaning does not alter. That which it meant when adopted, it means now. -- U.S. Supreme Court SOUTH CAROLINA v. US, 199 U.S. 437, 448 (1905) --- 66 days since the first successful real return to launch site (SpaceX)
Re: Spamassassin Bayes... "why give that spam that score???"
On Thu, 25 Feb 2016 13:58:03 -0800 (PST) John Hardin wrote: > On Thu, 25 Feb 2016, Steve wrote: > > b) Configure spamc -C report (run as any user) to initiate > > training of the amavis bayes database (in ~amavis/.spamassassin) ? > > That would probably be a code change, unless you want to write a > wrapped script that calls the real spamc and then sa-learn... > Probably not a good idea. I don't see why it would require a code change if ~amavis is a real unix home directory. It does require an instance of spamd that does nothing else since AFAIK it's not needed by amavisd. You can either run spamd -u amavis , or leave it as root and run spamc -u amavis. Either way spamd will drop to the user amavis and look for its files in ~amavis/.spamassassin I think you do need to use both the -C and -L options to spamc though. The alternative for both training and reporting/revoking would be to use the spamassassin script, but that's inefficient from the Dovecot plugin. > That's probably the easiest to do. > > https://wiki.apache.org/spamassassin/SiteWideBayesSetup It's presumably already site-wide with a database in ~amavis/.spamassassin > Also, if you are going to leave autolearn on, reduce the learn-as-ham > threshold! Autotraining and the Dovecot plugin isn't a good combination since they are both very poor at learning ham. If you really must use them together train a few thousand hams manually and then set the threshold low enough that it wont get screwed-up by autotraining.
Re: Spamassassin Bayes... "why give that spam that score???"
Am 25.02.2016 um 22:58 schrieb John Hardin: b) Configure spamc -C report (run as any user) to initiate training of the amavis bayes database (in ~amavis/.spamassassin) ? That would probably be a code change, unless you want to write a wrapped script that calls the real spamc and then sa-learn... Probably not a good idea why? spamc --help -F, --config path Use this configuration file signature.asc Description: OpenPGP digital signature
Re: Spamassassin Bayes... "why give that spam that score???"
On Thu, 25 Feb 2016, Steve wrote: Please keep the discussion on-list so others may help/benefit. On 25/02/2016 01:14, John Hardin wrote: The second one has autolearn=yes, so I would say that autolearn is probably the cause of this behavior. You're right... Manual training wasn't working - and autolearn became self-reinforcing as a result. I had been misinterpreting my logs (face-palm)! I now see that the training initiated by spamc (behind dovecot antispam) was trying to train the bayes database in ~/.spamassassin/bayes* - but amavis was using the bayes database in ~ amavis/.spamassassin/bayes* - and was failing as a result (which I had overlooked.) Yeah, "are you training the right database?" is a standard initial troubleshooting question; I apologize for not asking that up front. I can now refine my question: Is there an easy way to: a) Configure amavisd to use the spamassassin configuration (~/.spamassassin/user_prefs and bayes_*) for the intended mailbox's account? (As far as I can tell, this isn't supported...) Not sure, I'm unfamiliar with the details of amavisd. Sorry. b) Configure spamc -C report (run as any user) to initiate training of the amavis bayes database (in ~amavis/.spamassassin) ? That would probably be a code change, unless you want to write a wrapped script that calls the real spamc and then sa-learn... Probably not a good idea. c) Configure everything to use a single site-wide database? (I've found how-to documents suggesting that I set "bayes_path" and "bayes_file_mode" - but when I try this, this part of the configuration seems to be ignored.) That's probably the easiest to do. https://wiki.apache.org/spamassassin/SiteWideBayesSetup Also, if you are going to leave autolearn on, reduce the learn-as-ham threshold! Have you considered greylisting to give domains a chance to be added to URIBLs before you see them? I have - but I quickly lost patience with it. It is important to me that - if I'm having a phone conversation with someone, and they send me an email "there and then" - that I get to see it before hanging up. Greylisting is incompatible with this wish. It doesn't work for everyone. I'm not comfortable increasing the URIBL_BLACK score (as you appear to have done) as I don't want to risk any block-list ever being a single point of failure for false positives. URIBL_BLACK wouldn't become a poison pill by itself unless you score it over 5. I don't necessarily recommend trusting it *that* much, but 3.0 seems reasonable to me. I am, however, very curious about IXHASH - which looks as if it is useful. How does this compare with (or relate to) RAXOR/PYZOR/DCC? What's the best way to install it (on Ubuntu - if the distro is relevant to the answer...)? Dunno, maybe somebody else will chime in. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- A sword is never a killer, it is but a tool in the killer's hands. -- Lucius Annaeus Seneca (Martial) 4BC-65AD --- 66 days since the first successful real return to launch site (SpaceX)
Re: Spamassassin Bayes... "why give that spam that score???"
On 24 Feb 2016, at 20:14, John Hardin wrote: On Thu, 25 Feb 2016, Steve wrote: On 24/02/2016 22:59, John Hardin wrote: On Wed, 24 Feb 2016, Steve wrote: > I've used spamassassin for many years - on Ubuntu, using amvisd - with > great success. In recent months, I've been receiving several spam > messages each day that evade the filters. Can you provide samples? (e.g. three or four on Pastebin) One of each of the most common forms: http: //pastebin.com/Wk2KD1Q1 http: //pastebin.com/QCQ9Ymw7 http: //pastebin.com/wgkmiJLt The second one has autolearn=yes, so I would say that autolearn is probably the cause of this behavior. Note that the bayes score doesn't contribute to the autolearning decision to avoid positive feedback, but if there are no non-Bayes spam signs and the message scores lightly negative like that one does, it can be learned as ham. That would make any subsequent similar messages score even lower, possibly offsetting actual spam hits. Subsequently training those messages as spam will offset that effect, but you're to a degree playing whack-a-mole that way. I misspoke a bit when I said there are no knobs to twiddle. I forgot about the autolearn thresholds, but they aren't strictly part of how bayes itself works, they are (again) training. If you want to use autolearn, you might want to reduce the learn-as-ham threshold even further. View autolearn as a not-quite-trustworthy user making submissions, and the thresholds are a way to limit the effects of poor judgement. :) I'm much more certain that you should reduce your bayes_auto_learn_threshold_nonspam. Everyone should. The default is 0.1, and it looks like you've left that as-is. I use -0.2 because I really don't want the autolearner to assume mail is ham without at least 2 minor or one substantial indicator of hamminess. Maybe giving mail the benefit of the doubt made sense circa v3.1, but it definitely does not today. In the case of your 2nd example, it was autolearned as ham because its non-bayes score was -0.101, based on rules that only have independent scores at all for strategic UI (some might even say political) purposes.
Re: Spamassassin Bayes... "why give that spam that score???"
On Thu, 25 Feb 2016 00:41:04 + Steve wrote: > On 24/02/2016 22:59, John Hardin wrote: > > How do you train your Bayes? Autolearn? General user submissions? > > Trusted user submissions? Only you, from only your personal mail? > Only my personal mailbox *really* matters to me. I train from it > using the dovecot antispam plugin... which feeds mail I shift to/from > a spam folder through a pipe involving "spamc -C". I think that might be your problem. The equivalent option in the spamassassin script trains Bayes as a side-effect of reporting or revoking. I don't think "spamc -C" does.
Re: Spamassassin Bayes... "why give that spam that score???"
Am 25.02.2016 um 02:14 schrieb John Hardin: On Thu, 25 Feb 2016, Steve wrote: On 24/02/2016 22:59, John Hardin wrote: On Wed, 24 Feb 2016, Steve wrote: > I've used spamassassin for many years - on Ubuntu, using amvisd - with > great success. In recent months, I've been receiving several spam > messages each day that evade the filters. Can you provide samples? (e.g. three or four on Pastebin) One of each of the most common forms: http: //pastebin.com/Wk2KD1Q1 http: //pastebin.com/QCQ9Ymw7 http: //pastebin.com/wgkmiJLt The second one has autolearn=yes, so I would say that autolearn is probably the cause of this behavior autolearn is the root of all evil, it's nice for a "fire and fforget" setup with no manual training, but that's it got hit by it in the past multiple times in both directions (false negative ham and false positive spam) with the result of purge the whole bayes (commercial appliance using SpamAssassin as one part) after build up my own spamfilter solution, keep the whole corpus and *only* train by hand with no autlearning/autoexpire the bayes is 100% trustworthy and can be scored as nearly posion pill for spam as well as -3,5 for BAYES_00 given that 99% of junk is killed long before SA on MTA-level, 30% are sortcircuit ham and over 70% of the messages making it through bayes are BAYES_00 the setup is proven to be right 0 61132SPAM 0 21786HAM 02540731TOKEN insgesamt 73M -rw--- 1 sa-milt sa-milt 10M 2016-02-25 02:24 bayes_seen -rw--- 1 sa-milt sa-milt 81M 2016-02-25 02:24 bayes_toks BAYES_0025445 73.52 % BAYES_05 6711.93 % BAYES_20 7802.25 % BAYES_40 7202.08 % BAYES_50 25197.27 % BAYES_60 3701.06 % 7.90 % (OF TOTAL BLOCKED) BAYES_80 2880.83 % 6.15 % (OF TOTAL BLOCKED) BAYES_95 2840.82 % 6.06 % (OF TOTAL BLOCKED) BAYES_99 3529 10.19 %75.38 % (OF TOTAL BLOCKED) BAYES_99931859.20 %68.04 % (OF TOTAL BLOCKED) DNSWL 4 90.78 % SPF 33608 65.37 % SPF/DKIM WL 14653 28.50 % SHORTCIRCUIT16744 32.57 % BLOCKED 46819.10 % SPAMMY 44718.69 %95.51 % (OF TOTAL BLOCKED) signature.asc Description: OpenPGP digital signature
Re: Spamassassin Bayes... "why give that spam that score???"
On Thu, 25 Feb 2016, Reindl Harald wrote: 7.0 URIBL_BLACKContains an URL listed in the URIBL blacklist [URIs: leslie-bib***b.org] That, too. Steve, you might consider boosting your local score for URIBL_BLACK. :) -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Pork (n): (political) The manifestation of the principle that it is a felony to bribe a legislator, unless you are also a legislator. --- 65 days since the first successful real return to launch site (SpaceX)
Re: Spamassassin Bayes... "why give that spam that score???"
On Thu, 25 Feb 2016, Steve wrote: On 24/02/2016 22:59, John Hardin wrote: On Wed, 24 Feb 2016, Steve wrote: > I've used spamassassin for many years - on Ubuntu, using amvisd - with > great success. In recent months, I've been receiving several spam > messages each day that evade the filters. Can you provide samples? (e.g. three or four on Pastebin) One of each of the most common forms: http: //pastebin.com/Wk2KD1Q1 http: //pastebin.com/QCQ9Ymw7 http: //pastebin.com/wgkmiJLt The second one has autolearn=yes, so I would say that autolearn is probably the cause of this behavior. Note that the bayes score doesn't contribute to the autolearning decision to avoid positive feedback, but if there are no non-Bayes spam signs and the message scores lightly negative like that one does, it can be learned as ham. That would make any subsequent similar messages score even lower, possibly offsetting actual spam hits. Subsequently training those messages as spam will offset that effect, but you're to a degree playing whack-a-mole that way. I misspoke a bit when I said there are no knobs to twiddle. I forgot about the autolearn thresholds, but they aren't strictly part of how bayes itself works, they are (again) training. If you want to use autolearn, you might want to reduce the learn-as-ham threshold even further. View autolearn as a not-quite-trustworthy user making submissions, and the thresholds are a way to limit the effects of poor judgement. :) I note that they tend to come from different mail servers each time - the URLs in the body tend to be unique, too. Have you considered greylisting to give domains a chance to be added to URIBLs before you see them? > * The false positives all match BAYES_00 - attracting a default score of > -1.9. BAYES_00 seems to be at the crux of the misclassification. > > Is there a way to delve into why these messages have been allocated such > a low bayes score - while (to a human) appearing blatant, simple, spam > on "vanilla" spam topics? Has my bayes data been "poisoned" somehow? Poisoning is less likely than mistraining. How large is your userbase and mail volume? One user - me - several email addresses. 10,000 mails per month - several mailing lists where I read only a tiny fraction of the posts. Heh. For once it's someone pretty much like me. :) ~ 1,500 spams (that survive mail server RBLs). Autolearn is on - I don't think about it, it is automatic. :) How do you train your Bayes? Autolearn? General user submissions? Trusted user submissions? Only you, from only your personal mail? Only my personal mailbox *really* matters to me. I train from it using the dovecot antispam plugin... which feeds mail I shift to/from a spam folder through a pipe involving "spamc -C". And I assume there's a similar ham folder? You need both. Do you keep base training corpora so you can wipe and retrain if it goes off the rails for some reason? (In principle) I've got multi-gigabyte-scale spam/ham corpora. I'm yet to [ever] do anything with it. :) I have base bayes corpora of a few thousand messages each spam and ham, kept in aged corpora files. I add a handful to that every month, mostly on the spam side. SA is trained nightly from the current corpora files and I can retrain from from scratch from all of them if needed, but I haven't needed to do that yet. If all the FNs are getting BAYES_00, make sure you're (re)training them as spam. I believe I'm doing that - but it isn't easy to prove that the training 'worked'. If you look at the output from the training you'll be able to see how many "new" messages it learned from. It will have an effect, in that it will remove a specific mistraining, but in the meantime autolearn may be making bad decisions about other messages. Review how you're training. If your users aren't really trustworthy you should be manually reviewing submissions. When spam arrives in my primary inbox, I hand classify - I'm less obsessive about mailing lists. Dovecot initiates training automatically when I shift messages to a special spam folder. OK, good. If you had a userbase, their judgement (or lack thereof) could be an issue. I feel autolearn can be problematic, particularly if things are already going off the rails. I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast majority of my training. This year, I've hand-trained 216 false-negatives and 0 false positives. For the size of your install, I'd recommend turning off autolearn and go with purely hand-collected corpora. It serves me well. If you have base training corpora, review it for misclassifications (FNs), wipe and retrain. I guess I could do that... My expectation is that - if I train with the corpora I can pick easily (without changing configuration) I'll get the same bayes database I currently have... which will give the same scores. No, autolearning would n
Re: Spamassassin Bayes... "why give that spam that score???"
Am 25.02.2016 um 01:41 schrieb Steve: On 24/02/2016 22:59, John Hardin wrote: On Wed, 24 Feb 2016, Steve wrote: I've used spamassassin for many years - on Ubuntu, using amvisd - with great success. In recent months, I've been receiving several spam messages each day that evade the filters. Can you provide samples? (e.g. three or four on Pastebin) One of each of the most common forms: none of that 3 messages should make it into your inbox and at least never get BAYES_00 - looks like bad training! i tried to obfuscate the URIBL hits because otherwise even the mailing-list would reject my message http://pastebin.com/Wk2KD1Q1 /var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: Sanesecurity.Junk.52024.UNOFFICIAL FOUND /var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND /var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND /var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND /var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND /var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND --- VIRUS-SCAN SUMMARY --- Infected files: 1 Time: 0.009 sec (0 m 0 s) Content analysis details: (20.6 points, 5.5 required) pts rule name description -- -- 1.0 GENERIC_IXHASH DIGEST: generic.ixhash.net -0.3 RCVD_IN_MSPIKE_H4 RBL: Very Good reputation (+4) [108.62.157.149 listed in wl.mailspike.net] 7.0 URIBL_BLACKContains an URL listed in the URIBL blacklist [URIs: leslie-bib***b.org] 1.5 SPF_HELO_FAIL SPF: HELO does not match SPF record (fail) [SPF failed: Please see http://www.openspf.org/Why?s=helo;id=gw.shic.co.uk;ip=192.168.42.2;r=mail-gw.thelounge.net] 3.0 INVESTMENT_ADVICE BODY: Message mentions investment advice 1.5 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5002] 0.0 HTML_MESSAGE BODY: HTML included in message -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 0.5 PYZOR_CHECKListed in Pyzor (http://pyzor.sf.net/) 0.1 DKIM_SIGNEDMessage has a DKIM or DK signature, not necessarily valid 1.5 IXHASH_CHECK Message hits one ore more IXHASH digest-sources 2.5 RDNS_NONE Delivered to internal network by a host with no rDNS -0.0 RCVD_IN_MSPIKE_WL Mailspike good senders 2.5 DIGEST_MULTIPLE_LOCAL Message hits more than one network digest check (razor, pyzor, ixhash) http://pastebin.com/QCQ9Ymw7 /var/www/uploadtemp/cb2bd7249493a618230fc12473f311ee092a9c6a.eml: Sanesecurity.Blurl.56d5c1.UNOFFICIAL FOUND /var/www/uploadtemp/cb2bd7249493a618230fc12473f311ee092a9c6a.eml: Sanesecurity.Blurl.56d5c1.UNOFFICIAL FOUND /var/www/uploadtemp/cb2bd7249493a618230fc12473f311ee092a9c6a.eml: Sanesecurity.Blurl.56d5c1.UNOFFICIAL FOUND --- VIRUS-SCAN SUMMARY --- Infected files: 1 Time: 0.007 sec (0 m 0 s) Content analysis details: (18.5 points, 5.5 required) pts rule name description -- -- 7.0 URIBL_BLACKContains an URL listed in the URIBL blacklist [URIs: pinkhand***print.com] 3.5 URIBL_DBL_SPAM Contains a spam URL listed in the DBL blocklist [URIs: pinkhand***print.com] -0.1 CUST_DNSWL_2 RBL: score.senderscore.com (Low Trust) [85.195.78.13 listed in score.senderscore.com] -0.3 RCVD_IN_MSPIKE_H4 RBL: Very Good reputation (+4) [85.195.78.13 listed in wl.mailspike.net] 1.5 SPF_HELO_FAIL SPF: HELO does not match SPF record (fail) [SPF failed: Please see http://www.openspf.org/Why?s=helo;id=gw.shic.co.uk;ip=192.168.42.2;r=mail-gw.thelounge.net] 1.5 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] 0.0 HTML_MESSAGE BODY: HTML included in message 2.0 RAZOR2_CF_RANGE_E8_51_100 Razor2 gives engine 8 confidence level above 50% [cf: 100] -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.5 RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/) 0.5 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50% [cf: 100] -0.1 DKIM_VAL
Re: Spamassassin Bayes... "why give that spam that score???"
On 24/02/2016 22:59, John Hardin wrote: On Wed, 24 Feb 2016, Steve wrote: I've used spamassassin for many years - on Ubuntu, using amvisd - with great success. In recent months, I've been receiving several spam messages each day that evade the filters. Can you provide samples? (e.g. three or four on Pastebin) One of each of the most common forms: http://pastebin.com/Wk2KD1Q1 http://pastebin.com/QCQ9Ymw7 http://pastebin.com/wgkmiJLt I note that they tend to come from different mail servers each time - the URLs in the body tend to be unique, too. * The false positives all match BAYES_00 - attracting a default score of -1.9. BAYES_00 seems to be at the crux of the misclassification. Is there a way to delve into why these messages have been allocated such a low bayes score - while (to a human) appearing blatant, simple, spam on "vanilla" spam topics? Has my bayes data been "poisoned" somehow? Poisoning is less likely than mistraining. How large is your userbase and mail volume? One user - me - several email addresses. 10,000 mails per month - several mailing lists where I read only a tiny fraction of the posts. ~1,500 spams (that survive mail server RBLs). Autolearn is on - I don't think about it, it is automatic. :) How do you train your Bayes? Autolearn? General user submissions? Trusted user submissions? Only you, from only your personal mail? Only my personal mailbox *really* matters to me. I train from it using the dovecot antispam plugin... which feeds mail I shift to/from a spam folder through a pipe involving "spamc -C". Do you keep base training corpora so you can wipe and retrain if it goes off the rails for some reason? (In principle) I've got multi-gigabyte-scale spam/ham corpora. I'm yet to [ever] do anything with it. :) It is worth noting that I get a lot of correctly identified spam - and much of that matches BAYES_99 and BAYES_999... and my ham gets BATES_00... so, for many messages, bayes is working. Is it likely that I am suffering poor performance (for these specific messages) as a result of some tunable parameter? Probably not. There's not a lot to tune in Bayes. It's pretty much solely dependent on what you've trained it with. What is the most effective way to tackle this? If all the FNs are getting BAYES_00, make sure you're (re)training them as spam. I believe I'm doing that - but it isn't easy to prove that the training 'worked'. Review how you're training. If your users aren't really trustworthy you should be manually reviewing submissions. When spam arrives in my primary inbox, I hand classify - I'm less obsessive about mailing lists. Dovecot initiates training automatically when I shift messages to a special spam folder. I feel autolearn can be problematic, particularly if things are already going off the rails. I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast majority of my training. This year, I've hand-trained 216 false-negatives and 0 false positives. If you have base training corpora, review it for misclassifications (FNs), wipe and retrain. I guess I could do that... My expectation is that - if I train with the corpora I can pick easily (without changing configuration) I'll get the same bayes database I currently have... which will give the same scores. Really, I'd like to understand why my current bayes database makes the classifications it does.
Re: Spamassassin Bayes... "why give that spam that score???"
On Wed, 24 Feb 2016, Steve wrote: I've used spamassassin for many years - on Ubuntu, using amvisd - with great success. In recent months, I've been receiving several spam messages each day that evade the filters. Can you provide samples? (e.g. three or four on Pastebin) * The false positives all match BAYES_00 - attracting a default score of -1.9. BAYES_00 seems to be at the crux of the misclassification. Is there a way to delve into why these messages have been allocated such a low bayes score - while (to a human) appearing blatant, simple, spam on "vanilla" spam topics? Has my bayes data been "poisoned" somehow? Poisoning is less likely than mistraining. How large is your userbase and mail volume? How do you train your Bayes? Autolearn? General user submissions? Trusted user submissions? Only you, from only your personal mail? Do you keep base training corpora so you can wipe and retrain if it goes off the rails for some reason? It is worth noting that I get a lot of correctly identified spam - and much of that matches BAYES_99 and BAYES_999... and my ham gets BATES_00... so, for many messages, bayes is working. Is it likely that I am suffering poor performance (for these specific messages) as a result of some tunable parameter? Probably not. There's not a lot to tune in Bayes. It's pretty much solely dependent on what you've trained it with. What is the most effective way to tackle this? If all the FNs are getting BAYES_00, make sure you're (re)training them as spam. Review how you're training. If your users aren't really trustworthy you should be manually reviewing submissions. I feel autolearn can be problematic, particularly if things are already going off the rails. If you have base training corpora, review it for misclassifications (FNs), wipe and retrain. If you *don't* have base training corpora, start building them. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Maxim XXIX: The enemy of my enemy is my enemy's enemy. No more. No less. --- 65 days since the first successful real return to launch site (SpaceX)
Spamassassin Bayes... "why give that spam that score???"
I've used spamassassin for many years - on Ubuntu, using amvisd - with great success. In recent months, I've been receiving several spam messages each day that evade the filters. * These false-negatives conform to a handful of simple, formulaic, textual forms - on common subjects. * The emails consist fairly plain HTML and appear not to employ any significant obfuscation. * I have tried to train spamassassin with many of these spam samples - without any effect. * The bayes database is updated. The bayes_journal (37k), bayes_seen (5.2mb) and bayes_toks (5.4mb) files all have recent timestamps. * The false positives all match BAYES_00 - attracting a default score of -1.9. BAYES_00 seems to be at the crux of the misclassification. Is there a way to delve into why these messages have been allocated such a low bayes score - while (to a human) appearing blatant, simple, spam on "vanilla" spam topics? Has my bayes data been "poisoned" somehow? It is worth noting that I get a lot of correctly identified spam - and much of that matches BAYES_99 and BAYES_999... and my ham gets BATES_00... so, for many messages, bayes is working. Is it likely that I am suffering poor performance (for these specific messages) as a result of some tunable parameter? What is the most effective way to tackle this?