On Wed, 24 Feb 2016, Steve wrote:

I've used spamassassin for many years - on Ubuntu, using amvisd - with great success. In recent months, I've been receiving several spam messages each day that evade the filters.

Can you provide samples? (e.g. three or four on Pastebin)

* The false positives all match BAYES_00 - attracting a default score of -1.9. BAYES_00 seems to be at the crux of the misclassification.

Is there a way to delve into why these messages have been allocated such a low bayes score - while (to a human) appearing blatant, simple, spam on "vanilla" spam topics? Has my bayes data been "poisoned" somehow?

Poisoning is less likely than mistraining.

How large is your userbase and mail volume?

How do you train your Bayes? Autolearn? General user submissions? Trusted user submissions? Only you, from only your personal mail?

Do you keep base training corpora so you can wipe and retrain if it goes off the rails for some reason?

It is worth noting that I get a lot of correctly identified spam - and much of that matches BAYES_99 and BAYES_999... and my ham gets BATES_00... so, for many messages, bayes is working. Is it likely that I am suffering poor performance (for these specific messages) as a result of some tunable parameter?

Probably not. There's not a lot to tune in Bayes. It's pretty much solely dependent on what you've trained it with.

What is the most effective way to tackle this?

If all the FNs are getting BAYES_00, make sure you're (re)training them as spam.

Review how you're training. If your users aren't really trustworthy you should be manually reviewing submissions.

I feel autolearn can be problematic, particularly if things are already going off the rails.

If you have base training corpora, review it for misclassifications (FNs), wipe and retrain.

If you *don't* have base training corpora, start building them.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Maxim XXIX: The enemy of my enemy is my enemy's enemy.
              No more. No less.
-----------------------------------------------------------------------
 65 days since the first successful real return to launch site (SpaceX)

Reply via email to