On 8 Aug 2014, at 12:05, Justin Edmands wrote:
Aug 8 12:00:53.067 [19948] dbg: learn: auto-learn: message score:
13.934, computed score for autolearn: 17.583
Aug 8 12:00:53.067 [19948] dbg: learn: auto-learn? ham=0, spam=7,
body-points=7.448, head-points=5.511, learned-points=-1.9
Aug 8 12:00:53.067 [19948] dbg: learn: auto-learn: autolearn_force
not flagged for a rule. Body Only Points: 7.448 (3 req'd) / Head Only
Points: 5.511 (3 req'd)
Aug 8 12:00:53.067 [19948] dbg: learn: auto-learn? no: scored as spam
but learner indicated ham (-1.9 < -1)
This is really a SpamAssassin issue rather than a MIMEDefang issue, so
you probably could get a better answer from the broader SA community,
but I'll offer a vague rambling one :)
The SA auto-learn subsystem is designed to be very cautious in what it
learns because it carries diverse mistraining risks. The obvious part of
the caution is the spam/non-spam thresholds for auto-learning, but there
are also less prominent: the message is rescored for the threshold check
using scoreset 0 or 1, the learner demands a minimum of 3 pts each from
body & header/network rules to score as spam unless a matched rule has
the autolearn_force tflasg set, and other per-rule 'tflags' can modify
how the learner acts on a matching message. As a result, a message
actually has 5 scores tallied by SA: the normal score using scoreset 3
or 4, the score using scoreset 0 or 1 that gets compared to the spam &
nonspam autolearn threshold settings, the body-only score, the
header-only score, and the score using only rules with the "learn" tflag
(by default, that's only BAYES_* rules) which is reported in debug
messages as "learned-points". By default, that last value is used as a
backstop to prevent wildly divergent auto-learning. If the Bayes rules
score a message <-1 or >1 (by default: a Bayes probability below 1% or
above 50%) in dissent from the overall score, the message will not be
autolearned.
Is this something that I can fix? I want stuff to be trained as spam
but it doesn't seem to make it. I am thinking it's either a setting I
am not aware of or I need to retrain my bayes DB ham. Any help would
be great.
The real question is whether it is a problem at all, i.e. whether it's a
thing that merits fixing rather than a thing that is working as designed
and, at least in aggregate, for your benefit. Probably that particular
message was spam, given the very high score spread across rule types,
but it is certain that learning it as spam would change the way your
Bayes DB interprets similar messages and possible (absent other
evidence) that it was not spam at all. Unless you do intensive periodic
score adjustments of your non-Bayes rules based on a carefully
human-classified corpus of messages that are representative of the
actual mailstream seen by SA, a well-fed Bayes DB is going to be a
better judge than the other (static and mostly default) rules. As of SA
v3.4 (which you apparently have, as autolearn_force is new) you can
switch bayes_auto_learn_on_error to "1" to flip the auto-learner into a
mode where it *ONLY* learns a message when its learned-points
classification (i.e. the judgment of the existing Bayes DB) disagrees
with classification based on surpassing an autolearn threshold.
Whether you leave bayes_auto_learn_on_error at its default "0" for the
traditional behavior or switch it to "1" depends on what you believe to
be true about the relative accuracy of your Bayes and non-Bayes SA
rules. The traditional behavior expresses an assumption that the Bayes
DB is less likely to make a large classification error than the rules
used for the autolearn score, while the "learn on error" behavior
assumes that your Bayes DB is probably in error when it disagrees with
the other SA rules. Which way is better is site-specific, as that is
influenced by a site's particular mailstream idiosyncrasies, the
autolearn thresholds, local rules, local score adjustments to standard
rules, the exclusion of messages from SA scoring by other anti-spam
measures, and the nature of what gets fed to the Bayes DB after explicit
human classification.
Another way to increase autolearning without going all the way to the
"learn on error" behavior is to flag rules that you trust highly as
"autolearn_force" so that messages matching them won't ever be excluded
from autolearning based on the existing Bayes DB disagreeing with the
deterministic rules. I have started doing this for locally-defined
meta-rules that match on multiple hits on "net" rules such as the URIBL
family. My reasoning there is that an identical message can get
autolearned as ham at 12:00 because the spammer filled it with
Bayes-busting garbage and freshly minted payload URLs and sent from a
fresh "snowshoe" range but score well past the autolearn spam threshold
at 12:05 because by then multiple network services checked by SA rules
have switched their opinions. In short: there are