On 8/15/2012 2:24 PM, John Hardin wrote:
> On Wed, 15 Aug 2012, Ben Johnson wrote:
> 
>> Some 99% of the spam that I receive, which is grossly spammy (we're
>> talking auto loans, cash advances, dink pills, the whole lot) contains
>> "BAYES_00=-1.9" in the tests portion of the X-Spam-Status header.
>>
>> Might anyone know why?
> 
> Poor training.

John, I can't thank you enough for the thoroughness of your response.

> Apart from the Bayes score, what kind of scores are those spams getting?

Here are a few examples (the first two of which are two of VERY few in
which the BAYES_* value is over 00):

-----------------
No, score=0.192 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RDNS_NONE=0.793,
SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7] autolearn=no

No, score=2.241 tag=-999 tag2=3 kill=13 tests=[BAYES_20=-0.001,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=no

No, score=-0.836 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URI_HEX=1.122] autolearn=no

No, score=1.256 tag=-999 tag2=3 kill=13 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_DNSWL_MED=-2.3,
RDNS_NONE=0.793, SPF_PASS=-0.001, URIBL_DBL_SPAM=1.7,
URIBL_RHS_DOB=1.514] autolearn=no
-----------------

It bears mention that the RCVD_IN_DNSWL_MED test is having even more of
a negative impact (pardon the pun) than BAYES_*. I am already working
with the dnswl.org folks (off-list, for privacy reasons) to get to the
bottom of that issue.

>> While I have not trained the Bayesian filter manually to date,
> 
> Is there any provision for any manual training in your environment? Have
> you set up training folders where your users can submit message for
> training? Do you run sa-learn at all?

No, there is no provision. No, I have not set-up training folders, and
no, I have no run sa-learn manually at all.

Most of the list is probably laughing, but given the complexity of Spam
Assassin, this crucial requirement was lost on me, amidst the sea of
information and instructions. For example, there is no mention of the
fact that SA is essentially useless without Bayesian training on
http://wiki.apache.org/spamassassin/StartUsing .

>> how is it that the spammiest of the spam is being classified with
>> BAYES_00 (thereby receiving the score -1.9)? Doesn't BAYES_00 imply
>> that the message is almost certainly not spam?
> 
> BAYES_00 implies that the message in question looks very similar to
> messages the Bayes system has been told are not spam. It depends solely
> on how it has been trained.
> 
> I wasn't aware that autolearning could do a cold-start of Bayes, can
> anyone confirm whether this is the case?
> 
> If it can't then someone somewhere trained bayes up to the default
> minimum 200 hams and 200 spams needed for it to start classifying.
> 
> Before we offer suggestions, some more data from you please:
> 
> What version of SA is this?

# spamassassin --version
SpamAssassin version 3.3.1
  running on Perl version 5.10.1

> What does "sa-learn --dump magic" report about your current Bayes database?

# sa-learn --dump magic
ERROR: Bayes dump returned an error, please re-run with -D for more
information

# su amavis -c 'sa-learn --dump magic'

# su amavis -c 'sa-learn --dump magic'
0.000          0          3          0  non-token data: bayes db version
0.000          0      11499          0  non-token data: nspam
0.000          0      39412          0  non-token data: nham
0.000          0     197769          0  non-token data: ntokens
0.000          0 1344331893          0  non-token data: oldest atime
0.000          0 1345056746          0  non-token data: newest atime
0.000          0 1345053771          0  non-token data: last journal
sync atime
0.000          0 1345023550          0  non-token data: last expiry atime
0.000          0     345600          0  non-token data: last expire
atime delta
0.000          0       6482          0  non-token data: last expire
reduction count

> What are all of the bayes_* configuration options in your local config?

None are defined there. There are a few defaults/examples, but they are
commented-out.

> 
> What will probably end up happening is this:
> (1) wipe your Bayes database
> (2) turn off autolearn
> (3) collect several hundred hams and spams for an initial training corpus
> (4) train using that corpus
> (5) evaluate results
> 
> Depending on your mail volume, once Bayes is working well after manual
> training, you may then want to reenable autolearn; I personally suggest
> it only where the volume is high enough and/or the character of mail is
> varied enough to prohibit manual training. You might also want to adjust
> the autolearn thresholds.

That makes sense; thank you for the suggestion.

> You may also want to set up some mechanism for users to submit
> misclassified messages for training. Depending on how much you trust
> their judgement the learning from these can be automatic or can go
> through you as a reviewer.

That sounds like a good idea. Is there a particular HOW TO or tutorial
that you recommend? If it depends on the environment/configuration, this
server runs Ubuntu 10.04 with Dovecot, Amavis, Sieve, and Spam Assassin.

> Recommendation: keep your manual training corpus around in case you need
> to do the above again for some reason.

Again, sound advice. Thanks.

-Ben

Reply via email to