newbie questions: sought, sa-learn, rule weights

frederik Sat, 17 Oct 2015 21:37:14 -0700

Hello Users!

Apologies for asking multiple questions, I've just been reading
https://wiki.apache.org/spamassassin/ and have some things I wanted to
ask.


I'm getting a lot of spam, perhaps 25 messages/day, and about half of
it gets through Spamassassin. I'm trying to figure out how to fix the
situation.

I tried using the "sought" ruleset following instructions from
http://taint.org/2007/08/15/004348a.html, but didn't see much
difference.

I'm concerned that the BAYES_* rules aren't showing up in my spam
headers, and would like to know if there's a good way to look at the
tokens in the database. When I do "sa-learn --dump data", I see a file
with lines like this:

0.987          1          0 1436496897  0315e1da7f
0.016          0          1 1410284743  0320ba06ef
0.987          1          0 1393199297  0329ec4e6e
0.003          0          5 1268403253  03541effbc
0.008          0          2 1398222936  038d6e997d
0.016          0          1 1429567309  041cabf4ef
0.016          0          1 1431638107  041d441c1b

Is that normal? How do I get at the actual tokens? How do I see how it
scores a test message, just the Bayesian part? I find that I get a lot
of spam with exactly the same lines in the body of the message, and
the Bayesian classifier doesn't seem to register it.

Here's the output of sa-learn --dump magic:

0.000          0          3          0  non-token data: bayes db version
0.000          0      15466          0  non-token data: nspam
0.000          0      30317          0  non-token data: nham
0.000          0    1733267          0  non-token data: ntokens
0.000          0 1098575745          0  non-token data: oldest atime
0.000          0 1441160002          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1441160455          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction 
count

I couldn't find a sample output on your Wiki, with which to compare
this; I'm worried about the 0.000 lines and other zeroes.

I'm also thinking that I should employ some kind of sender address
whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
receiving for the first time from a particular sender, and there are a
lot of strings that I can say for sure I'd never find in a Subject
line of a message from a friend who is emailing me for the first time:
"ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
there a way to get Spamassassin to notice when this comes in and
automatically whitelist the recipients for me?

Relatedly, if I create rules for e.g. ATTN, "stock tip", then I'd also
like to generate my own rule weights using my own spam/ham corpora.
Does it still take a week to do? Why did Spamassassin go back to using
a GA for this process? Aren't there some much faster algorithms
around?

Thank you in advance,

Frederick

newbie questions: sought, sa-learn, rule weights

Reply via email to