Hello Users! Apologies for asking multiple questions, I've just been reading https://wiki.apache.org/spamassassin/ and have some things I wanted to ask.
I'm getting a lot of spam, perhaps 25 messages/day, and about half of it gets through Spamassassin. I'm trying to figure out how to fix the situation. I tried using the "sought" ruleset following instructions from http://taint.org/2007/08/15/004348a.html, but didn't see much difference. I'm concerned that the BAYES_* rules aren't showing up in my spam headers, and would like to know if there's a good way to look at the tokens in the database. When I do "sa-learn --dump data", I see a file with lines like this: 0.987 1 0 1436496897 0315e1da7f 0.016 0 1 1410284743 0320ba06ef 0.987 1 0 1393199297 0329ec4e6e 0.003 0 5 1268403253 03541effbc 0.008 0 2 1398222936 038d6e997d 0.016 0 1 1429567309 041cabf4ef 0.016 0 1 1431638107 041d441c1b Is that normal? How do I get at the actual tokens? How do I see how it scores a test message, just the Bayesian part? I find that I get a lot of spam with exactly the same lines in the body of the message, and the Bayesian classifier doesn't seem to register it. Here's the output of sa-learn --dump magic: 0.000 0 3 0 non-token data: bayes db version 0.000 0 15466 0 non-token data: nspam 0.000 0 30317 0 non-token data: nham 0.000 0 1733267 0 non-token data: ntokens 0.000 0 1098575745 0 non-token data: oldest atime 0.000 0 1441160002 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1441160455 0 non-token data: last expiry atime 0.000 0 0 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count I couldn't find a sample output on your Wiki, with which to compare this; I'm worried about the 0.000 lines and other zeroes. I'm also thinking that I should employ some kind of sender address whitelisting using e.g. TxRep. Most of my spam is stuff that I'm receiving for the first time from a particular sender, and there are a lot of strings that I can say for sure I'd never find in a Subject line of a message from a friend who is emailing me for the first time: "ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is there a way to get Spamassassin to notice when this comes in and automatically whitelist the recipients for me? Relatedly, if I create rules for e.g. ATTN, "stock tip", then I'd also like to generate my own rule weights using my own spam/ham corpora. Does it still take a week to do? Why did Spamassassin go back to using a GA for this process? Aren't there some much faster algorithms around? Thank you in advance, Frederick