Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Thomas Cameron

On 5/5/22 14:28, Dave Wreski wrote:
No, that's how you train your corpora. If you manually look through 
the headers of mail that's already been processed by your mail system, 
the ham should be as close to BAYES_00 as possible, and spam should be 
at BAYES_99. If that's not the case, then it's been trained incorrectly.


/etc/mail/spamassassin/local.cf:
bayes_auto_learnĀ  0
bayes_auto_expire 0

I'd also recommend disabling auto-learn, if you have that enabled.

If you've gone through your corpus manually, and are certain the ham 
is all good mail and the spam emails are all bad mail, then it might 
be worth it to dump the existing bayes database and just retrain it 
with the corresponding mboxes.


I also typically add --progress to sa-learn.

Best,
Dave



Thanks, I appreciate it. I'll tune it a bit.

Thomas



Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Dave Wreski



That's a great call, thanks. I grepped my mail files and didn't find 
any SPAM_99 headers in any of them.


You should be looking for BAYES_99 and BAYES_999 in your corpus.



Thanks, Dave. I use my various mailboxes (sa-learn --ham --mbox 
/home/thomas.cameron/mail/INBOX/[mailbox file] and then sa-learn --spam 
--mbox /home/thomas.cameron/mail/INBOX/spam) to train SA, doesn't that 
mean that I've already checked my corpora?


No, that's how you train your corpora. If you manually look through the 
headers of mail that's already been processed by your mail system, the 
ham should be as close to BAYES_00 as possible, and spam should be at 
BAYES_99. If that's not the case, then it's been trained incorrectly.


/etc/mail/spamassassin/local.cf:
bayes_auto_learn  0
bayes_auto_expire 0

I'd also recommend disabling auto-learn, if you have that enabled.

If you've gone through your corpus manually, and are certain the ham is 
all good mail and the spam emails are all bad mail, then it might be 
worth it to dump the existing bayes database and just retrain it with 
the corresponding mboxes.


I also typically add --progress to sa-learn.

Best,
Dave





Thomas


Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Thomas Cameron

On 5/5/22 11:59, Dave Wreski wrote:



You should probably check that none of your ham (i.e. non-spam)
messages contains SPAM_99 or SPAM_999. It can happen when spammers
poison your bayes database, and increased score in that case might
lead to legitimate mail being misclassified as a spam.


That's a great call, thanks. I grepped my mail files and didn't find 
any SPAM_99 headers in any of them.


You should be looking for BAYES_99 and BAYES_999 in your corpus.



Thanks, Dave. I use my various mailboxes (sa-learn --ham --mbox 
/home/thomas.cameron/mail/INBOX/[mailbox file] and then sa-learn --spam 
--mbox /home/thomas.cameron/mail/INBOX/spam) to train SA, doesn't that 
mean that I've already checked my corpora?


Thomas



Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Dave Wreski




You should probably check that none of your ham (i.e. non-spam)
messages contains SPAM_99 or SPAM_999. It can happen when spammers
poison your bayes database, and increased score in that case might
lead to legitimate mail being misclassified as a spam.


That's a great call, thanks. I grepped my mail files and didn't find any 
SPAM_99 headers in any of them.


You should be looking for BAYES_99 and BAYES_999 in your corpus.

Best,
Dave




Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Thomas Cameron

On 5/5/22 11:47, Matija Nalis wrote:

On Thu, May 05, 2022 at 10:37:40AM -0500, Thomas Cameron wrote:

I understand that turning knobs without understanding the consequences can
do bad thing, but almost all of the spam that gets through SA on my server
has SPAM_99 or SPAM_999 set in the headers. It is obviously spam, so I don't
really get how it wasn't flagged, but it wasn't. What are the risks of
giving more weight to SPAM_99 and/or SPAM_999? Explain it like I'm five,
sorry, it's probably something simple that I just don't understand.

Thomas


You should probably check that none of your ham (i.e. non-spam)
messages contains SPAM_99 or SPAM_999. It can happen when spammers
poison your bayes database, and increased score in that case might
lead to legitimate mail being misclassified as a spam.


That's a great call, thanks. I grepped my mail files and didn't find any 
SPAM_99 headers in any of them.


Thomas



Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Matija Nalis
You should probably check that none of your ham (i.e. non-spam)
messages contains SPAM_99 or SPAM_999. It can happen when spammers
poison your bayes database, and increased score in that case might
lead to legitimate mail being misclassified as a spam. 

On Thu, May 05, 2022 at 10:37:40AM -0500, Thomas Cameron wrote:
> I understand that turning knobs without understanding the consequences can
> do bad thing, but almost all of the spam that gets through SA on my server
> has SPAM_99 or SPAM_999 set in the headers. It is obviously spam, so I don't
> really get how it wasn't flagged, but it wasn't. What are the risks of
> giving more weight to SPAM_99 and/or SPAM_999? Explain it like I'm five,
> sorry, it's probably something simple that I just don't understand.
> 
> Thomas
> 

-- 
Opinions above are GNU-copylefted.


Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

2022-05-05 Thread Thomas Cameron

On 5/5/22 10:46, Reindl Harald wrote:



Am 05.05.22 um 17:37 schrieb Thomas Cameron:
I understand that turning knobs without understanding the 
consequences can do bad thing, but almost all of the spam that gets 
through SA on my server has SPAM_99 or SPAM_999 set in the headers. 
It is obviously spam, so I don't really get how it wasn't flagged, 
but it wasn't. What are the risks of giving more weight to SPAM_99 
and/or SPAM_999? Explain it like I'm five, sorry, it's probably 
something simple that I just don't understand


when your bayes is well trained just raise it

the risk is simple: when you bayes isn't trained well or poisend 
(autolearning is the root of all evil) you risk FPs


we milter-reject at 8.0 points and BAYES_99 + BAYES_999 are 7.5 points 
since 2014, the most junk collects the remaining 0.5 points with other 
rules and the few FP typically hit some DNSWL/SPF rules with negative 
score


well, our bayes has 160k messages



Many thanks! I appreciate the response!

Thomas