Re: Training spamassassin past 5,000 emails

2021-03-09 Thread Kris Deugau

RW wrote:

On Tue, 09 Mar 2021 08:52:28 -0500
Steve Dondley wrote:

I will also be allowing users to flag their own spam using the
roundcube webmail client.


If you do that you should review the submissions.


This.  SO much this.  ALL THE THIS.

If you're using the "Mark as Junk" or "Mark as Junk 2" plugin you will 
get a LOT of mail mistakenly marked as spam when the user intended to 
just delete it.  The icon in the "classic" theme/skin is VERY easy to 
mistake for "Delete".  It was so bad here we had to patch in a little 
Javascript confirmation popup when we first added Roundcube to our 
webmail stable.


Aside from that you will *also* get people who deliberately mark 
anything they don't want as spam.  This is not terribly healthy for the 
Bayes DB, and if you do any other local processing or deconstruction it 
will also poison those processes as well.


-kgd


Re: Training spamassassin past 5,000 emails

2021-03-09 Thread RW


On Tue, 09 Mar 2021 08:52:28 -0500
Steve Dondley wrote:

> On 2021-03-09 08:42 AM, RW wrote:

> > 
> > If you keep a full archive of what's been trained. I think it makes
> > sense to trim out old mail occasionally and recreate the database -
> > particularly if it's a single user Bayes.  
> 
> I'm harvesting spam/ham across multiple servers from many different 
> users on each server. Is there anything I should be aware of or
> worried about doing something like this? Do I risk the effectiveness
> of SA if it's not tailored to a specific user?

I was really thinking more of an individual running SA for their
own mail. It would be unusual for an admin to keep a full archive of
trained mail for each account.

Per user Bayes can be more accurate, but only if users take the
training seriously. 

> I will also be allowing users to flag their own spam using the
> roundcube webmail client.

If you do that you should review the submissions. 

> I'm not clear how the individual SA
> database works when there is also a server-wide database.

It's one or the other.






Re: Training spamassassin past 5,000 emails

2021-03-09 Thread Bill Cole

On 9 Mar 2021, at 7:49, Steve Dondley wrote:

I've read through 
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which 
states that "anything over about 5000 messages does not improve 
accuracy significantly in our tests."


Did you read the section on expiration? 
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html#expiration



So once I hit 5,000, what do?


Be happy that you've reached near-optimal Bayes accuracy.

Do I run --forget on say the 500 oldest emails, delete those from my 
ham/spam folders and then add in a batch of 500 newer ham/spam emails 
and then run sa-learn on all the emails in my spam/ham folders?


There are edge cases where using --force-expire periodically is 
necessary to get expiration to run often enough to avoid bloat, but 
unless you have autolearn on and high volume you are unlikely to run 
into that problem. If you are only doing manual learning, all should be 
well.



--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: Training spamassassin past 5,000 emails

2021-03-09 Thread Steve Dondley

On 2021-03-09 08:28 AM, Greg Troxel wrote:

Steve Dondley  writes:


I've read through
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
states that "anything over about 5000 messages does not improve
accuracy significantly in our tests."


I would take that with a grain of salt.   Based on my experience 
running

SA for many years, I'd say that if you have new spam  that isn't like
the spam you already have, learning on it will help.

Also, I take it as a comment about "there's no need to try hard to get
more the 5K messages".  It doesn't say, "if you train on more than 5000
bad things will happen".


So once I hit 5,000, what do? Do I run --forget on say the 500 oldest
emails, delete those from my ham/spam folders and then add in a batch
of 500 newer ham/spam emails and then run sa-learn on all the emails
in my spam/ham folders?


I've been running sa-learn daily over my ham folders and my spam 
folders

for years.  I refile spam and ham so that it will be learned.  I find
the bayes scoring is quite good except for novel spam.  My bayes_* 
files

are about 83M in total.

So I don't think you necessarily have a problem to solve.


OK, thanks for the advice. Appreciated.



Re: Training spamassassin past 5,000 emails

2021-03-09 Thread RW
On Tue, 09 Mar 2021 07:49:38 -0500
Steve Dondley wrote:

> I've read through 
> https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which 
> states that "anything over about 5000 messages does not improve
> accuracy significantly in our tests."
> 
> So once I hit 5,000, what do? Do I run --forget on say the 500 oldest 
> emails, delete those from my ham/spam folders and then add in a batch
> of 500 newer ham/spam emails and then run sa-learn on all the emails
> in my spam/ham folders?


You don't *need* to do anything, that figure is about diminishing
returns. 

If you keep a full archive of what's been trained. I think it makes
sense to trim out old mail occasionally and recreate the database -
particularly if it's a single user Bayes.

 


Re: Training spamassassin past 5,000 emails

2021-03-09 Thread Greg Troxel

Steve Dondley  writes:

> I've read through
> https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
> states that "anything over about 5000 messages does not improve
> accuracy significantly in our tests."

I would take that with a grain of salt.   Based on my experience running
SA for many years, I'd say that if you have new spam  that isn't like
the spam you already have, learning on it will help.

Also, I take it as a comment about "there's no need to try hard to get
more the 5K messages".  It doesn't say, "if you train on more than 5000
bad things will happen".

> So once I hit 5,000, what do? Do I run --forget on say the 500 oldest
> emails, delete those from my ham/spam folders and then add in a batch
> of 500 newer ham/spam emails and then run sa-learn on all the emails
> in my spam/ham folders?

I've been running sa-learn daily over my ham folders and my spam folders
for years.  I refile spam and ham so that it will be learned.  I find
the bayes scoring is quite good except for novel spam.  My bayes_* files
are about 83M in total.

So I don't think you necessarily have a problem to solve.


signature.asc
Description: PGP signature


Training spamassassin past 5,000 emails

2021-03-09 Thread Steve Dondley
I've read through 
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which 
states that "anything over about 5000 messages does not improve accuracy 
significantly in our tests."


So once I hit 5,000, what do? Do I run --forget on say the 500 oldest 
emails, delete those from my ham/spam folders and then add in a batch of 
500 newer ham/spam emails and then run sa-learn on all the emails in my 
spam/ham folders?