Re: Training spamassassin past 5,000 emails
RW wrote: On Tue, 09 Mar 2021 08:52:28 -0500 Steve Dondley wrote: I will also be allowing users to flag their own spam using the roundcube webmail client. If you do that you should review the submissions. This. SO much this. ALL THE THIS. If you're using the "Mark as Junk" or "Mark as Junk 2" plugin you will get a LOT of mail mistakenly marked as spam when the user intended to just delete it. The icon in the "classic" theme/skin is VERY easy to mistake for "Delete". It was so bad here we had to patch in a little Javascript confirmation popup when we first added Roundcube to our webmail stable. Aside from that you will *also* get people who deliberately mark anything they don't want as spam. This is not terribly healthy for the Bayes DB, and if you do any other local processing or deconstruction it will also poison those processes as well. -kgd
Re: Training spamassassin past 5,000 emails
On Tue, 09 Mar 2021 08:52:28 -0500 Steve Dondley wrote: > On 2021-03-09 08:42 AM, RW wrote: > > > > If you keep a full archive of what's been trained. I think it makes > > sense to trim out old mail occasionally and recreate the database - > > particularly if it's a single user Bayes. > > I'm harvesting spam/ham across multiple servers from many different > users on each server. Is there anything I should be aware of or > worried about doing something like this? Do I risk the effectiveness > of SA if it's not tailored to a specific user? I was really thinking more of an individual running SA for their own mail. It would be unusual for an admin to keep a full archive of trained mail for each account. Per user Bayes can be more accurate, but only if users take the training seriously. > I will also be allowing users to flag their own spam using the > roundcube webmail client. If you do that you should review the submissions. > I'm not clear how the individual SA > database works when there is also a server-wide database. It's one or the other.
Re: Training spamassassin past 5,000 emails
On 9 Mar 2021, at 7:49, Steve Dondley wrote: I've read through https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which states that "anything over about 5000 messages does not improve accuracy significantly in our tests." Did you read the section on expiration? https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html#expiration So once I hit 5,000, what do? Be happy that you've reached near-optimal Bayes accuracy. Do I run --forget on say the 500 oldest emails, delete those from my ham/spam folders and then add in a batch of 500 newer ham/spam emails and then run sa-learn on all the emails in my spam/ham folders? There are edge cases where using --force-expire periodically is necessary to get expiration to run often enough to avoid bloat, but unless you have autolearn on and high volume you are unlikely to run into that problem. If you are only doing manual learning, all should be well. -- Bill Cole b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many *@billmail.scconsult.com addresses) Not Currently Available For Hire
Re: Training spamassassin past 5,000 emails
On 2021-03-09 08:28 AM, Greg Troxel wrote: Steve Dondley writes: I've read through https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which states that "anything over about 5000 messages does not improve accuracy significantly in our tests." I would take that with a grain of salt. Based on my experience running SA for many years, I'd say that if you have new spam that isn't like the spam you already have, learning on it will help. Also, I take it as a comment about "there's no need to try hard to get more the 5K messages". It doesn't say, "if you train on more than 5000 bad things will happen". So once I hit 5,000, what do? Do I run --forget on say the 500 oldest emails, delete those from my ham/spam folders and then add in a batch of 500 newer ham/spam emails and then run sa-learn on all the emails in my spam/ham folders? I've been running sa-learn daily over my ham folders and my spam folders for years. I refile spam and ham so that it will be learned. I find the bayes scoring is quite good except for novel spam. My bayes_* files are about 83M in total. So I don't think you necessarily have a problem to solve. OK, thanks for the advice. Appreciated.
Re: Training spamassassin past 5,000 emails
On Tue, 09 Mar 2021 07:49:38 -0500 Steve Dondley wrote: > I've read through > https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which > states that "anything over about 5000 messages does not improve > accuracy significantly in our tests." > > So once I hit 5,000, what do? Do I run --forget on say the 500 oldest > emails, delete those from my ham/spam folders and then add in a batch > of 500 newer ham/spam emails and then run sa-learn on all the emails > in my spam/ham folders? You don't *need* to do anything, that figure is about diminishing returns. If you keep a full archive of what's been trained. I think it makes sense to trim out old mail occasionally and recreate the database - particularly if it's a single user Bayes.
Re: Training spamassassin past 5,000 emails
Steve Dondley writes: > I've read through > https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which > states that "anything over about 5000 messages does not improve > accuracy significantly in our tests." I would take that with a grain of salt. Based on my experience running SA for many years, I'd say that if you have new spam that isn't like the spam you already have, learning on it will help. Also, I take it as a comment about "there's no need to try hard to get more the 5K messages". It doesn't say, "if you train on more than 5000 bad things will happen". > So once I hit 5,000, what do? Do I run --forget on say the 500 oldest > emails, delete those from my ham/spam folders and then add in a batch > of 500 newer ham/spam emails and then run sa-learn on all the emails > in my spam/ham folders? I've been running sa-learn daily over my ham folders and my spam folders for years. I refile spam and ham so that it will be learned. I find the bayes scoring is quite good except for novel spam. My bayes_* files are about 83M in total. So I don't think you necessarily have a problem to solve. signature.asc Description: PGP signature
Training spamassassin past 5,000 emails
I've read through https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which states that "anything over about 5000 messages does not improve accuracy significantly in our tests." So once I hit 5,000, what do? Do I run --forget on say the 500 oldest emails, delete those from my ham/spam folders and then add in a batch of 500 newer ham/spam emails and then run sa-learn on all the emails in my spam/ham folders?