Re: Bayes expiration with Redis backend
On Fri, 27 Feb 2015 15:45:46 +0100 Matteo Dessalvi wrote: > Thanks a lot for the explanation Mark, it was very clear. > It would be a good idea considering to add that to the > perldoc of the BayesStore/Redis.pm module. It's not particular to Redis, the counts you quoted are simply the total number of emails learned by Bayes.
Re: Bayes expiration with Redis backend
Thanks a lot for the explanation Mark, it was very clear. It would be a good idea considering to add that to the perldoc of the BayesStore/Redis.pm module. Regards, Matteo On 27.02.2015 14:55, Mark Martinec wrote: When redis automatically expires tokens internally based on their TTL, this operation does not affect nspam and nham counts. These counts just grow all the time (as there is no explicit expiration that SpamAssassin would know about), reflecting the count of (auto)learning operations. Don't worry about large nspam and/or nham counts when redis is in use, all that matters is that these counts are above 200 (otherwise bayes is disabled). You may get the number of tokens that are actually in the redis database (not expired) by counting the number of lines produced on stdout by 'sa-learn --backup' or 'sa-learn --dump data'. The format of fields produced by --dump data is: probability spam_count ham_count atime token The --backup format is similar, but does not provide probabilities, just spam and ham counts. To get some estimate on the number of hammy vs. spammy tokens (not messages) currently in a database, try something like: sa-learn --dump data' | \ awk '$1<0.1 {h++}; $1>0.9 {s++}; END{printf("h=%d, s=%d\n",h,s)}' (caveat: sa-learn --backup or --dump data may not work on a huge database, as they need all the tokens (redis keys) to fit into memory) Mark
Re: Bayes expiration with Redis backend
On 27.02.2015 13:54, Axb wrote: Is it possible you reject so much spam tha SA sees very little spam? I believe it's the case. A combination of Postfix policies, blacklists, ClamAV plus additional signatures, etc. greatly reduces the amount of email sent through the filtering pipeline. If classification makes sense, you're not getting FPs or tons of spam with BAYES_00 you have nothing to worry about. Sometimes you have to trust your gut feeling... don't let wild bayes theories misguide you...watch your logs, listen to your users and build up a feel for it. I will certainly do that. So far the feedback from our users is quite good and so I can say I am happy with our current setup. Regards, Matteo
Re: Bayes expiration with Redis backend
Matteo Dessalvi wrote: I am using a centralized Redis instance to host the bayesian data for a bunch of MTAs. AFAICS the SA filter is working quite well and the BAYES_* rules are triggered correctly, no false positive so far. But I am concerned about the expiration of the bayesian data. sa-learn reports the following: 0.000 0 3 0 non-token data: bayes db version 0.000 0 8437 0 non-token data: nspam 0.000 0 495000 0 non-token data: nham As stated here: search.cpan.org/dist/Mail-SpamAssassin/lib/Mail/SpamAssassin/BayesStore/Redis.pm "Expiry is done internally in Redis using *_ttl settings (...) This is why --force-expire etc does nothing, and token counts and atime values are shown as zero in statistics." So, why the nham tokens have grown so much? It looks like it was never 'pruned'. When redis automatically expires tokens internally based on their TTL, this operation does not affect nspam and nham counts. These counts just grow all the time (as there is no explicit expiration that SpamAssassin would know about), reflecting the count of (auto)learning operations. Don't worry about large nspam and/or nham counts when redis is in use, all that matters is that these counts are above 200 (otherwise bayes is disabled). You may get the number of tokens that are actually in the redis database (not expired) by counting the number of lines produced on stdout by 'sa-learn --backup' or 'sa-learn --dump data'. The format of fields produced by --dump data is: probability spam_count ham_count atime token The --backup format is similar, but does not provide probabilities, just spam and ham counts. To get some estimate on the number of hammy vs. spammy tokens (not messages) currently in a database, try something like: sa-learn --dump data' | \ awk '$1<0.1 {h++}; $1>0.9 {s++}; END{printf("h=%d, s=%d\n",h,s)}' (caveat: sa-learn --backup or --dump data may not work on a huge database, as they need all the tokens (redis keys) to fit into memory) Mark
Re: Bayes expiration with Redis backend
On 02/27/2015 01:38 PM, Matteo Dessalvi wrote: Hi all. I am using a centralized Redis instance to host the bayesian data for a bunch of MTAs. AFAICS the SA filter is working quite well and the BAYES_* rules are triggered correctly, no false positive so far. But I am concerned about the expiration of the bayesian data. sa-learn reports the following: 0.000 0 3 0 non-token data: bayes db version 0.000 0 8437 0 non-token data: nspam 0.000 0 495000 0 non-token data: nham As stated here: search.cpan.org/dist/Mail-SpamAssassin/lib/Mail/SpamAssassin/BayesStore/Redis.pm "Expiry is done internally in Redis using *_ttl settings (...) This is why --force-expire etc does nothing, and token counts and atime values are shown as zero in statistics." So, why the nham tokens have grown so much? It looks like it was never 'pruned'. Is it possible you reject so much spam tha SA sees very little spam? I am using the following configuration for the expiration: bayes_token_ttl 21d bayes_seen_ttl 8d bayes_auto_expire 1 I have also left bayes_expiry_max_db_size undefined. You can remove that entry - Redis doesn't use it. My other concern is about the proportion between spam and ham tokens. Should I be worried about it? If classification makes sense, you're not getting FPs or tons of spam with BAYES_00 you have nothing to worry about. Sometimes you have to trust your gut feeling... don't let wild bayes theories misguide you...watch your logs, listen to your users and build up a feel for it. Your traffic is NEVER similar to anybody else's. If you want to manually train some missed spam, do it, there's lots of methods but without knowing what your setup looks like it's guesswork. The list archives are full of valuable infos and howtos h2h Axb