Matteo Dessalvi wrote:
I am using a centralized Redis instance to
host the bayesian data for a bunch of MTAs.

AFAICS the SA filter is working quite well
and the BAYES_* rules are triggered correctly,
no false positive so far.

But I am concerned about the expiration of the
bayesian data. sa-learn reports the following:

0.000 0 3 0 non-token data: bayes db version
0.000          0       8437          0  non-token data: nspam
0.000          0     495000          0  non-token data: nham

As stated here:

search.cpan.org/dist/Mail-SpamAssassin/lib/Mail/SpamAssassin/BayesStore/Redis.pm

"Expiry is done internally in Redis using *_ttl settings (...)
This is why --force-expire etc does nothing, and token counts
and atime values are shown as zero in statistics."

So, why the nham tokens have grown so much? It looks
like it was never 'pruned'.

When redis automatically expires tokens internally based on their
TTL, this operation does not affect nspam and nham counts. These
counts just grow all the time (as there is no explicit expiration
that SpamAssassin would know about), reflecting the count of
(auto)learning operations.

Don't worry about large nspam and/or nham counts when redis is
in use, all that matters is that these counts are above 200
(otherwise bayes is disabled).

You may get the number of tokens that are actually in the redis
database (not expired) by counting the number of lines produced
on stdout by 'sa-learn --backup' or 'sa-learn --dump data'.

The format of fields produced by --dump data is:

  probability  spam_count  ham_count  atime  token

The --backup format is similar, but does not provide
probabilities, just spam and ham counts.

To get some estimate on the number of hammy vs. spammy tokens
(not messages) currently in a database, try something like:

  sa-learn --dump data' | \
    awk '$1<0.1 {h++}; $1>0.9 {s++}; END{printf("h=%d, s=%d\n",h,s)}'


(caveat: sa-learn --backup or --dump data may not work on a huge
database, as they need all the tokens (redis keys) to fit into memory)


  Mark

Reply via email to