Am 12.12.2015 um 17:13 schrieb RW:
On Sat, 12 Dec 2015 13:29:40 +0100
Axb wrote:

On 12/12/2015 01:08 PM, Reindl Harald wrote:

I hate stale data... that's all

But you do keep stale data in the retained tokens, what you are getting
rid of is the contribution from old mails that's least likely to make a
difference to any classifications.  Expiry is about managing database
size; if it were about expiring stale information it would be
implemented differently.

correct

practical reasons?
it's a computer
performance... If I keep accessing X years of stale data my scanning
times go to the roof.

The time taken to look-up n tokens from a database containing m tokens
shouldn't strongly depend on m. There's something wrong if it does.

correct

a message has a fixed number of tokens which are querie against the database and it's primary key - it don't matter if that database has 150 thousand or 2 mio tokens - proven by the automated mass-test passing every corpus message agianst spamd, there is no change in performance, it only takes longer because the number of messages to test

that's how databases are working by design

financial reasons?
if you mean performance

no... money.. If I see 15 million msgs/day and keep the Bayes data
which those millions provided over a decade or more, I'd be in the TB
amount of data... I couldn't really justify requesting servers with
TBs RAM. Accounting would put me in the looney house.

The number of tokens depends on how many you train, not on how many you
scan

correct and to say it clear:

the need to train goes down when you don't lose data which re-appear in two months again - seasonal data and so on

i only need to train around 15-20 ham messages each week which are not BAYES_00 and there is not that much more spam below the milter-reject score

what i currently do is train milter-rejects which are not BAYES_99 by pass them through spamc in the feed-script and ignore anything which has already BAYES_99/999 - most likely i could even stop that now after a year of training *because* it catchs practically anything

so the real *need* of training has gone down to around 50 mails per week and it don't matter if it's 1000, 10000 or 15 million msg/day, the number only increases with users, the 75000 samples are covering them all on a site-wide setup

what's the difference to a default setups:

* you need to invest time at the begin
* it catchs "new" campaigns from the first message on
  which are in fact not really new, spam topics are
  always the same over years
* it don't get false-positives on seasonal ham because
  you did not lose the ham-tokens from the last season

summary: you can score bayes much higher without false-positives and so hit messages before the sending servers are on enough blackslists or RUIBL hits them - finally: your overall system works much more accurate after you paied the price of inital feeding

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to