Am 12.12.2015 um 17:13 schrieb RW:
On Sat, 12 Dec 2015 13:29:40 +0100 Axb wrote:On 12/12/2015 01:08 PM, Reindl Harald wrote:I hate stale data... that's allBut you do keep stale data in the retained tokens, what you are getting rid of is the contribution from old mails that's least likely to make a difference to any classifications. Expiry is about managing database size; if it were about expiring stale information it would be implemented differently.
correct
practical reasons? it's a computerperformance... If I keep accessing X years of stale data my scanning times go to the roof.The time taken to look-up n tokens from a database containing m tokens shouldn't strongly depend on m. There's something wrong if it does.
correcta message has a fixed number of tokens which are querie against the database and it's primary key - it don't matter if that database has 150 thousand or 2 mio tokens - proven by the automated mass-test passing every corpus message agianst spamd, there is no change in performance, it only takes longer because the number of messages to test
that's how databases are working by design
financial reasons? if you mean performanceno... money.. If I see 15 million msgs/day and keep the Bayes data which those millions provided over a decade or more, I'd be in the TB amount of data... I couldn't really justify requesting servers with TBs RAM. Accounting would put me in the looney house.The number of tokens depends on how many you train, not on how many you scan
correct and to say it clear:the need to train goes down when you don't lose data which re-appear in two months again - seasonal data and so on
i only need to train around 15-20 ham messages each week which are not BAYES_00 and there is not that much more spam below the milter-reject score
what i currently do is train milter-rejects which are not BAYES_99 by pass them through spamc in the feed-script and ignore anything which has already BAYES_99/999 - most likely i could even stop that now after a year of training *because* it catchs practically anything
so the real *need* of training has gone down to around 50 mails per week and it don't matter if it's 1000, 10000 or 15 million msg/day, the number only increases with users, the 75000 samples are covering them all on a site-wide setup
what's the difference to a default setups: * you need to invest time at the begin * it catchs "new" campaigns from the first message on which are in fact not really new, spam topics are always the same over years * it don't get false-positives on seasonal ham because you did not lose the ham-tokens from the last seasonsummary: you can score bayes much higher without false-positives and so hit messages before the sending servers are on enough blackslists or RUIBL hits them - finally: your overall system works much more accurate after you paied the price of inital feeding
signature.asc
Description: OpenPGP digital signature