On Sat, 21 Mar 2015 15:10:19 +0000 RW <rwmailli...@googlemail.com> wrote:
> The only token probabilities that can be skewed by token expiry are > those than get expired and are then subsequently relearned. Yup. But they might turn out to be important. > Even then when those tokens are relearned the probabilities will end > up more or less correct provided that the ham/spam ratio in > subsequent training is similar to the overall ratio in the database. So I discovered something... I did the math and it turns out that if you expire a token and then start relearning it, the relearned probability for that token works out *exactly* the same as if you'd only ever started your Bayes learning immediately after expiry. The different total message counts cancel out. The net effect is that your time window for token probabilities depends on the how often you see tokens. Often-seen tokens represent training going way back while seldom-seen tokens represent only a recent training window. This suggests an obvious strategy for spammers: Don't use weird text for your Bayes poison. Use very commonly-seen tokens that are likely to have been trained over a very long time; Bayes is much less likely to quickly change its opinion of those tokens. I don't know how the token-dependent training windows affect accuracy. But here's a nice topic for a term project for a math student. :) Regards, David.