On Sat, 21 Mar 2015 13:13:13 -0400 David F. Skoll wrote: > On Sat, 21 Mar 2015 15:10:19 +0000 > RW <rwmailli...@googlemail.com> wrote: > > > The only token probabilities that can be skewed by token expiry are > > those than get expired and are then subsequently relearned. > > Yup. But they might turn out to be important. > > > Even then when those tokens are relearned the probabilities will end > > up more or less correct provided that the ham/spam ratio in > > subsequent training is similar to the overall ratio in the database. > > So I discovered something... I did the math and it turns out that if > you expire a token and then start relearning it, the relearned > probability for that token works out *exactly* the same as if you'd > only ever started your Bayes learning immediately after expiry. The > different total message counts cancel out.
I think that's a special case. If a token expires when there are Nh1 hams and Ns1 spams in the database, and the probability is computed when the database contains Nh2,Ns2, Bayes will use Nh2,Ns2 in the calculation when it should really be using Nh2-Nh1 and Ns2-Ns1. The two calculations produce the same result when Ns2/Nh2 = (Ns2-Ns1)/(Nh2-Nh1) i.e. if spam and ham is being added in the same ratio that it occurs in the database. > This suggests an obvious strategy for spammers: <rest of message redacted>