On Sat, 21 Mar 2015 15:10:19 +0000
RW <rwmailli...@googlemail.com> wrote:

> The only token probabilities that can be skewed by token expiry are
> those than get expired and are then subsequently relearned.

Yup.  But they might turn out to be important.

> Even then when those tokens are relearned the probabilities will end
> up more or less correct provided that the ham/spam ratio in
> subsequent training is similar to the overall ratio in the database.

So I discovered something... I did the math and it turns out that if
you expire a token and then start relearning it, the relearned
probability for that token works out *exactly* the same as if you'd
only ever started your Bayes learning immediately after expiry.  The
different total message counts cancel out.

The net effect is that your time window for token probabilities depends
on the how often you see tokens.  Often-seen tokens represent training
going way back while seldom-seen tokens represent only a recent training
window.

This suggests an obvious strategy for spammers: Don't use weird text for
your Bayes poison.  Use very commonly-seen tokens that are likely to have
been trained over a very long time; Bayes is much less likely to quickly
change its opinion of those tokens.

I don't know how the token-dependent training windows affect accuracy.
But here's a nice topic for a term project for a math student. :)

Regards,

David.

Reply via email to