Interesting but what happens in the case where someone, like me, is getting 250+ spam a day and only about ten or so legitimate emails? This is not counting this account that my mailing lists go to which I have far better bayes performance on (1:100 spam/ham ratio instead of 10:1 or lower with my other accounts). With autotraining turned on, that means far more spam will get trained. Even if I turned off auto training, and trained only the ham that came through, it would simply allow changes in spam to begin to defeat the bayes filter over time, is that not so? Doesn't that mean that the expiration system that SA employs solves that problem?
Tom

Thomas Arend wrote:

Hello,

a lot of questions in this list are about the spam : ham ratio to be trained and how much mails should be trained. One continuously read myth is the 1 : 1 ratio.

I read an article about the best ratio as 1 : 1 and it was expirienced by a test and later on derived from the bayesian theorem. Unfortunately I didn't copy this article and can't remember enough to find the article by googling.

The problem is the conclusion of the article was wrong.

What I will try to show in the next steps - which unfortunately require a little bit algebra - is: Train bayes filter in accordance with your real spam ham ratio and train as much as possible. But never train to less ham or train only spam!

Here my argument follows:

In short the bayes theorem says

P(Spam|Token) = P(Token|Spam)*P(Spam)/P(Token)

that means: the probability of a message being Spam under the contition that a token is in the message is equal to the propability of the Token contained in a Spam message multiplied by the propability of a message being spam devided by the propability of any message containig the token.

So if you have received s spam messages and h ham messages where the token is in S spams and in H ham messages then you get:

s = number of spam messages
h = number of ham messages
S = number of spam messages containing the token
H = number of ham messages containing the token
s+h= number of messages
S+H = number of messages containing the token

Therefor

P(Spam) = s/(s+h)

is an aproximation of the probability of a random message being spam.

And for:
        P(Token) = (S+H)/(s+h)
        P(Token|Spam) = S/s


that leads to

        P(Spam|Token)   = S/s *s/(s+h) / ((S+H)/(s+h))
                                        = S / (S+H)

That means, that the probability of a given message being spam when it contians a token is independend of the number of messages trained.

Lets say in your real spam ham ratio is 10 to 1 and your messge body contains 1100 messages. 100 spam and also 50 ham messages should contain a certain token. Lets say "[EMAIL PROTECTED]@".

Total Messages: 1100
Spam (trained): 1000
Ham: 100
[EMAIL PROTECTED]@: in 100 spam and 50 ham

If you train all messages you will get a propability of 100 / (100+50) = 66.6% for the next message containing the token of being spam. Which isn't a high probability but works fine for this example.

If you train only 10% of your spam to get the spam ham ration of 1:1 you will supposably count only 10 spam messages with the token.

Spam (trained): 100
Ham: 100
[EMAIL PROTECTED]@: in 10 (=10% of 100) spam and 50 ham

Which leads to a spam probability of only 10 /(10+50) = 16.6%
Which is a little bit low.

What happens when you train less ham?

Lets assume you train only 50% of your ham but all your spam. You will supposably count only 25 ham messages with the token.

Spam (trained): 1000
Ham (50% trained): 50
[EMAIL PROTECTED]@: in 100 spam and 25 (= 50% of 50) ham

Which leads to a spam probability of 100 /(100+25) = 80%.

What happens when your ham spam ratio is 10 to 1?

Ham = 1000
Spam = 100
[EMAIL PROTECTED]@: in 100 ham and 50 spam

=> 50 / (50+100) = 33.3%

Ham (10% trained) = 100
Spam = 100
[EMAIL PROTECTED]@: in 10 (=10% of 100) ham and 50 spam

=> 50 / (50+10) = 83.3%

OOps!!!

So if you train to less spam you will get a higher False Negative rate, if you train to less ham you will get a higher False Positive rate.

Because a False Positive is more harmfull than a False Negative my conclusion is:
train iaw your real spam ham ratio, train as much as possible (= train all messages), but never train to less ham or train only spam!

(BTW: The risk of a False Positives is the reason why Paul Graham multiplied his token counts for ham with 2)

Another lesson should be: Never train whitelisted mails as ham!!!


Best regards


Thomas Arend

PS: I hope I made no mistakes.


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature



Reply via email to