Hello,

a lot of questions in this list are about the spam : ham ratio to be trained 
and how much mails should be trained. One continuously read myth is the 1 : 1 
ratio.

I read an article about the best ratio as 1 :  1 and it was expirienced by a 
test and later on derived from the bayesian theorem. Unfortunately I didn't 
copy this article and can't remember enough to find the article by googling.

The problem is the conclusion of the article was wrong.

What I will try to show in the next steps - which unfortunately require a 
little bit algebra - is: Train bayes filter in accordance with your real spam 
ham ratio and train as much as possible. But never train to less ham or train 
only spam!

Here my argument follows:

In short the bayes theorem says 

P(Spam|Token) = P(Token|Spam)*P(Spam)/P(Token) 

that means: the probability of a message being Spam under the contition that  
a token is in the message 
is equal to 
the propability of the Token contained in a Spam message 
multiplied by 
the propability of a message being spam 
devided by the propability of any message containig the token.

So if you have received s spam messages and h ham messages where the token is 
in S spams and in H ham messages then you get: 

s = number of spam messages
h = number of ham messages
S = number of spam messages containing the token
H = number of ham messages containing the token
s+h= number of messages
S+H = number of messages containing the token

Therefor 

        P(Spam) = s/(s+h) 

is an aproximation of the probability of a random message being spam.

And for:
        P(Token) = (S+H)/(s+h)
        P(Token|Spam) = S/s


that leads to

        P(Spam|Token)   = S/s *s/(s+h) / ((S+H)/(s+h))
                                        = S / (S+H)

That means, that the probability of a given message being spam when it 
contians a token is independend of the number of messages trained. 

Lets say in your real spam ham ratio is 10 to 1 and your messge body contains 
1100 messages. 100 spam and also 50 ham messages should contain a certain 
token. Lets say "[EMAIL PROTECTED]@". 

Total Messages: 1100
Spam (trained): 1000
Ham: 100
[EMAIL PROTECTED]@: in 100 spam and 50 ham 

If you train all messages you will get a propability of 100 / (100+50) = 66.6% 
for the next message containing the token of being spam. Which isn't a high 
probability but works fine for this example.

If you train only 10% of your spam to get the spam ham ration of 1:1 you will 
supposably count only 10 spam messages with the token. 

Spam (trained): 100
Ham: 100
[EMAIL PROTECTED]@: in 10 (=10% of 100) spam and 50 ham 

Which leads to a spam probability of only 10 /(10+50) = 16.6%
Which is a little bit low.
 
What happens when you train less ham? 

Lets assume you train only 50% of your ham but all your spam. You will 
supposably count only 25 ham messages with the token. 

Spam (trained): 1000
Ham (50% trained): 50
[EMAIL PROTECTED]@: in 100 spam and 25 (= 50% of 50) ham 

Which leads to a spam probability of 100 /(100+25) = 80%.

What happens when your ham spam ratio is 10 to 1?

Ham = 1000
Spam = 100
[EMAIL PROTECTED]@: in 100 ham and 50 spam 

=> 50 / (50+100) = 33.3%

Ham (10% trained) = 100
Spam = 100
[EMAIL PROTECTED]@: in 10 (=10% of 100) ham and 50 spam 

=> 50 / (50+10) = 83.3%

OOps!!!

So if you train to less spam you will get a higher False Negative rate, if you 
train to less ham you will get a higher False Positive rate.

Because a False Positive is more harmfull than a False Negative my conclusion 
is:
        train iaw your real spam ham ratio, train as much as possible (= train 
all 
        messages), but never train to less ham or train only spam!

(BTW: The risk of a False Positives is the reason why Paul Graham multiplied 
his token counts for ham with 2)

Another lesson should be: Never train whitelisted mails as ham!!!

 
Best regards 


Thomas Arend

PS: I hope I made no mistakes.
-- 
icq:133073900
http://www.t-arend.de

Attachment: pgpHF5v3GECVq.pgp
Description: PGP signature

Reply via email to