OK - Trying to make the really simple. Just talking about concept now.
Let's say I get an email where the subject is "I have aednocarsonoma of
the lung".
Right off you know it's ham because spammers never use the word
"aednocarsonoma" and normal people do. Spammer also never use:
"of the lung"
"the lung"
"aednocarsonoma of"
....
So - tell me you follow this so far. Spammers don't spam about
aednocarsonoma.
In this case I'm identifying ham because in some previous email people
were talking about lung cancer and those phrases were learned as ham.
But what makes it really ham is not just that it matches previous ham,
but it doesn't match previous spam.
A word like Viagra for example would produce no score because it is in
both sets. However "cheapest viagra online" would match spam and not
match ham indicating it's spam.
The magic here is that this detects both spam and ham. And it is
especially good at detecting ham, which greatly reduces false positives.