On 08/17/16 03:51, Antony Stone wrote:
On Wednesday 17 August 2016 at 05:06:50, Marc Perkel wrote:
What I'm doing is looking for fingerprints in email that intersect HAM
and not in SPAM - which would be a HAM result.
If it matches SPAM and does NOT match HAM - then it's SPAM.
The magic is in the NOT matching on the other side.
So if I say to you, "Let's get some lunch" that's ham because spammers
never say that, but normal people do. So the way to test what "spammers
never say" is to store what they do say and see if it's NOT in the list.
(Thus the infinite set)
What length are the tokens you store in the list? Single words (so the above
lunch example would contain 4 tokens)? Entire phrases (so the above would be
just 1 token)? Also how do you deal with spam which contains random cuttings
from legitimate texts (generally along with a graphic attachment and/or a URL
to get aross the "real" message)?
I tokenize a lot of different things but the fingerprints are at most 3
to 4 tokens long. If you go more then you get a database that's too big.
And in the body I'm just looking at the first 50 words, and a "concept
parser" that looks at the whole body.
http://wiki.junkemailfilter.com/index.php/Concept_Parsing_Spam_Filter
Similarly, there's only so many ways to misspell viagra, and good email
wouldn't have it spelled wrong.
Does this mean that people with bad spelling will more likely get classified as
spam, because they do not match the 'ham' group very well?
No - unless they misspell a lot of words the same way spammers misspell
it. If a spammer isn't misspelling the same way and normal people are -
it can count as ham - or be ignored.
Also, what happens to mail contains lots of tokens which match neither set
(for example, perfectly legitimate email which happens to be in a language the
system hasn't been trained with)?
Mail that doesn't match either side produces no score.
Antony.
--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400