On Wednesday 17 August 2016 at 05:06:50, Marc Perkel wrote:

> What I'm doing is looking for fingerprints in email that intersect HAM
> and not in SPAM - which would be a HAM result.
> If it matches SPAM and does NOT match HAM - then it's SPAM.
> 
> The magic is in the NOT matching on the other side.
> 
> So if I say to you, "Let's get some lunch" that's ham because spammers
> never say that, but normal people do. So the way to test what "spammers
> never say" is to store what they do say and see if it's NOT in the list.
> (Thus the infinite set)

What length are the tokens you store in the list?  Single words (so the above 
lunch example would contain 4 tokens)?  Entire phrases (so the above would be 
just 1 token)?  Also how do you deal with spam which contains random cuttings 
from legitimate texts (generally along with a graphic attachment and/or a URL 
to get aross the "real" message)?

> Similarly, there's only so many ways to misspell viagra, and good email
> wouldn't have it spelled wrong.

Does this mean that people with bad spelling will more likely get classified as 
spam, because they do not match the 'ham' group very well?

Also, what happens to mail contains lots of tokens which match neither set 
(for example, perfectly legitimate email which happens to be in a language the 
system hasn't been trained with)?


Antony.

-- 
Wanted: telepath.   You know where to apply.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Reply via email to