On 08/22/16 07:37, Antony Stone wrote:
On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:

OK - Trying to make the really simple. Just talking about concept now.

Let's say I get an email where the subject is "I have aednocarsonoma of
the lung".

Right off you know it's ham because spammers never use the word
"aednocarsonoma" and normal people do. Spammer also never use:

"of the lung"
"the lung"
"aednocarsonoma of"
How do you create those boundaries to define the tokens?

Here's an example:

"the quick brown fox jumps over the lazy dog"

becomes ...

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" 
"quick brown fox"
"the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" 
"over" "jumps over"
"fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps 
over the"
"lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" 
"over the lazy dog"







....

So - tell me you follow this so far. Spammers don't spam about
aednocarsonoma.

In this case I'm identifying ham because in some previous email people
were talking about lung cancer and those phrases were learned as ham.
But what makes it really ham is not just that it matches previous ham,
but it doesn't match previous spam.

A word like Viagra for example would produce no score because it is in
both sets. However "cheapest viagra online" would match spam and not
match ham indicating it's spam.
So what makes "cheapest Viagra online" a token, such that "cheapest" and
"online" are not tokens?



They would all be tokens. Just pointing out one that would match spam and not match ham. "cheapest" and "online" would likely be in both sets and would be ignored.

--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to