On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:

> OK - Trying to make the really simple. Just talking about concept now.
> 
> Let's say I get an email where the subject is "I have aednocarsonoma of
> the lung".
> 
> Right off you know it's ham because spammers never use the word
> "aednocarsonoma" and normal people do. Spammer also never use:
> 
> "of the lung"
> "the lung"
> "aednocarsonoma of"

How do you create those boundaries to define the tokens?

> ....
> 
> So - tell me you follow this so far. Spammers don't spam about
> aednocarsonoma.
> 
> In this case I'm identifying ham because in some previous email people
> were talking about lung cancer and those phrases were learned as ham.
> But what makes it really ham is not just that it matches previous ham,
> but it doesn't match previous spam.
> 
> A word like Viagra for example would produce no score because it is in
> both sets. However "cheapest viagra online" would match spam and not
> match ham indicating it's spam.

So what makes "cheapest Viagra online" a token, such that "cheapest" and 
"online" are not tokens?


Antony.

-- 
The words "e pluribus unum" on the Great Seal of the United States are from a 
poem by Virgil entitled "Moretum", which is about cheese and garlic salad 
dressing.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Reply via email to