On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:
> OK - Trying to make the really simple. Just talking about concept now.
>
> Let's say I get an email where the subject is "I have aednocarsonoma of
> the lung".
>
> Right off you know it's ham because spammers never use the word
> "aednocarsonoma" and normal people do. Spammer also never use:
>
> "of the lung"
> "the lung"
> "aednocarsonoma of"
How do you create those boundaries to define the tokens?
> ....
>
> So - tell me you follow this so far. Spammers don't spam about
> aednocarsonoma.
>
> In this case I'm identifying ham because in some previous email people
> were talking about lung cancer and those phrases were learned as ham.
> But what makes it really ham is not just that it matches previous ham,
> but it doesn't match previous spam.
>
> A word like Viagra for example would produce no score because it is in
> both sets. However "cheapest viagra online" would match spam and not
> match ham indicating it's spam.
So what makes "cheapest Viagra online" a token, such that "cheapest" and
"online" are not tokens?
Antony.
--
The words "e pluribus unum" on the Great Seal of the United States are from a
poem by Virgil entitled "Moretum", which is about cheese and garlic salad
dressing.
Please reply to the list;
please *don't* CC me.