On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote: > OK - Trying to make the really simple. Just talking about concept now. > > Let's say I get an email where the subject is "I have aednocarsonoma of > the lung". > > Right off you know it's ham because spammers never use the word > "aednocarsonoma" and normal people do. Spammer also never use: > > "of the lung" > "the lung" > "aednocarsonoma of"
How do you create those boundaries to define the tokens? > .... > > So - tell me you follow this so far. Spammers don't spam about > aednocarsonoma. > > In this case I'm identifying ham because in some previous email people > were talking about lung cancer and those phrases were learned as ham. > But what makes it really ham is not just that it matches previous ham, > but it doesn't match previous spam. > > A word like Viagra for example would produce no score because it is in > both sets. However "cheapest viagra online" would match spam and not > match ham indicating it's spam. So what makes "cheapest Viagra online" a token, such that "cheapest" and "online" are not tokens? Antony. -- The words "e pluribus unum" on the Great Seal of the United States are from a poem by Virgil entitled "Moretum", which is about cheese and garlic salad dressing. Please reply to the list; please *don't* CC me.