Re: [Spambayes] was no subject (where can find documentation)

Tony Meyer Mon, 12 Jun 2006 00:40:53 -0700

['yahoo.de']
> Ok, i take a look on it later.

If you took a look at it now, you might not need to ask this <0.5 wink>.


> But there is q Question regarding withespaces
> and token's building.
> Let consiider this sample:
> I get an email with only this paraghraph on the body:
> Sun is shining.
> if you say because of wiithspaces there are only:
> 1-sun
> 2-is
> 3-shining
> to be checked,

In short: yes.  In reality, we skip any tokens less than three  
characters in length, and there are also many tokens from the headers.

> i will ask what is with the substrings in sun and shining
>
> 1-sun
> 2-su
> 3-un
>
> and all combinations for shinig like
> 4-shining
> 5-hining
> 6-ining
> 7-ning
> 8-ing
> 9-ng
>
> ?
> Because the spam email could contain at this paragraph spam words  
> like this:
> sunBuy is shinigViagra
> i hope the sample is understandable:-)

Look for mention of "character n-grams" in the comments in  
tokenizer.py for discussion about this.  In short, 'words' work  
better and have the added bonus of resulting in (mostly) human- 
understandable tokens.

Your example (assuming there are no header tokens) would either be  
spam (another spam using these embedded words has already been  
trained), or unsure (they have never been seen before).  Your example  
is also extremely unclear - it does a very poor job at selling, which  
is the whole point, after all.  So a spammer gains little, and has  
lost a lot.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.


_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] was no subject (where can find documentation)

Reply via email to