['yahoo.de'] > Ok, i take a look on it later. If you took a look at it now, you might not need to ask this <0.5 wink>.
> But there is q Question regarding withespaces > and token's building. > Let consiider this sample: > I get an email with only this paraghraph on the body: > Sun is shining. > if you say because of wiithspaces there are only: > 1-sun > 2-is > 3-shining > to be checked, In short: yes. In reality, we skip any tokens less than three characters in length, and there are also many tokens from the headers. > i will ask what is with the substrings in sun and shining > > 1-sun > 2-su > 3-un > > and all combinations for shinig like > 4-shining > 5-hining > 6-ining > 7-ning > 8-ing > 9-ng > > ? > Because the spam email could contain at this paragraph spam words > like this: > sunBuy is shinigViagra > i hope the sample is understandable:-) Look for mention of "character n-grams" in the comments in tokenizer.py for discussion about this. In short, 'words' work better and have the added bonus of resulting in (mostly) human- understandable tokens. Your example (assuming there are no header tokens) would either be spam (another spam using these embedded words has already been trained), or unsure (they have never been seen before). Your example is also extremely unclear - it does a very poor job at selling, which is the whole point, after all. So a spammer gains little, and has lost a lot. =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
