I'm looking to modify spambayes to use 5-grams rather than split-on-whitespace. We have a few Asian customers and the default spambayes setup has not been very effective for them. So, we want to test with 5-grams and see if we can improve the effectiveness.
I know that n-grams have been tested several times before. So, if anyone has a n-gram tokenizer that they can share, I would appreciate a copy. Otherwise, I'll dive in and write it myself. Thanks. Richard Coleman [EMAIL PROTECTED] _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
