On ma, 2007-02-26 at 16:07 -0800, Peter Bishop wrote: > I was looking into why "Schwarzenegger" was not recognized as a token, > when I discovered that you had determined that it was good to have a > 12 character limit on tokens. Is this really better that a 15 char > limit?
For languages with short words like English, increasing the token length will only give marginally better results (if any). On the other hand, if a lot of your correspondence is in a language with long words (like German - and Schwarzenegger is a German/Austrian name), then increasing the token length might give better results. I presume the devs chose a limit of 12 chars based on experience (they have tested with thousands of messages). I think there must be some balance in the efficiency of the algorithm and the size of the token database. There is only one way to find out if a token limt of 15 is better _for_ _you_: try it out. -- Amedee
signature.asc
Description: This is a digitally signed message part
_______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
