Re: [Spambayes] 12 char limit on tokens

Amedee Van Gasse Mon, 26 Feb 2007 23:22:15 -0800

On ma, 2007-02-26 at 16:07 -0800, Peter Bishop wrote:
> I was looking into why "Schwarzenegger" was not recognized as a token,
> when I discovered that you had determined that it was good to have a
> 12 character limit on tokens.  Is this really better that a 15 char
> limit?


For languages with short words like English, increasing the token length
will only give marginally better results (if any).
On the other hand, if a lot of your correspondence is in a language with
long words (like German - and Schwarzenegger is a German/Austrian name),
then increasing the token length might give better results.

I presume the devs chose a limit of 12 chars based on experience (they
have tested with thousands of messages). I think there must be some
balance in the efficiency of the algorithm and the size of the token
database.

There is only one way to find out if a token limt of 15 is better _for_
_you_: try it out.

-- 
Amedee

signature.asc
Description: This is a digitally signed message part

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] 12 char limit on tokens

Reply via email to