I absolutely suck at math so I'm not even gonna think about it. But I've seen token databases with 3+ million tokens.
I've got bayes code running now using hashes, using Sidney's substr(sha1($token), -5) value. It provides a slight speedup (maybe 10-20%) on scanning. With the corpus I'm using I see ~7% decrease in db size (for DBM, I haven't looked SQL dbs yet), but my average toen size has been around 9 chars so we're not exactly shrinking things a ton. Question is, is using that value gonna work in the long run for dbs with 3-4 million tokens? Michael PS Interesting factoid, on my benchmark corpus (6000 ham and 6000 spam) we extract ~120k more tokens in 3.0.0 than we did in 2.63.
