I absolutely suck at math so I'm not even gonna think about it.  But
I've seen token databases with 3+ million tokens.

I've got bayes code running now using hashes, using Sidney's
substr(sha1($token), -5) value.  It provides a slight speedup (maybe
10-20%) on scanning.  With the corpus I'm using I see ~7% decrease in
db size (for DBM, I haven't looked SQL dbs yet), but my average toen
size has been around 9 chars so we're not exactly shrinking things a
ton.

Question is, is using that value gonna work in the long run for dbs
with 3-4 million tokens?

Michael

PS Interesting factoid, on my benchmark corpus (6000 ham and 6000
spam) we extract ~120k more tokens in 3.0.0 than we did in 2.63.

Reply via email to