Is there a particular problem you are trying to solve?
Yes, I'm trying to figure out why Kelsey sees the very high I/O requirements that he does that blocks him from scaling up to the multiple tens of thousands of users, while DSPAM claims to be running on a site with 125,000 users.
One thing that I noticed is that each time a message is processed, the entire set of Bayes data for that user would have to be read in order to process the tokens that are parsed out of that message. If a typical user has 200,000 tokens in the database and each token record contains 10 bytes for their username, 14 bytes for the token, and 4 byte integers for each of spam count, ham count, and atime, and is indexed by username as primary key, then is that 7,200,000 bytes plus whatever the index takes that has to be read in for each message? And then the next message is for a different user so it has to do it all over again?
That's why I think it might help to convert to an integer uid before getting to the database, and perhaps use a hash instead of the actual token.
-- sidney
