On Thu, Mar 11, 2004 at 07:36:07PM +1300, Sidney Markowitz wrote: > Michael Parker wrote: > >Is there a particular problem you are trying to solve? > > Yes, I'm trying to figure out why Kelsey sees the very high I/O > requirements that he does that blocks him from scaling up to the > multiple tens of thousands of users, while DSPAM claims to be running on > a site with 125,000 users.
Kelsey will have to correct me if I'm wrong, but he's not seeing the high I/O with the MySQL bayes storage, he's seeing it with the DB_File solution. > One thing that I noticed is that each time a message is processed, the > entire set of Bayes data for that user would have to be read in order to > process the tokens that are parsed out of that message. If a typical > user has 200,000 tokens in the database and each token record contains > 10 bytes for their username, 14 bytes for the token, and 4 byte integers > for each of spam count, ham count, and atime, and is indexed by username > as primary key, then is that 7,200,000 bytes plus whatever the index > takes that has to be read in for each message? And then the next message > is for a different user so it has to do it all over again? The index would have to be loaded into memory, not the entire table. The index on bayes_token should be fairly efficient. Given enough memory MySQL can hold a good bit of the index in memory. Practically every single one of my queries is served out of memory. Recently I've been peaking somewhere around 200 queries per second, with performance about equal, if not slightly better than DB_File on the same hardware. > That's why I think it might help to convert to an integer uid before > getting to the database, and perhaps use a hash instead of the actual token. I'm not a database theory expert, so I won't pretend to have all the answers. I think you are trying to solve a problem that doesn't need to be solved. Yes, fixed length char columns have a slight advantage over varchars, so hashing the token might help. An integer uid would mean you'd have to join to the userpref table, which I don't think you'd want to do unless you had to. That said, I wouldn't mind seeing a benchmark. Something that we can easily throw at a bunch of different configurations to see which one works out the best. Michael
