On Thu, Mar 11, 2004 at 07:36:07PM +1300, Sidney Markowitz wrote:
> Michael Parker wrote:
> >Is there a particular problem you are trying to solve?
> 
> Yes, I'm trying to figure out why Kelsey sees the very high I/O 
> requirements that he does that blocks him from scaling up to the 
> multiple tens of thousands of users, while DSPAM claims to be running on 
> a site with 125,000 users.

Kelsey will have to correct me if I'm wrong, but he's not seeing the
high I/O with the MySQL bayes storage, he's seeing it with the DB_File
solution.

> One thing that I noticed is that each time a message is processed, the 
> entire set of Bayes data for that user would have to be read in order to 
> process the tokens that are parsed out of that message. If a typical 
> user has 200,000 tokens in the database and each token record contains 
> 10 bytes for their username, 14 bytes for the token, and 4 byte integers 
> for each of spam count, ham count, and atime, and is indexed by username 
> as primary key, then is that 7,200,000 bytes plus whatever the index 
> takes that has to be read in for each message? And then the next message 
> is for a different user so it has to do it all over again?

The index would have to be loaded into memory, not the entire table.
The index on bayes_token should be fairly efficient.  Given enough
memory MySQL can hold a good bit of the index in memory.  Practically
every single one of my queries is served out of memory.  Recently I've
been peaking somewhere around 200 queries per second, with performance
about equal, if not slightly better than DB_File on the same hardware.

> That's why I think it might help to convert to an integer uid before 
> getting to the database, and perhaps use a hash instead of the actual token.

I'm not a database theory expert, so I won't pretend to have all the
answers.  I think you are trying to solve a problem that doesn't need
to be solved.  Yes, fixed length char columns have a slight advantage
over varchars, so hashing the token might help.  An integer uid would
mean you'd have to join to the userpref table, which I don't think
you'd want to do unless you had to.

That said, I wouldn't mind seeing a benchmark.  Something that we can
easily throw at a bunch of different configurations to see which one
works out the best.

Michael



Reply via email to