On average a message is broken down into 262 tokens (this is based of Sidney's mail flow) our target goal for deployment is ~2-3k msg/min capacity. For DB_file, this results in the worst case as, 25-52 mb/sec of read IO (4k read blocks * msg/sec * #tokens). Our benchmarking is pretty much in line with the theoretical numbers. This doesn't take into account database updates.
This really crystallizes it for me. Even though I said I wasn't going to say more until we saw some actual number [ :-) ], I'm willing to go out on a limb now and predict that switching to SQL will not make a difference for you. Here is my reasoning, which I think is completely independent of whether we use DB_File or SQL or any other disk based storage:
For each of the average 262 tokens in a message you have to access a database element to get the Bayes info for that user for that token. Either that element is on the disk or else it is already cached in memory. If it is on the disk, that means a minimum of one 4k disk block being read in.
The only ways I can think of right now to read less than 4k * 262 bytes per message are: 1) cache all of the users' Bayes db in between messages, i.e., have tens or hundreds of Gb of ram cache; 2) find a way to not require the database info for a large percentage of the tokens; or 3) get more than one of the tokens that are in the message in each 4k block that is read.
The first one isn't feasible, and the second is a radical approach that isn't likely to be feasible though I will think about it.
The third one is not going to happen by itself. Even storing all of one user's tokens together is not enough. When I run the numbers for 262 tokens per message and the 135,000 tokens in my Bayes db, I get that we would have to squeeze the database elements in at an impossible 5 bytes per token to get the average requirements cut in half down to 2k * 262 bytes per message. The only way I see to do it is to sort the records on the disk so the most common tokens come first _and_ squeeze the record size to as small as possible. This isn't something that SQL can do without some explicit help from us.
But it does point to some approaches. Separate the updating of the database from the reads that are done during the scan. Perhaps have a complete database that is updated, and from that generate a read only version that is used for scans. Store that database sorted by user id and within that sorted with the most frequently seen tokens first. Make that db as small as possible by eliminating what isn't needed for the bayes calculations in the scan. Perhaps the common tokens that are so poorly ranked don't even need the numbers stored with them, allowing us to fit even more in the first few blocks. Is the expiry date used in the Bayes calculation? If not we don't need it in that db. That db can use 48 bit hashes for the tokens. Since we group the records by username, every individual record doesn't have to contain the username. Can we get away with less than a full 4 byte integer for the hit counts when we do the Bayes calculations, perhaps by storing the hit counts as a log? We might be able to get the db that is actually used during the scans down to just 10 bytes per element and perhaps 7 or 8 for the common tokens that are poorly scored discriminators.
By the way this should work for DB_File as well as for SQL.
-- sidney
