http://bugzilla.spamassassin.org/show_bug.cgi?id=3225





------- Additional Comments From [EMAIL PROTECTED]  2004-04-12 15:00 -------
Created an attachment (id=1890)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=1890&action=view)
Patch File

Here is a version that does several things:

1) Implements Sidney's tok_get_all method for SQL and DBM.  Right now the SQL
version will get the tokens from the DB in chunks (100, 50, 25, 5, 1) which
needs to be benchmarked and tweaked based on what works the best.

2) Removes several full table scans to find the token_count and newest/oldest
token atimes by moving those values into the bayes_vars table.

3) Removes some code that is no longer called.

4) Adds some basic caching to avoid multiple lookups.

This patch does change the SQL database version so you can not use it without
wiping your existing data and starting from scratch (for the DB savy it is
possible to alter the bayes_vars table to add the new columns and then populate
them with the right values and bump the db version, but I'll leave that as a
lesson to the reader).  I'm hoping to get the backup/restore stuff done before
checking this in to help folks who are already using this do the upgrade
without too much grief.

Performance wise, my tests (via the benchmark) show a 2-3 times speedup from
the old code.  Compared to DBM it is about twice as slow for sa-learn
operations and statistically even for scanning.  The IO requirements should be
much smaller, and my casual testing are much lower for SQL than for DBM.

I'd love to hear some feedback from folks as to how well this works in your
setup.  Once I get some time I'd like to get folks using the benchmark I made
and hopefully extending it (for instance I'd love to start doing some
concurrent sa-learns and scanning to see how we do there).



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to