https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6444
--- Comment #20 from Mark Martinec <[email protected]> 2011-02-17 14:37:01 EST --- > It looks like this bug has been sitting for a while. I was considering moving > Bayes and possibly AWL to PostgreSQL but ran across this and bug 4400. > Are the possible performance issues that great to not put this into > production? The current code in trunk (to become 3.4.0) has my last patch incorporated (comment 9). It implements what Bradley Kieser proposed, but factors out some invariant code. So, token times are updated one at a time (not using an IN operator). In effect, this is in production now (for those of us running the trunk code). The old routine (sub tok_touch_all_old) is still there in this module (Mail::SpamAssassin::BayesStore::PgSQL), so it is possible to test each, just by renaming tok_touch_all to something like tok_touch_all_6444OFF and tok_touch_all_old to tok_touch_all. Actually this is what I'm doing here, i.e. using the tok_touch_all_old. It is faster here for some reason. I will attach a diagram to illustrate it. Using PostgreSQL 8.3.14 on FreeBSD, with 230.000 records in table bayes_token: \d bayes_token Table "public.bayes_token" Column | Type | Modifiers ------------+---------+---------------------------- id | integer | not null default 0 token | bytea | not null default ''::bytea spam_count | integer | not null default 0 ham_count | integer | not null default 0 atime | integer | not null default 0 Indexes: "bayes_token_pkey" PRIMARY KEY, btree (id, token) "bayes_token_idx1" btree (token) The feature: bayes_auto_learn_on_error 1 is of great help here, keeps the growth of the number of tokens in a database manageable. The attached diagram shows the b_tok_touch_all elapsed time for the last couple of hours on our production mailer (as measured by SpamAssassin and logged by amavisd-new in its TIMING-SA syslog entries, or by enabling a timing debug in spamd). At 18h I switched code back to what is in trunk now (i.e. uses tok_touch_all since 18h, and was using tok_touch_all_old earlier). It clearly show a substantial increase in elapsed time, although in absolute terms the change is still small (perhaps 20 ms) compared to the total elapsed time in SpamAssassin. > With that, is there any additional testing I may be able to assist with? > Unfortunately, outside of production I have somewhat small datasets to work > with which may not help with the performance benchmarking needed. It would certainly help to get an independent measurement report. If it turns out that different sites (versions of PostgreSQL ???) see one or the other code variant faster, it should be quite easy to add a configuration setting to chose one or the other. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
