https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4400
--- Comment #17 from Mark Martinec <[email protected]> 2010-06-14 14:22:54 EDT --- > [...] smallish set of 5000 tokens that accumulated over [...] Sorry, the figure was 'messages', not 'tokens'. The rest stands. Now that our bayes database has grown to 10.000 learned messages and 200.000 tokens, I repeated the measurements, switching between original and the hereby suggested index scheme on table bayes_token, and back. I was observing elapsed times in milliseconds for the tok_get_all (read), and tok_touch_all operations (update), which correspond to all SQL I/O in the Bayes plugin - the rest is tokenization and computing probabilities, both of which is just perl processing with no I/O. Messages were just our regular mail traffic. Results were plotted as a scattergram of elapsed ms vs. time-of-the-day. I must say that the change hardly makes any difference. It is interesting that both the tok_get_all times and the tok_touch_all times are multimodal, i.e. the elapsed times are grouped in two or three regions. The tok_get_all clusters are roughly at 8, 25 and 50 milliseconds, while tok_touch_all times are near 5 and 35 ms. Switching the index scheme only affects the upper cluster by very little: 1 or 2 ms out of 35, and perhaps 4 ms out of 50 ms. The suggested index scheme saves a little in updating, and loses a little while reading (select). Compared to the total time spent in bayes processing the change is insignificant. It is interesting that the ratio of time spent in SQL I/O vs. the total time spent in bayes is almost entirely in the 45% .. 55% range, i.e. about half of the time is due to SQL, the other half is spent in tokenization and computing the probability. In summary: dropping the unnecessary index and swapping the fields of a primary key (as suggested here) can save some unnecessary work for the SQL server, and can save some space, but makes no difference in SpamAssassin performance, at least in my case of a 200.000 token database and a PostgreSQL 8.3.11 server. Since the suggested change is just to a documentation file (sql/bayes_pg.sql), my choice would be to still go for it, can't hurt. Other benchmarking on a larger database is welcome... -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
