> I've tried all sorts of things (latest versions, upping shared mem, > etc etc), nothing seems to work, most changes seem to make it worse. > I can only imagine that I'm missing something really obvious. The > benchmark script I use to test changes take ~2mins to learn 2000 mails > for DBM, ~6mins for MySQL and I finally stopped the PostgreSQL test > last night when it had run for 6 hrs and still hadn't processed even > 1000 mails.
I can duplicate that to some extent-- time (1) reports on some occasions that it's 5 minutes to learn 70 emails, and other times those same messages take 12 minutes (measuring wall clock time). In both cases, system time is closer to 2 minute, and that roughly corresponds to the amount of time reported by the postreSQL server (via syslog, and logging elapsed query time from postgreSQL). I tried running sa-learn through the perl profiler, but the output of dprofpp was messed up and I couldn't get much useful info from it. (I've never used it before-- maybe I'm doing it wrong.) Looking at the SQL that is generated, I have two immediate thoughts: 1) why is the length of token defined as char(200) if we're only using the lowest 40 bits of the sha1 as the token in the database; When I reduced the size to match the size of the field for Mysql, my processing time dropped quite significantly (what took anywhere from 5-12 minutes is now taking about 1.75 minutes). 2) why is token a char instead of a bytea? I hacked SQL.pm to record the actual SQL being executed, and it looks like there are binary values being inserted. http://www.php-editors.com/postgres_manual/p_datatype-binary.html I'll poke around with bytea, next. > I'm interested in your test failures, if you are still experiencing > them can you please open a bug (http://bugzilla.spamassassin.org/) and > include the output from a t/bayessql.t run along with your setup. I'll try that. -ron
