RE: Questions about SA 3.0, in a distributed "enterprise" configu rati on

Ron Snyder 17 Jun 2004 16:10:27 -0000

> I've tried all sorts of things (latest versions, upping shared mem,
> etc etc), nothing seems to work, most changes seem to make it worse.
> I can only imagine that I'm missing something really obvious.  The
> benchmark script I use to test changes take ~2mins to learn 2000 mails
> for DBM, ~6mins for MySQL and I finally stopped the PostgreSQL test
> last night when it had run for 6 hrs and still hadn't processed even
> 1000 mails.


I can duplicate that to some extent-- time (1) reports on some occasions
that it's 5 minutes to learn 70 emails, and other times those same messages
take 12 minutes (measuring wall clock time).  In both cases, system time is
closer to 2 minute, and that roughly corresponds to the amount of time
reported by the postreSQL server (via syslog, and logging elapsed query time
from postgreSQL).

I tried running sa-learn through the perl profiler, but the output of
dprofpp was messed up and I couldn't get much useful info from it. (I've
never used it before-- maybe I'm doing it wrong.)

Looking at the SQL that is generated, I have two immediate thoughts:
1) why is the length of token defined as char(200) if we're only using the
lowest 40 bits of the sha1 as the token in the database;  When I reduced the
size to match the size of the field for Mysql, my processing time dropped
quite significantly (what took anywhere from 5-12 minutes is now taking
about 1.75 minutes).
2) why is token a char instead of a bytea? I hacked SQL.pm to record the
actual SQL being executed, and it looks like there are binary values being
inserted. http://www.php-editors.com/postgres_manual/p_datatype-binary.html

I'll poke around with bytea, next.


> I'm interested in your test failures, if you are still experiencing
> them can you please open a bug (http://bugzilla.spamassassin.org/) and
> include the output from a t/bayessql.t run along with your setup.

I'll try that.

-ron

RE: Questions about SA 3.0, in a distributed "enterprise" configu rati on

Reply via email to