Bayes rules taking minutes - solved by moving to innodb?

2006-03-23 Thread Andrew Donkin

Hi, people.  This started as a plea for help but ended as a report of
an investigation, so hopefully it will be a useful addition to the
archives.

About 1% of my scans were taking more than 300 seconds.  Extra
debugging in spamd showed me that the Bayes checks were the culprit:

13:38:05 spamd[16852]: slow: run_eval_tests BAYES_40 took 773 seconds 
13:45:18 spamd[16852]: slow: run_eval_tests BAYES_80 took 427 seconds 
13:45:20 spamd[16852]: slow: do_body_eval_tests(0) took 1212 seconds 

I am using per-user Bayes (on the recommendation of half this list,
and against the recommendation of the other half :-), and perform
about 100,000 scans per day.  Bayes_seen was ~ 150M, and bayes_token ~
1.5G.  The bayes_token index was 4.7G.

MySQL's slow query log showed that the queries did not take long to
execute after they achieved a lock, but I suspected they were not
getting their locks in reasonable time:

 mysql> SHOW STATUS LIKE 'Table%';
 +---++
 | Variable_name | Value  |
 +---++
 | Table_locks_immediate | 171036 |
 | Table_locks_waited| 220999 |
 +---++

In a healthy database, table_locks_waited is a small fraction of
table_locks_immediate.

I turned off bayes_auto_expire in case it was the expiry which caused
the contention, but no change.  I need bayes_auto_expire turned on
because as we've discussed before, there is no way to perform
expiration for every user in an SQL Bayes database.

Well, I started this email a week ago and now I've found that at peak
times, "SHOW PROCESSLIST" shows many threads -- like 100 -- locked on
"SELECT FROM bayes_token" and "INSERT INTO bayes_token".

So I tried to convert bayes_token to InnoDB to take advantage of its
row-level locking (this is advised by the developers but not reflected
in bayes_mysql.sql).  After MySQL worked on that for a few days I
stopped it, dropped the database (innodb was very confused), and
recreated the database and all tables using innodb and two-byte IDs.

It's early days, with only 7.6M tokens seen and few accounts over the
activation mark of 200 ham.  But I'm hoping my timeout problems are
over.

So my advice is:

 SHOW STATUS LIKE 'Table%';
 SHOW PROCESSLIST;
 Change to innodb
 ALTER TABLE bayes_token MODIFY id SMALLINT UNSIGNED NOT NULL,
 MODIFY spam_count SMALLINT UNSIGNED NOT NULL,
 MODIFY ham_count SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_seen MODIFY id SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT;

-- 
_____
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Re: SQL Bayes - MyISAM locks a problem?

2006-03-21 Thread Andrew Donkin

Duane Hill has:

> per-user [...] just over 10 gig [...] InnoDB [...]
> http://wiki.apache.org/spamassassin/DBIPlugin [...] bayes_vars table
> has 14,102 rows

Jason Frisvold:

> I'll have to give innodb a try..  :)  Thanks for the tip...

Jason, if you haven't moved to innodb already, try "SHOW PROCESSLIST"
in mysql.  Do you have many threads locked on "SELECT FROM
bayes_token" and "INSERT INTO bayes_token"?

I had about 100 threads locked, so I am changing to InnoDB for its
fine-grained locking.  About three days ago I issued "ALTER TABLE
bayes_token ENGINE innodb".  I'll let you know when it finishes.

-- 
_________
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Re: prefork: server reached --max-clients setting, consider raising it messages

2006-03-06 Thread Andrew Donkin

> After upgrading to 3.1 from 3.0 we are starting to see the following
> error messages in our logs "prefork: server reached --max-clients
> setting, consider raising it"

Short version: try --round-robin on spamd.

We scan about 100k messages a day balanced (with -H) between two spamd
hosts.  Traffic is bursty, and during the bursts a lot of spam leaks
through unchecked because spamc reaches its 120s timeout.

The really annoying thing is that a spamd child would continue to chew
on its message for a further few hundred seconds before classifying
it, only to find that spamc had already given up.  That child could
have been working for another spamc.  I wonder if there is a way for
spamd to catch SIGPIPE or some other message from the client, and
abort.

So I added --round-robin and things improved markedly.  The logging
isn't nearly so good ("grep prefork:" without --round-robin draws you
a great load histogram) but far less spam is leaking through.

One theory is that spamd doesn't spawn children quickly enough to cope
with rapidly-ramping load.  I was thinking of ripping out spamd's
one-new-child-per-second throtting to see if it improved matters, but
that experiment is way down the task list now.

Try --round-robin.  Scale it up until your spamd hosts are maximising
the use of their RAM.

Note that your spamd hosts should be similarly capable - spamc will
split the load evenly between all of them, even when all children are
busy on one.

-- 
_________
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Re: Question on long scan times - 4500 seconds with a blip at 300.

2006-02-19 Thread Andrew Donkin

Kevin Gagel:

>> I'm finding that scans are taking as long as 798 seconds

Daryl O'Shea:

> Scan times of 798 seconds are probably a result of a bayes expiry.  If
> auto expiry is enabled (default) I'd disable it and run a manually
> expiry as a cron job.

This is a tally of SA scan times at my site from January 14th, rounded
down to the nearest 100 seconds, ignoring those below 1000 seconds:

 1000 30
 1100 41
 1200 14
 1300 5
 3000 3
 3100 35
 3200 11
 4400 1
 4500 3

Those are combined figures from two load-balanced spamd hosts.  Both
hosts exhibited similar behaviour, so the problem is caused by
something they share: a flood of incoming email, MySQL Bayes, or DNS
lookups.  I doubt it's Bayes because I have per-user stats, which
makes the expiry runs rare and quick.  And the RBL timeout is 10
seconds.

More interestingly, here are the times and counts for all the scan
times under 1000 seconds on that same day:

 100-140 10
 300-310 128
 950-980 5

So what is about the five minute mark that is so attractive to SA?

I'm going to start putting "gosh this is taking a long time" debugging
into spamd, but I would love to hear from anyone who has any ideas
about it.

Thanks in advance!

-- 
_____
Andrew Donkin  Waikato University, Hamilton,  New Zealand


OT: what greylisting system do you use?

2006-02-15 Thread Andrew Donkin

Ladies and gents, I'm about to start some gentle investigations into
greylisting, but I need some advice on which policy server to use.
It's off topic, so if you choose to reply privately I'll summarise.

I'm asking here because I think you're closer to the leading edge of
spam control than the postfix-users list, and I'm sure that asking
there won't receive an objective response.  Actually asking here
probably won't either but at least it will be varied :-)

It will be working with Postfix on (sadly) multiple MX hosts for a
couple of hundred thousand messages per day.

I see two contenders:

 Policy daemon, http://policyd.sourceforge.net/download.html
 SQLgrey, http://sqlgrey.sourceforge.net/

I would love to hear from anyone who has experience with these
packages, positive or negative.  Thanks in advance!


-- 
Andrew


Patch to spamc for debugging w.r.t. spamd

2006-02-09 Thread Andrew Donkin

While working on scaling problems I needed to match spamc's debugging
output with spamd's.  That is laborious when you've got multiple MX
hosts connecting to multiple spamd hosts.

This patch to spamc.c reveals the port number from which spamc
connects to spamd.  Spamd already logs the same, so matching them up
is a doddle.

The patch isn't portable, and you'll need a fair bit of fuzz to get it
to apply.  But I expect you will already have worked on spamc.c a fair
bit to get it to compile with DO_CONNECT_DEBUG_SYSLOGS, so muscling
this patch in won't be a challenge to you.  I felt it should be in the
archive.

@@ -438,8 +440,18 @@
}
else {
 #ifdef DO_CONNECT_DEBUG_SYSLOGS
-   libspamc_log(tp->flags, DEBUG_LEVEL,
+ struct sockaddr_in namebuf;
+ socklen_t namesize = sizeof(namebuf);
+
+ memset(&namebuf, 0, sizeof(namebuf));
+ if (getsockname(mysock, (struct sockaddr *) &namebuf, &namesize)) {
+   libspamc_log(tp->flags, LOG_DEBUG,
   "dbg: connect(AF_INET) to spamd at %s done", ipaddr);
+ } else {
+   libspamc_log(tp->flags, LOG_DEBUG,
+"dbg: connect(AF_INET) to spamd at %s done, local port 
%d",
+ipaddr, ntohs(namebuf.sin_port));
+ }
 #endif
*sockptr = mysock;


Re: Spamd Child Forks, and scaling problems

2006-02-02 Thread Andrew Donkin

Ray wrote:

>> spamd: handled cleanup of child pid 19888 due to SIGCHLD
>> spamd[5262]: prefork: child states: II

Ray, the one you want to look for is "prefork: server reached
--max-clients setting, consider raising it".  If you see that, you may
want to look at your free RAM and, well, consider using
--max-CHILDREN.  There have been posts here about how to do that
properly (to summarise: gradually).

Mike replied:

> It's log noise. Nothing to be alarmed about.

The signal notification could be relegated to a debug level but the
second of those lines can be useful when diagnosing spamd's problems.
Example:

20:39:53 prefork: child states:  
20:39:59 prefork: child states: IBBI 
20:40:06 prefork: child states:  
20:40:07 prefork: child states: BBIBB 
20:40:09 prefork: child states: BBBIBB 
20:40:10 prefork: child states: IBBBIBB 
20:40:13 prefork: child states: BIBBBIBB 
20:40:15 prefork: child states: IBIBIBBBIIIBI 
20:40:18 prefork: child states: BBBIBIBIBBIIIB 
20:40:22 prefork: child states: BBB 
20:40:31 prefork: child states: BBBIBBIB 
20:40:36 prefork: child states: BBBIBBIB 

This continues, until (columns removed to reduce line length):

20:51:41  
20:51:43 BIBBB 
20:51:43 BB 
20:51:43 BBB 
20:51:44 IBBB 
20:51:45 B 
20:51:47 BB 
20:51:49 BBB 
20:51:50 BIBBIBIB 
20:51:52 BIIIBBBIBIBBB 

While this is going on, spamc is timing out and letting email through
unchecked.  spamd goes to some effort to not panic-spawn children, so
I have circumvented it by reverting to the old algorithm via
--round-robin.

-- 
_________
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Re: Gain an extra 25%! (was "Purging the Spamassassin Database")

2006-01-16 Thread Andrew Donkin

> ALTER TABLE bayes_vars   MODIFY id SMALLINT UNSIGNED NOT NULL;

Whoops - doing that will prevent any new users from being created in
your Bayes database.  Add auto_increment:

ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT;

-- 
_____
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Gain an extra 25%! (was "Purging the Spamassassin Database")

2006-01-15 Thread Andrew Donkin
Jason Frisvold  writes:

> I've been investigating some recent slowness issues with our mail
> servers and I noticed that the spamassassin database is getting rather
> large.  We process approximately 300,000 mails a day (or more).  The
> bayes_token database is over 1.8 Gig at the moment. (Actually, 1.8 Gig
> for the data, and 1.3 Gig for the index)

If you have fewer than 65,000 accounts you could halve the size of the
id columns.

Those folks *not* storing per-user Bayes statistics -- i.e. with only
one line in bayes_vars -- could take the id column down to one byte, a
TINYINT.

  ALTER TABLE bayes_token  MODIFY id SMALLINT UNSIGNED NOT NULL;
  ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL;
  ALTER TABLE bayes_seen   MODIFY id SMALLINT UNSIGNED NOT NULL;
  ALTER TABLE bayes_vars   MODIFY id SMALLINT UNSIGNED NOT NULL;

The last three won't recover much space, but table indexes should be
kept the same type and size.

Also, try the following.  Since you are keeping Bayes per user, your
spam_count and ham_count probably don't get very high so you can take
those down to SMALLINTs.

  SELECT MAX(ham_count), MAX(spam_count) FROM bayes_token;

  ALTER TABLE bayes_token MODIFY spam_count SMALLINT UNSIGNED NOT NULL,
  MODIFY ham_count SMALLINT UNSIGNED NOT NULL;

That took 25% off my bayes_token data file and index.  Since MySQL
likes to keep indexes in VM, I think it was worthwhile.


-- 
Ard


Making the hashcash plugin more useful

2006-01-12 Thread Andrew Donkin

A while ago the hashcash list was lamenting the lack of support in
transfer agents.  They are stuck in a chicken and egg situation where
no user agents will add hashcash headers because no spam checkers are
looking for them.

I think we could bootstrap that process very easily with a simple
addition to SA.

Configuring my own SA setup to benefit from hashcash headers was as
simple as adding "hashcash_accept [EMAIL PROTECTED]".  Voila, instant
potential benefit.  Zero actual benefit, since nobody is adding
hashcash headers to their outgoing email, but that's a whole new
crusade.

I think SA distributions should contain a comment block in
"rules/local.cf" along the lines of:

#   If you use the Hashcash plugin, uncomment this and change it
#   to suit your domain:
#
# hashcash_accept [EMAIL PROTECTED]

Better still, the Hashcash plugin could determine the domain
algorithmically, and use it in the absence of any other
hashcash_accept lines.  Schemes for doing that based on the hostname
are all half-baked and non-portable, involving trimming components off
the host's DNS domain.  However just about anything is better than
nothing.

How about this: use the last three components of the DNS domain if the
last component has two letters, otherwise use the last two components.
Prefix that with %u@(?:.*\.)?

At worst, it will cause the plugin to accept hashcash payments
computed for an address with the same username at a different domain.
Hashcash's double-spend protection will prevent the same payment
working twice, so I don't see this as much of a loophole.  Spammers
still won't have enough CPU.

What do you think?  At the very least, I would like to see a comment
in local.cf or init.pre.  There is no argument not to.  It is a shame
to see thousands of SA installations a tiny, tiny step away from
taking advantage of hashcash.


Minor points, for Perl programmers and pedants:

- I actually use [EMAIL PROTECTED]".  I think the hashcash plugin
  should anchor the pattern at both ends.

- More paranoid sites, or those inside .com which is easily polluted,
  should use "^%u@(?:.*\.)?theirdomain.com$" or just
  "[EMAIL PROTECTED]"

-- 
_____
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Scaling SA for 100k/day: (was Purging the Spamassassin Database)

2006-01-12 Thread Andrew Donkin

> I've been investigating some recent slowness issues with our mail
> servers and I noticed that the spamassassin database is getting
> rather large.  We process approximately 300,000 mails a day (or
> more).

We do only a third of that Jason but I'm still having problems with
capacity.  I have filled spamc with debugging to help me figure out
where the problem lies, and I think it has come down to database
contention, or slow DNS, DCC, and Razor lookups.  Whatever it is, I'm
running out of spamd children.

I have two boxes running spamd --max-children 60, and in bursty times
every child is busy and spam leaks through unchecked.  We receive far
more "leaked" spam than false negatives:  5,000 out of 171,000
attempted in the past 48 hours.

Could Jason, and others on the list who handle a large amount of
email, report back on their setups?  It might be quite a useful
resource to have in the archives.  I don't think it has been covered
on this list before, but please set me straight if it has.

In particular I am interested in:

- how many boxes running spamd?
- how many spamd children per box (spamd --max-children)
- if Bayes is SQL, is it on the same or separate server as spamd, and
  are you replicating it to balance read load?
- spamc timeout (spamc -t)
- rbl_timeout
- bayes_expiry_max_db_size
- bayes_journal_max_size

Until this morning, when something calamitous happened to it, I was
using Berkeley for Bayes.  That is why I am interested in the load on
MySQL, since my database is on a separate box and is already handling
the per-user configs (one select per message) and the statistics (one
update per message).

With autolearning on, and the default bayes_journal_max_size, the
journal filled and was flushed every couple of minutes.  Approximately
how often should the journal flush itself?  Is there any harm in
having it happen every few minutes, or should I tune it up to an hour
or so?

> The bayes_token database is over 1.8 Gig at the moment. (Actually,
> 1.8 Gig for the data, and 1.3 Gig for the index)

Yikes.  That is the kind of thing I need to avoid!

Many, many thanks in advance.

-- 
_________
Andrew Donkin  Waikato University, Hamilton,  New Zealand