Bayes rules taking minutes - solved by moving to innodb?
Hi, people. This started as a plea for help but ended as a report of an investigation, so hopefully it will be a useful addition to the archives. About 1% of my scans were taking more than 300 seconds. Extra debugging in spamd showed me that the Bayes checks were the culprit: 13:38:05 spamd[16852]: slow: run_eval_tests BAYES_40 took 773 seconds 13:45:18 spamd[16852]: slow: run_eval_tests BAYES_80 took 427 seconds 13:45:20 spamd[16852]: slow: do_body_eval_tests(0) took 1212 seconds I am using per-user Bayes (on the recommendation of half this list, and against the recommendation of the other half :-), and perform about 100,000 scans per day. Bayes_seen was ~ 150M, and bayes_token ~ 1.5G. The bayes_token index was 4.7G. MySQL's slow query log showed that the queries did not take long to execute after they achieved a lock, but I suspected they were not getting their locks in reasonable time: mysql> SHOW STATUS LIKE 'Table%'; +---++ | Variable_name | Value | +---++ | Table_locks_immediate | 171036 | | Table_locks_waited| 220999 | +---++ In a healthy database, table_locks_waited is a small fraction of table_locks_immediate. I turned off bayes_auto_expire in case it was the expiry which caused the contention, but no change. I need bayes_auto_expire turned on because as we've discussed before, there is no way to perform expiration for every user in an SQL Bayes database. Well, I started this email a week ago and now I've found that at peak times, "SHOW PROCESSLIST" shows many threads -- like 100 -- locked on "SELECT FROM bayes_token" and "INSERT INTO bayes_token". So I tried to convert bayes_token to InnoDB to take advantage of its row-level locking (this is advised by the developers but not reflected in bayes_mysql.sql). After MySQL worked on that for a few days I stopped it, dropped the database (innodb was very confused), and recreated the database and all tables using innodb and two-byte IDs. It's early days, with only 7.6M tokens seen and few accounts over the activation mark of 200 ham. But I'm hoping my timeout problems are over. So my advice is: SHOW STATUS LIKE 'Table%'; SHOW PROCESSLIST; Change to innodb ALTER TABLE bayes_token MODIFY id SMALLINT UNSIGNED NOT NULL, MODIFY spam_count SMALLINT UNSIGNED NOT NULL, MODIFY ham_count SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_seen MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT; -- _____ Andrew Donkin Waikato University, Hamilton, New Zealand
Re: SQL Bayes - MyISAM locks a problem?
Duane Hill has: > per-user [...] just over 10 gig [...] InnoDB [...] > http://wiki.apache.org/spamassassin/DBIPlugin [...] bayes_vars table > has 14,102 rows Jason Frisvold: > I'll have to give innodb a try.. :) Thanks for the tip... Jason, if you haven't moved to innodb already, try "SHOW PROCESSLIST" in mysql. Do you have many threads locked on "SELECT FROM bayes_token" and "INSERT INTO bayes_token"? I had about 100 threads locked, so I am changing to InnoDB for its fine-grained locking. About three days ago I issued "ALTER TABLE bayes_token ENGINE innodb". I'll let you know when it finishes. -- _________ Andrew Donkin Waikato University, Hamilton, New Zealand
Re: prefork: server reached --max-clients setting, consider raising it messages
> After upgrading to 3.1 from 3.0 we are starting to see the following > error messages in our logs "prefork: server reached --max-clients > setting, consider raising it" Short version: try --round-robin on spamd. We scan about 100k messages a day balanced (with -H) between two spamd hosts. Traffic is bursty, and during the bursts a lot of spam leaks through unchecked because spamc reaches its 120s timeout. The really annoying thing is that a spamd child would continue to chew on its message for a further few hundred seconds before classifying it, only to find that spamc had already given up. That child could have been working for another spamc. I wonder if there is a way for spamd to catch SIGPIPE or some other message from the client, and abort. So I added --round-robin and things improved markedly. The logging isn't nearly so good ("grep prefork:" without --round-robin draws you a great load histogram) but far less spam is leaking through. One theory is that spamd doesn't spawn children quickly enough to cope with rapidly-ramping load. I was thinking of ripping out spamd's one-new-child-per-second throtting to see if it improved matters, but that experiment is way down the task list now. Try --round-robin. Scale it up until your spamd hosts are maximising the use of their RAM. Note that your spamd hosts should be similarly capable - spamc will split the load evenly between all of them, even when all children are busy on one. -- _________ Andrew Donkin Waikato University, Hamilton, New Zealand
Re: Question on long scan times - 4500 seconds with a blip at 300.
Kevin Gagel: >> I'm finding that scans are taking as long as 798 seconds Daryl O'Shea: > Scan times of 798 seconds are probably a result of a bayes expiry. If > auto expiry is enabled (default) I'd disable it and run a manually > expiry as a cron job. This is a tally of SA scan times at my site from January 14th, rounded down to the nearest 100 seconds, ignoring those below 1000 seconds: 1000 30 1100 41 1200 14 1300 5 3000 3 3100 35 3200 11 4400 1 4500 3 Those are combined figures from two load-balanced spamd hosts. Both hosts exhibited similar behaviour, so the problem is caused by something they share: a flood of incoming email, MySQL Bayes, or DNS lookups. I doubt it's Bayes because I have per-user stats, which makes the expiry runs rare and quick. And the RBL timeout is 10 seconds. More interestingly, here are the times and counts for all the scan times under 1000 seconds on that same day: 100-140 10 300-310 128 950-980 5 So what is about the five minute mark that is so attractive to SA? I'm going to start putting "gosh this is taking a long time" debugging into spamd, but I would love to hear from anyone who has any ideas about it. Thanks in advance! -- _____ Andrew Donkin Waikato University, Hamilton, New Zealand
OT: what greylisting system do you use?
Ladies and gents, I'm about to start some gentle investigations into greylisting, but I need some advice on which policy server to use. It's off topic, so if you choose to reply privately I'll summarise. I'm asking here because I think you're closer to the leading edge of spam control than the postfix-users list, and I'm sure that asking there won't receive an objective response. Actually asking here probably won't either but at least it will be varied :-) It will be working with Postfix on (sadly) multiple MX hosts for a couple of hundred thousand messages per day. I see two contenders: Policy daemon, http://policyd.sourceforge.net/download.html SQLgrey, http://sqlgrey.sourceforge.net/ I would love to hear from anyone who has experience with these packages, positive or negative. Thanks in advance! -- Andrew
Patch to spamc for debugging w.r.t. spamd
While working on scaling problems I needed to match spamc's debugging output with spamd's. That is laborious when you've got multiple MX hosts connecting to multiple spamd hosts. This patch to spamc.c reveals the port number from which spamc connects to spamd. Spamd already logs the same, so matching them up is a doddle. The patch isn't portable, and you'll need a fair bit of fuzz to get it to apply. But I expect you will already have worked on spamc.c a fair bit to get it to compile with DO_CONNECT_DEBUG_SYSLOGS, so muscling this patch in won't be a challenge to you. I felt it should be in the archive. @@ -438,8 +440,18 @@ } else { #ifdef DO_CONNECT_DEBUG_SYSLOGS - libspamc_log(tp->flags, DEBUG_LEVEL, + struct sockaddr_in namebuf; + socklen_t namesize = sizeof(namebuf); + + memset(&namebuf, 0, sizeof(namebuf)); + if (getsockname(mysock, (struct sockaddr *) &namebuf, &namesize)) { + libspamc_log(tp->flags, LOG_DEBUG, "dbg: connect(AF_INET) to spamd at %s done", ipaddr); + } else { + libspamc_log(tp->flags, LOG_DEBUG, +"dbg: connect(AF_INET) to spamd at %s done, local port %d", +ipaddr, ntohs(namebuf.sin_port)); + } #endif *sockptr = mysock;
Re: Spamd Child Forks, and scaling problems
Ray wrote: >> spamd: handled cleanup of child pid 19888 due to SIGCHLD >> spamd[5262]: prefork: child states: II Ray, the one you want to look for is "prefork: server reached --max-clients setting, consider raising it". If you see that, you may want to look at your free RAM and, well, consider using --max-CHILDREN. There have been posts here about how to do that properly (to summarise: gradually). Mike replied: > It's log noise. Nothing to be alarmed about. The signal notification could be relegated to a debug level but the second of those lines can be useful when diagnosing spamd's problems. Example: 20:39:53 prefork: child states: 20:39:59 prefork: child states: IBBI 20:40:06 prefork: child states: 20:40:07 prefork: child states: BBIBB 20:40:09 prefork: child states: BBBIBB 20:40:10 prefork: child states: IBBBIBB 20:40:13 prefork: child states: BIBBBIBB 20:40:15 prefork: child states: IBIBIBBBIIIBI 20:40:18 prefork: child states: BBBIBIBIBBIIIB 20:40:22 prefork: child states: BBB 20:40:31 prefork: child states: BBBIBBIB 20:40:36 prefork: child states: BBBIBBIB This continues, until (columns removed to reduce line length): 20:51:41 20:51:43 BIBBB 20:51:43 BB 20:51:43 BBB 20:51:44 IBBB 20:51:45 B 20:51:47 BB 20:51:49 BBB 20:51:50 BIBBIBIB 20:51:52 BIIIBBBIBIBBB While this is going on, spamc is timing out and letting email through unchecked. spamd goes to some effort to not panic-spawn children, so I have circumvented it by reverting to the old algorithm via --round-robin. -- _________ Andrew Donkin Waikato University, Hamilton, New Zealand
Re: Gain an extra 25%! (was "Purging the Spamassassin Database")
> ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL; Whoops - doing that will prevent any new users from being created in your Bayes database. Add auto_increment: ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT; -- _____ Andrew Donkin Waikato University, Hamilton, New Zealand
Gain an extra 25%! (was "Purging the Spamassassin Database")
Jason Frisvold writes: > I've been investigating some recent slowness issues with our mail > servers and I noticed that the spamassassin database is getting rather > large. We process approximately 300,000 mails a day (or more). The > bayes_token database is over 1.8 Gig at the moment. (Actually, 1.8 Gig > for the data, and 1.3 Gig for the index) If you have fewer than 65,000 accounts you could halve the size of the id columns. Those folks *not* storing per-user Bayes statistics -- i.e. with only one line in bayes_vars -- could take the id column down to one byte, a TINYINT. ALTER TABLE bayes_token MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_seen MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL; The last three won't recover much space, but table indexes should be kept the same type and size. Also, try the following. Since you are keeping Bayes per user, your spam_count and ham_count probably don't get very high so you can take those down to SMALLINTs. SELECT MAX(ham_count), MAX(spam_count) FROM bayes_token; ALTER TABLE bayes_token MODIFY spam_count SMALLINT UNSIGNED NOT NULL, MODIFY ham_count SMALLINT UNSIGNED NOT NULL; That took 25% off my bayes_token data file and index. Since MySQL likes to keep indexes in VM, I think it was worthwhile. -- Ard
Making the hashcash plugin more useful
A while ago the hashcash list was lamenting the lack of support in transfer agents. They are stuck in a chicken and egg situation where no user agents will add hashcash headers because no spam checkers are looking for them. I think we could bootstrap that process very easily with a simple addition to SA. Configuring my own SA setup to benefit from hashcash headers was as simple as adding "hashcash_accept [EMAIL PROTECTED]". Voila, instant potential benefit. Zero actual benefit, since nobody is adding hashcash headers to their outgoing email, but that's a whole new crusade. I think SA distributions should contain a comment block in "rules/local.cf" along the lines of: # If you use the Hashcash plugin, uncomment this and change it # to suit your domain: # # hashcash_accept [EMAIL PROTECTED] Better still, the Hashcash plugin could determine the domain algorithmically, and use it in the absence of any other hashcash_accept lines. Schemes for doing that based on the hostname are all half-baked and non-portable, involving trimming components off the host's DNS domain. However just about anything is better than nothing. How about this: use the last three components of the DNS domain if the last component has two letters, otherwise use the last two components. Prefix that with %u@(?:.*\.)? At worst, it will cause the plugin to accept hashcash payments computed for an address with the same username at a different domain. Hashcash's double-spend protection will prevent the same payment working twice, so I don't see this as much of a loophole. Spammers still won't have enough CPU. What do you think? At the very least, I would like to see a comment in local.cf or init.pre. There is no argument not to. It is a shame to see thousands of SA installations a tiny, tiny step away from taking advantage of hashcash. Minor points, for Perl programmers and pedants: - I actually use [EMAIL PROTECTED]". I think the hashcash plugin should anchor the pattern at both ends. - More paranoid sites, or those inside .com which is easily polluted, should use "^%u@(?:.*\.)?theirdomain.com$" or just "[EMAIL PROTECTED]" -- _____ Andrew Donkin Waikato University, Hamilton, New Zealand
Scaling SA for 100k/day: (was Purging the Spamassassin Database)
> I've been investigating some recent slowness issues with our mail > servers and I noticed that the spamassassin database is getting > rather large. We process approximately 300,000 mails a day (or > more). We do only a third of that Jason but I'm still having problems with capacity. I have filled spamc with debugging to help me figure out where the problem lies, and I think it has come down to database contention, or slow DNS, DCC, and Razor lookups. Whatever it is, I'm running out of spamd children. I have two boxes running spamd --max-children 60, and in bursty times every child is busy and spam leaks through unchecked. We receive far more "leaked" spam than false negatives: 5,000 out of 171,000 attempted in the past 48 hours. Could Jason, and others on the list who handle a large amount of email, report back on their setups? It might be quite a useful resource to have in the archives. I don't think it has been covered on this list before, but please set me straight if it has. In particular I am interested in: - how many boxes running spamd? - how many spamd children per box (spamd --max-children) - if Bayes is SQL, is it on the same or separate server as spamd, and are you replicating it to balance read load? - spamc timeout (spamc -t) - rbl_timeout - bayes_expiry_max_db_size - bayes_journal_max_size Until this morning, when something calamitous happened to it, I was using Berkeley for Bayes. That is why I am interested in the load on MySQL, since my database is on a separate box and is already handling the per-user configs (one select per message) and the statistics (one update per message). With autolearning on, and the default bayes_journal_max_size, the journal filled and was flushed every couple of minutes. Approximately how often should the journal flush itself? Is there any harm in having it happen every few minutes, or should I tune it up to an hour or so? > The bayes_token database is over 1.8 Gig at the moment. (Actually, > 1.8 Gig for the data, and 1.3 Gig for the index) Yikes. That is the kind of thing I need to avoid! Many, many thanks in advance. -- _________ Andrew Donkin Waikato University, Hamilton, New Zealand