I've got a production amavisd-new array of servers handling roughly 16,000,000 messages per day. Users can toggle quarantining. With the quarantines that are enabled and a 2 week retention we have roughly 12 million message in quarantine at any given time. We send a summary email to subscribers based on their preference (daily, weekly, monthly) containing the list of messages they have in quarantine. My problem comes into play generating these summary emails. The SQL server just seems to hack and cough and take too long on the individual queries to keep up any longer. Notably, this is a fairly new server. Data and comments follow:
Server Stats AMD Quad Core Opteron 2.3Ghz 8GB ECC RAM 4 x 300GB WD 10krpm Raptors in RAID-10 array Vital my.conf configuration # Innodb settings innodb_file_per_table innodb_buffer_pool_size = 5120M innodb_log_buffer_size = 4M innodb_flush_log_at_trx_commit = 0 innodb_flush_method = O_DIRECT thread_concurrency = 8 When running the list of status reports, CPU usage hovers around an aggregate 2-3%, and that's with the CPUs clocked down by cpuspeed to 1100mhz. CPU doesn't seem to be an issue. Disk I/O does not seem to be maxxed. Disk I/O seems to hover around 30-50%. Output from iostat -kx 60 for relevant disks: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.25 8.23 45.09 12.93 617.70 272.89 30.70 0.41 7.09 6.49 37.64 sdb 0.25 8.23 44.98 12.93 617.03 272.89 30.74 0.39 6.79 6.32 36.59 sdc 0.43 8.10 46.46 13.61 641.56 280.95 30.71 0.37 6.17 5.71 34.31 sdd 0.63 8.10 46.46 13.61 635.83 280.95 30.52 0.38 6.40 5.84 35.08 RAM get's pretty well used up by InnoDB, here's details of RAM after the server's been online for about 45 minutes. Mem: 7986040k total, 7636416k used, 349624k free, 73244k buffers Swap: 3951952k total, 164k used, 3951788k free, 1352148k cached That tables I'm querying on look as follows (not exact because innodb doesn't report exact numbers in summary): maddr 9045756 rows, 0.8GB data, 1.8GB index msgrcpt 26639273 rows, 3.6GB data, 2.2GB index msgs 24832832 rows, 10.7GB data, 6.1GB index quarantine 11776200 rows, 93.8GB data, 302MB index The query..... SELECT maddr.email AS email, bspam_level, time_num, from_addr, subject, msgs.mail_id, secret_id FROM amavisd_stats.msgrcpt, amavisd_stats.msgs, amavisd_stats.maddr WHERE (maddr.email = 'j...@smith.com') AND msgrcpt.rid = maddr.id AND msgrcpt.rs != 'R' AND (msgrcpt.ds = 'D' OR msgrcpt.ds = 'R' OR msgrcpt.ds = 'B') AND quar_type = 'Q' AND time_num >= '1278027742' AND msgrcpt.mail_id = msgs.mail_id ORDER BY time_num DESC EXPLAIN QUERY output (I hope this formats OK via email), if not, it can be found at http://locu.net/temp/amavis_explain.txt. I've also appended the output of SHOW ENGINE INNODB STATUS to the URL. +----+-------------+---------+------+-------------------------------------+-----------------+---------+-------------------------------+------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------+------+-------------------------------------+-----------------+---------+-------------------------------+------+-----------------------------------------------------------+ | 1 | SIMPLE | maddr | ref | PRIMARY,email | email | 257 | const | 55 | Using where; Using index; Using temporary; Using filesort | | 1 | SIMPLE | msgrcpt | ref | msgrcpt_idx_mail_id,msgrcpt_idx_rid | msgrcpt_idx_rid | 4 | amavisd_stats.maddr.id | 14 | Using where | | 1 | SIMPLE | msgs | ref | PRIMARY,msgs_idx_time_num | PRIMARY | 14 | amavisd_stats.msgrcpt.mail_id | 1 | Using where | +----+-------------+---------+------+-------------------------------------+-----------------+---------+-------------------------------+------+-----------------------------------------------------------+ Provided I'm reading the EXPLAIN output correctly it shows it's successfully using indexes, and searching a limited number of rows. More data from the slow query log on one of the same queries. # Time: 100702 17:11:39 # Query_time: 264.682337 Lock_time: 0.000319 Rows_sent: 217 Rows_examined: 16933 SET timestamp=1278115899; SELECT maddr.email AS email, bspam_level, time_num, from_addr, subject, msgs.mail_id, secret_id FROM amavisd_stats.msgrcpt, amavisd_stats.msgs, amavisd_stats.maddr WHERE (maddr.email = 'j...@smith.com') AND msgrcpt.rid = maddr.id AND msgrcpt.rs != 'R' AND (msgrcpt.ds = 'D' OR msgrcpt.ds = 'R' OR msgrcpt.ds = 'B') AND quar_type = 'Q' AND time_num >= '1278029235' AND msgrcpt.mail_id = msgs.mail_id ORDER BY time_num DESC; The thing that gets me is the Query time being 264 seconds that when watching status sits in status 'Copying to tmp table'. I know InnoDB doesn't report lock time accurately in the slow query log; however, there should be no locks affecting a SELECT query anyways. If the same query is run using a fake user that doesn't exist, it returns swiftly with zero results. Users with more results take longer to query. Also running the same query a 2nd time produces results almost immediately. RAM would be my first guess; however, if I recall correctly, these queries were just as slow when the combined total of indexes were less than the available RAM dedicated to innodb buffer pool. I'm hoping there's somebody on the list running a high capacity setup with some advice. Thanks, N ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ AMaViS-user mailing list AMaViS-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/amavis-user Please visit http://www.ijs.si/software/amavisd/ regularly For administrativa requests please send email to rainer at openantivirus dot org