So, our MySQL master database crashes about once a week, then immediately recovers. We are running a Dell 2850 -- 64-bit Fedora Core 3 box with 6G of memory, 4 Intel Xeon processors, at 3.60 GHz speed each (says /proc/cpuinfo), each cpu cache size is 2048 Kb. It replicates to 2 slaves, which have the same hardware and memory. (the slaves don't crash).
I've done everything at http://dev.mysql.com/doc/refman/4.1/en/crashing.html
uname -a
Linux dbhotsl1.manhunt.net 2.6.12-1.1381_FC3smp #1 SMP Fri Oct 21 04:22:48 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
cat /proc/meminfo
MemTotal: 6142460 kB MemFree: 26564 kB Buffers: 15396 kB Cached: 805128 kB SwapCached: 1336 kB Active: 5503352 kB Inactive: 505792 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 6142460 kB LowFree: 26564 kB SwapTotal: 2096472 kB SwapFree: 2088036 kB Dirty: 1996 kB Writeback: 0 kB Mapped: 5195364 kB Slab: 78348 kB CommitLimit: 5167700 kB Committed_AS: 5532772 kB PageTables: 12384 kB VmallocTotal: 34359738367 kB VmallocUsed: 263636 kB VmallocChunk: 34359474295 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB The server regularly runs at 20-30 MB free memory all the time, so it's not (necessarily) a low memory issue. We get the dreaded "Signal 11" error, and no core dumps even though we have core-file set in the [mysqld] of the my.cnf. Speaking of the my.cnf, here it is: ----------------------------------------------------------------------- [mysqld] core-file old-passwords tmpdir = /tmp/ datadir = /var/lib/mysql socket = /var/lib/mysql/mysql.sock port = 3306 key_buffer = 320M max_allowed_packet = 16M table_cache = 10240 thread_cache = 80 ft_min_word_len = 3 # Query Cache Settings - OFF due to overload of Session table query_cache_size = 32M query_cache_type = 2 # Log queries taking longer than "long_query_time" seconds long_query_time = 4 log-slow-queries = /var/log/mysql/slow-queries.log log-error = /var/log/mysql/mysqld.err # Try number of CPU's*2 for thread_concurrency thread_concurrency = 12 interactive_timeout = 28800 wait_timeout = 30 # up to 15 Apache Servers with 256 connections each = 3840 # 5.8 G of memory = 2200 cxns # when you change this recalculate total possible mysqld memory usage!! # innodb_buffer_pool_size + key_buffer_size # + max_connections*(sort_buffer_size+read_buffer_size+binlog_cache_size) # + max_connections*2MB max_connections = 2200 max_connect_errors = 128 # Replication Master Server (default) # binary logging is required for replication log-bin=/var/log/mysql/dbhotsl1-bin server-id = 18 binlog-do-db = db1 binlog-do-db = db2 binlog-do-db = db3 max_binlog_size = 2G # InnoDB tables innodb_data_home_dir = /var/lib/mysql/ innodb_data_file_path = ibdata1:3G;ibdata2:3G;ibdata3:3G;ibdata4:3G; innodb_log_group_home_dir = /var/log/mysql/ innodb_log_files_in_group = 2 innodb_log_arch_dir = /var/log/mysql/ innodb_buffer_pool_size = 4G innodb_additional_mem_pool_size = 40M innodb_log_file_size = 160M innodb_log_buffer_size = 80M innodb_flush_log_at_trx_commit = 0 innodb_lock_wait_timeout = 50 innodb_thread_concurrency = 8 innodb_file_io_threads = 4 ################################################## [mysql.server] user=mysql basedir=/var/lib ################################################## [safe_mysqld] err-log=/var/log/mysql/mysqld.log pid-file=/var/run/mysqld/mysqld.pid ----------------------------------------------------------------------- And then the error file, pretty standard, not really telling me anything (and no stack trace): -------------------------------------------------------------------------- mysqld got signal 11; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=335544320 read_buffer_size=131072 max_used_connections=2201 max_connections=2200 threads_connected=152 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 5114862 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. 060427 23:56:44 InnoDB: Database was not shut down normally! InnoDB: Starting crash recovery. InnoDB: Reading tablespace information from the .ibd files... InnoDB: Restoring possible half-written data pages from the doublewrite InnoDB: buffer... 060427 23:56:44 InnoDB: Starting log scan based on checkpoint at InnoDB: log sequence number 752 3907332354. InnoDB: Doing recovery: scanned up to log sequence number 752 3912574976 InnoDB: Doing recovery: scanned up to log sequence number 752 3917817856 [...more of the same] InnoDB: Doing recovery: scanned up to log sequence number 752 4144467558 060427 23:57:09 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 9 6 97 98 99 InnoDB: Apply batch completed InnoDB: In a MySQL replication slave the last master binlog file InnoDB: position 0 53262417, file name swan-bin.003989 InnoDB: Last MySQL binlog file position 0 933891534, file name /var/log/mysql/dbhotsl1-bin.001193 060427 23:59:08 InnoDB: Flushing modified pages from the buffer pool... 060427 23:59:39 InnoDB: Started; log sequence number 752 4144467558 /usr/sbin/mysqld: ready for connections. Version: '4.1.12-standard-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 MySQL Community Edition - Standard (GPL) ------------------------------------------------------------------------- Our binary logs grow 1.1G in 2 hours, so enabling the general log isn't really an option. I'm looking for answers that are not "check the slow query log" because 1) I've done that and changes have been put in place, and 2) there are no slow queries before a crash but plenty after (because of all the queries queued up while the server was down). Can anyone help shed some light? Even if it's further ways to debug or find debug files. It's a really bizarre problem, and shouldn't be happening, and not acceptable in our environment. We run about 3000 queries per second normal load, 6000 at peak times and when we crash -- 1000 of those queries are DML under normal load, about 2000 under peak load and when we crash. We hit peak loads AFTER we crash, which makes sense (we're a web-based application). We have tons of monitoring, both on the system and things like the InnoDB monitor. We deadlock only right AFTER a crash, and will go for days at a time without a deadlock (ie, the "last detected deadlock" before a crash was 5 days earlier). Load and memory usage are normal for hours, and only increase AFTER a crash. There are no core files generated. Just today I added core-file-size=unlimited to the my.cnf in the hopes that that will work. If not, I have permission to restart MySQL running as the root user to see if it will dump core, even though it's a security risk. One thing we did do was CHECK, ANALYZE and OPTIMIZE all the tables at the beginning of the month -- this helped stop the crashing under high load, but it still crashes about once a week, at NON peak times. Is it possible we're doing so many updates that we're slowly corrupting our db? I'm also going to try doing a CHECK, ANALYZE and OPTIMIZE weekly to see if that helps...but I really feel like we shouldn't need to do that. We've looked everywhere we can find and there are no clues. Any advice/help? Thanx! -Sheeri -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]