Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention
The data size is about 200 GB. I would have noticed increase on writes. No backup activity is running (actually I don't do conventional backups). Any theories? Thank you for your interest. Kind regards, -- Luis Motta Campos On 23 Oct 2011, at 14:06, Tyler Poland tpol...@engineyard.com wrote: Luis, How large is your database? Have you checked for an increase in write activity on the master leading up to this? Are you running a backup against the replica? Thank you, Tyler Sent from my Droid Bionic On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk wrote: Fellow DBAs and MySQL Users [apologies for eventual duplicates - I've posted this to percona-discuss...@googlegroups.com also] I've been hunting an issue with my database cluster for several months now without much success. Maybe I'm overlooking something here. I've been observing the database slowing down and lagging behind for thousands of seconds (sometimes over the course of several days) even without any query load besides replication itself. I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel processors) running Debian Lenny. MySQL data, binary logs, relay logs, innodb log files are on separated partitions from each other, on a RAID system separated from the operating system disks. Default Storage Engine is InnoDB, and the usual InnoDB memory structures are stable and look healthy. I have about 500 (read) queries per second on average, and about 10% of this as writes on the master. I've been observing something that looks like between 6 and 10 pending reads per second uniformly on my cacti graphs. The issue is characterized by the server suddenly slowing down writes without any previous warning or change, and lagging behind for several thousand seconds (triggering all sorts of alerts on my monitoring system). I don't observe extra CPU activity, just a reduced disk access ratio (from about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it neither InnoDB hashing activity, nor with long-running-queries, nor with background read/write thread activities. I don't have any clues of what is causing this behavior, and I'm unable to reproduce it under controlled conditions. I've observed the issue both on severs with and without workload (apart from the usual replication load). I am sure no changes were applied to the server or to the cluster. I'm looking forward for suggestions and theories on the issue - all ideas are welcome. Thank you for your time and attention, Kind regards, -- Luis Motta Campos is a DBA, Foodie, and Photographer -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com
Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention
Claudio, Thank you for your interest. I will wait for the issue to happen again and will see what kind of information I can get back with strace. This is indeed something I didn't think of trying yet. I'll keep you people posted on this. Much appreciated on the new approaches and fresh ideas. Kind regards, -- Luis Motta Campos On 23 Oct 2011, at 23:27, Claudio Nanni claudio.na...@gmail.com wrote: Luis, Very hard to tackle. In my experience, excluding external(to mysql) bottlenecks, like hardware, o.s. etc, 'suspects' are the shared resources 'guarded' by unique mutexes, like on the query cache or key cache. Since you do not use MySQL it cannot be the key cache. Since you use percona the query cache is disabled by default. You should go a bit lower level and catch the system calls with one of the tools you surely know to see if there are waits on the semaphores. I also would like to tell that the 'seconds behind master' reported by the slave is not reliable. Good luck! Claudio 2011/10/23 Tyler Poland tpol...@engineyard.com Luis, How large is your database? Have you checked for an increase in write activity on the master leading up to this? Are you running a backup against the replica? Thank you, Tyler Sent from my Droid Bionic On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk wrote: Fellow DBAs and MySQL Users [apologies for eventual duplicates - I've posted this to percona-discuss...@googlegroups.com also] I've been hunting an issue with my database cluster for several months now without much success. Maybe I'm overlooking something here. I've been observing the database slowing down and lagging behind for thousands of seconds (sometimes over the course of several days) even without any query load besides replication itself. I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel processors) running Debian Lenny. MySQL data, binary logs, relay logs, innodb log files are on separated partitions from each other, on a RAID system separated from the operating system disks. Default Storage Engine is InnoDB, and the usual InnoDB memory structures are stable and look healthy. I have about 500 (read) queries per second on average, and about 10% of this as writes on the master. I've been observing something that looks like between 6 and 10 pending reads per second uniformly on my cacti graphs. The issue is characterized by the server suddenly slowing down writes without any previous warning or change, and lagging behind for several thousand seconds (triggering all sorts of alerts on my monitoring system). I don't observe extra CPU activity, just a reduced disk access ratio (from about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it neither InnoDB hashing activity, nor with long-running-queries, nor with background read/write thread activities. I don't have any clues of what is causing this behavior, and I'm unable to reproduce it under controlled conditions. I've observed the issue both on severs with and without workload (apart from the usual replication load). I am sure no changes were applied to the server or to the cluster. I'm looking forward for suggestions and theories on the issue - all ideas are welcome. Thank you for your time and attention, Kind regards, -- Luis Motta Campos is a DBA, Foodie, and Photographer -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com -- Claudio
Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention
Thank you for sharing your experience, Howard. As those are replica servers, I don't care much about losing a second worth of data in case of power failure. I believe the data centre has double independent power sources, and my hardware man assured me if the power goes down at the data centre we'll have bigger issues to worry about. As a result, I can run with innodb_flush_log_at_trx_commit = 2 without worrying too much about it. The most interesting thing that came out of all this conversation is that everybody seems to agree this is some sort of lock contention on a lock that only get hot under certain conditions, and that doesn't seems part of the usual set of locks monitored by Cacti. I shall start paying beers to the MySQL developers I know again... Thank you very much once more for sharing your experiences. This is invaluable and I hope I can do the same for you in the future. Kind regards, -- Luis Motta Campos
Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention
Luis, How large is your database? Have you checked for an increase in write activity on the master leading up to this? Are you running a backup against the replica? Thank you, Tyler Sent from my Droid Bionic On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk wrote: Fellow DBAs and MySQL Users [apologies for eventual duplicates - I've posted this to percona-discuss...@googlegroups.com also] I've been hunting an issue with my database cluster for several months now without much success. Maybe I'm overlooking something here. I've been observing the database slowing down and lagging behind for thousands of seconds (sometimes over the course of several days) even without any query load besides replication itself. I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel processors) running Debian Lenny. MySQL data, binary logs, relay logs, innodb log files are on separated partitions from each other, on a RAID system separated from the operating system disks. Default Storage Engine is InnoDB, and the usual InnoDB memory structures are stable and look healthy. I have about 500 (read) queries per second on average, and about 10% of this as writes on the master. I've been observing something that looks like between 6 and 10 pending reads per second uniformly on my cacti graphs. The issue is characterized by the server suddenly slowing down writes without any previous warning or change, and lagging behind for several thousand seconds (triggering all sorts of alerts on my monitoring system). I don't observe extra CPU activity, just a reduced disk access ratio (from about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it neither InnoDB hashing activity, nor with long-running-queries, nor with background read/write thread activities. I don't have any clues of what is causing this behavior, and I'm unable to reproduce it under controlled conditions. I've observed the issue both on severs with and without workload (apart from the usual replication load). I am sure no changes were applied to the server or to the cluster. I'm looking forward for suggestions and theories on the issue - all ideas are welcome. Thank you for your time and attention, Kind regards, -- Luis Motta Campos is a DBA, Foodie, and Photographer -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com
Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention
Luis, Very hard to tackle. In my experience, excluding external(to mysql) bottlenecks, like hardware, o.s. etc, 'suspects' are the shared resources 'guarded' by unique mutexes, like on the query cache or key cache. Since you do not use MySQL it cannot be the key cache. Since you use percona the query cache is disabled by default. You should go a bit lower level and catch the system calls with one of the tools you surely know to see if there are waits on the semaphores. I also would like to tell that the 'seconds behind master' reported by the slave is not reliable. Good luck! Claudio 2011/10/23 Tyler Poland tpol...@engineyard.com Luis, How large is your database? Have you checked for an increase in write activity on the master leading up to this? Are you running a backup against the replica? Thank you, Tyler Sent from my Droid Bionic On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk wrote: Fellow DBAs and MySQL Users [apologies for eventual duplicates - I've posted this to percona-discuss...@googlegroups.com also] I've been hunting an issue with my database cluster for several months now without much success. Maybe I'm overlooking something here. I've been observing the database slowing down and lagging behind for thousands of seconds (sometimes over the course of several days) even without any query load besides replication itself. I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel processors) running Debian Lenny. MySQL data, binary logs, relay logs, innodb log files are on separated partitions from each other, on a RAID system separated from the operating system disks. Default Storage Engine is InnoDB, and the usual InnoDB memory structures are stable and look healthy. I have about 500 (read) queries per second on average, and about 10% of this as writes on the master. I've been observing something that looks like between 6 and 10 pending reads per second uniformly on my cacti graphs. The issue is characterized by the server suddenly slowing down writes without any previous warning or change, and lagging behind for several thousand seconds (triggering all sorts of alerts on my monitoring system). I don't observe extra CPU activity, just a reduced disk access ratio (from about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it neither InnoDB hashing activity, nor with long-running-queries, nor with background read/write thread activities. I don't have any clues of what is causing this behavior, and I'm unable to reproduce it under controlled conditions. I've observed the issue both on severs with and without workload (apart from the usual replication load). I am sure no changes were applied to the server or to the cluster. I'm looking forward for suggestions and theories on the issue - all ideas are welcome. Thank you for your time and attention, Kind regards, -- Luis Motta Campos is a DBA, Foodie, and Photographer -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com -- Claudio
RE: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention
One cause of heavy replication lag we noticed was due to a misbehaving application blasting updates (and commits) onto the master InnoDB tables from multiple clients. Since slave replication is single-threaded, it couldn't keep up I/O-wise, while the master seemed to show reasonably low load throughout. The temporary fix was to just set innodb_flush_log_at_trx_commit = 2 to only flush the log file to disk once every second. Result was the lag went from 5,000 seconds behind and climbing to 0 in literally seconds, and the slave load dropped way below 1 again. The catch (there's always one, of course) is if the server crashes, you could lose up to 1 seconds' worth of uncommitted transactions. Howard From: Claudio Nanni [claudio.na...@gmail.com] Sent: Sunday, October 23, 2011 2:27 PM To: Tyler Poland Cc: mysql@lists.mysql.com Subject: Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention Luis, Very hard to tackle. In my experience, excluding external(to mysql) bottlenecks, like hardware, o.s. etc, 'suspects' are the shared resources 'guarded' by unique mutexes, like on the query cache or key cache. Since you do not use MySQL it cannot be the key cache. Since you use percona the query cache is disabled by default. You should go a bit lower level and catch the system calls with one of the tools you surely know to see if there are waits on the semaphores. I also would like to tell that the 'seconds behind master' reported by the slave is not reliable. Good luck! Claudio 2011/10/23 Tyler Poland tpol...@engineyard.com Luis, How large is your database? Have you checked for an increase in write activity on the master leading up to this? Are you running a backup against the replica? Thank you, Tyler Sent from my Droid Bionic On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk wrote: Fellow DBAs and MySQL Users [apologies for eventual duplicates - I've posted this to percona-discuss...@googlegroups.com also] I've been hunting an issue with my database cluster for several months now without much success. Maybe I'm overlooking something here. I've been observing the database slowing down and lagging behind for thousands of seconds (sometimes over the course of several days) even without any query load besides replication itself. I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel processors) running Debian Lenny. MySQL data, binary logs, relay logs, innodb log files are on separated partitions from each other, on a RAID system separated from the operating system disks. Default Storage Engine is InnoDB, and the usual InnoDB memory structures are stable and look healthy. I have about 500 (read) queries per second on average, and about 10% of this as writes on the master. I've been observing something that looks like between 6 and 10 pending reads per second uniformly on my cacti graphs. The issue is characterized by the server suddenly slowing down writes without any previous warning or change, and lagging behind for several thousand seconds (triggering all sorts of alerts on my monitoring system). I don't observe extra CPU activity, just a reduced disk access ratio (from about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it neither InnoDB hashing activity, nor with long-running-queries, nor with background read/write thread activities. I don't have any clues of what is causing this behavior, and I'm unable to reproduce it under controlled conditions. I've observed the issue both on severs with and without workload (apart from the usual replication load). I am sure no changes were applied to the server or to the cluster. I'm looking forward for suggestions and theories on the issue - all ideas are welcome. Thank you for your time and attention, Kind regards, -- Luis Motta Campos is a DBA, Foodie, and Photographer -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com -- Claudio -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org