Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention

2011-10-24 Thread Luis Motta Campos
The data size is about 200 GB. I would have noticed increase on writes. No 
backup activity is running (actually I don't do conventional backups). 

Any theories?
Thank you for your interest. 
Kind regards,
--
Luis Motta Campos

On 23 Oct 2011, at 14:06, Tyler Poland tpol...@engineyard.com wrote:

 Luis,
 
 How large is your database?  Have you checked for an increase in write
 activity on the master leading up to this? Are you running a backup against
 the replica?
 
 Thank you,
 Tyler
 
 Sent from my Droid Bionic
 On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk
 wrote:
 
 Fellow DBAs and MySQL Users
 
 [apologies for eventual duplicates - I've posted this to
 percona-discuss...@googlegroups.com also]
 
 I've been hunting an issue with my database cluster for several months now
 without much success. Maybe I'm overlooking something here.
 
 I've been observing the database slowing down and lagging behind for
 thousands of seconds (sometimes over the course of several days) even
 without any query load besides replication itself.
 
 I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell
 R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel
 processors) running Debian Lenny. MySQL data, binary logs, relay logs,
 innodb log files are on separated partitions from each other, on a RAID
 system separated from the operating system disks.
 
 Default Storage Engine is InnoDB, and the usual InnoDB memory structures
 are stable and look healthy.
 
 I have about 500 (read) queries per second on average, and about 10% of
 this as writes on the master.
 
 I've been observing something that looks like between 6 and 10 pending
 reads per second uniformly on my cacti graphs.
 
 The issue is characterized by the server suddenly slowing down writes
 without any previous warning or change, and lagging behind for several
 thousand seconds (triggering all sorts of alerts on my monitoring system). I
 don't observe extra CPU activity, just a reduced disk access ratio (from
 about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it
 neither InnoDB hashing activity, nor with long-running-queries, nor with
 background read/write thread activities.
 
 I don't have any clues of what is causing this behavior, and I'm unable to
 reproduce it under controlled conditions. I've observed the issue both on
 severs with and without workload (apart from the usual replication load). I
 am sure no changes were applied to the server or to the cluster.
 
 I'm looking forward for suggestions and theories on the issue - all ideas
 are welcome.
 Thank you for your time and attention,
 Kind regards,
 --
 Luis Motta Campos
 is a DBA, Foodie, and Photographer
 
 
 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com
 
 


Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention

2011-10-24 Thread Luis Motta Campos
Claudio, 

Thank you for your interest. 
I will wait for the issue to happen again and will see what kind of information 
I can get back with strace. This is indeed something I didn't think of trying 
yet. 

I'll keep you people posted on this. 
Much appreciated on the new approaches and fresh ideas. 
Kind regards,
--
Luis Motta Campos

On 23 Oct 2011, at 23:27, Claudio Nanni claudio.na...@gmail.com wrote:

 Luis,
 
 Very hard to tackle.
 In my experience, excluding external(to mysql) bottlenecks, like hardware,
 o.s. etc, 'suspects' are the shared resources 'guarded' by unique mutexes,
 like on the query cache or key cache.
 Since you do not use MySQL it cannot be the key cache. Since you use percona
 the query cache is disabled by default.
 You should go a bit lower level and catch the system calls with one of the
 tools you surely know to see if there are waits on the semaphores.
 
 I also would like to tell that the 'seconds behind master' reported by the
 slave is not reliable.
 
 Good luck!
 
 Claudio
 
 2011/10/23 Tyler Poland tpol...@engineyard.com
 
 Luis,
 
 How large is your database?  Have you checked for an increase in write
 activity on the master leading up to this? Are you running a backup against
 the replica?
 
 Thank you,
 Tyler
 
 Sent from my Droid Bionic
 On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk
 wrote:
 
 Fellow DBAs and MySQL Users
 
 [apologies for eventual duplicates - I've posted this to
 percona-discuss...@googlegroups.com also]
 
 I've been hunting an issue with my database cluster for several months
 now
 without much success. Maybe I'm overlooking something here.
 
 I've been observing the database slowing down and lagging behind for
 thousands of seconds (sometimes over the course of several days) even
 without any query load besides replication itself.
 
 I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell
 R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel
 processors) running Debian Lenny. MySQL data, binary logs, relay logs,
 innodb log files are on separated partitions from each other, on a RAID
 system separated from the operating system disks.
 
 Default Storage Engine is InnoDB, and the usual InnoDB memory structures
 are stable and look healthy.
 
 I have about 500 (read) queries per second on average, and about 10% of
 this as writes on the master.
 
 I've been observing something that looks like between 6 and 10 pending
 reads per second uniformly on my cacti graphs.
 
 The issue is characterized by the server suddenly slowing down writes
 without any previous warning or change, and lagging behind for several
 thousand seconds (triggering all sorts of alerts on my monitoring
 system). I
 don't observe extra CPU activity, just a reduced disk access ratio (from
 about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it
 neither InnoDB hashing activity, nor with long-running-queries, nor with
 background read/write thread activities.
 
 I don't have any clues of what is causing this behavior, and I'm unable
 to
 reproduce it under controlled conditions. I've observed the issue both on
 severs with and without workload (apart from the usual replication load).
 I
 am sure no changes were applied to the server or to the cluster.
 
 I'm looking forward for suggestions and theories on the issue - all ideas
 are welcome.
 Thank you for your time and attention,
 Kind regards,
 --
 Luis Motta Campos
 is a DBA, Foodie, and Photographer
 
 
 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com
 
 
 
 
 
 
 -- 
 Claudio


Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention

2011-10-24 Thread Luis Motta Campos
Thank you for sharing your experience, Howard. 

As those are replica servers, I don't care much about losing a second worth of 
data in case of power failure. I believe the data centre has double independent 
power sources, and my hardware man assured me if the power goes down at the 
data centre we'll have bigger issues to worry about. As a result, I can run 
with innodb_flush_log_at_trx_commit = 2 without worrying too much about it. 

The most interesting thing that came out of all this conversation is that 
everybody seems to agree this is some sort of lock contention on a lock that 
only get hot under certain conditions, and that doesn't seems part of the usual 
set of locks monitored by Cacti. 

I shall start paying beers to the MySQL developers I know again...

Thank you very much once more for sharing your experiences. This is invaluable 
and I hope I can do the same for you in the future. 

Kind regards,
--
Luis Motta Campos

Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention

2011-10-23 Thread Tyler Poland
Luis,

How large is your database?  Have you checked for an increase in write
activity on the master leading up to this? Are you running a backup against
the replica?

Thank you,
Tyler

Sent from my Droid Bionic
On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk
wrote:

 Fellow DBAs and MySQL Users

 [apologies for eventual duplicates - I've posted this to
 percona-discuss...@googlegroups.com also]

 I've been hunting an issue with my database cluster for several months now
 without much success. Maybe I'm overlooking something here.

 I've been observing the database slowing down and lagging behind for
 thousands of seconds (sometimes over the course of several days) even
 without any query load besides replication itself.

 I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell
 R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel
 processors) running Debian Lenny. MySQL data, binary logs, relay logs,
 innodb log files are on separated partitions from each other, on a RAID
 system separated from the operating system disks.

 Default Storage Engine is InnoDB, and the usual InnoDB memory structures
 are stable and look healthy.

 I have about 500 (read) queries per second on average, and about 10% of
 this as writes on the master.

 I've been observing something that looks like between 6 and 10 pending
 reads per second uniformly on my cacti graphs.

 The issue is characterized by the server suddenly slowing down writes
 without any previous warning or change, and lagging behind for several
 thousand seconds (triggering all sorts of alerts on my monitoring system). I
 don't observe extra CPU activity, just a reduced disk access ratio (from
 about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it
 neither InnoDB hashing activity, nor with long-running-queries, nor with
 background read/write thread activities.

 I don't have any clues of what is causing this behavior, and I'm unable to
 reproduce it under controlled conditions. I've observed the issue both on
 severs with and without workload (apart from the usual replication load). I
 am sure no changes were applied to the server or to the cluster.

 I'm looking forward for suggestions and theories on the issue - all ideas
 are welcome.
 Thank you for your time and attention,
 Kind regards,
 --
 Luis Motta Campos
 is a DBA, Foodie, and Photographer


 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com




Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention

2011-10-23 Thread Claudio Nanni
Luis,

Very hard to tackle.
In my experience, excluding external(to mysql) bottlenecks, like hardware,
o.s. etc, 'suspects' are the shared resources 'guarded' by unique mutexes,
like on the query cache or key cache.
Since you do not use MySQL it cannot be the key cache. Since you use percona
the query cache is disabled by default.
You should go a bit lower level and catch the system calls with one of the
tools you surely know to see if there are waits on the semaphores.

I also would like to tell that the 'seconds behind master' reported by the
slave is not reliable.

Good luck!

Claudio

2011/10/23 Tyler Poland tpol...@engineyard.com

 Luis,

 How large is your database?  Have you checked for an increase in write
 activity on the master leading up to this? Are you running a backup against
 the replica?

 Thank you,
 Tyler

 Sent from my Droid Bionic
 On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk
 wrote:

  Fellow DBAs and MySQL Users
 
  [apologies for eventual duplicates - I've posted this to
  percona-discuss...@googlegroups.com also]
 
  I've been hunting an issue with my database cluster for several months
 now
  without much success. Maybe I'm overlooking something here.
 
  I've been observing the database slowing down and lagging behind for
  thousands of seconds (sometimes over the course of several days) even
  without any query load besides replication itself.
 
  I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell
  R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel
  processors) running Debian Lenny. MySQL data, binary logs, relay logs,
  innodb log files are on separated partitions from each other, on a RAID
  system separated from the operating system disks.
 
  Default Storage Engine is InnoDB, and the usual InnoDB memory structures
  are stable and look healthy.
 
  I have about 500 (read) queries per second on average, and about 10% of
  this as writes on the master.
 
  I've been observing something that looks like between 6 and 10 pending
  reads per second uniformly on my cacti graphs.
 
  The issue is characterized by the server suddenly slowing down writes
  without any previous warning or change, and lagging behind for several
  thousand seconds (triggering all sorts of alerts on my monitoring
 system). I
  don't observe extra CPU activity, just a reduced disk access ratio (from
  about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it
  neither InnoDB hashing activity, nor with long-running-queries, nor with
  background read/write thread activities.
 
  I don't have any clues of what is causing this behavior, and I'm unable
 to
  reproduce it under controlled conditions. I've observed the issue both on
  severs with and without workload (apart from the usual replication load).
 I
  am sure no changes were applied to the server or to the cluster.
 
  I'm looking forward for suggestions and theories on the issue - all ideas
  are welcome.
  Thank you for your time and attention,
  Kind regards,
  --
  Luis Motta Campos
  is a DBA, Foodie, and Photographer
 
 
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:
  http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com
 
 




-- 
Claudio


RE: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention

2011-10-23 Thread Howard Hart
One cause of heavy replication lag we noticed was due to a misbehaving 
application blasting updates (and commits) onto the master InnoDB tables from 
multiple clients. Since slave replication is single-threaded, it couldn't keep 
up I/O-wise, while the master seemed to show reasonably low load throughout. 

The temporary fix was to just set innodb_flush_log_at_trx_commit = 2 to only 
flush the log file to disk once every second. Result was the lag went from 
5,000 seconds behind and climbing to 0 in literally seconds, and the slave 
load dropped way below 1 again.

The catch (there's always one, of course) is if the server crashes, you could 
lose up to 1 seconds' worth of uncommitted transactions.

Howard

From: Claudio Nanni [claudio.na...@gmail.com]
Sent: Sunday, October 23, 2011 2:27 PM
To: Tyler Poland
Cc: mysql@lists.mysql.com
Subject: Re: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and 
Recovers Without Intervention

Luis,

Very hard to tackle.
In my experience, excluding external(to mysql) bottlenecks, like hardware,
o.s. etc, 'suspects' are the shared resources 'guarded' by unique mutexes,
like on the query cache or key cache.
Since you do not use MySQL it cannot be the key cache. Since you use percona
the query cache is disabled by default.
You should go a bit lower level and catch the system calls with one of the
tools you surely know to see if there are waits on the semaphores.

I also would like to tell that the 'seconds behind master' reported by the
slave is not reliable.

Good luck!

Claudio

2011/10/23 Tyler Poland tpol...@engineyard.com

 Luis,

 How large is your database?  Have you checked for an increase in write
 activity on the master leading up to this? Are you running a backup against
 the replica?

 Thank you,
 Tyler

 Sent from my Droid Bionic
 On Oct 23, 2011 5:40 AM, Luis Motta Campos luismottacam...@yahoo.co.uk
 wrote:

  Fellow DBAs and MySQL Users
 
  [apologies for eventual duplicates - I've posted this to
  percona-discuss...@googlegroups.com also]
 
  I've been hunting an issue with my database cluster for several months
 now
  without much success. Maybe I'm overlooking something here.
 
  I've been observing the database slowing down and lagging behind for
  thousands of seconds (sometimes over the course of several days) even
  without any query load besides replication itself.
 
  I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell
  R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel
  processors) running Debian Lenny. MySQL data, binary logs, relay logs,
  innodb log files are on separated partitions from each other, on a RAID
  system separated from the operating system disks.
 
  Default Storage Engine is InnoDB, and the usual InnoDB memory structures
  are stable and look healthy.
 
  I have about 500 (read) queries per second on average, and about 10% of
  this as writes on the master.
 
  I've been observing something that looks like between 6 and 10 pending
  reads per second uniformly on my cacti graphs.
 
  The issue is characterized by the server suddenly slowing down writes
  without any previous warning or change, and lagging behind for several
  thousand seconds (triggering all sorts of alerts on my monitoring
 system). I
  don't observe extra CPU activity, just a reduced disk access ratio (from
  about 5-6MB/s to 500KB/s) and replication lagging. I could correlate it
  neither InnoDB hashing activity, nor with long-running-queries, nor with
  background read/write thread activities.
 
  I don't have any clues of what is causing this behavior, and I'm unable
 to
  reproduce it under controlled conditions. I've observed the issue both on
  severs with and without workload (apart from the usual replication load).
 I
  am sure no changes were applied to the server or to the cluster.
 
  I'm looking forward for suggestions and theories on the issue - all ideas
  are welcome.
  Thank you for your time and attention,
  Kind regards,
  --
  Luis Motta Campos
  is a DBA, Foodie, and Photographer
 
 
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:
  http://lists.mysql.com/mysql?unsub=tpol...@engineyard.com
 
 




--
Claudio

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org