Greetings to all,

I am having a weird issue with MySQL that I can't solve.  We are getting
intermittent client connection errors code 2003 to the database server for
10mins seemingly at random, and after 20+ days of uptime. Unfortunately, I
have not been able to correlate these connection problems with any other
queries, jobs, etc, so I was hoping someone here might be able to help me
out.

The problem is as follows. Seemingly at random, the master suddenly stops
accepting connections, and the clients return connection error 2003,
indicating the master did not respond in a timely manner. This goes on for
about 10 minutes, at which point the master starts accepting connections
again, without any human input. This happened at 4am on Sunday morning for
example, so it healed itself before I could get myself out of bed and
comprehend the situation, let alone connect somewhere and try and fix it.
We are seeing this happen about 4 or 5 times a week for the last 2 weeks,
and there seems to be no pattern as to the time or date. Sometimes it
happens twice in one day, and then disappears for 4 days. There was no spike
in activity as far as we can tell, and the CPU and network usage were stable
at about 2% and 4% of capacity respectively. Also, we have slow query log
turned on and set to 1sec, and there are no queries anywhere near the gaps
in connection.

We are running MySQL 5.0.44 on a single master on its on hardware, with a
replication slave on a different machine. We have a write through memcached
setup in front my MySQL, which handles the majority of the requests, so
MySQL is seeing about 20 to 30 ops (select, inserts, updates) per second on
average. All of this is running on Amazon EC2 instances, and have dedicated
boxes (we are running the 64bit Large Instance, which is supposed to be a
dedicated virtual box with 2 CPU, 2 cores apiece and 8G of ram, with 1.5/2G
free.) We then have two other machines that run the front end web servers
running PHP 5.1.6 and load balancers, which connect to the database when the
cache doesnt have the required information. I did not post this to the PHP
section since it seems like a more general issue with the server as opposed
to the clients.

After the second time it happened, we switched out our AWS hardware in hopes
that it was a hardware fluke, but to no avail. The problem reared its
uglyhead 3 days later.  We doubt it is the internal Amazon network since the
external monitoring of the box continues to work and spit out information,
and no other box is showing similar connection symptoms. Also, all of our
boxes are in the same Amazon Zone, which implies that they are in the same
colo. This makes me think that a combination of our configuration and
queries are causing the trouble.

I checked the archives, but it seems that the people who encountered this
error saw it during setup/configuration, and not randomly after 30 days of
uptime. I doubt anyone has the answer, so I was hoping someone could help me
understand the best way to debug this problem in order to find the reason
for these random outages.

Thanks in advance for any and all help!

Pieter de Zwart

Reply via email to