Greetings to all, I am having a weird issue with MySQL that I can't solve. We are getting intermittent client connection errors code 2003 to the database server for 10mins seemingly at random, and after 20+ days of uptime. Unfortunately, I have not been able to correlate these connection problems with any other queries, jobs, etc, so I was hoping someone here might be able to help me out.
The problem is as follows. Seemingly at random, the master suddenly stops accepting connections, and the clients return connection error 2003, indicating the master did not respond in a timely manner. This goes on for about 10 minutes, at which point the master starts accepting connections again, without any human input. This happened at 4am on Sunday morning for example, so it healed itself before I could get myself out of bed and comprehend the situation, let alone connect somewhere and try and fix it. We are seeing this happen about 4 or 5 times a week for the last 2 weeks, and there seems to be no pattern as to the time or date. Sometimes it happens twice in one day, and then disappears for 4 days. There was no spike in activity as far as we can tell, and the CPU and network usage were stable at about 2% and 4% of capacity respectively. Also, we have slow query log turned on and set to 1sec, and there are no queries anywhere near the gaps in connection. We are running MySQL 5.0.44 on a single master on its on hardware, with a replication slave on a different machine. We have a write through memcached setup in front my MySQL, which handles the majority of the requests, so MySQL is seeing about 20 to 30 ops (select, inserts, updates) per second on average. All of this is running on Amazon EC2 instances, and have dedicated boxes (we are running the 64bit Large Instance, which is supposed to be a dedicated virtual box with 2 CPU, 2 cores apiece and 8G of ram, with 1.5/2G free.) We then have two other machines that run the front end web servers running PHP 5.1.6 and load balancers, which connect to the database when the cache doesnt have the required information. I did not post this to the PHP section since it seems like a more general issue with the server as opposed to the clients. After the second time it happened, we switched out our AWS hardware in hopes that it was a hardware fluke, but to no avail. The problem reared its uglyhead 3 days later. We doubt it is the internal Amazon network since the external monitoring of the box continues to work and spit out information, and no other box is showing similar connection symptoms. Also, all of our boxes are in the same Amazon Zone, which implies that they are in the same colo. This makes me think that a combination of our configuration and queries are causing the trouble. I checked the archives, but it seems that the people who encountered this error saw it during setup/configuration, and not randomly after 30 days of uptime. I doubt anyone has the answer, so I was hoping someone could help me understand the best way to debug this problem in order to find the reason for these random outages. Thanks in advance for any and all help! Pieter de Zwart