On Mar 25, 2014, at 8:46 AM, Matthew Spilich wrote: > The symptom: The database machine (running postgres 9.1.9 on CentOS 6.4) is > running a low utilization most of the time, but once every day or two, it > will appear to slow down to the point where queries back up and clients are > unable to connect. Once this event occurs, there are lots of concurrent > queries, I see slow queries appear in the logs, but there doesn't appear to > be anything abnormal that I have been able to see that causes this behavior. ... > Has any on the forum seen something similar? Any suggestions on what to > look at next? If it is helpful to describe the server hardware, it's got 2 > E5-2670 cpu and 256 GB of ram, and the database is hosted on 1.6TB raid 10 > local storage (15K 300 GB drives).
I could be way off here, but years ago I experienced something like this (in oracle land) and after some stressful chasing, the marginal failure of the raid controller revealed itself. Same kind of event, steady traffic and then some i/o would not complete and normal ops would stack up. Anyway, what you report reminded me of that event. The E5 is a few years old, I wonder if the raid controller firmware needs a patch? I suppose a marginal power supply might cause a similar "hang." Anyway, marginal failures are very painful. Have you checked sar or OS logging at event time?