On Mon, Jul 20, 2009 at 7:00 AM, Nicholas Sherlock <n.sherl...@gmail.com>wrote:
> Matthew Tice wrote: > >> Currently we're migrating our static node cluster from 32bit OpenSuse 10.3 >> using the disk_cache_module on a 2G tmpfs to a 64bit CentOS 5.3 using the >> disk_cache module on a 9G tmpfs. After pushing these CentOS nodes into >> production (and consequently adding many more requests) we started seeing a >> load spike on these systems. Preliminary tests have shown that using a 2G >> (maybe 3G - still testing that one) tmpfs on the same CentOS node doesn't >> have the same high load. I'm not sure if this is a bug with tmpfs, >> Apache/disk_cache, CentOS, or what. Any insight into this strange problem >> would be appreciated. >> > > I had this problem on my server where the system service "mlocate" was > scheduled to run every day. It basically scans every file on the system, and > with the huge numbers of files generated by disk_cache, it took more than a > day to finish one scan. So the next day, there were two running mlocate > instances. Then three. Then no legitimate IO requests were being serviced > and the whole server ground to a halt. The load average skyrocketed because > of all the waiting processes. "mlocate" didn't show up on 'top' because it > used almost no CPU time. I diagnosed the problem with 'iotop' - it gives > per-process IO stats. > > This is probably not the same problem you're having, but iotop is still a > useful tool to identify IO competition when you can't find the culprit based > on CPU-time. > > Cheers, > Nicholas Sherlock > > Thanks Nicholas, I'll take a look at that. I had htcacheclean running every 5 min. which could have caused a bulk of my problems. I changed the daemon to kick off every 30 min. instead which seems to have helped - a little. The machine isn't quite as sluggish but the load is still hovering around 2 (5 min. average). Matt