Brian Buhrow's note sparked another thought. > Stop darkstat. Machine locks.
Is it possible the machine is not, strictly, hung, just doing something that renders it unresponsive for a human-perceptible time? You wrote of having to get remote hands to poke an unresponsive machine; how long did that take? Did your remote hands notice whether the disk light was lit (if there is such a light)? I've had machines appear to lock up hard when what's actually going on is that a large process is dumping core. If the machine has what I consider insane amounts of core (say, 64G), if darkstat's rlimit lets it eat most of that, if there's enough free disk to store a substantial fraction of that, and darkstat has a bug that leads it to core when it's killed, this could look very similar - trying to write a 40-50 gig coredump will not be fast, even on the kind of machine that has 40-50 gigs of core to write. Especially if you've configured swap and it's thrashing between reading swap and writing the coredump. Another possibility is that it does not truly leak VM, but balloons until its attempts to grab more memory are rejected and then does some kind of management of the memory it's been granted. If its rlimit is set high enough, the symptoms could be similar. (Still need to posit something that makes it try to drop core, though.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B