Ineiev wrote:
> For the record: git.sv.gnu.org has broken again.
The VMs seem to be developing a repeating problem of their root file
system, which resides on the Ceph SAN storage area network, getting
stuck. Then the Linux kernel takes a hung_task timeout mitigation.
But processes are started which try to read from the disk and then
can't read. They get stuck in the uninterruptible D state waiting for
DMA which never comes.
This results in processes stacking up until the machine runs out of
process slots or runs out of memory. This results in a "can't fork"
error. The load average hits 120, 214, various very large numbers
that I an see in either my terminal running htop or in munin.
The only mitigation is to reboot the system. The system reboots. The
newly booted system is then okay. This has been a very unusual
situation. But it's been hitting us multiple times lately. It's no
longer unusual. It's becoming situation normal. Which is terrible!
And worse it has been either an hour or four before one of us becomes
available and takes action of rebooting the system.
I have set up a monit rule to try to automatically detect this bad
state and reboot the system before it becomes completely unresponsive.
check system loadavg
if loadavg (1min) > 75 for 2 cycles then exec "/bin/systemctl reboot
--force"
A cycle is 2 minutes. For most triggers I will wait 3 times to be
sure but in this case if the system load is over 75 for two hits then
it is heading to unresponsive "can't fork" problems pretty quickly.
But the system is usually working until the load gets over 100.
The question is will the system hit the unresponsive point before 2
minutes is over or not. Maybe. In which case it won't be able to
fork off the reboot action. But if the system can load systemctl then
this should reboot the system. This will hopefully recover more
quickly than us humans have been able to react to it so far. We will
see.
Bob