Hi guys, how are you?.
We are having this problems with ACS when a primary storages fails.
We have several primary storage with Linux and NFS server serving KVM
images. So every hosts have been mounted all the NFS servers because in
one Host can be running VMs from different storages. The main problem of
this, is when some storage fails because any reason all the cluster gets
crazy and start rebooting the hosts to reconnect with this storage and
all the VMs on the cluster, (including the VMs that were working good)
goes down becuase the conection to one storage fails.
If the problem with storage is permanent, the cluster never start again
and hosts will reboot indefinitely.
When this problem appears, the logs say this:
host heartbeat: kvmheartbeat.sh will reboot system because it was unable
to write the heartbeat to the storage.
Many users, edit the script kvmheartbeat.shto avoid the hosts reboot or
restart the agent on the host but i really not be sure that this is the
real solution.
Can someone help to propose a best solution at this high risk problem?.
Regards,
Mauro