RE: All cluster reboot when a Primary storage fails

Piotr Pisz Wed, 20 Oct 2021 01:45:04 -0700

Hi,
I experienced this problem myself, in the KVM, Ceph, NFS-Ganesha environment at 
full Ceph load, the Ganesha NFS server was able to hang. 
Servers were able to randomly restart due to lack of NFS access. 
Which magnified the problem and there was a cascade and restart of the entire 
environment.
We currently have the restart line removed in kvmheartbeat, instead we report 
the restart attempt via prometheus.


Regards,
Piotr


-----Original Message-----
From: Sina Kashipazha <s.kashipa...@protonmail.com.INVALID> 
Sent: Wednesday, October 20, 2021 10:35 AM
To: users@cloudstack.apache.org
Subject: Re: All cluster reboot when a Primary storage fails

Hey Daniel,

PR #4586 (https://github.com/apache/cloudstack/pull/4586) addressed your issue, 
as well. I'm currently working on it. Could you share with me how I can 
reproduce your reboot problem?

Kind regards,
Sina

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Saturday, October 16th, 2021 at 05:40, Daniel Augusto Veronezi Salvador 
<dvsalvador...@gmail.com> wrote:

> Hi Mauro,
> 

> On KVM's monitor, when there is an inconsistency on the heartbeat's file
> 

> or heartbeat timeout is extrapolated several times, by default, the host
> 

> is restarted.
> 

> The PR 4586 (https://github.com/apache/cloudstack/pull/4586) already
> 

> addressed this issue by externalizing a property, which allows the
> 

> operator to decide if the host must be restarted or not (default is
> 

> 'true', meaning that the host will be restarted). However, this feature
> 

> will be available only after release 4.16.
> 

> Best regards,
> 

> Daniel Salvador
> 

> On 15/10/2021 20:43, Mauro Ferraro - G2K Hosting wrote:
> 

> > Hi guys, how are you?.
> > 

> > We are having this problems with ACS when a primary storages fails.
> > 

> > We have several primary storage with Linux and NFS server serving KVM
> > 

> > images. So every hosts have been mounted all the NFS servers because
> > 

> > in one Host can be running VMs from different storages. The main
> > 

> > problem of this, is when some storage fails because any reason all the
> > 

> > cluster gets crazy and start rebooting the hosts to reconnect with
> > 

> > this storage and all the VMs on the cluster, (including the VMs that
> > 

> > were working good) goes down becuase the conection to one storage fails.
> > 

> > If the problem with storage is permanent, the cluster never start
> > 

> > again and hosts will reboot indefinitely.
> > 

> > When this problem appears, the logs say this:
> > 

> > host heartbeat: kvmheartbeat.sh will reboot system because it was
> > 

> > unable to write the heartbeat to the storage.
> > 

> > Many users, edit the script kvmheartbeat.shto avoid the hosts reboot
> > 

> > or restart the agent on the host but i really not be sure that this is
> > 

> > the real solution.
> > 

> > Can someone help to propose a best solution at this high risk problem?.
> > 

> > Regards,
> > 

> > Mauro

RE: All cluster reboot when a Primary storage fails

Reply via email to