Hi all,

We've tweaked some parameters such as ping.timeout and ping.interval and
now Apache CloudStack notices that the host is down in 12-15 minutes which
I still feel is a long time.
However, we're unable to start the VM's on a different host because of the
error below:
"Insufficient capacity to restart VM, name: testvm1, id: 541 which was
running on host name: <FQDN>(id:85), availability zone: <ZONE>, pod: <POD>"

We're testing the HA feature of ACS where, in the case of a hypervisor
failure, ACS should notice this failure and immediately start the affected
VM's on a different hypervisor (which would be part of the HA pair) but so
far, we're not getting the results we're expecting.

The HA partner has more than enough CPU and memory resources available.
What's causing this inability to restart the VM on the HA partner?

Kind regards,

Jeroen Kleijer

On Sat, Feb 15, 2025 at 12:18 PM Jimmy Huybrechts <ji...@linservers.com>
wrote:

> Hi,
>
> Before I got sick I was trying to do the same thing and noticed the same
> as you, here it noticed the server being offline in 10 minutes or so, but
> even when the server got back it still thought all those vm’s were working
> while a virsh list on the hypervisor clearly showed it had no running vm’s.
>
> I get the same when doing a shutdown -h  on a vm, it takes cloudstack up
> to 10 minutes before it sees that the vm is offline, how can we shorten
> this time drastically? As now when I need to do a change to a vm I shut it
> down and then need to wait all that time before I can make the change and
> boot it again.
>
> --
> Jimmy
>
> From: Jeroen Kleijer <jeroen.klei...@gmail.com>
> Date: Saturday, 15 February 2025 at 08:40
> To: users@cloudstack.apache.org <users@cloudstack.apache.org>
> Subject: Hypervisor failure
> Hi all,
>
> We're running tests on our Apache CloudStack (4.19.1.2) environment where
> our hypervisors are running KVM. We've noticed that when we pull the plug
> on a hypervisor, it can take ACS up to an hour(!) before it finally notices
> that the hypervisor is down and changes the state to DOWN and in the
> meantime, it considers the VM's that were running on it still available.
> This leads us to two questions:
> 1) which variables need to be tweaked to make ACS notice something like
> this in just a couple of minutes instead of more than an hour?
> 2) why are these values so high? An hour before ACS defines an agent
> offline seems very long.
>
> Kind regards,
>
> Jeroen Kleijer
>

Reply via email to