Hi Daan
It seems cloudstack did know the host had died because it tried to fence the 
host but couldn't because we have host HA disabled. It also reported OOB stop 
had occurred on the HA enabled VM's and started them all again on the same 
host. We then had to put the host into MM because the iDrac logs were showing 
issues with  2 memory DIMMS.

All I know is that whichever host the corrupt VR was running on - we could not 
Console to it or any other running VM on the same host - because the agent 
comms were messed up.

We have found in the agent host a line that states PublicKey authentication had 
failed to the VR (because the VR was corrupt at the guest OS level). At the 
time we did not see this and any command sent from with ACS mgmt. to either 
reboot the VR or restart the VPC with cleanup resulted in the host agent not 
servicing the request or any other request - such as to view the console of any 
VM or live migrate any VM to another host. We're still sifting through both 
agent and mgmt. logs to try and determine what exactly happened that was 
causing this behaviour. All other running VM's on the host were actually fine 
as we could connect by external methods.
We are hoping to upgrade the environment ASAP so we can get better Host HA with 
StorPool Primary storage.

BR

Gary


Gary Dixon
Quadris Cloud Manager
0161 537 4980 +44 7989717661
gary.di...@quadris.co.uk
www.quadris.com
Innovation House, 12-13 Bredbury Business Park
Bredbury Park Way, Bredbury, Stockport, SK6 2SN
-----Original Message-----
From: Daan Hoogland <daan.hoogl...@gmail.com>
Sent: Monday, February 26, 2024 1:03 PM
To: users@cloudstack.apache.org
Subject: Re: corrupt RVR causing host agent issues

Gary, the mail does not display the screenshot for me. Also this is an old 
version (4.15) I think you should upgrade.

What might be the root of your issue is that *you* have seen the physical host 
crashed but CloudStack could not determine that. To prevent starting the same 
VM twice it would withhold taking any action in such situations.

You may call this a bug or a "lack of feature", but the bottom line is that 
this is expected behaviour.

I do not think a corrupt VR would crash a host.


On Mon, Feb 26, 2024 at 1:25 PM Gary Dixon <gary.di...@quadris.co.uk.invalid>
wrote:

> ACS 4.15.2
>
> KVM
>
> Ubuntu 20.04
>
>
>
> Hi all
>
>
>
> We had a physical host crash on Friday due to hardware failure. This
> appeared to have caused issues with some RVR’s going into an ‘unknown’
> state.
>
>
>
> The strange thing was that on any host where a RVR in an unknown state
> was running – we could not console onto any VM’s on that host – nor
> could we SSH directly to the RVR from the host.
>
> The UI was showing all hosts agent state as ‘UP’
>
>
>
> Only when we restarted the ACS mgmt. service did we notice that the
> host agent where a RVR was running in an ‘unknown’ state then was in a
> ‘connecting’ state for some time – there were no networking issues
> either – host was pingable from the mgmt. server.
>
>
>
> We were then briefly able to console onto one of the RVR’s in an
> unknown state and then discovered that the RVR was indeed corrupt –
> this is the screenshot of the RVR terminal :
>
>
>
> We then marked the RVR in the DB as ‘stopped’ and virsh destroyed it
> directly on the host. We were then able to restart the VPC with
> cleanup which then re-created the corrupt RVR.
>
> It then appeared that once the corrupt RVR had gone – all other RVR’s
> in an unknown state transitioned to ‘backup’ state
>
>
>
> We are wondering if we have encountered a bug where if a corrupt RVR
> crashes the host cloudstack agent if ACS tries to do anything with the
> RVR – like restart it
>
>
>
> BR
>
>
>
> Gary
>
>
>
>
>
>
> Gary Dixon​​​​
> Quadris Cloud Manager
> 0161 537 4980 <0161%20537%204980>
>  +44 7989717661 <+44%207989717661>
> gary.di...@quadris.co.uk
> http://www.q/
> uadris.com%2F&data=05%7C02%7CGary.Dixon%40quadris.co.uk%7Cccb839a47f40
> 4b38ae5608dc36cb3fbe%7Cf1d6abf3d3b44894ae16db0fb93a96a2%7C0%7C0%7C6384
> 45493800485528%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2l
> uMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9hX%2BwqSLFpxdb
> KKSdUqqhPBIK3CaUyl%2F9GkrNUSny98%3D&reserved=0
> Innovation House, 12‑13 Bredbury Business Park Bredbury Park Way,
> Bredbury, Stockport, SK6 2SN
>


--
Daan

Reply via email to