Storpool have looked into it and have determined that 'fencing' is causing the corruption - we are seeing VM instances running on 2 hosts - here is a log excerpt :
Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-eb111af1 work-670) (logid:07a47ffd) Ovm3Investigator could not find VM[User|i-2-393-VM] Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-eb111af1 work-670) (logid:07a47ffd) Fencing off VM that we don't know the state of Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.o.h.OvmFencer] (HA-Worker-1:ctx-eb111af1 work-670) (logid:07a47ffd) Don't know how to fence non Ovm hosts KVM Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-eb111af1 work-670) (logid:07a47ffd) Fencer OvmFenceBuilder returned null Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.h.o.r.Ovm3FenceBuilder] (HA-Worker-1:ctx-eb111af1 work-670) (logid:07a47ffd) Don't know how to fence non Ovm3 hosts KVM Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-eb111af1 work-670) (logid:07a47ffd) Fencer Ovm3FenceBuilder returned null Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.h.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|i-2-393-VM] returning null Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) ManagementIPSysVMInvestigator could not find VM[User|i-2-393-VM] Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.h.Ovm3Investigator] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) isVmAlive: CTXDC02 on qcloud-s1-p1-c1-kvm3 Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Ovm3Investigator could not find VM[User|i-2-393-VM] Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Fencing off VM that we don't know the state of Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.o.h.OvmFencer] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Don't know how to fence non Ovm hosts KVM Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Fencer OvmFenceBuilder returned null Jul 16 12:37:04 server25311 java[962152]: DEBUG [c.c.h.o.r.Ovm3FenceBuilder] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Don't know how to fence non Ovm3 hosts KVM Jul 16 12:37:04 server25311 java[962152]: INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f405f7dd work-669) (logid:5cd9b357) Fencer Ovm3FenceBuilder returned null Gary Dixon Technical Consultant T: 0161 537 4980 W: www.quadris.co.uk The information contained in this e-mail from Quadris may be confidential and privileged for the private use of the named recipient. The contents of this e-mail may not necessarily represent the official views of Quadris. If you have received this information in error you must not copy, distribute or take any action or reliance on its contents. Please destroy any hard copies and delete this message. -----Original Message----- From: Simon Weller <swel...@ena.com.INVALID> Sent: 20 July 2022 22:10 To: users@cloudstack.apache.org Subject: Re: Virtual Router filesystem corruption Gary, No prob with the info, thanks for providing it. Since you're using Storpool, I'd suggest you reach out to them on this directly and see whether they have any information that could be helpful. There was an issue a while ago (Storpool actually reported it) where a kernel commit introduced a bug that caused file corruption. That was back in about 2018 - https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstorpool.com%2Fblog%2Fbeware-silent-data-corruption-discovered-in-linux-kernels-4-10-4-17%2F&data=05%7C01%7CGary.Dixon%40quadris.co.uk%7Cece461d8f261463ed2b408da6a9433b7%7Cf1d6abf3d3b44894ae16db0fb93a96a2%7C0%7C0%7C637939481997555790%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yajiRaYix9u2mhCLVnG9%2FDQHcS9tSPNuhzORVgTEBQ4%3D&reserved=0 I believe ACS 4.15.x uses Debian 10.5 (Buster) for the VR images (dates to August 2020), That release is based on kernel 4.19.0-10. -Si ________________________________ From: Gary Dixon <gary.di...@quadris.co.uk.INVALID> Sent: Wednesday, July 20, 2022 3:00 PM To: users@cloudstack.apache.org <users@cloudstack.apache.org> Subject: Re: Virtual Router filesystem corruption EXTERNAL EMAIL: This message originated outside of ENA. Use caution when clicking links, opening attachments, or complying with requests. Click the "Phish Alert Report" button above the email, or contact MIS, regarding any suspicious message. Hi SI Sure. Sorry for the lack of info. First time posting on the forum. We are using KVM hyper visor on Ubuntu 20.04 hosts. Primary storage is Storpool. Let me know if you need more info Best regards Gary Get Outlook for iOS<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fo0ukef&data=05%7C01%7CGary.Dixon%40quadris.co.uk%7Cece461d8f261463ed2b408da6a9433b7%7Cf1d6abf3d3b44894ae16db0fb93a96a2%7C0%7C0%7C637939481997555790%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mTRmXAFdzzfnILhmSaX%2BYityquQqipKEofQN5XSsLUs%3D&reserved=0> Gary Dixon Technical Consultant T: 0161 537 4980<tel:0161%20537%204980> W: https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.quadris.co.uk%2F&data=05%7C01%7CGary.Dixon%40quadris.co.uk%7Cece461d8f261463ed2b408da6a9433b7%7Cf1d6abf3d3b44894ae16db0fb93a96a2%7C0%7C0%7C637939481997555790%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7akNVPG9bwo%2FYtqh1NH0DBJxW81kL2EuF%2Br9dqiWoAU%3D&reserved=0 [cid:image937435.png@813B89FF.AD6D2675] The information contained in this e-mail from Quadris may be confidential and privileged for the private use of the named recipient. The contents of this e-mail may not necessarily represent the official views of Quadris. If you have received this information in error you must not copy, distribute or take any action or reliance on its contents. Please destroy any hard copies and delete this message. ________________________________ From: Simon Weller <swel...@ena.com.INVALID> Sent: Wednesday, July 20, 2022 8:55:22 PM To: users@cloudstack.apache.org <users@cloudstack.apache.org> Subject: Re: Virtual Router filesystem corruption Gary, Can you provide some information about the OS, underlying hypervisor and primary storage in use? -Si ________________________________ From: Gary Dixon <gary.di...@quadris.co.uk.INVALID> Sent: Wednesday, July 20, 2022 11:15 AM To: users@cloudstack.apache.org <users@cloudstack.apache.org> Subject: Virtual Router filesystem corruption EXTERNAL EMAIL: This message originated outside of ENA. Use caution when clicking links, opening attachments, or complying with requests. Click the "Phish Alert Report" button above the email, or contact MIS, regarding any suspicious message. Hi All We are seeing ext4 filesystem corruption on a number of virtual routers recently and manually running fsck doesn’t appear to help at all in fixing the issue (Corrupt inode bitmap) We end up having to restart the associated VPC with cleanup enable to rebuild a new VR. Is this a common issue with ACS 4.15.1 ? Or are there specific circumstances that are causing the VR fs corruption that we could perhaps mitigate ? Kind regards Gary Gary Dixon Technical Consultant T: 0161 537 4980<tel:0161%20537%204980> W: https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.quadris.co.uk%2F&data=05%7C01%7CGary.Dixon%40quadris.co.uk%7Cece461d8f261463ed2b408da6a9433b7%7Cf1d6abf3d3b44894ae16db0fb93a96a2%7C0%7C0%7C637939481997555790%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7akNVPG9bwo%2FYtqh1NH0DBJxW81kL2EuF%2Br9dqiWoAU%3D&reserved=0 [cid:image001.png@01D89C5B.5685DCC0] The information contained in this e-mail from Quadris may be confidential and privileged for the private use of the named recipient. The contents of this e-mail may not necessarily represent the official views of Quadris. If you have received this information in error you must not copy, distribute or take any action or reliance on its contents. Please destroy any hard copies and delete this message.