We have number of clusters connected to ovirt-engine. Some of these are single host clusters (running ovirt-release43-4.3.5.2-1 on CentOS7) with local storage. Recently, ovirt-engine started reporting one of these hosts NonResponsive, VMs were still running on the host but ovirt seems unable to communicate with it, testing shows no issues connecting engine -> host:vdsm and likewise the host can communicate with the engine on port 80 and 443.
The host in question cannot be managed via IPMI for power management but we are able to perform an SSH reboot via the engine interface. We opted to login to the running virtual machines, shut them down and issue the SSH reboot from the engine. The server changes to rebooting status for some time and then reports NonResponsive state. We are unable to maintenance the host or confirm host has been rebooted manually as we are presented with the following "Error while executing action: Cannot perform confirm 'Host has been rebooted'. Another power management action is already in progress." The VDSM logs on the host in question are continually showing: 2020-04-16 08:23:51,478+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:711) 2020-04-16 08:23:52,332+0000 INFO (jsonrpc/7) [vdsm.api] FINISH getStoragePoolInfo error=Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',) from=::ffff:10.10.1.252,33680, task_id=420249a4-55c0-436d-92c7-ea1286a0e287 (api:52) 2020-04-16 08:23:52,332+0000 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='420249a4-55c0-436d-92c7-ea1286a0e287') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getStoragePoolInfo File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2550, in getStoragePoolInfo pool = self.getPool(spUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 351, in getPool raise se.StoragePoolUnknown(spUUID) StoragePoolUnknown: Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',) 2020-04-16 08:23:52,333+0000 INFO (jsonrpc/7) [storage.TaskManager.Task] (Task='420249a4-55c0-436d-92c7-ea1286a0e287') aborting: Task is aborted: "Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)" - code 309 (task:1181) During this period, the following is observed in the engine logs: 2020-04-16 08:23:52,307Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM compute01.ovirt.local command SpmStatusVDS failed: Message timeout which can be caused by communication issues 2020-04-16 08:23:52,307Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Command 'SpmStatusVDSCommand(HostName = compute01.ovirt.local, SpmStatusVDSCommandParameters:{hostId='67dc53da-d5ee-461e-87de-2ca6dd78637f', storagePoolId='6baea5dc-b049-47c2-a94f-5229c37c62d0'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2020-04-16 08:23:52,346Z ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Failed in 'GetStoragePoolInfoVDS' method 2020-04-16 08:23:52,355Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetStoragePoolInfoVDS failed: Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',) 2020-04-16 08:23:52,356Z ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',), code = 309 The metadata file for the local storage domain looks fine? ALIGNMENT=1048576 BLOCK_SIZE=512 CLASS=Data DESCRIPTION=compute01_local_storage IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 MASTER_VERSION=1 POOL_DESCRIPTION=compute01_local POOL_DOMAINS=1cc26dea-688c-40cc-bda6-38b00054001e:Active POOL_SPM_ID=-1 POOL_SPM_LVER=-1 POOL_UUID=6baea5dc-b049-47c2-a94f-5229c37c62d0 REMOTE_PATH=/mnt/ovirt_datastore ROLE=Master SDUUID=1cc26dea-688c-40cc-bda6-38b00054001e TYPE=LOCALFS VERSION=5 _SHA_CKSUM=24c85256b889d0b3384e7975c660f4a5cbb58d33 I would assume this has happened because ovirt was unable to power cycle the machine and now can't confirm the SPM state? Normally in a case like this we would confirm the host has been manually rebooted but we're unable to do that. How can I clear the power management action that ovirt-engine thinks is in progress? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LLORHPLPNGECTDGZVHHKQPK2NGW5JLDB/