We have number of clusters connected to ovirt-engine. Some of these are single 
host clusters (running ovirt-release43-4.3.5.2-1 on CentOS7) with local 
storage. Recently, ovirt-engine started reporting one of these hosts 
NonResponsive, VMs were still running on the host but ovirt seems unable to 
communicate with it, testing shows no issues connecting engine -> host:vdsm and 
likewise the host can communicate with the engine on port 80 and 443.

The host in question cannot be managed via IPMI for power management but we are 
able to perform an SSH reboot via the engine interface. We opted to login to 
the running virtual machines, shut them down and issue the SSH reboot from the 
engine. The server changes to rebooting status for some time and then reports 
NonResponsive state. 

We are unable to maintenance the host or confirm host has been rebooted 
manually as we are presented with the following

"Error while executing action: Cannot perform confirm 'Host has been rebooted'. 
Another power management action is already in progress."

The VDSM logs on the host in question are continually showing:

2020-04-16 08:23:51,478+0000 INFO  (vmrecovery) [vds] recovery: waiting for 
storage pool to go up (clientIF:711)
2020-04-16 08:23:52,332+0000 INFO  (jsonrpc/7) [vdsm.api] FINISH 
getStoragePoolInfo error=Unknown pool id, pool not connected: 
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',) from=::ffff:10.10.1.252,33680, 
task_id=420249a4-55c0-436d-92c7-ea1286a0e287 (api:52)
2020-04-16 08:23:52,332+0000 ERROR (jsonrpc/7) [storage.TaskManager.Task] 
(Task='420249a4-55c0-436d-92c7-ea1286a0e287') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in 
_run
    return fn(*args, **kargs)
  File "<string>", line 2, in getStoragePoolInfo
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2550, in 
getStoragePoolInfo
    pool = self.getPool(spUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 351, in 
getPool
    raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: 
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)
2020-04-16 08:23:52,333+0000 INFO  (jsonrpc/7) [storage.TaskManager.Task] 
(Task='420249a4-55c0-436d-92c7-ea1286a0e287') aborting: Task is aborted: 
"Unknown pool id, pool not connected: 
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)" - code 309 (task:1181)

During this period, the following is observed in the engine logs:

2020-04-16 08:23:52,307Z ERROR 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID: 
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM compute01.ovirt.local command 
SpmStatusVDS failed: Message timeout which can be caused by communication issues
2020-04-16 08:23:52,307Z ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] 
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Command 
'SpmStatusVDSCommand(HostName = compute01.ovirt.local, 
SpmStatusVDSCommandParameters:{hostId='67dc53da-d5ee-461e-87de-2ca6dd78637f', 
storagePoolId='6baea5dc-b049-47c2-a94f-5229c37c62d0'})' execution failed: 
VDSGenericException: VDSNetworkException: Message timeout which can be caused 
by communication issues
2020-04-16 08:23:52,346Z ERROR 
[org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand] 
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Failed in 
'GetStoragePoolInfoVDS' method
2020-04-16 08:23:52,355Z ERROR 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID: 
IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetStoragePoolInfoVDS failed: 
Unknown pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)
2020-04-16 08:23:52,356Z ERROR 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] 
IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: 
IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Unknown pool id, 
pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',), code = 309

The metadata file for the local storage domain looks fine?

ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=compute01_local_storage
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=compute01_local
POOL_DOMAINS=1cc26dea-688c-40cc-bda6-38b00054001e:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=6baea5dc-b049-47c2-a94f-5229c37c62d0
REMOTE_PATH=/mnt/ovirt_datastore
ROLE=Master
SDUUID=1cc26dea-688c-40cc-bda6-38b00054001e
TYPE=LOCALFS
VERSION=5
_SHA_CKSUM=24c85256b889d0b3384e7975c660f4a5cbb58d33

I would assume this has happened because ovirt was unable to power cycle the 
machine and now can't confirm the SPM state? Normally in a case like this we 
would confirm the host has been manually rebooted but we're unable to do that.

How can I clear the power management action that ovirt-engine thinks is in 
progress?
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/LLORHPLPNGECTDGZVHHKQPK2NGW5JLDB/

Reply via email to