Hi, I have a hyperconverged ovirt 4.3.10 installation. One host is unresponsive. This has happened in the past but most often resolved itself within a few minutes. This time it is a different beast. The host does not come back. Putting the host in maintenance is not possible because "Host is unresponsive but has running virtual machines". Putting the host into local maintenance on the local CLI does not help. Global maintenance does not have any effect either. The host has been rebooted several times both locally and via the web interface on the engine, which worked but did not resolve the issue. Logs: /var/log/ovirt-engine/engine.log 2022-06-07 06:40:17,606+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-34) [26f286ba-58d0-44be-96e0-f874057fe603] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch Host to Maintenance mode. Host still has running VMs on it and is in Non Responsive state. 2022-06-07 06:40:17,606+02 WARN [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-34) [26f286ba-58d0-44be-96e0-f874057fe603] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_VDS_IS_NOT_RESPONDING_WITH_VMS,$HostNotResponding ovirt3.os-s.de 2022-06-07 06:41:23,974+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirt3.os-s.de command Get Host Capabilities failed: Message timeout which can be caused by communication issues 2022-06-07 06:41:23,974+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2022-06-07 06:44:47,033+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirt3.os-s.de command Get Host Capabilities failed: Message timeout which can be caused by communication issues 2022-06-07 06:44:47,034+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
vdsmd status: Jun 06 08:51:09 ovirt3.os-s.de vdsm[3567]: WARN Worker blocked: <Worker name=jsonrpc/1 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getStats', 'id': u'ce7ab5d1-3034-4b5a-91c5-9446e17341a2'} at 0x7fe1646563d0> timeout=60, duration=120.00 at 0x7fe1646562d0> task#=15 at 0x7fe165f47e50>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in getStats File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1435, in getStats multipath=True)} File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 79, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats(timeout=5) File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 146, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request self.send_content(h, request_body) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content connection.endheaders(request_body) File: "/usr/lib64/python2.7/httplib.py", line 1052, in endheaders self._send_output(message_body) File: "/usr/lib64/python2.7/httplib.py", line 890, in _send_output self.send(msg) File: "/usr/lib64/python2.7/httplib.py", line 852, in send self.connect() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 60, in connect self.sock.connect(base64.b16decode(self.host)) File: "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) Jun 06 08:51:14 ovirt3.os-s.de vdsm[3567]: WARN Worker blocked: <Worker name=periodic/0 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7fe16457ba90> at 0x7fe16457bad0> timeout=15, duration=120.00 at 0x7fe1644c3910> task#=22 at 0x7fe1646e4550>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 79, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats(timeout=5) File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 146, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request self.send_content(h, request_body) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content connection.endheaders(request_body) File: "/usr/lib64/python2.7/httplib.py", line 1052, in endheaders self._send_output(message_body) File: "/usr/lib64/python2.7/httplib.py", line 890, in _send_output self.send(msg) File: "/usr/lib64/python2.7/httplib.py", line 852, in send self.connect() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 60, in connect self.sock.connect(base64.b16decode(self.host)) File: "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) vdsmd log: 2022-06-07 05:02:13,717+0000 INFO (jsonrpc/3) [vdsm.api] FINISH multipath_health return={} from=::1,54704, task_id=169b447f-fd52-4a6f-9277-eff9870ffb81 (api:54) 2022-06-07 05:02:13,725+0000 INFO (jsonrpc/3) [api.host] FINISH getStats return={'status': {'message': 'Done', 'code': 0}, 'info': {'cpuStatistics': {'20': {'cpuUser': '0.07', 'nodeIndex': 1, 'cpuSys': '0.00', 'cpuIdle': '99.93'}, '21': {'cpuUser': '0.00', 'nodeIndex': 1, 'cpuSys': '0.00', 'cpuIdle': '100.00'}, '22': {'cpuUser': '0.00', 'nodeIndex': 1, 'cpuSys': '0.00', 'cpuIdle': '100.00'}, '23': {'cpuUser': '0.00', 'nodeIndex': 1, 'cpuSys': '0.07', 'cpuIdle': '99.93'}, '1': {'cpuUser': '0.87', 'nodeIndex': 0, 'cpuSys': '0.60', 'cpuIdle': '98.53'}, '0': {'cpuUser': '0.40', 'nodeIndex': 0, 'cpuSys': '0.27', 'cpuIdle': '99.33'}, '3': {'cpuUser': '0.27', 'nodeIndex': 0, 'cpuSys': '0.33', 'cpuIdle': '99.40'}, '2': {'cpuUser': '0.53', 'nodeIndex': 0, 'cpuSys': '0.40', 'cpuIdle': '99.07'}, '5': {'cpuUser': '0.33', 'nodeIndex': 0, 'cpuSys': '0.33', 'cpuIdle': '99.34'}, '4': {'cpuUser': '0.33', 'nodeIndex': 0, 'cpuSys': '0.40', 'cpuIdle': '99.27'}, '7': {'cpuUser': '0.13', 'nodeIndex': 1, 'cpuSys': '0.07', 'cpuIdle': '99.80'}, '6': {'cpuUser': '0.13', 'nodeIndex': 1, 'cpuSys': '0.07', 'cpuIdle': '99.80'}, '9': {'cpuUser': '0.13', 'nodeIndex': 1, 'cpuSys': '0.13', 'cpuIdle': '99.74'}, '8': {'cpuUser': '0.07', 'nodeIndex': 1, 'cpuSys': '0.13', 'cpuIdle': '99.80'}, '11': {'cpuUser': '0.20', 'nodeIndex': 1, 'cpuSys': '0.13', 'cpuIdle': '99.67'}, '10': {'cpuUser': '0.20', 'nodeIndex': 1, 'cpuSys': '0.27', 'cpuIdle': '99.53'}, '13': {'cpuUser': '0.80', 'nodeIndex': 0, 'cpuSys': '0.47', 'cpuIdle': '98.73'}, '12': {'cpuUser': '0.00', 'nodeIndex': 0, 'cpuSys': '0.00', 'cpuIdle': '100.00'}, '15': {'cpuUser': '0.07', 'nodeIndex': 0, 'cpuSys': '0.00', 'cpuIdle': '99.93'}, '14': {'cpuUser': '0.20', 'nodeIndex': 0, 'cpuSys': '0.27', 'cpuIdle': '99.53'}, '17': {'cpuUser': '0.07', 'nodeIndex': 0, 'cpuSys': '0.00', 'cpuIdle': '99.93'}, '16': {'cpuUser': '0.07', 'nodeIndex': 0, 'cpuSys': '0.00', 'cpuIdle': '99.93'}, '19': {'cpuUser': '1.46', 'nodeIndex': 1, 'cpuSys': '0.33', 'cpuIdle' : '98.21'}, '18': {'cpuUser': '0.27', 'nodeIndex': 1, 'cpuSys': '0.07', 'cpuIdle': '99.66'}}, 'numaNodeMemFree': {'1': {'memPercent': 18, 'memFree': '52935'}, '0': {'memPercent': 17, 'memFree': '53551'}}, 'memShared': 0, 'haScore': 0, 'thpState': 'always', 'ksmMergeAcrossNodes': True, 'vmCount': 3, 'memUsed': '3', 'storageDomains': {}, 'incomingVmMigrations': 0, 'network': {'bond0': {'rxErrors': '0', 'txErrors': '0', 'speed': '2000', 'rxDropped': '4', 'name': 'bond0', 'tx': '84766340728', 'txDropped': '0', 'duplex': 'full', 'sampleTime': 1654578133.717023, 'rx': '102755561238', 'state': 'up'}, 'eno50': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': 'eno50', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'down'}, ';vdsmdummy;': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': ';vdsmdummy;', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.71702 3, 'rx': '0', 'state': 'down'}, 'ovirtmgmt': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '122', 'name': 'ovirtmgmt', 'tx': '81789466633', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '99028291927', 'state': 'up'}, 'lo': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': 'lo', 'tx': '10214215104', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '10214215104', 'state': 'up'}, 'ovs-system': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': 'ovs-system', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'down'}, 'eno49': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': 'eno49', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'down'}, 'br-int': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'na me': 'br-int', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'down'}, 'eno1': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': 'eno1', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'down'}, 'eno2': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '0', 'name': 'eno2', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'down'}, 'eno3': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '1', 'name': 'eno3', 'tx': '42493183527', 'txDropped': '0', 'duplex': 'full', 'sampleTime': 1654578133.717023, 'rx': '20793574', 'state': 'up'}, 'eno4': {'rxErrors': '0', 'txErrors': '0', 'speed': '1000', 'rxDropped': '1', 'name': 'eno4', 'tx': '42273157201', 'txDropped': '0', 'duplex': 'full', 'sampleTime': 1654578133.717023, 'rx': '102734767664', 'state': 'up'}, 'genev_sys_6081': {'rxErrors': '0', 'txErrors': '8', 'speed': '1000', 'rxDropped': '0', 'name': 'genev_sys_6081', 'tx': '0', 'txDropped': '0', 'duplex': 'unknown', 'sampleTime': 1654578133.717023, 'rx': '0', 'state': 'up'}}, 'txDropped': '128', 'anonHugePages': '326', 'ksmPages': 100, 'elapsedTime': '391.12', 'cpuLoad': '0.16', 'cpuSys': '0.18', 'diskStats': {'/var/log': {'free': '6478'}, '/var/run/vdsm/': {'free': '64294'}, '/tmp': {'free': '906'}}, 'cpuUserVdsmd': '0.47', 'netConfigDirty': 'False', 'memCommitted': 28672, 'ksmState': False, 'vmMigrating': 0, 'ksmCpu': 0, 'memAvailable': 126975, 'bootTime': '1654505219', 'haStats': {'active': True, 'configured': True, 'score': 0, 'localMaintenance': True, 'globalMaintenance': False}, 'momStatus': 'active', 'multipathHealth': {}, 'rxDropped': '0', 'outgoingVmMigrations': 0, 'swapTotal': 32191, 'swapFree': 32191, 'hugepages': defaultdict(<type 'dict'>, {1048576: {'resv_hugepages': 0, 'free_hugepages': 0, 'nr_overcommit_hugepages': 0, 'surplus_hugepages ': 0, 'vm.free_hugepages': 0, 'nr_hugepages': 0, 'nr_hugepages_mempolicy': 0}, 2048: {'resv_hugepages': 0, 'free_hugepages': 0, 'nr_overcommit_hugepages': 0, 'surplus_hugepages': 0, 'vm.free_hugepages': 0, 'nr_hugepages': 0, 'nr_hugepages_mempolicy': 0}}), 'dateTime': '2022-06-07T05:02:13 GMT', 'cpuUser': '0.27', 'memFree': 126719, 'cpuIdle': '99.55', 'vmActive': 0, 'v2vJobs': {}, 'cpuSysVdsmd': '0.13'}} from=::1,54704 (api:54) 2022-06-07 05:02:13,726+0000 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getStats succeeded in 0.03 seconds (__init__:312) 2022-06-07 05:02:13,973+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=65629949-c74d-4f66-846a-1e2c40be9157 (api:48) 2022-06-07 05:02:13,973+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=65629949-c74d-4f66-846a-1e2c40be9157 (api:54) 2022-06-07 05:02:13,973+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:723) 2022-06-07 05:02:14,468+0000 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312) 2022-06-07 05:02:14,471+0000 INFO (jsonrpc/5) [vdsm.api] START repoStats(domains=[u'5d6276cc-08ab-47f6-81e7-4e64aac3d386']) from=::1,54704, task_id=6918f715-bbf4-4c7f-af00-1764294ab665 (api:48) 2022-06-07 05:02:14,472+0000 INFO (jsonrpc/5) [vdsm.api] FINISH repoStats return={} from=::1,54704, task_id=6918f715-bbf4-4c7f-af00-1764294ab665 (api:54) 2022-06-07 05:02:14,472+0000 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:312) Gluster volumes are all online. systemctl status ovirt-ha-broker: Jun 06 08:47:08 ovirt3.os-s.de systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. Jun 06 08:47:59 ovirt3.os-s.de ovirt-ha-broker[1524]: ovirt-ha-broker mgmt_bridge.MgmtBridge ERROR Failed to getVdsStats: No 'network' in result Any ideas? Ralf _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XT5WJCZKLIEDNI65IDTBY56VO7X7JOLT/