I have a three node Ovirt cluster where one node has stale-data for the hosted engine, but the other two nodes do not:
Output of `hosted-engine --vm-status` on a good node: ``` !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host ovirt2.low.mdds.tcs-sec.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.low.mdds.tcs-sec.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : f91f57e4 local_conf_timestamp : 9915242 Host timestamp : 9915241 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=9915241 (Fri Mar 27 14:38:14 2020) host-id=1 score=3400 vm_conf_refresh_time=9915242 (Fri Mar 27 14:38:14 2020) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host ovirt1.low.mdds.tcs-sec.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.low.mdds.tcs-sec.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 48f9c0fc local_conf_timestamp : 9218845 Host timestamp : 9218845 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=9218845 (Fri Mar 27 14:38:22 2020) host-id=2 score=3400 vm_conf_refresh_time=9218845 (Fri Mar 27 14:38:22 2020) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host ovirt3.low.mdds.tcs-sec.com (id: 3) status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : ovirt3.low.mdds.tcs-sec.com Host ID : 3 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 620c8566 local_conf_timestamp : 1208310 Host timestamp : 1208310 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1208310 (Mon Dec 16 21:14:24 2019) host-id=3 score=3400 vm_conf_refresh_time=1208310 (Mon Dec 16 21:14:24 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! ``` I tried the steps in https://access.redhat.com/discussions/3511881, but `hosted-engine --vm-status` on the node with stale data shows: ``` The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. ``` One the stale node, ovirt-ha-agent and ovirt-ha-broker are continually restarting. Since it seems the agent depends on the broker, the broker logs includes this snippet, repeating roughly every 3 seconds: ``` MainThread::INFO::2020-03-27 15:01:06,584::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.3.6 started MainThread::INFO::2020-03-27 15:01:06,584::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2020-03-27 15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2020-03-27 15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2020-03-27 15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2020-03-27 15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2020-03-27 15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2020-03-27 15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2020-03-27 15:01:06,588::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2020-03-27 15:01:06,588::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2020-03-27 15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2020-03-27 15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2020-03-27 15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2020-03-27 15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2020-03-27 15:01:06,590::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2020-03-27 15:01:06,590::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2020-03-27 15:01:06,590::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::INFO::2020-03-27 15:01:06,678::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) Connecting the storage MainThread::INFO::2020-03-27 15:01:06,678::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2020-03-27 15:01:06,717::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2020-03-27 15:01:06,732::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::WARNING::2020-03-27 15:01:08,940::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: [Errno 5] Input/output error: '/rhev/data-center/mnt/glusterSD/ovirt2:_engine/182a4a94-743f-4941-89c1-dc2008ae1cf5/ha_agent/hosted-engine.lockspace' ``` I restarted the stale node yesterday, but it still shows stale data from December of last year. What is the recommended way for me to try to recover from this? (This came to my attention when warnings concerning space on the /var/log partition began popping up.) Thank you, Randall _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5TWVACR6PADE6N2GD5W6NFTIEDHLRDMZ/