Hmmm, virsh tells me the HE is running but it hasn't come up and the agent.log is full of the same errors.
On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq <shar...@jalloq.co.uk> wrote: > Ah hah! Ok, so I've managed to start it using virsh on the second host > but my first host is still dead. > > First of all, what are these 56,317 .prob- files that get dumped to the > NFS mounts? > > Secondly, why doesn't the node mount the NFS directories at boot? Is that > the issue with this particular node? > > On Wed, Apr 8, 2020 at 11:12 PM <eev...@digitaldatatechs.com> wrote: > >> Did you try virsh list --inactive >> >> >> >> Eric Evans >> >> Digital Data Services LLC. >> >> 304.660.9080 >> >> >> >> *From:* Shareef Jalloq <shar...@jalloq.co.uk> >> *Sent:* Wednesday, April 8, 2020 5:58 PM >> *To:* Strahil Nikolov <hunter86...@yahoo.com> >> *Cc:* Ovirt Users <users@ovirt.org> >> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to rescue? >> >> >> >> I've now shut down the VMs on one host and rebooted it but the agent >> service doesn't start. If I run 'hosted-engine --vm-status' I get: >> >> >> >> The hosted engine configuration has not been retrieved from shared >> storage. Please ensure that ovirt-ha-agent is running and the storage >> server is reachable. >> >> >> >> and indeed if I list the mounts under /rhev/data-center/mnt, only one of >> the directories is mounted. I have 3 NFS mounts, one ISO Domain and two >> Data Domains. Only one Data Domain has mounted and this has lots of .prob >> files in. So why haven't the other NFS exports been mounted? >> >> >> >> Manually mounting them doesn't seem to have helped much either. I can >> start the broker service but the agent service says no. Same error as the >> one in my last email. >> >> >> >> Shareef. >> >> >> >> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq <shar...@jalloq.co.uk> >> wrote: >> >> Right, still down. I've run virsh and it doesn't know anything about the >> engine vm. >> >> >> >> I've restarted the broker and agent services and I still get nothing in >> virsh->list. >> >> >> >> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of errors: >> >> >> >> broker.log: >> >> >> >> MainThread::INFO::2020-04-08 >> 20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) >> ovirt-hosted-engine-ha broker 2.3.6 started >> >> MainThread::INFO::2020-04-08 >> 20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Searching for submonitors in >> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors >> >> MainThread::INFO::2020-04-08 >> 20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor network >> >> MainThread::INFO::2020-04-08 >> 20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor cpu-load-no-engine >> >> MainThread::INFO::2020-04-08 >> 20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor mgmt-bridge >> >> MainThread::INFO::2020-04-08 >> 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor network >> >> MainThread::INFO::2020-04-08 >> 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor cpu-load >> >> MainThread::INFO::2020-04-08 >> 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor engine-health >> >> MainThread::INFO::2020-04-08 >> 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor mgmt-bridge >> >> MainThread::INFO::2020-04-08 >> 20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor cpu-load-no-engine >> >> MainThread::INFO::2020-04-08 >> 20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor cpu-load >> >> MainThread::INFO::2020-04-08 >> 20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor mem-free >> >> MainThread::INFO::2020-04-08 >> 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor storage-domain >> >> MainThread::INFO::2020-04-08 >> 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor storage-domain >> >> MainThread::INFO::2020-04-08 >> 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor mem-free >> >> MainThread::INFO::2020-04-08 >> 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Loaded submonitor engine-health >> >> MainThread::INFO::2020-04-08 >> 20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Finished loading submonitors >> >> MainThread::INFO::2020-04-08 >> 20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) >> Connecting the storage >> >> MainThread::INFO::2020-04-08 >> 20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >> Connecting storage server >> >> MainThread::INFO::2020-04-08 >> 20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >> Connecting storage server >> >> MainThread::INFO::2020-04-08 >> 20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >> Refreshing the storage domain >> >> MainThread::WARNING::2020-04-08 >> 20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) >> Can't connect vdsm storage: Command StorageDomain.getInfo with args >> {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: >> >> (code=350, message=Error in storage domain action: >> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) >> >> MainThread::INFO::2020-04-08 >> 20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) >> ovirt-hosted-engine-ha broker 2.3.6 started >> >> MainThread::INFO::2020-04-08 >> 20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> Searching for submonitors in >> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors >> >> >> >> agent.log: >> >> >> >> MainThread::ERROR::2020-04-08 >> 20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >> Trying to restart agent >> >> MainThread::INFO::2020-04-08 >> 20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) >> Agent shutting down >> >> MainThread::INFO::2020-04-08 >> 20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) >> ovirt-hosted-engine-ha agent 2.3.6 started >> >> MainThread::INFO::2020-04-08 >> 20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) >> Found certificate common name: ovirt-node-01.phoelex.com >> >> MainThread::INFO::2020-04-08 >> 20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >> Initializing ha-broker connection >> >> MainThread::INFO::2020-04-08 >> 20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) >> Starting monitor network, options {'tcp_t_address': '', 'network_test': >> 'dns', 'tcp_t_port': '', 'addr': '192.168.1.99'} >> >> MainThread::ERROR::2020-04-08 >> 20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >> Failed to start necessary monitors >> >> MainThread::ERROR::2020-04-08 >> 20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >> Traceback (most recent call last): >> >> File >> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >> line 131, in _run_agent >> >> return action(he) >> >> File >> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >> line 55, in action_proper >> >> return he.start_monitoring() >> >> File >> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >> line 432, in start_monitoring >> >> self._initialize_broker() >> >> File >> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >> line 556, in _initialize_broker >> >> m.get('options', {})) >> >> File >> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >> line 89, in start_monitor >> >> ).format(t=type, o=options, e=e) >> >> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: >> [Errno 2] No such file or directory, [monitor: 'network', options: >> {'tcp_t_address': '', 'network_test': 'dns', 'tcp_t_port': '', 'addr': >> '192.168.1.99'}] >> >> >> >> MainThread::ERROR::2020-04-08 >> 20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >> Trying to restart agent >> >> MainThread::INFO::2020-04-08 >> 20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) >> Agent shutting down >> >> >> >> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov <hunter86...@yahoo.com> >> wrote: >> >> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" < >> mat...@ltresources.co.uk> wrote: >> >On the host you tried to restart the engine on: >> > >> >Add an alias to virsh (authenticates with virsh_auth.conf) >> > >> >alias virsh='virsh -c >> >qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf' >> > >> >Then run virsh: >> > >> >virsh >> > >> >virsh # list >> > Id Name State >> >---------------------------------------------------- >> > xx HostedEngine Paused >> > xx ********** running >> > ... >> > xx ********** running >> > >> >HostedEngine should be in the list, try and resume the engine: >> > >> >virsh # resume HostedEngine >> > >> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq <shar...@jalloq.co.uk> >> >wrote: >> > >> >> Thanks! >> >> >> >> The status hangs due to, I guess, the VM being down.... >> >> >> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start >> >> VM exists and is down, cleaning up and restarting >> >> VM in WaitForLaunch >> >> >> >> but this doesn't seem to do anything. OK, after a while I get a >> >status of >> >> it being barfed... >> >> >> >> --== Host ovirt-node-00.phoelex.com (id: 1) status ==-- >> >> >> >> conf_on_shared_storage : True >> >> Status up-to-date : False >> >> Hostname : ovirt-node-00.phoelex.com >> >> Host ID : 1 >> >> Engine status : unknown stale-data >> >> Score : 3400 >> >> stopped : False >> >> Local maintenance : False >> >> crc32 : 9c4a034b >> >> local_conf_timestamp : 523362 >> >> Host timestamp : 523608 >> >> Extra metadata (valid at timestamp): >> >> metadata_parse_version=1 >> >> metadata_feature_version=1 >> >> timestamp=523608 (Wed Apr 8 16:17:11 2020) >> >> host-id=1 >> >> score=3400 >> >> vm_conf_refresh_time=523362 (Wed Apr 8 16:13:06 2020) >> >> conf_on_shared_storage=True >> >> maintenance=False >> >> state=EngineDown >> >> stopped=False >> >> >> >> >> >> --== Host ovirt-node-01.phoelex.com (id: 2) status ==-- >> >> >> >> conf_on_shared_storage : True >> >> Status up-to-date : True >> >> Hostname : ovirt-node-01.phoelex.com >> >> Host ID : 2 >> >> Engine status : {"reason": "bad vm status", >> >"health": >> >> "bad", "vm": "down_unexpected", "detail": "Down"} >> >> Score : 0 >> >> stopped : False >> >> Local maintenance : False >> >> crc32 : 5045f2eb >> >> local_conf_timestamp : 1737037 >> >> Host timestamp : 1737283 >> >> Extra metadata (valid at timestamp): >> >> metadata_parse_version=1 >> >> metadata_feature_version=1 >> >> timestamp=1737283 (Wed Apr 8 16:16:17 2020) >> >> host-id=2 >> >> score=0 >> >> vm_conf_refresh_time=1737037 (Wed Apr 8 16:12:11 2020) >> >> conf_on_shared_storage=True >> >> maintenance=False >> >> state=EngineUnexpectedlyDown >> >> stopped=False >> >> >> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett >> ><mat...@ltresources.co.uk> >> >> wrote: >> >> >> >>> First steps, on one of your hosts as root: >> >>> >> >>> To get information: >> >>> hosted-engine --vm-status >> >>> >> >>> To start the engine: >> >>> hosted-engine --vm-start >> >>> >> >>> >> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq <shar...@jalloq.co.uk> >> >wrote: >> >>> >> >>>> So my engine has gone down and I can't ssh into it either. If I >> >try to >> >>>> log into the web-ui of the node it is running on, I get redirected >> >because >> >>>> the node can't reach the engine. >> >>>> >> >>>> What are my next steps? >> >>>> >> >>>> Shareef. >> >>>> _______________________________________________ >> >>>> Users mailing list -- users@ovirt.org >> >>>> To unsubscribe send an email to users-le...@ovirt.org >> >>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html >> >>>> oVirt Code of Conduct: >> >>>> https://www.ovirt.org/community/about/community-guidelines/ >> >>>> List Archives: >> >>>> >> > >> https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRSW5CDRQWR5MIKJUH3ISLCQ/ >> >>>> >> >>> >> >> This has to be resolved: >> >> Engine status : unknown stale-data >> >> Run again 'hosted-engine --vm-status'. If it remains the same, restart >> ovirt-ha-broker.service & ovirt-ha-agent.service >> >> Verify that the engine's storage is available. Then monitor the broker & >> agent logs in /var/log/ovirt-hosted-engine-ha >> >> Best Regards, >> Strahil Nikolov >> >> >> >>
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LAGON5G75LBSDQ7A2WOUEARW7ANL5GNL/