OK, let's go through this. I'm looking at the node that at least still has some VMs running. virsh also tells me that the HostedEngine VM is running but it's unresponsive and I can't shut it down.
1. All storage domains exist and are mounted. 2. The ha_agent exists: [root@ovirt-node-01 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/ nas-01.phoelex.com\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ dom_md ha_agent images master 3. There are two links [root@ovirt-node-01 ovirt-hosted-engine-ha]# ll /rhev/data-center/mnt/ nas-01.phoelex.com \:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ha_agent/ total 8 lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.lockspace -> /var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ffb90b82-42fe-4253-85d5-aaec8c280aaf/90e68791-0c6f-406a-89ac-e0d86c631604 lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.metadata -> /var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/2161aed0-7250-4c1d-b667-ac94f60af17e/6b818e33-f80a-48cc-a59c-bba641e027d4 4. The services exist but all seem to have some sort of warning: a) Apr 08 18:10:55 ovirt-node-01.phoelex.com sanlock[1728]: *2020-04-08 18:10:55 1744152 [36796]: s16 delta_renew long write time 10 sec* b) Mar 23 18:02:59 ovirt-node-01.phoelex.com supervdsmd[29409]: *failed to load module nvdimm: libbd_nvdimm.so.2: cannot open shared object file: No such file or directory* c) Apr 09 08:05:13 ovirt-node-01.phoelex.com vdsm[4801]: *ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished?* d)Apr 08 22:48:27 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08 22:48:27.134+0000: 29309: warning : qemuGetProcessInfo:1404 : cannot parse process status data Apr 08 22:48:27 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08 22:48:27.134+0000: 29309: error : virNetDevTapInterfaceStats:764 : internal error: /proc/net/dev: Interface not found Apr 08 23:09:39 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08 23:09:39.844+0000: 29307: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error Apr 09 01:05:26 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-09 01:05:26.660+0000: 29307: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error 5 & 6. The broker log is continually printing this error: MainThread::INFO::2020-04-09 08:07:31,438::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.3.6 started MainThread::DEBUG::2020-04-09 08:07:31,438::broker::55::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Running broker MainThread::DEBUG::2020-04-09 08:07:31,438::broker::120::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_monitor) Starting monitor MainThread::INFO::2020-04-09 08:07:31,438::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker /submonitors MainThread::INFO::2020-04-09 08:07:31,439::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2020-04-09 08:07:31,440::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2020-04-09 08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2020-04-09 08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2020-04-09 08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2020-04-09 08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2020-04-09 08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2020-04-09 08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2020-04-09 08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2020-04-09 08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2020-04-09 08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2020-04-09 08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2020-04-09 08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2020-04-09 08:07:31,444::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2020-04-09 08:07:31,444::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::DEBUG::2020-04-09 08:07:31,444::broker::128::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_storage_broker) Starting storage broker MainThread::DEBUG::2020-04-09 08:07:31,444::storage_backends::369::ovirt_hosted_engine_ha.lib.storage_backends::(connect) Connecting to VDSM MainThread::DEBUG::2020-04-09 08:07:31,444::util::384::ovirt_hosted_engine_ha.lib.storage_backends::(__log_debug) Creating a new json-rpc connection to VDSM Client localhost:54321::DEBUG::2020-04-09 08:07:31,453::concurrent::258::root::(run) START thread <Thread(Client localhost:54321, started daemon 139992488138496)> (func=<bound method Reactor.process_requests of <yajsonrpc.betterAsyncore.Reactor object at 0x7f528acabc90>>, args=(), kwargs={}) Client localhost:54321::DEBUG::2020-04-09 08:07:31,459::stompclient::138::yajsonrpc.protocols.stomp.AsyncClient::(_process_connected) Stomp connection established MainThread::DEBUG::2020-04-09 08:07:31,467::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending response MainThread::INFO::2020-04-09 08:07:31,530::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) Connecting the storage MainThread::INFO::2020-04-09 08:07:31,531::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::DEBUG::2020-04-09 08:07:31,531::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending response MainThread::DEBUG::2020-04-09 08:07:31,534::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending response MainThread::DEBUG::2020-04-09 08:07:32,199::storage_server::158::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(_validate_pre_connected_path) Storage domain a6cea67d-dbfb-45cf-a775-b4d0d47b26f2 is not available MainThread::INFO::2020-04-09 08:07:32,199::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::DEBUG::2020-04-09 08:07:32,199::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending response MainThread::DEBUG::2020-04-09 08:07:32,814::storage_server::363::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) [{u'status': 0, u'id': u'e29cf818-5ee5-46e1-85c1-8aeefa33e95d'}] MainThread::INFO::2020-04-09 08:07:32,814::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::DEBUG::2020-04-09 08:07:32,815::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending response MainThread::DEBUG::2020-04-09 08:07:33,129::storage_server::420::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Error refreshing storage domain: Command StorageDomain.getStats with args {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: (code=350, message=Error in storage domain action: (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) MainThread::DEBUG::2020-04-09 08:07:33,130::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending response MainThread::DEBUG::2020-04-09 08:07:33,795::storage_backends::208::ovirt_hosted_engine_ha.lib.storage_backends::(_get_sector_size) Command StorageDomain.getInfo with args {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: (code=350, message=Error in storage domain action: (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) MainThread::WARNING::2020-04-09 08:07:33,795::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Command StorageDomain.getInfo with args {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: (code=350, message=Error in storage domain action: (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) The UUID it is moaning about is indeed the one that the HA sits on and is the one I listed the contents of in step 2 above. So why can't it see this domain? Thanks, Shareef. On Thu, Apr 9, 2020 at 6:12 AM Strahil Nikolov <hunter86...@yahoo.com> wrote: > On April 9, 2020 1:51:05 AM GMT+03:00, Shareef Jalloq < > shar...@jalloq.co.uk> wrote: > >Don't know if this is useful or not, but I just tried to shutdown and > >start > >another VM on one of the hosts and get the following error: > > > >virsh # start scratch > > > >error: Failed to start domain scratch > > > >error: Network not found: no network with matching name > >'vdsm-ovirtmgmt' > > > >Is this not referring to the interface name as the network is called > >'ovirtmgnt'. > > > >On Wed, Apr 8, 2020 at 11:35 PM Shareef Jalloq <shar...@jalloq.co.uk> > >wrote: > > > >> Hmmm, virsh tells me the HE is running but it hasn't come up and the > >> agent.log is full of the same errors. > >> > >> On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq <shar...@jalloq.co.uk> > >> wrote: > >> > >>> Ah hah! Ok, so I've managed to start it using virsh on the second > >host > >>> but my first host is still dead. > >>> > >>> First of all, what are these 56,317 .prob- files that get dumped to > >the > >>> NFS mounts? > >>> > >>> Secondly, why doesn't the node mount the NFS directories at boot? > >Is > >>> that the issue with this particular node? > >>> > >>> On Wed, Apr 8, 2020 at 11:12 PM <eev...@digitaldatatechs.com> wrote: > >>> > >>>> Did you try virsh list --inactive > >>>> > >>>> > >>>> > >>>> Eric Evans > >>>> > >>>> Digital Data Services LLC. > >>>> > >>>> 304.660.9080 > >>>> > >>>> > >>>> > >>>> *From:* Shareef Jalloq <shar...@jalloq.co.uk> > >>>> *Sent:* Wednesday, April 8, 2020 5:58 PM > >>>> *To:* Strahil Nikolov <hunter86...@yahoo.com> > >>>> *Cc:* Ovirt Users <users@ovirt.org> > >>>> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to > >rescue? > >>>> > >>>> > >>>> > >>>> I've now shut down the VMs on one host and rebooted it but the > >agent > >>>> service doesn't start. If I run 'hosted-engine --vm-status' I get: > >>>> > >>>> > >>>> > >>>> The hosted engine configuration has not been retrieved from shared > >>>> storage. Please ensure that ovirt-ha-agent is running and the > >storage > >>>> server is reachable. > >>>> > >>>> > >>>> > >>>> and indeed if I list the mounts under /rhev/data-center/mnt, only > >one of > >>>> the directories is mounted. I have 3 NFS mounts, one ISO Domain > >and two > >>>> Data Domains. Only one Data Domain has mounted and this has lots > >of .prob > >>>> files in. So why haven't the other NFS exports been mounted? > >>>> > >>>> > >>>> > >>>> Manually mounting them doesn't seem to have helped much either. I > >can > >>>> start the broker service but the agent service says no. Same error > >as the > >>>> one in my last email. > >>>> > >>>> > >>>> > >>>> Shareef. > >>>> > >>>> > >>>> > >>>> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq > ><shar...@jalloq.co.uk> > >>>> wrote: > >>>> > >>>> Right, still down. I've run virsh and it doesn't know anything > >about > >>>> the engine vm. > >>>> > >>>> > >>>> > >>>> I've restarted the broker and agent services and I still get > >nothing in > >>>> virsh->list. > >>>> > >>>> > >>>> > >>>> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of > >errors: > >>>> > >>>> > >>>> > >>>> broker.log: > >>>> > >>>> > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) > >>>> ovirt-hosted-engine-ha broker 2.3.6 started > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Searching for submonitors in > >>>> > >/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor network > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor cpu-load-no-engine > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor mgmt-bridge > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor network > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor cpu-load > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor engine-health > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor mgmt-bridge > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor cpu-load-no-engine > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor cpu-load > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor mem-free > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor storage-domain > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor storage-domain > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor mem-free > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Loaded submonitor engine-health > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Finished loading submonitors > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) > >>>> Connecting the storage > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) > >>>> Connecting storage server > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) > >>>> Connecting storage server > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) > >>>> Refreshing the storage domain > >>>> > >>>> MainThread::WARNING::2020-04-08 > >>>> > > >20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) > >>>> Can't connect vdsm storage: Command StorageDomain.getInfo with args > >>>> {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: > >>>> > >>>> (code=350, message=Error in storage domain action: > >>>> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) > >>>> ovirt-hosted-engine-ha broker 2.3.6 started > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > >>>> Searching for submonitors in > >>>> > >/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors > >>>> > >>>> > >>>> > >>>> agent.log: > >>>> > >>>> > >>>> > >>>> MainThread::ERROR::2020-04-08 > >>>> > > >20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) > >>>> Trying to restart agent > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > >20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > >>>> Agent shutting down > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > >20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > >>>> ovirt-hosted-engine-ha agent 2.3.6 started > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) > >>>> Found certificate common name: ovirt-node-01.phoelex.com > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) > >>>> Initializing ha-broker connection > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > > >20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > >>>> Starting monitor network, options {'tcp_t_address': '', > >'network_test': > >>>> 'dns', 'tcp_t_port': '', 'addr': '192.168.1.99'} > >>>> > >>>> MainThread::ERROR::2020-04-08 > >>>> > > >20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) > >>>> Failed to start necessary monitors > >>>> > >>>> MainThread::ERROR::2020-04-08 > >>>> > > >20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) > >>>> Traceback (most recent call last): > >>>> > >>>> File > >>>> > >"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > >>>> line 131, in _run_agent > >>>> > >>>> return action(he) > >>>> > >>>> File > >>>> > >"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > >>>> line 55, in action_proper > >>>> > >>>> return he.start_monitoring() > >>>> > >>>> File > >>>> > > >"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > >>>> line 432, in start_monitoring > >>>> > >>>> self._initialize_broker() > >>>> > >>>> File > >>>> > > >"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > >>>> line 556, in _initialize_broker > >>>> > >>>> m.get('options', {})) > >>>> > >>>> File > >>>> > > >"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > >>>> line 89, in start_monitor > >>>> > >>>> ).format(t=type, o=options, e=e) > >>>> > >>>> RequestError: brokerlink - failed to start monitor via > >ovirt-ha-broker: > >>>> [Errno 2] No such file or directory, [monitor: 'network', options: > >>>> {'tcp_t_address': '', 'network_test': 'dns', 'tcp_t_port': '', > >'addr': > >>>> '192.168.1.99'}] > >>>> > >>>> > >>>> > >>>> MainThread::ERROR::2020-04-08 > >>>> > > >20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) > >>>> Trying to restart agent > >>>> > >>>> MainThread::INFO::2020-04-08 > >>>> > >20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > >>>> Agent shutting down > >>>> > >>>> > >>>> > >>>> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov > ><hunter86...@yahoo.com> > >>>> wrote: > >>>> > >>>> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" < > >>>> mat...@ltresources.co.uk> wrote: > >>>> >On the host you tried to restart the engine on: > >>>> > > >>>> >Add an alias to virsh (authenticates with virsh_auth.conf) > >>>> > > >>>> >alias virsh='virsh -c > >>>> >qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf' > >>>> > > >>>> >Then run virsh: > >>>> > > >>>> >virsh > >>>> > > >>>> >virsh # list > >>>> > Id Name State > >>>> >---------------------------------------------------- > >>>> > xx HostedEngine Paused > >>>> > xx ********** running > >>>> > ... > >>>> > xx ********** running > >>>> > > >>>> >HostedEngine should be in the list, try and resume the engine: > >>>> > > >>>> >virsh # resume HostedEngine > >>>> > > >>>> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq <shar...@jalloq.co.uk> > >>>> >wrote: > >>>> > > >>>> >> Thanks! > >>>> >> > >>>> >> The status hangs due to, I guess, the VM being down.... > >>>> >> > >>>> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start > >>>> >> VM exists and is down, cleaning up and restarting > >>>> >> VM in WaitForLaunch > >>>> >> > >>>> >> but this doesn't seem to do anything. OK, after a while I get a > >>>> >status of > >>>> >> it being barfed... > >>>> >> > >>>> >> --== Host ovirt-node-00.phoelex.com (id: 1) status ==-- > >>>> >> > >>>> >> conf_on_shared_storage : True > >>>> >> Status up-to-date : False > >>>> >> Hostname : ovirt-node-00.phoelex.com > >>>> >> Host ID : 1 > >>>> >> Engine status : unknown stale-data > >>>> >> Score : 3400 > >>>> >> stopped : False > >>>> >> Local maintenance : False > >>>> >> crc32 : 9c4a034b > >>>> >> local_conf_timestamp : 523362 > >>>> >> Host timestamp : 523608 > >>>> >> Extra metadata (valid at timestamp): > >>>> >> metadata_parse_version=1 > >>>> >> metadata_feature_version=1 > >>>> >> timestamp=523608 (Wed Apr 8 16:17:11 2020) > >>>> >> host-id=1 > >>>> >> score=3400 > >>>> >> vm_conf_refresh_time=523362 (Wed Apr 8 16:13:06 2020) > >>>> >> conf_on_shared_storage=True > >>>> >> maintenance=False > >>>> >> state=EngineDown > >>>> >> stopped=False > >>>> >> > >>>> >> > >>>> >> --== Host ovirt-node-01.phoelex.com (id: 2) status ==-- > >>>> >> > >>>> >> conf_on_shared_storage : True > >>>> >> Status up-to-date : True > >>>> >> Hostname : ovirt-node-01.phoelex.com > >>>> >> Host ID : 2 > >>>> >> Engine status : {"reason": "bad vm status", > >>>> >"health": > >>>> >> "bad", "vm": "down_unexpected", "detail": "Down"} > >>>> >> Score : 0 > >>>> >> stopped : False > >>>> >> Local maintenance : False > >>>> >> crc32 : 5045f2eb > >>>> >> local_conf_timestamp : 1737037 > >>>> >> Host timestamp : 1737283 > >>>> >> Extra metadata (valid at timestamp): > >>>> >> metadata_parse_version=1 > >>>> >> metadata_feature_version=1 > >>>> >> timestamp=1737283 (Wed Apr 8 16:16:17 2020) > >>>> >> host-id=2 > >>>> >> score=0 > >>>> >> vm_conf_refresh_time=1737037 (Wed Apr 8 16:12:11 2020) > >>>> >> conf_on_shared_storage=True > >>>> >> maintenance=False > >>>> >> state=EngineUnexpectedlyDown > >>>> >> stopped=False > >>>> >> > >>>> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett > >>>> ><mat...@ltresources.co.uk> > >>>> >> wrote: > >>>> >> > >>>> >>> First steps, on one of your hosts as root: > >>>> >>> > >>>> >>> To get information: > >>>> >>> hosted-engine --vm-status > >>>> >>> > >>>> >>> To start the engine: > >>>> >>> hosted-engine --vm-start > >>>> >>> > >>>> >>> > >>>> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq > ><shar...@jalloq.co.uk> > >>>> >wrote: > >>>> >>> > >>>> >>>> So my engine has gone down and I can't ssh into it either. If > >I > >>>> >try to > >>>> >>>> log into the web-ui of the node it is running on, I get > >redirected > >>>> >because > >>>> >>>> the node can't reach the engine. > >>>> >>>> > >>>> >>>> What are my next steps? > >>>> >>>> > >>>> >>>> Shareef. > >>>> >>>> _______________________________________________ > >>>> >>>> Users mailing list -- users@ovirt.org > >>>> >>>> To unsubscribe send an email to users-le...@ovirt.org > >>>> >>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html > >>>> >>>> oVirt Code of Conduct: > >>>> >>>> https://www.ovirt.org/community/about/community-guidelines/ > >>>> >>>> List Archives: > >>>> >>>> > >>>> > > >>>> > > > https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRSW5CDRQWR5MIKJUH3ISLCQ/ > >>>> >>>> > >>>> >>> > >>>> > >>>> This has to be resolved: > >>>> > >>>> Engine status : unknown stale-data > >>>> > >>>> Run again 'hosted-engine --vm-status'. If it remains the same, > >restart > >>>> ovirt-ha-broker.service & ovirt-ha-agent.service > >>>> > >>>> Verify that the engine's storage is available. Then monitor the > >broker > >>>> & agent logs in /var/log/ovirt-hosted-engine-ha > >>>> > >>>> Best Regards, > >>>> Strahil Nikolov > >>>> > >>>> > >>>> > >>>> > > Hi Shareef, > > The flow of activation oVirt is more complex than a plain KVM. > Mounting of the domains happen during the activation of the node ( the > HostedEngine is activating everything needed). > > Focus on the HostedEngine VM. > Is it running properly ? > > If not,try: > 1. Verify that the storage domain exists > 2. Check if it has 'ha_agents' directory > 3. Check if the links are OK, if not you can safely remove the links > > 4. Next check the services are running: > A) sanlock > B) supervdsmd > C) vdsmd > D) libvirtd > > 5. Increase the log level for broker and agent services: > > cd /etc/ovirt-hosted-engine-ha > vim *-log.conf > > systemctl restart ovirt-ha-broker ovirt-ha-agent > > 6. Check what they are complaining about > Keep in mind that agent will keep throwing errors untill the broker stops > doing it (agent depends on broker), so broker must be OK before > peoceeding with the agent log. > > About the manual VM start, you need 2 things: > > 1. Define the VM network > # cat vdsm-ovirtmgmt.xml <network> > <name>vdsm-ovirtmgmt</name> > <uuid>8ded486e-e681-4754-af4b-5737c2b05405</uuid> > <forward mode='bridge'/> > <bridge name='ovirtmgmt'/> > </network> > > [root@ovirt1 HostedEngine-RECOVERY]# virsh define vdsm-ovirtmgmt.xml > > 2. Get an xml definition which can be found in the vdsm log. Every VM at > start up has it's configuration printed out in vdsm log on the host it > starts. > Save to file and then: > A) virsh define myvm.xml > B) virsh start myvm > > It seems there is/was a problem with your NFS shares. > > > Best Regards, > Strahil Nikolov >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/43C2BK2GOWXFD65E4ZRSKBN6D36VZ7GZ/