On April 9, 2020 11:12:30 AM GMT+03:00, Shareef Jalloq <shar...@jalloq.co.uk> wrote: >OK, let's go through this. I'm looking at the node that at least still >has >some VMs running. virsh also tells me that the HostedEngine VM is >running >but it's unresponsive and I can't shut it down. > >1. All storage domains exist and are mounted. >2. The ha_agent exists: > >[root@ovirt-node-01 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/ >nas-01.phoelex.com\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ > >dom_md ha_agent images master > >3. There are two links > >[root@ovirt-node-01 ovirt-hosted-engine-ha]# ll /rhev/data-center/mnt/ >nas-01.phoelex.com >\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ha_agent/ > >total 8 > >lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.lockspace -> >/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ffb90b82-42fe-4253-85d5-aaec8c280aaf/90e68791-0c6f-406a-89ac-e0d86c631604 > >lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.metadata -> >/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/2161aed0-7250-4c1d-b667-ac94f60af17e/6b818e33-f80a-48cc-a59c-bba641e027d4 > >4. The services exist but all seem to have some sort of warning: > >a) Apr 08 18:10:55 ovirt-node-01.phoelex.com sanlock[1728]: *2020-04-08 >18:10:55 1744152 [36796]: s16 delta_renew long write time 10 sec* > >b) Mar 23 18:02:59 ovirt-node-01.phoelex.com supervdsmd[29409]: *failed >to >load module nvdimm: libbd_nvdimm.so.2: cannot open shared object file: >No >such file or directory* > >c) Apr 09 08:05:13 ovirt-node-01.phoelex.com vdsm[4801]: *ERROR failed >to >retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is >the >Hosted Engine setup finished?* > >d)Apr 08 22:48:27 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08 >22:48:27.134+0000: 29309: warning : qemuGetProcessInfo:1404 : cannot >parse >process status data > >Apr 08 22:48:27 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08 >22:48:27.134+0000: 29309: error : virNetDevTapInterfaceStats:764 : >internal >error: /proc/net/dev: Interface not found > >Apr 08 23:09:39 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08 >23:09:39.844+0000: 29307: error : virNetSocketReadWire:1806 : End of >file >while reading data: Input/output error > >Apr 09 01:05:26 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-09 >01:05:26.660+0000: 29307: error : virNetSocketReadWire:1806 : End of >file >while reading data: Input/output error > >5 & 6. The broker log is continually printing this error: > >MainThread::INFO::2020-04-09 >08:07:31,438::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) >ovirt-hosted-engine-ha broker 2.3.6 started > >MainThread::DEBUG::2020-04-09 >08:07:31,438::broker::55::ovirt_hosted_engine_ha.broker.broker.Broker::(run) >Running broker > >MainThread::DEBUG::2020-04-09 >08:07:31,438::broker::120::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_monitor) >Starting monitor > >MainThread::INFO::2020-04-09 >08:07:31,438::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Searching for submonitors in >/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker > >/submonitors > >MainThread::INFO::2020-04-09 >08:07:31,439::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor network > >MainThread::INFO::2020-04-09 >08:07:31,440::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor cpu-load-no-engine > >MainThread::INFO::2020-04-09 >08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor mgmt-bridge > >MainThread::INFO::2020-04-09 >08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor network > >MainThread::INFO::2020-04-09 >08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor cpu-load > >MainThread::INFO::2020-04-09 >08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor engine-health > >MainThread::INFO::2020-04-09 >08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor mgmt-bridge > >MainThread::INFO::2020-04-09 >08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor cpu-load-no-engine > >MainThread::INFO::2020-04-09 >08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor cpu-load > >MainThread::INFO::2020-04-09 >08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor mem-free > >MainThread::INFO::2020-04-09 >08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor storage-domain > >MainThread::INFO::2020-04-09 >08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor storage-domain > >MainThread::INFO::2020-04-09 >08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor mem-free > >MainThread::INFO::2020-04-09 >08:07:31,444::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Loaded submonitor engine-health > >MainThread::INFO::2020-04-09 >08:07:31,444::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >Finished loading submonitors > >MainThread::DEBUG::2020-04-09 >08:07:31,444::broker::128::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_storage_broker) >Starting storage broker > >MainThread::DEBUG::2020-04-09 >08:07:31,444::storage_backends::369::ovirt_hosted_engine_ha.lib.storage_backends::(connect) >Connecting to VDSM > >MainThread::DEBUG::2020-04-09 >08:07:31,444::util::384::ovirt_hosted_engine_ha.lib.storage_backends::(__log_debug) >Creating a new json-rpc connection to VDSM > >Client localhost:54321::DEBUG::2020-04-09 >08:07:31,453::concurrent::258::root::(run) START thread <Thread(Client >localhost:54321, started daemon 139992488138496)> (func=<bound method >Reactor.process_requests of <yajsonrpc.betterAsyncore.Reactor object at >0x7f528acabc90>>, args=(), kwargs={}) > >Client localhost:54321::DEBUG::2020-04-09 >08:07:31,459::stompclient::138::yajsonrpc.protocols.stomp.AsyncClient::(_process_connected) >Stomp connection established > >MainThread::DEBUG::2020-04-09 >08:07:31,467::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending >response > >MainThread::INFO::2020-04-09 >08:07:31,530::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) >Connecting the storage > >MainThread::INFO::2020-04-09 >08:07:31,531::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >Connecting storage server > >MainThread::DEBUG::2020-04-09 >08:07:31,531::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending >response > >MainThread::DEBUG::2020-04-09 >08:07:31,534::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending >response > >MainThread::DEBUG::2020-04-09 >08:07:32,199::storage_server::158::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(_validate_pre_connected_path) >Storage domain a6cea67d-dbfb-45cf-a775-b4d0d47b26f2 is not available > >MainThread::INFO::2020-04-09 >08:07:32,199::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >Connecting storage server > >MainThread::DEBUG::2020-04-09 >08:07:32,199::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending >response > >MainThread::DEBUG::2020-04-09 >08:07:32,814::storage_server::363::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >[{u'status': 0, u'id': u'e29cf818-5ee5-46e1-85c1-8aeefa33e95d'}] > >MainThread::INFO::2020-04-09 >08:07:32,814::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >Refreshing the storage domain > >MainThread::DEBUG::2020-04-09 >08:07:32,815::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending >response > >MainThread::DEBUG::2020-04-09 >08:07:33,129::storage_server::420::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >Error refreshing storage domain: Command StorageDomain.getStats with >args >{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: > >(code=350, message=Error in storage domain action: >(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) > >MainThread::DEBUG::2020-04-09 >08:07:33,130::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending >response > >MainThread::DEBUG::2020-04-09 >08:07:33,795::storage_backends::208::ovirt_hosted_engine_ha.lib.storage_backends::(_get_sector_size) >Command StorageDomain.getInfo with args {'storagedomainID': >'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: > >(code=350, message=Error in storage domain action: >(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) > >MainThread::WARNING::2020-04-09 >08:07:33,795::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) >Can't connect vdsm storage: Command StorageDomain.getInfo with args >{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed: > >(code=350, message=Error in storage domain action: >(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) > > >The UUID it is moaning about is indeed the one that the HA sits on and >is >the one I listed the contents of in step 2 above. > > >So why can't it see this domain? > > >Thanks, Shareef. > >On Thu, Apr 9, 2020 at 6:12 AM Strahil Nikolov <hunter86...@yahoo.com> >wrote: > >> On April 9, 2020 1:51:05 AM GMT+03:00, Shareef Jalloq < >> shar...@jalloq.co.uk> wrote: >> >Don't know if this is useful or not, but I just tried to shutdown >and >> >start >> >another VM on one of the hosts and get the following error: >> > >> >virsh # start scratch >> > >> >error: Failed to start domain scratch >> > >> >error: Network not found: no network with matching name >> >'vdsm-ovirtmgmt' >> > >> >Is this not referring to the interface name as the network is called >> >'ovirtmgnt'. >> > >> >On Wed, Apr 8, 2020 at 11:35 PM Shareef Jalloq ><shar...@jalloq.co.uk> >> >wrote: >> > >> >> Hmmm, virsh tells me the HE is running but it hasn't come up and >the >> >> agent.log is full of the same errors. >> >> >> >> On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq ><shar...@jalloq.co.uk> >> >> wrote: >> >> >> >>> Ah hah! Ok, so I've managed to start it using virsh on the >second >> >host >> >>> but my first host is still dead. >> >>> >> >>> First of all, what are these 56,317 .prob- files that get dumped >to >> >the >> >>> NFS mounts? >> >>> >> >>> Secondly, why doesn't the node mount the NFS directories at boot? >> >Is >> >>> that the issue with this particular node? >> >>> >> >>> On Wed, Apr 8, 2020 at 11:12 PM <eev...@digitaldatatechs.com> >wrote: >> >>> >> >>>> Did you try virsh list --inactive >> >>>> >> >>>> >> >>>> >> >>>> Eric Evans >> >>>> >> >>>> Digital Data Services LLC. >> >>>> >> >>>> 304.660.9080 >> >>>> >> >>>> >> >>>> >> >>>> *From:* Shareef Jalloq <shar...@jalloq.co.uk> >> >>>> *Sent:* Wednesday, April 8, 2020 5:58 PM >> >>>> *To:* Strahil Nikolov <hunter86...@yahoo.com> >> >>>> *Cc:* Ovirt Users <users@ovirt.org> >> >>>> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to >> >rescue? >> >>>> >> >>>> >> >>>> >> >>>> I've now shut down the VMs on one host and rebooted it but the >> >agent >> >>>> service doesn't start. If I run 'hosted-engine --vm-status' I >get: >> >>>> >> >>>> >> >>>> >> >>>> The hosted engine configuration has not been retrieved from >shared >> >>>> storage. Please ensure that ovirt-ha-agent is running and the >> >storage >> >>>> server is reachable. >> >>>> >> >>>> >> >>>> >> >>>> and indeed if I list the mounts under /rhev/data-center/mnt, >only >> >one of >> >>>> the directories is mounted. I have 3 NFS mounts, one ISO Domain >> >and two >> >>>> Data Domains. Only one Data Domain has mounted and this has >lots >> >of .prob >> >>>> files in. So why haven't the other NFS exports been mounted? >> >>>> >> >>>> >> >>>> >> >>>> Manually mounting them doesn't seem to have helped much either. >I >> >can >> >>>> start the broker service but the agent service says no. Same >error >> >as the >> >>>> one in my last email. >> >>>> >> >>>> >> >>>> >> >>>> Shareef. >> >>>> >> >>>> >> >>>> >> >>>> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq >> ><shar...@jalloq.co.uk> >> >>>> wrote: >> >>>> >> >>>> Right, still down. I've run virsh and it doesn't know anything >> >about >> >>>> the engine vm. >> >>>> >> >>>> >> >>>> >> >>>> I've restarted the broker and agent services and I still get >> >nothing in >> >>>> virsh->list. >> >>>> >> >>>> >> >>>> >> >>>> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of >> >errors: >> >>>> >> >>>> >> >>>> >> >>>> broker.log: >> >>>> >> >>>> >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) >> >>>> ovirt-hosted-engine-ha broker 2.3.6 started >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Searching for submonitors in >> >>>> >> >>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor network >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor cpu-load-no-engine >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor mgmt-bridge >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor network >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor cpu-load >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor engine-health >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor mgmt-bridge >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor cpu-load-no-engine >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor cpu-load >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor mem-free >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor storage-domain >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor storage-domain >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor mem-free >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Loaded submonitor engine-health >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Finished loading submonitors >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) >> >>>> Connecting the storage >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >> >>>> Connecting storage server >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >> >>>> Connecting storage server >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) >> >>>> Refreshing the storage domain >> >>>> >> >>>> MainThread::WARNING::2020-04-08 >> >>>> >> >> >>20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) >> >>>> Can't connect vdsm storage: Command StorageDomain.getInfo with >args >> >>>> {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} >failed: >> >>>> >> >>>> (code=350, message=Error in storage domain action: >> >>>> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) >> >>>> ovirt-hosted-engine-ha broker 2.3.6 started >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) >> >>>> Searching for submonitors in >> >>>> >> >>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors >> >>>> >> >>>> >> >>>> >> >>>> agent.log: >> >>>> >> >>>> >> >>>> >> >>>> MainThread::ERROR::2020-04-08 >> >>>> >> >> >>20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >> >>>> Trying to restart agent >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >>20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) >> >>>> Agent shutting down >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >>20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) >> >>>> ovirt-hosted-engine-ha agent 2.3.6 started >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) >> >>>> Found certificate common name: ovirt-node-01.phoelex.com >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >> >>>> Initializing ha-broker connection >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >> >>20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) >> >>>> Starting monitor network, options {'tcp_t_address': '', >> >'network_test': >> >>>> 'dns', 'tcp_t_port': '', 'addr': '192.168.1.99'} >> >>>> >> >>>> MainThread::ERROR::2020-04-08 >> >>>> >> >> >>20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >> >>>> Failed to start necessary monitors >> >>>> >> >>>> MainThread::ERROR::2020-04-08 >> >>>> >> >> >>20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >> >>>> Traceback (most recent call last): >> >>>> >> >>>> File >> >>>> >> >>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >> >>>> line 131, in _run_agent >> >>>> >> >>>> return action(he) >> >>>> >> >>>> File >> >>>> >> >>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >> >>>> line 55, in action_proper >> >>>> >> >>>> return he.start_monitoring() >> >>>> >> >>>> File >> >>>> >> >> >>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >> >>>> line 432, in start_monitoring >> >>>> >> >>>> self._initialize_broker() >> >>>> >> >>>> File >> >>>> >> >> >>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >> >>>> line 556, in _initialize_broker >> >>>> >> >>>> m.get('options', {})) >> >>>> >> >>>> File >> >>>> >> >> >>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >> >>>> line 89, in start_monitor >> >>>> >> >>>> ).format(t=type, o=options, e=e) >> >>>> >> >>>> RequestError: brokerlink - failed to start monitor via >> >ovirt-ha-broker: >> >>>> [Errno 2] No such file or directory, [monitor: 'network', >options: >> >>>> {'tcp_t_address': '', 'network_test': 'dns', 'tcp_t_port': '', >> >'addr': >> >>>> '192.168.1.99'}] >> >>>> >> >>>> >> >>>> >> >>>> MainThread::ERROR::2020-04-08 >> >>>> >> >> >>20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >> >>>> Trying to restart agent >> >>>> >> >>>> MainThread::INFO::2020-04-08 >> >>>> >> >>20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) >> >>>> Agent shutting down >> >>>> >> >>>> >> >>>> >> >>>> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov >> ><hunter86...@yahoo.com> >> >>>> wrote: >> >>>> >> >>>> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" < >> >>>> mat...@ltresources.co.uk> wrote: >> >>>> >On the host you tried to restart the engine on: >> >>>> > >> >>>> >Add an alias to virsh (authenticates with virsh_auth.conf) >> >>>> > >> >>>> >alias virsh='virsh -c >> >>>> >>qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf' >> >>>> > >> >>>> >Then run virsh: >> >>>> > >> >>>> >virsh >> >>>> > >> >>>> >virsh # list >> >>>> > Id Name State >> >>>> >---------------------------------------------------- >> >>>> > xx HostedEngine Paused >> >>>> > xx ********** running >> >>>> > ... >> >>>> > xx ********** running >> >>>> > >> >>>> >HostedEngine should be in the list, try and resume the engine: >> >>>> > >> >>>> >virsh # resume HostedEngine >> >>>> > >> >>>> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq ><shar...@jalloq.co.uk> >> >>>> >wrote: >> >>>> > >> >>>> >> Thanks! >> >>>> >> >> >>>> >> The status hangs due to, I guess, the VM being down.... >> >>>> >> >> >>>> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start >> >>>> >> VM exists and is down, cleaning up and restarting >> >>>> >> VM in WaitForLaunch >> >>>> >> >> >>>> >> but this doesn't seem to do anything. OK, after a while I >get a >> >>>> >status of >> >>>> >> it being barfed... >> >>>> >> >> >>>> >> --== Host ovirt-node-00.phoelex.com (id: 1) status ==-- >> >>>> >> >> >>>> >> conf_on_shared_storage : True >> >>>> >> Status up-to-date : False >> >>>> >> Hostname : >ovirt-node-00.phoelex.com >> >>>> >> Host ID : 1 >> >>>> >> Engine status : unknown stale-data >> >>>> >> Score : 3400 >> >>>> >> stopped : False >> >>>> >> Local maintenance : False >> >>>> >> crc32 : 9c4a034b >> >>>> >> local_conf_timestamp : 523362 >> >>>> >> Host timestamp : 523608 >> >>>> >> Extra metadata (valid at timestamp): >> >>>> >> metadata_parse_version=1 >> >>>> >> metadata_feature_version=1 >> >>>> >> timestamp=523608 (Wed Apr 8 16:17:11 2020) >> >>>> >> host-id=1 >> >>>> >> score=3400 >> >>>> >> vm_conf_refresh_time=523362 (Wed Apr 8 16:13:06 2020) >> >>>> >> conf_on_shared_storage=True >> >>>> >> maintenance=False >> >>>> >> state=EngineDown >> >>>> >> stopped=False >> >>>> >> >> >>>> >> >> >>>> >> --== Host ovirt-node-01.phoelex.com (id: 2) status ==-- >> >>>> >> >> >>>> >> conf_on_shared_storage : True >> >>>> >> Status up-to-date : True >> >>>> >> Hostname : >ovirt-node-01.phoelex.com >> >>>> >> Host ID : 2 >> >>>> >> Engine status : {"reason": "bad vm >status", >> >>>> >"health": >> >>>> >> "bad", "vm": "down_unexpected", "detail": "Down"} >> >>>> >> Score : 0 >> >>>> >> stopped : False >> >>>> >> Local maintenance : False >> >>>> >> crc32 : 5045f2eb >> >>>> >> local_conf_timestamp : 1737037 >> >>>> >> Host timestamp : 1737283 >> >>>> >> Extra metadata (valid at timestamp): >> >>>> >> metadata_parse_version=1 >> >>>> >> metadata_feature_version=1 >> >>>> >> timestamp=1737283 (Wed Apr 8 16:16:17 2020) >> >>>> >> host-id=2 >> >>>> >> score=0 >> >>>> >> vm_conf_refresh_time=1737037 (Wed Apr 8 16:12:11 2020) >> >>>> >> conf_on_shared_storage=True >> >>>> >> maintenance=False >> >>>> >> state=EngineUnexpectedlyDown >> >>>> >> stopped=False >> >>>> >> >> >>>> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett >> >>>> ><mat...@ltresources.co.uk> >> >>>> >> wrote: >> >>>> >> >> >>>> >>> First steps, on one of your hosts as root: >> >>>> >>> >> >>>> >>> To get information: >> >>>> >>> hosted-engine --vm-status >> >>>> >>> >> >>>> >>> To start the engine: >> >>>> >>> hosted-engine --vm-start >> >>>> >>> >> >>>> >>> >> >>>> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq >> ><shar...@jalloq.co.uk> >> >>>> >wrote: >> >>>> >>> >> >>>> >>>> So my engine has gone down and I can't ssh into it either. >If >> >I >> >>>> >try to >> >>>> >>>> log into the web-ui of the node it is running on, I get >> >redirected >> >>>> >because >> >>>> >>>> the node can't reach the engine. >> >>>> >>>> >> >>>> >>>> What are my next steps? >> >>>> >>>> >> >>>> >>>> Shareef. >> >>>> >>>> _______________________________________________ >> >>>> >>>> Users mailing list -- users@ovirt.org >> >>>> >>>> To unsubscribe send an email to users-le...@ovirt.org >> >>>> >>>> Privacy Statement: >https://www.ovirt.org/privacy-policy.html >> >>>> >>>> oVirt Code of Conduct: >> >>>> >>>> https://www.ovirt.org/community/about/community-guidelines/ >> >>>> >>>> List Archives: >> >>>> >>>> >> >>>> > >> >>>> >> > >> >https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRSW5CDRQWR5MIKJUH3ISLCQ/ >> >>>> >>>> >> >>>> >>> >> >>>> >> >>>> This has to be resolved: >> >>>> >> >>>> Engine status : unknown stale-data >> >>>> >> >>>> Run again 'hosted-engine --vm-status'. If it remains the same, >> >restart >> >>>> ovirt-ha-broker.service & ovirt-ha-agent.service >> >>>> >> >>>> Verify that the engine's storage is available. Then monitor the >> >broker >> >>>> & agent logs in /var/log/ovirt-hosted-engine-ha >> >>>> >> >>>> Best Regards, >> >>>> Strahil Nikolov >> >>>> >> >>>> >> >>>> >> >>>> >> >> Hi Shareef, >> >> The flow of activation oVirt is more complex than a plain KVM. >> Mounting of the domains happen during the activation of the node ( >the >> HostedEngine is activating everything needed). >> >> Focus on the HostedEngine VM. >> Is it running properly ? >> >> If not,try: >> 1. Verify that the storage domain exists >> 2. Check if it has 'ha_agents' directory >> 3. Check if the links are OK, if not you can safely remove the links >> >> 4. Next check the services are running: >> A) sanlock >> B) supervdsmd >> C) vdsmd >> D) libvirtd >> >> 5. Increase the log level for broker and agent services: >> >> cd /etc/ovirt-hosted-engine-ha >> vim *-log.conf >> >> systemctl restart ovirt-ha-broker ovirt-ha-agent >> >> 6. Check what they are complaining about >> Keep in mind that agent will keep throwing errors untill the broker >stops >> doing it (agent depends on broker), so broker must be OK before >> peoceeding with the agent log. >> >> About the manual VM start, you need 2 things: >> >> 1. Define the VM network >> # cat vdsm-ovirtmgmt.xml <network> >> <name>vdsm-ovirtmgmt</name> >> <uuid>8ded486e-e681-4754-af4b-5737c2b05405</uuid> >> <forward mode='bridge'/> >> <bridge name='ovirtmgmt'/> >> </network> >> >> [root@ovirt1 HostedEngine-RECOVERY]# virsh define vdsm-ovirtmgmt.xml >> >> 2. Get an xml definition which can be found in the vdsm log. Every VM >at >> start up has it's configuration printed out in vdsm log on the host >it >> starts. >> Save to file and then: >> A) virsh define myvm.xml >> B) virsh start myvm >> >> It seems there is/was a problem with your NFS shares. >> >> >> Best Regards, >> Strahil Nikolov >>
Hey Shareef, Check if there are any files or folders not owned by vdsm:kvm . Something like this: find . -not -user 36 -not -group 36 -print Also check if vdsm can access the images in the '<vol-mount-point>/images' directories. Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/N42KAKSIBDYWAUTDNEHMSSARE3OQWM7M/