On April 9, 2020 11:12:30 AM GMT+03:00, Shareef Jalloq <shar...@jalloq.co.uk> 
wrote:
>OK, let's go through this.  I'm looking at the node that at least still
>has
>some VMs running.  virsh also tells me that the HostedEngine VM is
>running
>but it's unresponsive and I can't shut it down.
>
>1. All storage domains exist and are mounted.
>2. The ha_agent exists:
>
>[root@ovirt-node-01 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/
>nas-01.phoelex.com\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/
>
>dom_md  ha_agent  images  master
>
>3.  There are two links
>
>[root@ovirt-node-01 ovirt-hosted-engine-ha]# ll /rhev/data-center/mnt/
>nas-01.phoelex.com
>\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ha_agent/
>
>total 8
>
>lrwxrwxrwx. 1 vdsm kvm 132 Apr  2 14:50 hosted-engine.lockspace ->
>/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ffb90b82-42fe-4253-85d5-aaec8c280aaf/90e68791-0c6f-406a-89ac-e0d86c631604
>
>lrwxrwxrwx. 1 vdsm kvm 132 Apr  2 14:50 hosted-engine.metadata ->
>/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/2161aed0-7250-4c1d-b667-ac94f60af17e/6b818e33-f80a-48cc-a59c-bba641e027d4
>
>4. The services exist but all seem to have some sort of warning:
>
>a) Apr 08 18:10:55 ovirt-node-01.phoelex.com sanlock[1728]: *2020-04-08
>18:10:55 1744152 [36796]: s16 delta_renew long write time 10 sec*
>
>b) Mar 23 18:02:59 ovirt-node-01.phoelex.com supervdsmd[29409]: *failed
>to
>load module nvdimm: libbd_nvdimm.so.2: cannot open shared object file:
>No
>such file or directory*
>
>c) Apr 09 08:05:13 ovirt-node-01.phoelex.com vdsm[4801]: *ERROR failed
>to
>retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is
>the
>Hosted Engine setup finished?*
>
>d)Apr 08 22:48:27 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
>22:48:27.134+0000: 29309: warning : qemuGetProcessInfo:1404 : cannot
>parse
>process status data
>
>Apr 08 22:48:27 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
>22:48:27.134+0000: 29309: error : virNetDevTapInterfaceStats:764 :
>internal
>error: /proc/net/dev: Interface not found
>
>Apr 08 23:09:39 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
>23:09:39.844+0000: 29307: error : virNetSocketReadWire:1806 : End of
>file
>while reading data: Input/output error
>
>Apr 09 01:05:26 ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-09
>01:05:26.660+0000: 29307: error : virNetSocketReadWire:1806 : End of
>file
>while reading data: Input/output error
>
>5 & 6.  The broker log is continually printing this error:
>
>MainThread::INFO::2020-04-09
>08:07:31,438::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>ovirt-hosted-engine-ha broker 2.3.6 started
>
>MainThread::DEBUG::2020-04-09
>08:07:31,438::broker::55::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>Running broker
>
>MainThread::DEBUG::2020-04-09
>08:07:31,438::broker::120::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_monitor)
>Starting monitor
>
>MainThread::INFO::2020-04-09
>08:07:31,438::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Searching for submonitors in
>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker
>
>/submonitors
>
>MainThread::INFO::2020-04-09
>08:07:31,439::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor network
>
>MainThread::INFO::2020-04-09
>08:07:31,440::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor cpu-load-no-engine
>
>MainThread::INFO::2020-04-09
>08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor mgmt-bridge
>
>MainThread::INFO::2020-04-09
>08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor network
>
>MainThread::INFO::2020-04-09
>08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor cpu-load
>
>MainThread::INFO::2020-04-09
>08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor engine-health
>
>MainThread::INFO::2020-04-09
>08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor mgmt-bridge
>
>MainThread::INFO::2020-04-09
>08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor cpu-load-no-engine
>
>MainThread::INFO::2020-04-09
>08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor cpu-load
>
>MainThread::INFO::2020-04-09
>08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor mem-free
>
>MainThread::INFO::2020-04-09
>08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor storage-domain
>
>MainThread::INFO::2020-04-09
>08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor storage-domain
>
>MainThread::INFO::2020-04-09
>08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor mem-free
>
>MainThread::INFO::2020-04-09
>08:07:31,444::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Loaded submonitor engine-health
>
>MainThread::INFO::2020-04-09
>08:07:31,444::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>Finished loading submonitors
>
>MainThread::DEBUG::2020-04-09
>08:07:31,444::broker::128::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_storage_broker)
>Starting storage broker
>
>MainThread::DEBUG::2020-04-09
>08:07:31,444::storage_backends::369::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
>Connecting to VDSM
>
>MainThread::DEBUG::2020-04-09
>08:07:31,444::util::384::ovirt_hosted_engine_ha.lib.storage_backends::(__log_debug)
>Creating a new json-rpc connection to VDSM
>
>Client localhost:54321::DEBUG::2020-04-09
>08:07:31,453::concurrent::258::root::(run) START thread <Thread(Client
>localhost:54321, started daemon 139992488138496)> (func=<bound method
>Reactor.process_requests of <yajsonrpc.betterAsyncore.Reactor object at
>0x7f528acabc90>>, args=(), kwargs={})
>
>Client localhost:54321::DEBUG::2020-04-09
>08:07:31,459::stompclient::138::yajsonrpc.protocols.stomp.AsyncClient::(_process_connected)
>Stomp connection established
>
>MainThread::DEBUG::2020-04-09
>08:07:31,467::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
>response
>
>MainThread::INFO::2020-04-09
>08:07:31,530::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
>Connecting the storage
>
>MainThread::INFO::2020-04-09
>08:07:31,531::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>Connecting storage server
>
>MainThread::DEBUG::2020-04-09
>08:07:31,531::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
>response
>
>MainThread::DEBUG::2020-04-09
>08:07:31,534::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
>response
>
>MainThread::DEBUG::2020-04-09
>08:07:32,199::storage_server::158::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(_validate_pre_connected_path)
>Storage domain a6cea67d-dbfb-45cf-a775-b4d0d47b26f2 is not available
>
>MainThread::INFO::2020-04-09
>08:07:32,199::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>Connecting storage server
>
>MainThread::DEBUG::2020-04-09
>08:07:32,199::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
>response
>
>MainThread::DEBUG::2020-04-09
>08:07:32,814::storage_server::363::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>[{u'status': 0, u'id': u'e29cf818-5ee5-46e1-85c1-8aeefa33e95d'}]
>
>MainThread::INFO::2020-04-09
>08:07:32,814::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>Refreshing the storage domain
>
>MainThread::DEBUG::2020-04-09
>08:07:32,815::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
>response
>
>MainThread::DEBUG::2020-04-09
>08:07:33,129::storage_server::420::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>Error refreshing storage domain: Command StorageDomain.getStats with
>args
>{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
>
>(code=350, message=Error in storage domain action:
>(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
>
>MainThread::DEBUG::2020-04-09
>08:07:33,130::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
>response
>
>MainThread::DEBUG::2020-04-09
>08:07:33,795::storage_backends::208::ovirt_hosted_engine_ha.lib.storage_backends::(_get_sector_size)
>Command StorageDomain.getInfo with args {'storagedomainID':
>'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
>
>(code=350, message=Error in storage domain action:
>(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
>
>MainThread::WARNING::2020-04-09
>08:07:33,795::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
>Can't connect vdsm storage: Command StorageDomain.getInfo with args
>{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
>
>(code=350, message=Error in storage domain action:
>(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
>
>
>The UUID it is moaning about is indeed the one that the HA sits on and
>is
>the one I listed the contents of in step 2 above.
>
>
>So why can't it see this domain?
>
>
>Thanks, Shareef.
>
>On Thu, Apr 9, 2020 at 6:12 AM Strahil Nikolov <hunter86...@yahoo.com>
>wrote:
>
>> On April 9, 2020 1:51:05 AM GMT+03:00, Shareef Jalloq <
>> shar...@jalloq.co.uk> wrote:
>> >Don't know if this is useful or not, but I just tried to shutdown
>and
>> >start
>> >another VM on one of the hosts and get the following error:
>> >
>> >virsh # start scratch
>> >
>> >error: Failed to start domain scratch
>> >
>> >error: Network not found: no network with matching name
>> >'vdsm-ovirtmgmt'
>> >
>> >Is this not referring to the interface name as the network is called
>> >'ovirtmgnt'.
>> >
>> >On Wed, Apr 8, 2020 at 11:35 PM Shareef Jalloq
><shar...@jalloq.co.uk>
>> >wrote:
>> >
>> >> Hmmm, virsh tells me the HE is running but it hasn't come up and
>the
>> >> agent.log is full of the same errors.
>> >>
>> >> On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq
><shar...@jalloq.co.uk>
>> >> wrote:
>> >>
>> >>> Ah hah!  Ok, so I've managed to start it using virsh on the
>second
>> >host
>> >>> but my first host is still dead.
>> >>>
>> >>> First of all, what are these 56,317 .prob- files that get dumped
>to
>> >the
>> >>> NFS mounts?
>> >>>
>> >>> Secondly, why doesn't the node mount the NFS directories at boot?
>> >Is
>> >>> that the issue with this particular node?
>> >>>
>> >>> On Wed, Apr 8, 2020 at 11:12 PM <eev...@digitaldatatechs.com>
>wrote:
>> >>>
>> >>>> Did you try virsh list --inactive
>> >>>>
>> >>>>
>> >>>>
>> >>>> Eric Evans
>> >>>>
>> >>>> Digital Data Services LLC.
>> >>>>
>> >>>> 304.660.9080
>> >>>>
>> >>>>
>> >>>>
>> >>>> *From:* Shareef Jalloq <shar...@jalloq.co.uk>
>> >>>> *Sent:* Wednesday, April 8, 2020 5:58 PM
>> >>>> *To:* Strahil Nikolov <hunter86...@yahoo.com>
>> >>>> *Cc:* Ovirt Users <users@ovirt.org>
>> >>>> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to
>> >rescue?
>> >>>>
>> >>>>
>> >>>>
>> >>>> I've now shut down the VMs on one host and rebooted it but the
>> >agent
>> >>>> service doesn't start.  If I run 'hosted-engine --vm-status' I
>get:
>> >>>>
>> >>>>
>> >>>>
>> >>>> The hosted engine configuration has not been retrieved from
>shared
>> >>>> storage. Please ensure that ovirt-ha-agent is running and the
>> >storage
>> >>>> server is reachable.
>> >>>>
>> >>>>
>> >>>>
>> >>>> and indeed if I list the mounts under /rhev/data-center/mnt,
>only
>> >one of
>> >>>> the directories is mounted.  I have 3 NFS mounts, one ISO Domain
>> >and two
>> >>>> Data Domains.  Only one Data Domain has mounted and this has
>lots
>> >of .prob
>> >>>> files in.  So why haven't the other NFS exports been mounted?
>> >>>>
>> >>>>
>> >>>>
>> >>>> Manually mounting them doesn't seem to have helped much either. 
>I
>> >can
>> >>>> start the broker service but the agent service says no.  Same
>error
>> >as the
>> >>>> one in my last email.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Shareef.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq
>> ><shar...@jalloq.co.uk>
>> >>>> wrote:
>> >>>>
>> >>>> Right, still down.  I've run virsh and it doesn't know anything
>> >about
>> >>>> the engine vm.
>> >>>>
>> >>>>
>> >>>>
>> >>>> I've restarted the broker and agent services and I still get
>> >nothing in
>> >>>> virsh->list.
>> >>>>
>> >>>>
>> >>>>
>> >>>> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of
>> >errors:
>> >>>>
>> >>>>
>> >>>>
>> >>>> broker.log:
>> >>>>
>> >>>>
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>> >>>> ovirt-hosted-engine-ha broker 2.3.6 started
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Searching for submonitors in
>> >>>>
>>
>>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor network
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor cpu-load-no-engine
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor mgmt-bridge
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor network
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor cpu-load
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor engine-health
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor mgmt-bridge
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor cpu-load-no-engine
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor cpu-load
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor mem-free
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor storage-domain
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor storage-domain
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor mem-free
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Loaded submonitor engine-health
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Finished loading submonitors
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
>> >>>> Connecting the storage
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> >>>> Connecting storage server
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> >>>> Connecting storage server
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> >>>> Refreshing the storage domain
>> >>>>
>> >>>> MainThread::WARNING::2020-04-08
>> >>>>
>>
>>
>>20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
>> >>>> Can't connect vdsm storage: Command StorageDomain.getInfo with
>args
>> >>>> {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'}
>failed:
>> >>>>
>> >>>> (code=350, message=Error in storage domain action:
>> >>>> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>> >>>> ovirt-hosted-engine-ha broker 2.3.6 started
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> >>>> Searching for submonitors in
>> >>>>
>>
>>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
>> >>>>
>> >>>>
>> >>>>
>> >>>> agent.log:
>> >>>>
>> >>>>
>> >>>>
>> >>>> MainThread::ERROR::2020-04-08
>> >>>>
>>
>>
>>20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> >>>> Trying to restart agent
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>> >>>> Agent shutting down
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>> >>>> ovirt-hosted-engine-ha agent 2.3.6 started
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
>> >>>> Found certificate common name: ovirt-node-01.phoelex.com
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
>> >>>> Initializing ha-broker connection
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>
>>20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
>> >>>> Starting monitor network, options {'tcp_t_address': '',
>> >'network_test':
>> >>>> 'dns', 'tcp_t_port': '', 'addr': '192.168.1.99'}
>> >>>>
>> >>>> MainThread::ERROR::2020-04-08
>> >>>>
>>
>>
>>20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
>> >>>> Failed to start necessary monitors
>> >>>>
>> >>>> MainThread::ERROR::2020-04-08
>> >>>>
>>
>>
>>20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> >>>> Traceback (most recent call last):
>> >>>>
>> >>>>   File
>> >>>>
>>
>>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
>> >>>> line 131, in _run_agent
>> >>>>
>> >>>>     return action(he)
>> >>>>
>> >>>>   File
>> >>>>
>>
>>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
>> >>>> line 55, in action_proper
>> >>>>
>> >>>>     return he.start_monitoring()
>> >>>>
>> >>>>   File
>> >>>>
>>
>>
>>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
>> >>>> line 432, in start_monitoring
>> >>>>
>> >>>>     self._initialize_broker()
>> >>>>
>> >>>>   File
>> >>>>
>>
>>
>>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
>> >>>> line 556, in _initialize_broker
>> >>>>
>> >>>>     m.get('options', {}))
>> >>>>
>> >>>>   File
>> >>>>
>>
>>
>>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>> >>>> line 89, in start_monitor
>> >>>>
>> >>>>     ).format(t=type, o=options, e=e)
>> >>>>
>> >>>> RequestError: brokerlink - failed to start monitor via
>> >ovirt-ha-broker:
>> >>>> [Errno 2] No such file or directory, [monitor: 'network',
>options:
>> >>>> {'tcp_t_address': '', 'network_test': 'dns', 'tcp_t_port': '',
>> >'addr':
>> >>>> '192.168.1.99'}]
>> >>>>
>> >>>>
>> >>>>
>> >>>> MainThread::ERROR::2020-04-08
>> >>>>
>>
>>
>>20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> >>>> Trying to restart agent
>> >>>>
>> >>>> MainThread::INFO::2020-04-08
>> >>>>
>>
>>20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>> >>>> Agent shutting down
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov
>> ><hunter86...@yahoo.com>
>> >>>> wrote:
>> >>>>
>> >>>> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" <
>> >>>> mat...@ltresources.co.uk> wrote:
>> >>>> >On the host you tried to restart the engine on:
>> >>>> >
>> >>>> >Add an alias to virsh (authenticates with virsh_auth.conf)
>> >>>> >
>> >>>> >alias virsh='virsh -c
>> >>>>
>>qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
>> >>>> >
>> >>>> >Then run virsh:
>> >>>> >
>> >>>> >virsh
>> >>>> >
>> >>>> >virsh # list
>> >>>> > Id    Name                           State
>> >>>> >----------------------------------------------------
>> >>>> > xx    HostedEngine                   Paused
>> >>>> > xx    **********                     running
>> >>>> > ...
>> >>>> > xx     **********                     running
>> >>>> >
>> >>>> >HostedEngine should be in the list, try and resume the engine:
>> >>>> >
>> >>>> >virsh # resume HostedEngine
>> >>>> >
>> >>>> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq
><shar...@jalloq.co.uk>
>> >>>> >wrote:
>> >>>> >
>> >>>> >> Thanks!
>> >>>> >>
>> >>>> >> The status hangs due to, I guess, the VM being down....
>> >>>> >>
>> >>>> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start
>> >>>> >> VM exists and is down, cleaning up and restarting
>> >>>> >> VM in WaitForLaunch
>> >>>> >>
>> >>>> >> but this doesn't seem to do anything.  OK, after a while I
>get a
>> >>>> >status of
>> >>>> >> it being barfed...
>> >>>> >>
>> >>>> >> --== Host ovirt-node-00.phoelex.com (id: 1) status ==--
>> >>>> >>
>> >>>> >> conf_on_shared_storage             : True
>> >>>> >> Status up-to-date                  : False
>> >>>> >> Hostname                           :
>ovirt-node-00.phoelex.com
>> >>>> >> Host ID                            : 1
>> >>>> >> Engine status                      : unknown stale-data
>> >>>> >> Score                              : 3400
>> >>>> >> stopped                            : False
>> >>>> >> Local maintenance                  : False
>> >>>> >> crc32                              : 9c4a034b
>> >>>> >> local_conf_timestamp               : 523362
>> >>>> >> Host timestamp                     : 523608
>> >>>> >> Extra metadata (valid at timestamp):
>> >>>> >> metadata_parse_version=1
>> >>>> >> metadata_feature_version=1
>> >>>> >> timestamp=523608 (Wed Apr  8 16:17:11 2020)
>> >>>> >> host-id=1
>> >>>> >> score=3400
>> >>>> >> vm_conf_refresh_time=523362 (Wed Apr  8 16:13:06 2020)
>> >>>> >> conf_on_shared_storage=True
>> >>>> >> maintenance=False
>> >>>> >> state=EngineDown
>> >>>> >> stopped=False
>> >>>> >>
>> >>>> >>
>> >>>> >> --== Host ovirt-node-01.phoelex.com (id: 2) status ==--
>> >>>> >>
>> >>>> >> conf_on_shared_storage             : True
>> >>>> >> Status up-to-date                  : True
>> >>>> >> Hostname                           :
>ovirt-node-01.phoelex.com
>> >>>> >> Host ID                            : 2
>> >>>> >> Engine status                      : {"reason": "bad vm
>status",
>> >>>> >"health":
>> >>>> >> "bad", "vm": "down_unexpected", "detail": "Down"}
>> >>>> >> Score                              : 0
>> >>>> >> stopped                            : False
>> >>>> >> Local maintenance                  : False
>> >>>> >> crc32                              : 5045f2eb
>> >>>> >> local_conf_timestamp               : 1737037
>> >>>> >> Host timestamp                     : 1737283
>> >>>> >> Extra metadata (valid at timestamp):
>> >>>> >> metadata_parse_version=1
>> >>>> >> metadata_feature_version=1
>> >>>> >> timestamp=1737283 (Wed Apr  8 16:16:17 2020)
>> >>>> >> host-id=2
>> >>>> >> score=0
>> >>>> >> vm_conf_refresh_time=1737037 (Wed Apr  8 16:12:11 2020)
>> >>>> >> conf_on_shared_storage=True
>> >>>> >> maintenance=False
>> >>>> >> state=EngineUnexpectedlyDown
>> >>>> >> stopped=False
>> >>>> >>
>> >>>> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett
>> >>>> ><mat...@ltresources.co.uk>
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >>> First steps, on one of your hosts as root:
>> >>>> >>>
>> >>>> >>> To get information:
>> >>>> >>> hosted-engine --vm-status
>> >>>> >>>
>> >>>> >>> To start the engine:
>> >>>> >>> hosted-engine --vm-start
>> >>>> >>>
>> >>>> >>>
>> >>>> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq
>> ><shar...@jalloq.co.uk>
>> >>>> >wrote:
>> >>>> >>>
>> >>>> >>>> So my engine has gone down and I can't ssh into it either. 
>If
>> >I
>> >>>> >try to
>> >>>> >>>> log into the web-ui of the node it is running on, I get
>> >redirected
>> >>>> >because
>> >>>> >>>> the node can't reach the engine.
>> >>>> >>>>
>> >>>> >>>> What are my next steps?
>> >>>> >>>>
>> >>>> >>>> Shareef.
>> >>>> >>>> _______________________________________________
>> >>>> >>>> Users mailing list -- users@ovirt.org
>> >>>> >>>> To unsubscribe send an email to users-le...@ovirt.org
>> >>>> >>>> Privacy Statement:
>https://www.ovirt.org/privacy-policy.html
>> >>>> >>>> oVirt Code of Conduct:
>> >>>> >>>> https://www.ovirt.org/community/about/community-guidelines/
>> >>>> >>>> List Archives:
>> >>>> >>>>
>> >>>> >
>> >>>>
>> >
>>
>https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRSW5CDRQWR5MIKJUH3ISLCQ/
>> >>>> >>>>
>> >>>> >>>
>> >>>>
>> >>>> This has  to be resolved:
>> >>>>
>> >>>> Engine status                      : unknown stale-data
>> >>>>
>> >>>> Run again 'hosted-engine --vm-status'. If it remains the same,
>> >restart
>> >>>> ovirt-ha-broker.service & ovirt-ha-agent.service
>> >>>>
>> >>>> Verify that the engine's storage is available. Then monitor the
>> >broker
>> >>>> & agent logs in /var/log/ovirt-hosted-engine-ha
>> >>>>
>> >>>> Best Regards,
>> >>>> Strahil Nikolov
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>>
>> Hi Shareef,
>>
>> The flow of activation oVirt is more complex than a plain KVM.
>> Mounting of the domains happen during the activation of the node  (
>the
>> HostedEngine is activating everything needed).
>>
>> Focus on the HostedEngine VM.
>> Is it running properly ?
>>
>> If not,try:
>> 1. Verify that the storage domain exists
>> 2. Check if  it has 'ha_agents' directory
>> 3. Check if the links are  OK, if not you can safely remove the links
>>
>> 4. Next check the services are running:
>> A) sanlock
>> B) supervdsmd
>> C) vdsmd
>> D) libvirtd
>>
>> 5. Increase the log level for broker  and agent services:
>>
>> cd  /etc/ovirt-hosted-engine-ha
>> vim *-log.conf
>>
>> systemctl restart ovirt-ha-broker ovirt-ha-agent
>>
>> 6. Check what they are complaining about
>> Keep in mind that agent will keep throwing errors  untill the broker
>stops
>> doing it (agent depends  on broker),  so broker must be OK before
>> peoceeding with the agent log.
>>
>> About the manual VM start, you need  2 things:
>>
>> 1.  Define the VM network
>> # cat vdsm-ovirtmgmt.xml <network>
>>   <name>vdsm-ovirtmgmt</name>
>>   <uuid>8ded486e-e681-4754-af4b-5737c2b05405</uuid>
>>   <forward mode='bridge'/>
>>   <bridge name='ovirtmgmt'/>
>> </network>
>>
>> [root@ovirt1 HostedEngine-RECOVERY]# virsh define vdsm-ovirtmgmt.xml
>>
>> 2. Get an xml definition which can be found in the vdsm log. Every VM
>at
>> start up has it's configuration printed out  in vdsm log  on the host
>it
>> starts.
>> Save to file and then:
>> A) virsh define myvm.xml
>> B) virsh start myvm
>>
>> It seems there is/was a problem with your NFS shares.
>>
>>
>> Best Regards,
>> Strahil Nikolov
>>

Hey Shareef,

Check if there are any files or folders not owned by vdsm:kvm . Something like 
this:

find . -not -user 36 -not  -group 36 -print

Also check if vdsm can access the images in the  '<vol-mount-point>/images' 
directories.

Best Regards,
Strahil Nikolov
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/N42KAKSIBDYWAUTDNEHMSSARE3OQWM7M/

Reply via email to