The hosted-engine storage domain is mounted for sure, but the issue is here: Exception: Failed to start monitoring domain (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, host_id=3): timeout during domain acquisition
The point is that in VDSM logs I see just something like: 2017-02-02 21:05:22,283 INFO (jsonrpc/1) [dispatcher] Run and protect: repoStats(options=None) (logUtils:49) 2017-02-02 21:05:22,285 INFO (jsonrpc/1) [dispatcher] Run and protect: repoStats, Return response: {u'a7fbaaad-7043-4391-9523-3bedcdc4fb0d': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.000748727', 'lastCheck': '0.1', 'valid': True}, u'2b2a44fc-f2bd-47cd-b7af-00be59e30a35': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.00082529', 'lastCheck': '0.1', 'valid': True}, u'5d99af76-33b5-47d8-99da-1f32413c7bb0': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000349356', 'lastCheck': '5.3', 'valid': True}, u'7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96': {'code': 0, 'actual': True, 'version': 4, 'acquired': False, 'delay': '0.000377052', 'lastCheck': '0.6', 'valid': True}} (logUtils:52) Where the other storage domains have 'acquired': True whil it's always 'acquired': False for the hosted-engine storage domain. Could you please share your /var/log/sanlock.log from the same host and the output of sanlock client status ? On Fri, Feb 3, 2017 at 3:52 PM, Ralf Schenk <r...@databay.de> wrote: > Hello, > > I also put host in Maintenance and restarted vdsm while ovirt-ha-agent is > running. I can mount the gluster Volume "engine" manually in the host. > > I get this repeatedly in /var/log/vdsm.log: > > 2017-02-03 15:29:28,891 INFO (MainThread) [vds] Exiting (vdsm:167) > 2017-02-03 15:29:30,974 INFO (MainThread) [vds] (PID: 11456) I am the > actual vdsm 4.19.4-1.el7.centos microcloud27 (3.10.0-514.6.1.el7.x86_64) > (vdsm:145) > 2017-02-03 15:29:30,974 INFO (MainThread) [vds] VDSM will run with cpu > affinity: frozenset([1]) (vdsm:251) > 2017-02-03 15:29:31,013 INFO (MainThread) [storage.check] Starting check > service (check:91) > 2017-02-03 15:29:31,017 INFO (MainThread) [storage.Dispatcher] Starting > StorageDispatcher... (dispatcher:47) > 2017-02-03 15:29:31,017 INFO (check/loop) [storage.asyncevent] Starting > <EventLoop running=True closed=False at 0x37480464> (asyncevent:122) > 2017-02-03 15:29:31,156 INFO (MainThread) [dispatcher] Run and protect: > registerDomainStateChangeCallback(callbackFunc=<functools.partial object > at 0x2881fc8>) (logUtils:49) > 2017-02-03 15:29:31,156 INFO (MainThread) [dispatcher] Run and protect: > registerDomainStateChangeCallback, Return response: None (logUtils:52) > 2017-02-03 15:29:31,160 INFO (MainThread) [MOM] Preparing MOM interface > (momIF:49) > 2017-02-03 15:29:31,161 INFO (MainThread) [MOM] Using named unix socket > /var/run/vdsm/mom-vdsm.sock (momIF:58) > 2017-02-03 15:29:31,162 INFO (MainThread) [root] Unregistering all > secrets (secret:91) > 2017-02-03 15:29:31,164 INFO (MainThread) [vds] Setting channels' timeout > to 30 seconds. (vmchannels:223) > 2017-02-03 15:29:31,165 INFO (MainThread) [vds.MultiProtocolAcceptor] > Listening at :::54321 (protocoldetector:185) > 2017-02-03 15:29:31,354 INFO (vmrecovery) [vds] recovery: completed in 0s > (clientIF:495) > 2017-02-03 15:29:31,371 INFO (BindingXMLRPC) [vds] XMLRPC server running > (bindingxmlrpc:63) > 2017-02-03 15:29:31,471 INFO (periodic/1) [dispatcher] Run and protect: > repoStats(options=None) (logUtils:49) > 2017-02-03 15:29:31,472 INFO (periodic/1) [dispatcher] Run and protect: > repoStats, Return response: {} (logUtils:52) > 2017-02-03 15:29:31,472 WARN (periodic/1) [MOM] MOM not available. > (momIF:116) > 2017-02-03 15:29:31,473 WARN (periodic/1) [MOM] MOM not available, KSM > stats will be missing. (momIF:79) > 2017-02-03 15:29:31,474 ERROR (periodic/1) [root] failed to retrieve > Hosted Engine HA info (api:252) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 231, in > _getHaInfo > stats = instance.get_all_stats() > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", > line 103, in get_all_stats > self._configure_broker_conn(broker) > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", > line 180, in _configure_broker_conn > dom_type=dom_type) > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > line 177, in set_storage_domain > .format(sd_type, options, e)) > RequestError: Failed to set storage domain FilesystemBackend, options > {'dom_type': 'glusterfs', 'sd_uuid': '7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96'}: > Request failed: <class 'ovirt_hos > ted_engine_ha.lib.storage_backends.BackendFailureException'> > 2017-02-03 15:29:35,920 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] > Accepted connection from ::1:49506 (protocoldetector:72) > 2017-02-03 15:29:35,929 INFO (Reactor thread) [ProtocolDetector.Detector] > Detected protocol stomp from ::1:49506 (protocoldetector:127) > 2017-02-03 15:29:35,930 INFO (Reactor thread) [Broker.StompAdapter] > Processing CONNECT request (stompreactor:102) > 2017-02-03 15:29:35,930 INFO (JsonRpc (StompReactor)) > [Broker.StompAdapter] Subscribe command received (stompreactor:129) > 2017-02-03 15:29:36,067 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call > Host.ping succeeded in 0.00 seconds (__init__:515) > 2017-02-03 15:29:36,071 INFO (jsonrpc/1) [throttled] Current > getAllVmStats: {} (throttledlog:105) > 2017-02-03 15:29:36,071 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call > Host.getAllVmStats succeeded in 0.00 seconds (__init__:515) > 2017-02-03 15:29:46,435 INFO (periodic/0) [dispatcher] Run and protect: > repoStats(options=None) (logUtils:49) > 2017-02-03 15:29:46,435 INFO (periodic/0) [dispatcher] Run and protect: > repoStats, Return response: {} (logUtils:52) > 2017-02-03 15:29:46,439 ERROR (periodic/0) [root] failed to retrieve > Hosted Engine HA info (api:252) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 231, in > _getHaInfo > stats = instance.get_all_stats() > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", > line 103, in get_all_stats > self._configure_broker_conn(broker) > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", > line 180, in _configure_broker_conn > dom_type=dom_type) > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > line 177, in set_storage_domain > .format(sd_type, options, e)) > RequestError: Failed to set storage domain FilesystemBackend, options > {'dom_type': 'glusterfs', 'sd_uuid': '7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96'}: > Request failed: <class 'ovirt_hos > ted_engine_ha.lib.storage_backends.BackendFailureException'> > 2017-02-03 15:29:51,095 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call > Host.getAllVmStats succeeded in 0.00 seconds (__init__:515) > 2017-02-03 15:29:51,219 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call > Host.setKsmTune succeeded in 0.00 seconds (__init__:515) > 2017-02-03 15:30:01,444 INFO (periodic/1) [dispatcher] Run and protect: > repoStats(options=None) (logUtils:49) > 2017-02-03 15:30:01,444 INFO (periodic/1) [dispatcher] Run and protect: > repoStats, Return response: {} (logUtils:52) > 2017-02-03 15:30:01,448 ERROR (periodic/1) [root] failed to retrieve > Hosted Engine HA info (api:252) > > > > Am 03.02.2017 um 13:39 schrieb Simone Tiraboschi: > > I see there an ERROR on stopMonitoringDomain but I cannot see the > correspondent startMonitoringDomain; could you please look for it? > > On Fri, Feb 3, 2017 at 1:16 PM, Ralf Schenk <r...@databay.de> wrote: > >> Hello, >> >> attached is my vdsm.log from the host with hosted-engine-ha around the >> time-frame of agent timeout that is not working anymore for engine (it >> works in Ovirt and is active). It simply isn't working for engine-ha >> anymore after Update. >> >> At 2017-02-02 19:25:34,248 you'll find an error corresponoding to agent >> timeout error. >> >> Bye >> >> >> >> Am 03.02.2017 um 11:28 schrieb Simone Tiraboschi: >> >> 3. Three of my hosts have the hosted engine deployed for ha. First all >>>> three where marked by a crown (running was gold and others where silver). >>>> After upgrading the 3 Host deployed hosted engine ha is not active anymore. >>>> >>>> I can't get this host back with working ovirt-ha-agent/broker. I >>>> already rebooted, manually restarted the services but It isn't able to get >>>> cluster state according to >>>> "hosted-engine --vm-status". The other hosts state the host status as >>>> "unknown stale-data" >>>> >>>> I already shut down all agents on all hosts and issued a "hosted-engine >>>> --reinitialize-lockspace" but that didn't help. >>>> >>>> Agents stops working after a timeout-error according to log: >>>> >>>> MainThread::INFO::2017-02-02 19:24:52,040::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::INFO::2017-02-02 19:24:59,185::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::INFO::2017-02-02 19:25:06,333::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::INFO::2017-02-02 19:25:13,554::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::INFO::2017-02-02 19:25:20,710::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::INFO::2017-02-02 19:25:27,865::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::ERROR::2017-02-02 19:25:27,866::hosted_engine::8 >>>> 15::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) >>>> Failed to start monitoring domain >>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, >>>> host_id=3): timeout during domain acquisition >>>> MainThread::WARNING::2017-02-02 19:25:27,866::hosted_engine::4 >>>> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>> Error while monitoring engine: Failed to start monitoring domain >>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, host_id=3): timeout >>>> during domain acquisition >>>> MainThread::WARNING::2017-02-02 19:25:27,866::hosted_engine::4 >>>> 72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>> Unexpected error >>>> Traceback (most recent call last): >>>> File >>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>> line 443, in start_monitoring >>>> self._initialize_domain_monitor() >>>> File >>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>> line 816, in _initialize_domain_monitor >>>> raise Exception(msg) >>>> Exception: Failed to start monitoring domain >>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, host_id=3): timeout >>>> during domain acquisition >>>> MainThread::ERROR::2017-02-02 19:25:27,866::hosted_engine::4 >>>> 85::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>> Shutting down the agent because of 3 failures in a row! >>>> MainThread::INFO::2017-02-02 19:25:32,087::hosted_engine::8 >>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>> VDSM domain monitor status: PENDING >>>> MainThread::INFO::2017-02-02 19:25:34,250::hosted_engine::7 >>>> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) >>>> Failed to stop monitoring domain >>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96): >>>> Storage domain is member of pool: u'domain=7c8deaa8-be02-4aaf-b9 >>>> b4-ddc8da99ad96' >>>> MainThread::INFO::2017-02-02 19:25:34,254::agent::143::ovir >>>> t_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down >>>> >>> Simone, Martin, can you please follow up on this? >>> >> >> Ralph, could you please attach vdsm logs from on of your hosts for the >> relevant time frame? >> >> >> -- >> >> >> *Ralf Schenk* >> fon +49 (0) 24 05 / 40 83 70 <+49%202405%20408370> >> fax +49 (0) 24 05 / 40 83 759 <+49%202405%204083759> >> mail *r...@databay.de* <r...@databay.de> >> >> *Databay AG* >> Jens-Otto-Krag-Straße 11 >> D-52146 Würselen >> *www.databay.de* <http://www.databay.de> >> >> Sitz/Amtsgericht Aachen • HRB:8437 • USt-IdNr.: DE 210844202 >> Vorstand: Ralf Schenk, Dipl.-Ing. Jens Conze, Aresch Yavari, Dipl.-Kfm. >> Philipp Hermanns >> Aufsichtsratsvorsitzender: Wilhelm Dohmen >> ------------------------------ >> > > > -- > > > *Ralf Schenk* > fon +49 (0) 24 05 / 40 83 70 <+49%202405%20408370> > fax +49 (0) 24 05 / 40 83 759 <+49%202405%204083759> > mail *r...@databay.de* <r...@databay.de> > > *Databay AG* > Jens-Otto-Krag-Straße 11 > D-52146 Würselen > *www.databay.de* <http://www.databay.de> > > Sitz/Amtsgericht Aachen • HRB:8437 • USt-IdNr.: DE 210844202 > Vorstand: Ralf Schenk, Dipl.-Ing. Jens Conze, Aresch Yavari, Dipl.-Kfm. > Philipp Hermanns > Aufsichtsratsvorsitzender: Wilhelm Dohmen > ------------------------------ >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users