On Fri, Feb 3, 2017 at 7:20 PM, Simone Tiraboschi <stira...@redhat.com> wrote:
> > > On Fri, Feb 3, 2017 at 5:22 PM, Ralf Schenk <r...@databay.de> wrote: > >> Hello, >> >> of course: >> >> [root@microcloud27 mnt]# sanlock client status >> daemon 8a93c9ea-e242-408c-a63d-a9356bb22df5.microcloud >> p -1 helper >> p -1 listener >> p -1 status >> >> sanlock.log attached. (Beginning 2017-01-27 where everything was fine) >> > Thanks, the issue is here: > > 2017-02-02 19:01:22+0100 4848 [1048]: s36 lockspace > 7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96:3:/rhev/data-center/mnt/glusterSD/glusterfs.rxmgmt.databay.de:_engine/7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96/dom_md/ids:0 > 2017-02-02 19:03:42+0100 4988 [12983]: s36 delta_acquire host_id 3 busy1 3 15 > 13129 7ad427b1-fbb6-4cee-b9ee-01f596fddfbb.microcloud > 2017-02-02 19:03:43+0100 4989 [1048]: s36 add_lockspace fail result -262 > > Could you please check if you have other hosts contending for the same ID > (id=3 in this case). > Another option is to manually force a sanlock renewal on that host and check what happens, something like: sanlock client renewal -s 7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96:3:/rhev/data- center/mnt/glusterSD/glusterfs.rxmgmt.databay.de:_engine/7c8deaa8-be02-4aaf- b9b4-ddc8da99ad96/dom_md/ids:0 > > >> Bye >> >> Am 03.02.2017 um 16:12 schrieb Simone Tiraboschi: >> >> The hosted-engine storage domain is mounted for sure, >> but the issue is here: >> Exception: Failed to start monitoring domain >> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, host_id=3): timeout >> during domain acquisition >> >> The point is that in VDSM logs I see just something like: >> 2017-02-02 21:05:22,283 INFO (jsonrpc/1) [dispatcher] Run and protect: >> repoStats(options=None) (logUtils:49) >> 2017-02-02 21:05:22,285 INFO (jsonrpc/1) [dispatcher] Run and protect: >> repoStats, Return response: {u'a7fbaaad-7043-4391-9523-3bedcdc4fb0d': >> {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': >> '0.000748727', 'lastCheck': '0.1', 'valid': True}, >> u'2b2a44fc-f2bd-47cd-b7af-00be59e30a35': {'code': 0, 'actual': True, >> 'version': 0, 'acquired': True, 'delay': '0.00082529', 'lastCheck': '0.1', >> 'valid': True}, u'5d99af76-33b5-47d8-99da-1f32413c7bb0': {'code': 0, >> 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000349356', >> 'lastCheck': '5.3', 'valid': True}, u'7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96': >> {'code': 0, 'actual': True, 'version': 4, 'acquired': False, 'delay': >> '0.000377052', 'lastCheck': '0.6', 'valid': True}} (logUtils:52) >> >> Where the other storage domains have 'acquired': True whil it's >> always 'acquired': False for the hosted-engine storage domain. >> >> Could you please share your /var/log/sanlock.log from the same host and >> the output of >> sanlock client status >> ? >> >> >> >> >> On Fri, Feb 3, 2017 at 3:52 PM, Ralf Schenk <r...@databay.de> wrote: >> >>> Hello, >>> >>> I also put host in Maintenance and restarted vdsm while ovirt-ha-agent >>> is running. I can mount the gluster Volume "engine" manually in the host. >>> >>> I get this repeatedly in /var/log/vdsm.log: >>> >>> 2017-02-03 15:29:28,891 INFO (MainThread) [vds] Exiting (vdsm:167) >>> 2017-02-03 15:29:30,974 INFO (MainThread) [vds] (PID: 11456) I am the >>> actual vdsm 4.19.4-1.el7.centos microcloud27 (3.10.0-514.6.1.el7.x86_64) >>> (vdsm:145) >>> 2017-02-03 15:29:30,974 INFO (MainThread) [vds] VDSM will run with cpu >>> affinity: frozenset([1]) (vdsm:251) >>> 2017-02-03 15:29:31,013 INFO (MainThread) [storage.check] Starting >>> check service (check:91) >>> 2017-02-03 15:29:31,017 INFO (MainThread) [storage.Dispatcher] Starting >>> StorageDispatcher... (dispatcher:47) >>> 2017-02-03 15:29:31,017 INFO (check/loop) [storage.asyncevent] Starting >>> <EventLoop running=True closed=False at 0x37480464> (asyncevent:122) >>> 2017-02-03 15:29:31,156 INFO (MainThread) [dispatcher] Run and protect: >>> registerDomainStateChangeCallback(callbackFunc=<functools.partial >>> object at 0x2881fc8>) (logUtils:49) >>> 2017-02-03 15:29:31,156 INFO (MainThread) [dispatcher] Run and protect: >>> registerDomainStateChangeCallback, Return response: None (logUtils:52) >>> 2017-02-03 15:29:31,160 INFO (MainThread) [MOM] Preparing MOM interface >>> (momIF:49) >>> 2017-02-03 15:29:31,161 INFO (MainThread) [MOM] Using named unix socket >>> /var/run/vdsm/mom-vdsm.sock (momIF:58) >>> 2017-02-03 15:29:31,162 INFO (MainThread) [root] Unregistering all >>> secrets (secret:91) >>> 2017-02-03 15:29:31,164 INFO (MainThread) [vds] Setting channels' >>> timeout to 30 seconds. (vmchannels:223) >>> 2017-02-03 15:29:31,165 INFO (MainThread) [vds.MultiProtocolAcceptor] >>> Listening at :::54321 (protocoldetector:185) >>> 2017-02-03 15:29:31,354 INFO (vmrecovery) [vds] recovery: completed in >>> 0s (clientIF:495) >>> 2017-02-03 15:29:31,371 INFO (BindingXMLRPC) [vds] XMLRPC server >>> running (bindingxmlrpc:63) >>> 2017-02-03 15:29:31,471 INFO (periodic/1) [dispatcher] Run and protect: >>> repoStats(options=None) (logUtils:49) >>> 2017-02-03 15:29:31,472 INFO (periodic/1) [dispatcher] Run and protect: >>> repoStats, Return response: {} (logUtils:52) >>> 2017-02-03 15:29:31,472 WARN (periodic/1) [MOM] MOM not available. >>> (momIF:116) >>> 2017-02-03 15:29:31,473 WARN (periodic/1) [MOM] MOM not available, KSM >>> stats will be missing. (momIF:79) >>> 2017-02-03 15:29:31,474 ERROR (periodic/1) [root] failed to retrieve >>> Hosted Engine HA info (api:252) >>> Traceback (most recent call last): >>> File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 231, >>> in _getHaInfo >>> stats = instance.get_all_stats() >>> File >>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", >>> line 103, in get_all_stats >>> self._configure_broker_conn(broker) >>> File >>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", >>> line 180, in _configure_broker_conn >>> dom_type=dom_type) >>> File >>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >>> line 177, in set_storage_domain >>> .format(sd_type, options, e)) >>> RequestError: Failed to set storage domain FilesystemBackend, options >>> {'dom_type': 'glusterfs', 'sd_uuid': >>> '7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96'}: >>> Request failed: <class 'ovirt_hos >>> ted_engine_ha.lib.storage_backends.BackendFailureException'> >>> 2017-02-03 15:29:35,920 INFO (Reactor thread) >>> [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:49506 >>> (protocoldetector:72) >>> 2017-02-03 15:29:35,929 INFO (Reactor thread) >>> [ProtocolDetector.Detector] Detected protocol stomp from ::1:49506 >>> (protocoldetector:127) >>> 2017-02-03 15:29:35,930 INFO (Reactor thread) [Broker.StompAdapter] >>> Processing CONNECT request (stompreactor:102) >>> 2017-02-03 15:29:35,930 INFO (JsonRpc (StompReactor)) >>> [Broker.StompAdapter] Subscribe command received (stompreactor:129) >>> 2017-02-03 15:29:36,067 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC >>> call Host.ping succeeded in 0.00 seconds (__init__:515) >>> 2017-02-03 15:29:36,071 INFO (jsonrpc/1) [throttled] Current >>> getAllVmStats: {} (throttledlog:105) >>> 2017-02-03 15:29:36,071 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC >>> call Host.getAllVmStats succeeded in 0.00 seconds (__init__:515) >>> 2017-02-03 15:29:46,435 INFO (periodic/0) [dispatcher] Run and protect: >>> repoStats(options=None) (logUtils:49) >>> 2017-02-03 15:29:46,435 INFO (periodic/0) [dispatcher] Run and protect: >>> repoStats, Return response: {} (logUtils:52) >>> 2017-02-03 15:29:46,439 ERROR (periodic/0) [root] failed to retrieve >>> Hosted Engine HA info (api:252) >>> Traceback (most recent call last): >>> File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 231, >>> in _getHaInfo >>> stats = instance.get_all_stats() >>> File >>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", >>> line 103, in get_all_stats >>> self._configure_broker_conn(broker) >>> File >>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", >>> line 180, in _configure_broker_conn >>> dom_type=dom_type) >>> File >>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >>> line 177, in set_storage_domain >>> .format(sd_type, options, e)) >>> RequestError: Failed to set storage domain FilesystemBackend, options >>> {'dom_type': 'glusterfs', 'sd_uuid': >>> '7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96'}: >>> Request failed: <class 'ovirt_hos >>> ted_engine_ha.lib.storage_backends.BackendFailureException'> >>> 2017-02-03 15:29:51,095 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC >>> call Host.getAllVmStats succeeded in 0.00 seconds (__init__:515) >>> 2017-02-03 15:29:51,219 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC >>> call Host.setKsmTune succeeded in 0.00 seconds (__init__:515) >>> 2017-02-03 15:30:01,444 INFO (periodic/1) [dispatcher] Run and protect: >>> repoStats(options=None) (logUtils:49) >>> 2017-02-03 15:30:01,444 INFO (periodic/1) [dispatcher] Run and protect: >>> repoStats, Return response: {} (logUtils:52) >>> 2017-02-03 15:30:01,448 ERROR (periodic/1) [root] failed to retrieve >>> Hosted Engine HA info (api:252) >>> >>> >>> >>> Am 03.02.2017 um 13:39 schrieb Simone Tiraboschi: >>> >>> I see there an ERROR on stopMonitoringDomain but I cannot see the >>> correspondent startMonitoringDomain; could you please look for it? >>> >>> On Fri, Feb 3, 2017 at 1:16 PM, Ralf Schenk <r...@databay.de> wrote: >>> >>>> Hello, >>>> >>>> attached is my vdsm.log from the host with hosted-engine-ha around the >>>> time-frame of agent timeout that is not working anymore for engine (it >>>> works in Ovirt and is active). It simply isn't working for engine-ha >>>> anymore after Update. >>>> >>>> At 2017-02-02 19:25:34,248 you'll find an error corresponoding to agent >>>> timeout error. >>>> >>>> Bye >>>> >>>> >>>> >>>> Am 03.02.2017 um 11:28 schrieb Simone Tiraboschi: >>>> >>>> 3. Three of my hosts have the hosted engine deployed for ha. First all >>>>>> three where marked by a crown (running was gold and others where silver). >>>>>> After upgrading the 3 Host deployed hosted engine ha is not active >>>>>> anymore. >>>>>> >>>>>> I can't get this host back with working ovirt-ha-agent/broker. I >>>>>> already rebooted, manually restarted the services but It isn't able to >>>>>> get >>>>>> cluster state according to >>>>>> "hosted-engine --vm-status". The other hosts state the host status as >>>>>> "unknown stale-data" >>>>>> >>>>>> I already shut down all agents on all hosts and issued a >>>>>> "hosted-engine --reinitialize-lockspace" but that didn't help. >>>>>> >>>>>> Agents stops working after a timeout-error according to log: >>>>>> >>>>>> MainThread::INFO::2017-02-02 19:24:52,040::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::INFO::2017-02-02 19:24:59,185::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::INFO::2017-02-02 19:25:06,333::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::INFO::2017-02-02 19:25:13,554::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::INFO::2017-02-02 19:25:20,710::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::INFO::2017-02-02 19:25:27,865::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::ERROR::2017-02-02 19:25:27,866::hosted_engine::8 >>>>>> 15::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) >>>>>> Failed to start monitoring domain >>>>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, >>>>>> host_id=3): timeout during domain acquisition >>>>>> MainThread::WARNING::2017-02-02 19:25:27,866::hosted_engine::4 >>>>>> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>> Error while monitoring engine: Failed to start monitoring domain >>>>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, host_id=3): timeout >>>>>> during domain acquisition >>>>>> MainThread::WARNING::2017-02-02 19:25:27,866::hosted_engine::4 >>>>>> 72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>> Unexpected error >>>>>> Traceback (most recent call last): >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 443, in start_monitoring >>>>>> self._initialize_domain_monitor() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 816, in _initialize_domain_monitor >>>>>> raise Exception(msg) >>>>>> Exception: Failed to start monitoring domain >>>>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96, host_id=3): timeout >>>>>> during domain acquisition >>>>>> MainThread::ERROR::2017-02-02 19:25:27,866::hosted_engine::4 >>>>>> 85::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>>>>> Shutting down the agent because of 3 failures in a row! >>>>>> MainThread::INFO::2017-02-02 19:25:32,087::hosted_engine::8 >>>>>> 41::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) >>>>>> VDSM domain monitor status: PENDING >>>>>> MainThread::INFO::2017-02-02 19:25:34,250::hosted_engine::7 >>>>>> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) >>>>>> Failed to stop monitoring domain >>>>>> (sd_uuid=7c8deaa8-be02-4aaf-b9b4-ddc8da99ad96): >>>>>> Storage domain is member of pool: u'domain=7c8deaa8-be02-4aaf-b9 >>>>>> b4-ddc8da99ad96' >>>>>> MainThread::INFO::2017-02-02 19:25:34,254::agent::143::ovir >>>>>> t_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down >>>>>> >>>>> Simone, Martin, can you please follow up on this? >>>>> >>>> >>>> Ralph, could you please attach vdsm logs from on of your hosts for the >>>> relevant time frame? >>>> >>>> >>>> -- >>>> >>>> >>>> *Ralf Schenk* >>>> fon +49 (0) 24 05 / 40 83 70 <+49%202405%20408370> >>>> fax +49 (0) 24 05 / 40 83 759 <+49%202405%204083759> >>>> mail *r...@databay.de* <r...@databay.de> >>>> >>>> *Databay AG* >>>> Jens-Otto-Krag-Straße 11 >>>> D-52146 Würselen >>>> *www.databay.de* <http://www.databay.de> >>>> >>>> Sitz/Amtsgericht Aachen • HRB:8437 • USt-IdNr.: DE 210844202 >>>> Vorstand: Ralf Schenk, Dipl.-Ing. Jens Conze, Aresch Yavari, Dipl.-Kfm. >>>> Philipp Hermanns >>>> Aufsichtsratsvorsitzender: Wilhelm Dohmen >>>> ------------------------------ >>>> >>> >>> >>> -- >>> >>> >>> *Ralf Schenk* >>> fon +49 (0) 24 05 / 40 83 70 <+49%202405%20408370> >>> fax +49 (0) 24 05 / 40 83 759 <+49%202405%204083759> >>> mail *r...@databay.de* <r...@databay.de> >>> >>> *Databay AG* >>> Jens-Otto-Krag-Straße 11 >>> D-52146 Würselen >>> *www.databay.de* <http://www.databay.de> >>> >>> Sitz/Amtsgericht Aachen • HRB:8437 • USt-IdNr.: DE 210844202 >>> Vorstand: Ralf Schenk, Dipl.-Ing. Jens Conze, Aresch Yavari, Dipl.-Kfm. >>> Philipp Hermanns >>> Aufsichtsratsvorsitzender: Wilhelm Dohmen >>> ------------------------------ >>> >> >> >> -- >> >> >> *Ralf Schenk* >> fon +49 (0) 24 05 / 40 83 70 <+49%202405%20408370> >> fax +49 (0) 24 05 / 40 83 759 <+49%202405%204083759> >> mail *r...@databay.de* <r...@databay.de> >> >> *Databay AG* >> Jens-Otto-Krag-Straße 11 >> D-52146 Würselen >> *www.databay.de* <http://www.databay.de> >> >> Sitz/Amtsgericht Aachen • HRB:8437 • USt-IdNr.: DE 210844202 >> Vorstand: Ralf Schenk, Dipl.-Ing. Jens Conze, Aresch Yavari, Dipl.-Kfm. >> Philipp Hermanns >> Aufsichtsratsvorsitzender: Wilhelm Dohmen >> ------------------------------ >> > >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users