On Thu, Apr 20, 2017 at 2:42 PM, Nir Soffer <nsof...@redhat.com> wrote:
> > > בתאריך יום ה׳, 20 באפר׳ 2017, 13:51, מאת Yaniv Kaul <yk...@redhat.com>: > >> On Thu, Apr 20, 2017 at 1:32 PM, Nir Soffer <nsof...@redhat.com> wrote: >> >>> >>> >>> בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski < >>> piotr.kliczew...@gmail.com>: >>> >>>> On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <yk...@redhat.com> wrote: >>>> > >>>> > >>>> > On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski >>>> > <piotr.kliczew...@gmail.com> wrote: >>>> >> >>>> >> On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <yk...@redhat.com> >>>> wrote: >>>> >> > No, that's not the issue. >>>> >> > I've seen it happening few times. >>>> >> > >>>> >> > 1. It always with the ISO domain (which we don't use anyway in >>>> o-s-t) >>>> >> > 2. Apparently, only one host is asking for a mount: >>>> >> > authenticated mount request from 192.168.201.4:713 for >>>> /exports/nfs/iso >>>> >> > (/exports/nfs/iso) >>>> >> > >>>> >> > (/var/log/messages of the NFS server) >>>> >> > >>>> >> > And indeed, you can see in[1] that host1 made the request and all >>>> is >>>> >> > well on >>>> >> > it. >>>> >> > >>>> >> > However, there are connection issues with host0 which cause a >>>> timeout to >>>> >> > connectStorageServer(): >>>> >> > From[2]: >>>> >> > >>>> >> > 2017-04-19 18:58:58,465-04 DEBUG >>>> >> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] >>>> (ResponseWorker) >>>> >> > [] >>>> >> > Message received: >>>> >> > >>>> >> > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite- >>>> master-host0:192912448","message":"Vds >>>> >> > timeout occured"},"id":null} >>>> >> > >>>> >> > 2017-04-19 18:58:58,475-04 ERROR >>>> >> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling. >>>> AuditLogDirector] >>>> >> > (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID: >>>> >> > VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call >>>> Stack: >>>> >> > null, >>>> >> > Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0 >>>> command >>>> >> > ConnectStorageServerVDS failed: Message timeout which can be >>>> caused by >>>> >> > communication issues >>>> >> > 2017-04-19 18:58:58,475-04 INFO >>>> >> > >>>> >> > [org.ovirt.engine.core.vdsbroker.vdsbroker. >>>> ConnectStorageServerVDSCommand] >>>> >> > (org.ovirt.thread.pool-7-thread-37) [755b908a] Command >>>> >> > >>>> >> > 'org.ovirt.engine.core.vdsbroker.vdsbroker. >>>> ConnectStorageServerVDSCommand' >>>> >> > return value ' >>>> >> > ServerConnectionStatusReturn:{status='Status [code=5022, >>>> message=Message >>>> >> > timeout which can be caused by communication issues]'} >>>> >> > >>>> >> > >>>> >> > I wonder why, but on /var/log/messages[3], I'm seeing: >>>> >> > Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm >>>> Executor >>>> >> > WARN >>>> >> > Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask >>>> >> > {'params': >>>> >> > {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f- >>>> 9445bda7b6d1', >>>> >> > u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', >>>> >> > u'user': >>>> >> > u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password': >>>> >> > '********', >>>> >> > u'port': u''}], u'storagepoolID': >>>> >> > u'00000000-0000-0000-0000-000000000000', >>>> >> > u'domainType': 1}, 'jsonrpc': '2.0', 'method': >>>> >> > u'StoragePool.connectStorageServer', 'id': >>>> >> > u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60, >>>> >> > duration=60 at 0x2f44310> task#=9 at 0x2ac11d0> >>>> >> > ... >>>> >> > >>>> >> >>>> >> I see following sequence: >>>> >> >>>> >> The message is sent: >>>> >> >>>> >> 2017-04-19 18:55:58,020-04 DEBUG >>>> >> [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] >>>> >> (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND >>>> >> destination:jms.topic.vdsm_requests >>>> >> content-length:381 >>>> >> ovirtCorrelationId:755b908a >>>> >> reply-to:jms.topic.vdsm_responses >>>> >> >>>> >> <JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method: >>>> >> StoragePool.connectStorageServer, params: >>>> >> {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1, >>>> >> connectionParams=[{password=, protocol_version=4.2, port=, iqn=, >>>> >> connection=192.168.201.3:/exports/nfs/share1, >>>> >> id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}> >>>> >> >>>> >> There is no response for specified amount of time and we timeout: >>>> >> >>>> >> 2017-04-19 18:58:58,465-04 DEBUG >>>> >> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] >>>> >> (ResponseWorker) [] Message received: >>>> >> >>>> >> {"jsonrpc":"2.0","error":{"code":"lago-basic-suite- >>>> master-host0:192912448","message":"Vds >>>> >> timeout occured"},"id":null} >>>> >> >>>> >> As Yaniv pointed here is why we never get the response: >>>> >> >>>> >> Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor >>>> >> WARN Worker blocked: <Worker name=jsonrpc/3 running <Task >>>> <JsonRpcTask >>>> >> {'params': {u'connectionParams': [{u'id': >>>> >> u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection': >>>> >> u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'', >>>> >> u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********', >>>> >> u'port': u''}], u'storagepoolID': >>>> >> u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, >>>> 'jsonrpc': >>>> >> '2.0', 'method': u'StoragePool.connectStorageServer', 'id': >>>> >> u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60, >>>> >> duration=180 at 0x2f44310> task#=9 at 0x2ac11d0> >>>> >>> >>> This means the connection attempt was stuck for 180 seconds. Need to >>> check if the mount was stuck, or maybe there is some issue in supervdsm >>> running this. >>> >>> This is a new warning introduced lately, before a stuck worker was >>> hidden. >>> >>> Do we check that the nfs server is up before we start the tests? >>> >> >> We do not check, but it is up - it's installed before anything else, and >> happens to be on the Engine. It has several minutes to be up and ready. >> Moreover, a command to the same NFS server, same params, only a different >> mount point, succeeded, a second earlier: >> >> 18:55:56,300::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call >> mount with (u'192.168.201.3:/exports/nfs/share2', >> u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share2') {'vfstype': >> 'nfs', 'mntOpts': >> 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=1', 'timeout': >> None, 'cgroup': None} >> >> ... >> >> 18:55:56,338::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) >> return mount with None >> >> >> But: >> >> 18:55:58,076::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call >> mount with (u'192.168.201.3:/exports/nfs/share1', >> u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share1') {'vfstype': >> 'nfs', 'mntOpts': >> 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=2', 'timeout': >> None, 'cgroup': None} >> >> >> Got stuck. >> >> > The stuck mount is using nfs 4.2, the successful one is 4.1. > On the other host it succeeded[1], but the order was different (it runs in parallel threads) . Perhaps we should add some random sleep between them? Y. [1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log > >> See https://bugzilla.redhat.com/show_bug.cgi?id=1443913 >> >> Y. >> >> >> >> >>> >>> >> >>>> >> > >>>> >> > 3. Also, there is still the infamous unable to update response >>>> issues. >>>> >> > >>>> >> >>>> >> When we see timeout on a call our default behavior is to reconnect >>>> >> when we clean pending messages. >>>> >> As a result when we reconnect and receive a response from the message >>>> >> sent before disconnect >>>> >> we say it is unknown to us. >>>> > >>>> > >>>> > But the example I've given was earlier than the storage issue? >>>> >>>> The specific message that you refer to was a ping command but it >>>> timeout (3 secs) >>>> before it arrived and it was removed from tracking. When it finally >>>> arrived it was not tracked anymore. >>>> >>>> We may want to increase the timeout to give it more time for arrival. >>>> >>>> > Y. >>>> > >>>> >> >>>> >> >>>> >> > >>>> >> > {"jsonrpc":"2.0","method":"Host.ping","params":{},"id":" >>>> 7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}� >>>> >> > 2017-04-19 18:54:27,843-04 DEBUG >>>> >> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] >>>> >> > (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND >>>> >> > destination:jms.topic.vdsm_requests >>>> >> > content-length:94 >>>> >> > ovirtCorrelationId:62d198cc >>>> >> > reply-to:jms.topic.vdsm_responses >>>> >> > >>>> >> > <JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", >>>> method: >>>> >> > Host.ping, params: {}> >>>> >> > 2017-04-19 18:54:27,885-04 DEBUG >>>> >> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] >>>> >> > (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND >>>> >> > ovirtCorrelationId:1f9aac13 >>>> >> > destination:jms.topic.vdsm_requests >>>> >> > reply-to:jms.topic.vdsm_responses >>>> >> > content-length:94 >>>> >> > >>>> >> > ... >>>> >> > >>>> >> > {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", >>>> >> > "result": >>>> >> > true}� >>>> >> > 2017-04-19 18:54:32,132-04 DEBUG >>>> >> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] >>>> (ResponseWorker) >>>> >> > [] >>>> >> > Message received: {"jsonrpc": "2.0", "id": >>>> >> > "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true} >>>> >> > 2017-04-19 18:54:32,133-04 ERROR >>>> >> > [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] >>>> Not >>>> >> > able >>>> >> > to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0" >>>> >> > >>>> >> > >>>> >> > Would be nice to understand why. >>>> >> > >>>> >> > >>>> >> > >>>> >> > 4. Lastly, MOM is not running. Why? >>>> >> > >>>> >> > >>>> >> > Please open a bug with the details from item #2 above. >>>> >> > Y. >>>> >> > >>>> >> > >>>> >> > [1] >>>> >> > >>>> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ >>>> master/6403/artifact/exported-artifacts/basic-suit-master- >>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py/ >>>> lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log >>>> >> > [2] >>>> >> > >>>> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ >>>> master/6403/artifact/exported-artifacts/basic-suit-master- >>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py/ >>>> lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log >>>> >> > [3] >>>> >> > >>>> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ >>>> master/6403/artifact/exported-artifacts/basic-suit-master- >>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py/ >>>> lago-basic-suite-master-host0/_var_log/messages >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <gshi...@redhat.com> >>>> wrote: >>>> >> >> >>>> >> >> Test failed: add_secondary_storage_domains >>>> >> >> Link to suspected patches: >>>> >> >> Link to Job: >>>> >> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ >>>> master/6403 >>>> >> >> Link to all logs: >>>> >> >> >>>> >> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ >>>> master/6403/artifact/exported-artifacts/basic-suit-master- >>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py >>>> >> >> >>>> >> >> >>>> >> >> Error seems to be: >>>> >> >> 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2) >>>> >> >> [storage.TaskManager.Task] >>>> >> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error >>>> >> >> (task:871) >>>> >> >> Traceback (most recent call last): >>>> >> >> File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", >>>> line >>>> >> >> 878, >>>> >> >> in _run >>>> >> >> return fn(*args, **kargs) >>>> >> >> File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line >>>> 52, in >>>> >> >> wrapper >>>> >> >> res = f(*args, **kwargs) >>>> >> >> File "/usr/share/vdsm/storage/hsm.py", line 2709, in >>>> >> >> getStorageDomainInfo >>>> >> >> dom = self.validateSdUUID(sdUUID) >>>> >> >> File "/usr/share/vdsm/storage/hsm.py", line 298, in >>>> validateSdUUID >>>> >> >> sdDom = sdCache.produce(sdUUID=sdUUID) >>>> >> >> File "/usr/share/vdsm/storage/sdc.py", line 112, in produce >>>> >> >> domain.getRealDomain() >>>> >> >> File "/usr/share/vdsm/storage/sdc.py", line 53, in >>>> getRealDomain >>>> >> >> return self._cache._realProduce(self._sdUUID) >>>> >> >> File "/usr/share/vdsm/storage/sdc.py", line 136, in >>>> _realProduce >>>> >> >> domain = self._findDomain(sdUUID) >>>> >> >> File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain >>>> >> >> return findMethod(sdUUID) >>>> >> >> File "/usr/share/vdsm/storage/sdc.py", line 178, in >>>> >> >> _findUnfetchedDomain >>>> >> >> raise se.StorageDomainDoesNotExist(sdUUID) >>>> >> >> StorageDomainDoesNotExist: Storage domain does not exist: >>>> >> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',) >>>> >> >> 2017-04-19 18:58:58,777-0400 INFO (jsonrpc/2) >>>> >> >> [storage.TaskManager.Task] >>>> >> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is >>>> >> >> aborted: >>>> >> >> 'Storage domain does not exist' - code 358 (task:1176) >>>> >> >> 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) >>>> [storage.Dispatcher] >>>> >> >> {'status': {'message': "Storage domain does not exist: >>>> >> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}} >>>> >> >> (dispatcher:78) >>>> >> >> >>>> >> >> _______________________________________________ >>>> >> >> Devel mailing list >>>> >> >> Devel@ovirt.org >>>> >> >> http://lists.ovirt.org/mailman/listinfo/devel >>>> >> > >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Devel mailing list >>>> >> > Devel@ovirt.org >>>> >> > http://lists.ovirt.org/mailman/listinfo/devel >>>> > >>>> > >>>> _______________________________________________ >>>> Devel mailing list >>>> Devel@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/devel >>> >>>
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel