On Tue, 26 Nov 2019, 10:19 Dominik Holler, <dhol...@redhat.com> wrote:
> > > On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsof...@redhat.com> wrote: > >> On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dhol...@redhat.com> >> wrote: >> > >> > >> > >> > On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsof...@redhat.com> wrote: >> >> >> >> On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dhol...@redhat.com> >> wrote: >> >> > >> >> > >> >> > >> >> > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsof...@redhat.com> >> wrote: >> >> >> >> >> >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dhol...@redhat.com> >> wrote: >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsof...@redhat.com> >> wrote: >> >> >> >> >> >> >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < >> dhol...@redhat.com> wrote: >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < >> dhol...@redhat.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < >> dhol...@redhat.com> wrote: >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < >> nsof...@redhat.com> wrote: >> >> >> >> >>>> >> >> >> >> >>>> >> >> >> >> >>>> >> >> >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < >> msobc...@redhat.com> wrote: >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < >> dhol...@redhat.com> wrote: >> >> >> >> >>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < >> dhol...@redhat.com> wrote: >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora >> Barroso <mdbarr...@redhat.com> wrote: >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < >> vjura...@redhat.com> wrote: >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte >> de Mora Barroso wrote: >> >> >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < >> vjura...@redhat.com> >> >> >> >> >>>>>>>> > > wrote: >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik >> Holler wrote: >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < >> dhol...@redhat.com> >> >> >> >> >>>>>>>> > > > > wrote: >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < >> nsof...@redhat.com> >> >> >> >> >>>>>>>> > > > > > wrote: >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech >> Juranek >> >> >> >> >>>>>>>> > > > > >> <vjura...@redhat.com> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> wrote: >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > Hi, >> >> >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in >> 002_bootstrap.check_update_host. It >> >> >> >> >>>>>>>> > > > > >> > fails >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> with >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": >> [], "msg": "Depsolve >> >> >> >> >>>>>>>> > > > > >> > Error >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> occured: >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best >> update candidate for package >> >> >> >> >>>>>>>> > > > > >> > vdsm- >> >> >> >> >>>>>>>> > > > > >> > >> network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> nmstate >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > needed by >> vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >> >> >>>>>>>> > > > > >> > Problem 2: >> >> >> >> >>>>>>>> > > > > >> > package >> vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> vdsm-network >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none >> of the providers can be >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> installed\n >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate >> for package vdsm- >> >> >> >> >>>>>>>> > > > > >> > >> python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >> >> >> >> >>>>>>>> > > > > >> > nmstate >> >> >> >> >>>>>>>> > > > > >> > needed by >> vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo >> enabled by >> >> >> >> >>>>>>>> > > > > >> ovirt-release-master. >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > I re-triggered as >> >> >> >> >>>>>>>> > > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >> >> >> >> >>>>>>>> > > > > > maybe >> >> >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >> >> >> >> >>>>>>>> > > > > > was missing >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > Looks like >> >> >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is >> ignored by OST. >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which >> really missed this patch. >> >> >> >> >>>>>>>> > > > I did a rebase and now running with this patch in >> build #6132 [2]. Let's >> >> >> >> >>>>>>>> > > > wait >> >> >> >> >>>>>>>> > for it to see if gerrit #104825 helps. >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > [1] >> https://jenkins.ovirt.org/job/standard-manual-runner/909/ >> >> >> >> >>>>>>>> > > > [2] >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > > Miguel, do you think merging >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >> >> >> >> >>>>>>>> > > > > t-cq >> >> >> >> >>>>>>>> > .repo.in >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > would solve this? >> >> >> >> >>>>>>>> > > >> >> >> >> >>>>>>>> > > >> >> >> >> >>>>>>>> > > I've split the patch Dominik mentions above in two, >> one of them adding >> >> >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >> >> >> >> >>>>>>>> > > >> >> >> >> >>>>>>>> > > Let's see if it fixes it. >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > it fixes original issue, but OST still fails in >> >> >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik >> Holler please confirm. >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> Let me know if you need any help Dominik. >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> Thanks. >> >> >> >> >>>>>>> The problem is that the hosts lost connection to storage: >> >> >> >> >>>>>>> >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >> : >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) >> [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n >> /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] >> ignore_suspended_devices=1 write_cache_state=0 >> disable_after_error_count=3 >> filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", >> "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 >> wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' >> --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o >> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name >> (cwd None) (commands:153) >> >> >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) >> [storage.Monitor] Error checking path >> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata >> (monitor:501) >> >> >> >> >>>>>>> Traceback (most recent call last): >> >> >> >> >>>>>>> File >> "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in >> _pathChecked >> >> >> >> >>>>>>> delay = result.delay() >> >> >> >> >>>>>>> File >> "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >> >> >> >>>>>>> raise exception.MiscFileReadException(self.path, >> self.rc, self.err) >> >> >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal >> file read failure: >> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', >> 1, 'Read timeout') >> >> >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) >> [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became >> INVALID (monitor:472) >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> >> >> >> >> >>>>>>> I failed to reproduce local to analyze this, I will try >> again, any hints welcome. >> >> >> >> >>>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that >> 008_basic_ui_sanity.py triggers the problem. >> >> >> >> >>>>>> Is there someone with knowledge about the basic_ui_sanity >> around? >> >> >> >> >>>>> >> >> >> >> >>>>> How do you think it's related? By commenting out the ui >> sanity tests and seeing OST with successful finish? >> >> >> >> >>>>> >> >> >> >> >>>>> Looking at 6134 run you were discussing: >> >> >> >> >>>>> >> >> >> >> >>>>> - timing of the ui sanity set-up [1]: >> >> >> >> >>>>> >> >> >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >> >> >> >> >>>>> >> >> >> >> >>>>> - timing of first encountered storage error [2]: >> >> >> >> >>>>> >> >> >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) >> [storage.Monitor] Error checking path >> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata >> (monitor:501) >> >> >> >> >>>>> Traceback (most recent call last): >> >> >> >> >>>>> File >> "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in >> _pathChecked >> >> >> >> >>>>> delay = result.delay() >> >> >> >> >>>>> File >> "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >> >> >> >>>>> raise exception.MiscFileReadException(self.path, >> self.rc, self.err) >> >> >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal >> file read failure: >> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', >> 1, 'Read timeout') >> >> >> >> >>>>> >> >> >> >> >>>>> Timezone difference aside, it seems to me that these >> storage errors occured before doing anything ui-related. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> You are right, a time.sleep(8*60) in >> >> >> >> >> https://gerrit.ovirt.org/#/c/104925/2 >> >> >> >> >> has triggers the issue the same way. >> >> >> >> >> >> >> >> So this is a test issues, assuming that the UI tests can >> complete in >> >> >> >> less than 8 minutes? >> >> >> >> >> >> >> > >> >> >> > To my eyes this looks like storage is just stop working after >> some time. >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> > Nir or Steve, can you please confirm that this is a storage >> problem? >> >> >> >> >> >> >> >> Why do you think we have a storage problem? >> >> >> >> >> >> >> > >> >> >> > I understand from the posted log snippets that they say that the >> storage is not accessible anymore, >> >> >> >> >> >> No, so far one read timeout was reported, this does not mean storage >> >> >> is not available anymore. >> >> >> It can be temporary issue that does not harm anything. >> >> >> >> >> >> > while the host is still responsive. >> >> >> > This might be triggered by something outside storage, e.g. the >> network providing the storage stopped working, >> >> >> > But I think a possible next step in analysing this issue would be >> to find the reason why storage is not happy. >> >> >> >> >> > >> >> > Sounds like there was a miscommunication in this thread. >> >> > I try to address all of your points, please let me know if something >> is missing or not clearly expressed. >> >> > >> >> >> >> >> >> First step is to understand which test fails, >> >> > >> >> > >> >> > 098_ovirt_provider_ovn.use_ovn_provider >> >> > >> >> >> >> >> >> and why. This can be done by the owner of the test, >> >> > >> >> > >> >> > The test was added by the network team. >> >> > >> >> >> >> >> >> understanding what the test does >> >> > >> >> > >> >> > The test tries to add a vNIC. >> >> > >> >> >> >> >> >> and what is the expected system behavior. >> >> >> >> >> > >> >> > It is expected that adding a vNIC works, because the VM should be up. >> >> >> >> What was the actual behavior? >> >> >> >> >> If the owner of the test thinks that the test failed because of a >> storage issue >> >> > >> >> > >> >> > I am not sure who is the owner, but I do. >> >> >> >> Can you explain why how a vNIC failed because of a storage issue? >> >> >> > >> > >> > Test fails with: >> > >> > Cannot add a Network Interface when VM is not Down, Up or Image-Locked. >> > >> > engine.log says: >> > {"jsonrpc": "2.0", "method": >> "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": >> {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": >> "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", >> "name": "vda", "path": >> "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, >> "notify_time": 4298388570}} >> >> So you think adding vNIC failed because the VM was paused? >> >> > Yes, because of the error message "Cannot add a Network Interface when VM > is not Down, Up or Image-Locked." > > >> > vdsm.log says: >> > >> > 2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error >> checking path >> /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata >> (monitor:501) >> > Traceback (most recent call last): >> > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line >> 499, in _pathChecked >> > delay = result.delay() >> > File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line >> 391, in delay >> > raise exception.MiscFileReadException(self.path, self.rc, self.err) >> > vdsm.storage.exception.MiscFileReadException: Internal file read >> failure: >> ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', >> 1, 'Read timeout') >> >> Is this related to the paused vm? >> >> > The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' > makes me thinking this. > > >> You did not provide a timestamp for the engine event above. >> >> > > I can't find last weeks log, maybe they are faded out already. > Please find more recent logs in > https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492 > > > >> > ... >> > >> > 2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker >> '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' >> is blocked for 60.00 seconds (check:282) >> > 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] >> Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) >> > Traceback (most recent call last): >> > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line >> 408, in _checkDomainStatus >> > self.domain.selftest() >> > File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line >> 710, in selftest >> > self.oop.os.statvfs(self.domaindir) >> > File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", >> line 242, in statvfs >> > return self._iop.statvfs(path) >> > File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line >> 479, in statvfs >> > resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) >> > File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line >> 442, in _sendCommand >> > raise Timeout(os.strerror(errno.ETIMEDOUT)) >> > ioprocess.Timeout: Connection timed out >> >> This show that storage was not accessible for 60 seconds (ioprocess >> uses 60 seconds timeout). >> >> 60 seconds timeout is bad. If we have leases on this storage domain >> (e.g. SPM lease) they will >> expire in 20 seconds after this event and the vdsm on the SPM host >> will be killed. >> >> Do we have network tests changing the network used by the NFS storage >> domain before this event? >> >> > No. > > >> What were the changes the network tests or code since OST was successful? >> >> > I am not aware of a change, which might be relevant. > Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) > is on CentOS 7 is relevant. > Also the occurrence of this issue seems not to be 100% deterministic, I > guess because it is timing related. > > The error is reproducible locally by running OST, and just keep the > environment alive after basic-suite-master succeeded. > After some time, the storage will become inaccessible. > When this happens, does the storage domain change its state and goes south, or is it a temporary glitch that only halts VMs? Does the host or storage server host logs have something suspicious at that time (kernel messages, nfs logs)? > >> >> Can you explain how adding 8 minutes sleep instead of the UI tests >> >> reproduced the issue? >> >> >> > >> > >> > This shows that the issue is not triggered by the UI test, but maybe by >> passing time. >> >> Do we run the ovn tests after the UI tests? >> >> >> >> someone from storage can look at this. >> >> >> >> >> > >> >> > Thanks, I would appreciate this. >> >> > >> >> >> >> >> >> But the fact that adding long sleep reproduce the issue means it is >> not related >> >> >> in any way to storage. >> >> >> >> >> >> Nir >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>> >> >> >> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple >> of days ago about some storage metadata issues and he said he got a >> response from Nir, that "it's a known issue". >> >> >> >> >>>>> >> >> >> >> >>>>> Nir, Amit, can you comment on this? >> >> >> >> >>>> >> >> >> >> >>>> >> >> >> >> >>>> The error mentioned here is not vdsm error but warning >> about storage accessibility. We sould convert the tracebacks to warning. >> >> >> >> >>>> >> >> >> >> >>>> The reason for such issue can be misconfigured network >> (maybe network team is testing negative flows?), >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> No. >> >> >> >> >>> >> >> >> >> >>>> >> >> >> >> >>>> or some issue in the NFS server. >> >> >> >> >>>> >> >> >> >> >>> >> >> >> >> >>> Only hint I found is >> >> >> >> >>> "Exiting Time2Retain handler because session_reinstatement=1" >> >> >> >> >>> but I have no idea what this means or if this is relevant at >> all. >> >> >> >> >>> >> >> >> >> >>>> >> >> >> >> >>>> One read timeout is not an issue. We have a real issue only >> if we have consistent read timeouts or errors for couple of minutes, after >> that engine can deactivate the storage domain or some hosts if only these >> hosts are having trouble to access storage. >> >> >> >> >>>> >> >> >> >> >>>> In OST we never expect such conditions since we don't test >> negative flows, and we should have good connectivity with the vms running >> on the same host. >> >> >> >> >>>> >> >> >> >> >>> >> >> >> >> >>> Ack, this seems to be the problem. >> >> >> >> >>> >> >> >> >> >>>> >> >> >> >> >>>> Nir >> >> >> >> >>>> >> >> >> >> >>>> >> >> >> >> >>>>> [1] >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >> >> >> >> >>>>> [2] >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >> >> >> >> >>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> Marcin, could you please take a look? >> >> >> >> >>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>>> >> >> >> >> >>>>>>>> >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >> >> >> >> >>>>>>>> > > >> >> >> >> >>>>>>>> > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> Who installs this rpm in OST? >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > I do not understand the question. >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > > >> >> >> >> >>>>>>>> > > > > >> > [...] >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > See [2] for full error. >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > Can someone please take a look? >> >> >> >> >>>>>>>> > > > > >> > Thanks >> >> >> >> >>>>>>>> > > > > >> > Vojta >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > >> >> >> >> >>>>>>>> > > > > >> > [1] >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >> >> >> >> >>>>>>>> > > > > >> > [2] >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> >> >> >> >>>>>>>> > > > > >> / >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > >> exported-artifacts/test_logs/basic-suite-master/ >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> >> >> >> >>>>>>>> > > > > >> ____ >> >> >> >> >>>>>>>> > > > > >> ________________________________>> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >> >> >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to >> devel-le...@ovirt.org >> >> >> >> >>>>>>>> > > > > >> > Privacy Statement: >> https://www.ovirt.org/site/privacy-policy/ >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> > List Archives: >> >> >> >> >>>>>>>> > > > > >> >> >> >> >> >>>>>>>> > > > > >> >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> >> >> >> >>>>>>>> > > > > >> N26B >> >> >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >> >> >> >> >>>>>>>> > > > > >> >> _______________________________________________ >> >> >> >> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org >> >> >> >> >>>>>>>> > > > > >> To unsubscribe send an email to >> devel-le...@ovirt.org >> >> >> >> >>>>>>>> > > > > >> Privacy Statement: >> https://www.ovirt.org/site/privacy-policy/ >> >> >> >> >>>>>>>> > > > > >> oVirt Code of Conduct: >> >> >> >> >>>>>>>> > > > > >> >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >> >> >>>>>>>> > > > > >> List Archives: >> >> >> >> >>>>>>>> > > > > >> >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> >> >> >> >>>>>>>> > > > > >> N5K3 >> >> >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > > >> >> >> >> >>>>>>>> > > >> >> >> >> >>>>>>>> > > _______________________________________________ >> >> >> >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org >> >> >> >> >>>>>>>> > > To unsubscribe send an email to >> devel-le...@ovirt.org >> >> >> >> >>>>>>>> > > Privacy Statement: >> https://www.ovirt.org/site/privacy-policy/ >> >> >> >> >>>>>>>> > > oVirt Code of Conduct: >> >> >> >> >>>>>>>> > > >> https://www.ovirt.org/community/about/community-guidelines/ List >> Archives: >> >> >> >> >>>>>>>> > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >> >> >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >> >> >> >> >>>>>>>> > >> >> >> >> >>>>>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> -- >> >> >> >> >>>>> Martin Perina >> >> >> >> >>>>> Manager, Software Engineering >> >> >> >> >>>>> Red Hat Czech s.r.o. >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ > Devel mailing list -- devel@ovirt.org > To unsubscribe send an email to devel-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/KMRWVJNQ6GADNHSOR4E3YIXUP7DV3DLA/ >
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/X3QEO5T37XCOT6LB46OHLPPR3AIFGR2V/