On Wed, Apr 7, 2021 at 12:36 PM Marcin Sobczyk <msobc...@redhat.com> wrote:
>
>
>
> On 4/6/21 10:37 AM, Marcin Sobczyk wrote:
> >
> > On 4/6/21 9:55 AM, Yedidyah Bar David wrote:
> >> On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobc...@redhat.com> wrote:
> >>> Hi,
> >>>
> >>> On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
> >>>> On Mon, Apr 5, 2021 at 5:53 AM <jenk...@jenkins.phx.ovirt.org> wrote:
> >>>>> Project: 
> >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
> >>>>> Build: 
> >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/
> >>>> FYI: This failed twice in a row (1973 and 1974), for the same reason.
> >>>> I reproduced locally, looked a bit, failed to find the root cause.
> >>>> When I connected
> >>>> to host-1's console, it was stuck in emergency after reboot. I checked
> >>>> a bit, there
> >>>> was some error about kdump failing to read the kernel image
> >>>> ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually
> >>>> as root I did
> >>>> manage to read it. I rebooted, and the VM came up fine. I decided to
> >>>> try OST again,
> >>>> cleaned up and ran it, and opened a 'lago console' on the vm after it
> >>>> was up, but
> >>>> OST passed. Tried again, passed again. Then I manually ran in CI 1975
> >>>> and it passed,
> >>>> and also the nightly 1976 passed. So I am going to ignore for now.
> >>>>
> >>>> I think we need a patch to make lago/OST log consoles of all the VMs.
> >>>> I might try
> >>>> to work on this.
> >>> Also stumbled upon this. Please take a look at
> >>> https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
> >> Yes, I did notice this change and wondered if it's related...
> >>
> >> But it's not merged yet, and still HE passed at least 4 times (two locally,
> >> two on CI). Obviously this does not prove that the issue is fixed.
> >>
> >> Anyway, in addition to merely fixing it (which perhaps your patch does),
> >> I also wanted to emphasize the importance of making it easier to fix
> >> future such cases. How did you manage to find the root cause?
> > My case was similar - HE suite was failing for me constantly. I noticed
> > host-1 drops to emergency shell, so I just 'virsh console'd inside
> > and went through the logs. That's when I spotted the problem with
> > the additional '/var/tmp' disk. I tried the fix on my machine and HE
> > suite started working again. Moments later I tried running HE suite
> > without the patch and it was successful again.
> >
> > I couldn't figure out what's the real cause behind these problems,
> > but removing the unnecessary additional disk from host-1 seemed
> > to do the trick.
> >
> > +1 for logging consoles of the VMs - that should help with these kind
> > of problems in the future.
> Yesterday we hit this problem at least 2 times:
>
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16183
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16184
>
> Didi, please review the patch mentioned above. If you don't have
> any objections let's merge it and work on improving logging later.

+1 from me.

I also pushed this to log consoles, but it's not as easy as hoped:

https://gerrit.ovirt.org/c/lago-ost/+/114150

When you have time, please see my comment there and reply...

Thanks and best regards,
-- 
Didi
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/INUYPDJR3GESUVOS7YBKMHEAP5KO4QAH/

Reply via email to