[ovirt-devel] Re: bz 1915329: [Stream] Add host fails with: Destination /etc/pki/ovirt-engine/requests not writable

2021-01-18 Thread Marcin Sobczyk



On 1/18/21 9:58 AM, Yedidyah Bar David wrote:

On Mon, Jan 18, 2021 at 10:53 AM Martin Perina  wrote:



On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David  wrote:

On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David  wrote:

On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David  wrote:

On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David  wrote:

On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David  wrote:

On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David  wrote:

On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk  wrote:

Hi,

my guess is it's selinux-related.

Unfortunately I can't find any meaningful errors in audit.log in a
scenario where host deployment fails.
However switching selinux to permissive mode before adding hosts makes
the problem go away, so it's probably not an error somewhere in logic.

It's getting weirder: Under strace, it succeeds:

https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948

(Can't see the actual log, as I didn't add '-A', so it was overwritten
on restart...)

After updating it to use '-A' it indeed shows that it worked:

43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK

43664 14:16:55.997695 <... access resumed>) = 0

Weird.

Now ran in parallel 'ci test' for this patch and another one from
master, for comparison:

Again, the same:


https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/

With strace, passed,


https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/

Without strace, failed.

Last nightly run that passed [1] used:

ost-images-el8-host-installed-1-202101100446.x86_64
ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64

Trying now with these - not sure it possible to put specific versions inside
automation/*packages, let's see:

https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977

Indeed, with a fixed ost-images and removing updates, it passes. network suite
failed, but he-basic passed:

https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/artifact/ci_build_summary.html

So I am quite certain this is an OS issue. Not sure how we do not see
this in basic-suite.
Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird.

when this fails, we do not collect all engine's /var/log, only
messages and ovirt-engine/ .
So it's not easy to get a list of the packages that were updated.

Pushed now:

https://github.com/oVirt/ovirt-ansible-collection/pull/202

to get all of engine's /var/log, and ran manual HE job with it:

https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7680/

This one I accidentally ran with the wrong repo, then ran another one
with the correct repo [1],
But:

1. The repo wasn't used. Emailed about this a separate thread: "manual
job does not use custom repo"

2. It passed! Being what seems like a heisenbug, I understand why when
you run it under strace it
works differently. But even if you just intend to collect more logs it
also causes it to behave
differently? :-) This does not mean that "problem solved" - latest
nightly run [2] did fail with
the same error.

Status:

1. he-basic-suite is still failing.

2. Patch to collect all of /var/log from the engine merged.

Dana, can you please update? Did you have any progress?

IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-).
So, how do we continue?


Switching to CentOS Stream development/testing is a big effort, I'm not sure we 
can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...

+1

IMO we should now revert appliance and node to CentOS 8.3, and then
continue the discussion.
Having he-basic-suite broken for a week is too much.
+1 The testing infrastructure for Stream is here, but if it doesn't work 
yet than let's stick to the plan and focus on 8.3.







[1] 
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7681/
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/




[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/

--
Didi



--
Didi



--
Didi



--
Martin Perina
Manager, Software Engineering
Red Hat Czech s.r.o.




___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/MHE2YPXUYGDX6IBS265BPXEXLGCEOZWI/


[ovirt-devel] Re: bz 1915329: [Stream] Add host fails with: Destination /etc/pki/ovirt-engine/requests not writable

2021-01-18 Thread Yedidyah Bar David
On Mon, Jan 18, 2021 at 10:53 AM Martin Perina  wrote:
>
>
>
> On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David  wrote:
>>
>> On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David  wrote:
>> >
>> > On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David  wrote:
>> > >
>> > > On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David  
>> > > wrote:
>> > > >
>> > > > On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David  
>> > > > wrote:
>> > > > >
>> > > > > On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David  
>> > > > > wrote:
>> > > > > >
>> > > > > > On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk 
>> > > > > >  wrote:
>> > > > > > >
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > my guess is it's selinux-related.
>> > > > > > >
>> > > > > > > Unfortunately I can't find any meaningful errors in audit.log in 
>> > > > > > > a
>> > > > > > > scenario where host deployment fails.
>> > > > > > > However switching selinux to permissive mode before adding hosts 
>> > > > > > > makes
>> > > > > > > the problem go away, so it's probably not an error somewhere in 
>> > > > > > > logic.
>> > > > > >
>> > > > > > It's getting weirder: Under strace, it succeeds:
>> > > > > >
>> > > > > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
>> > > > > >
>> > > > > > (Can't see the actual log, as I didn't add '-A', so it was 
>> > > > > > overwritten
>> > > > > > on restart...)
>> > > > >
>> > > > > After updating it to use '-A' it indeed shows that it worked:
>> > > > >
>> > > > > 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK
>> > > > > 
>> > > > > 43664 14:16:55.997695 <... access resumed>) = 0
>> > > > >
>> > > > > Weird.
>> > > > >
>> > > > > Now ran in parallel 'ci test' for this patch and another one from
>> > > > > master, for comparison:
>> > > >
>> > > > Again, the same:
>> > > >
>> > > > >
>> > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
>> > > >
>> > > > With strace, passed,
>> > > >
>> > > > >
>> > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
>> > > >
>> > > > Without strace, failed.
>> > > >
>> > > > Last nightly run that passed [1] used:
>> > > >
>> > > > ost-images-el8-host-installed-1-202101100446.x86_64
>> > > > ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
>> > > >
>> > > > Trying now with these - not sure it possible to put specific versions 
>> > > > inside
>> > > > automation/*packages, let's see:
>> > > >
>> > > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977
>> > >
>> > > Indeed, with a fixed ost-images and removing updates, it passes. network 
>> > > suite
>> > > failed, but he-basic passed:
>> > >
>> > > https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/artifact/ci_build_summary.html
>> > >
>> > > So I am quite certain this is an OS issue. Not sure how we do not see
>> > > this in basic-suite.
>> > > Perhaps it's related to nested-kvm, or to load/slowness caused by that? 
>> > > Weird.
>> > >
>> > > when this fails, we do not collect all engine's /var/log, only
>> > > messages and ovirt-engine/ .
>> > > So it's not easy to get a list of the packages that were updated.
>> > >
>> > > Pushed now:
>> > >
>> > > https://github.com/oVirt/ovirt-ansible-collection/pull/202
>> > >
>> > > to get all of engine's /var/log, and ran manual HE job with it:
>> > >
>> > > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7680/
>> >
>> > This one I accidentally ran with the wrong repo, then ran another one
>> > with the correct repo [1],
>> > But:
>> >
>> > 1. The repo wasn't used. Emailed about this a separate thread: "manual
>> > job does not use custom repo"
>> >
>> > 2. It passed! Being what seems like a heisenbug, I understand why when
>> > you run it under strace it
>> > works differently. But even if you just intend to collect more logs it
>> > also causes it to behave
>> > differently? :-) This does not mean that "problem solved" - latest
>> > nightly run [2] did fail with
>> > the same error.
>>
>> Status:
>>
>> 1. he-basic-suite is still failing.
>>
>> 2. Patch to collect all of /var/log from the engine merged.
>>
>> Dana, can you please update? Did you have any progress?
>>
>> IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-).
>> So, how do we continue?
>
>
> Switching to CentOS Stream development/testing is a big effort, I'm not sure 
> we can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...

IMO we should now revert appliance and node to CentOS 8.3, and then
continue the discussion.
Having he-basic-suite broken for a week is too much.

>>
>>
>> >
>> > [1] 
>> > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7681/
>> > [2] 
>> > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
>> >
>> > >
>> > >
>> > > >
>> > > > [1] 
>> > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
>> > > --
>> > > Didi
>> >
>> 

[ovirt-devel] Re: bz 1915329: [Stream] Add host fails with: Destination /etc/pki/ovirt-engine/requests not writable

2021-01-18 Thread Martin Perina
On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David  wrote:

> On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David 
> wrote:
> >
> > On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David 
> wrote:
> > >
> > > On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David 
> wrote:
> > > >
> > > > On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David 
> wrote:
> > > > >
> > > > > On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <
> d...@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <
> msobc...@redhat.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > my guess is it's selinux-related.
> > > > > > >
> > > > > > > Unfortunately I can't find any meaningful errors in audit.log
> in a
> > > > > > > scenario where host deployment fails.
> > > > > > > However switching selinux to permissive mode before adding
> hosts makes
> > > > > > > the problem go away, so it's probably not an error somewhere
> in logic.
> > > > > >
> > > > > > It's getting weirder: Under strace, it succeeds:
> > > > > >
> > > > > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
> > > > > >
> > > > > > (Can't see the actual log, as I didn't add '-A', so it was
> overwritten
> > > > > > on restart...)
> > > > >
> > > > > After updating it to use '-A' it indeed shows that it worked:
> > > > >
> > > > > 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK
> > > > > 
> > > > > 43664 14:16:55.997695 <... access resumed>) = 0
> > > > >
> > > > > Weird.
> > > > >
> > > > > Now ran in parallel 'ci test' for this patch and another one from
> > > > > master, for comparison:
> > > >
> > > > Again, the same:
> > > >
> > > > >
> > > > >
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
> > > >
> > > > With strace, passed,
> > > >
> > > > >
> > > > >
> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
> > > >
> > > > Without strace, failed.
> > > >
> > > > Last nightly run that passed [1] used:
> > > >
> > > > ost-images-el8-host-installed-1-202101100446.x86_64
> > > > ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
> > > >
> > > > Trying now with these - not sure it possible to put specific
> versions inside
> > > > automation/*packages, let's see:
> > > >
> > > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977
> > >
> > > Indeed, with a fixed ost-images and removing updates, it passes.
> network suite
> > > failed, but he-basic passed:
> > >
> > >
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/artifact/ci_build_summary.html
> > >
> > > So I am quite certain this is an OS issue. Not sure how we do not see
> > > this in basic-suite.
> > > Perhaps it's related to nested-kvm, or to load/slowness caused by
> that? Weird.
> > >
> > > when this fails, we do not collect all engine's /var/log, only
> > > messages and ovirt-engine/ .
> > > So it's not easy to get a list of the packages that were updated.
> > >
> > > Pushed now:
> > >
> > > https://github.com/oVirt/ovirt-ansible-collection/pull/202
> > >
> > > to get all of engine's /var/log, and ran manual HE job with it:
> > >
> > >
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7680/
> >
> > This one I accidentally ran with the wrong repo, then ran another one
> > with the correct repo [1],
> > But:
> >
> > 1. The repo wasn't used. Emailed about this a separate thread: "manual
> > job does not use custom repo"
> >
> > 2. It passed! Being what seems like a heisenbug, I understand why when
> > you run it under strace it
> > works differently. But even if you just intend to collect more logs it
> > also causes it to behave
> > differently? :-) This does not mean that "problem solved" - latest
> > nightly run [2] did fail with
> > the same error.
>
> Status:
>
> 1. he-basic-suite is still failing.
>
> 2. Patch to collect all of /var/log from the engine merged.
>
> Dana, can you please update? Did you have any progress?
>
> IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue
> :-).
> So, how do we continue?
>

Switching to CentOS Stream development/testing is a big effort, I'm not
sure we can do this and still deliver all the RFEs/bugs planned for 4.4.5
...

>
> >
> > [1]
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7681/
> > [2]
> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
> >
> > >
> > >
> > > >
> > > > [1]
> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
> > > --
> > > Didi
> >
> >
> >
> > --
> > Didi
>
>
>
> --
> Didi
>
>

-- 
Martin Perina
Manager, Software Engineering
Red Hat Czech s.r.o.
___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List 

[ovirt-devel] Re: bz 1915329: [Stream] Add host fails with: Destination /etc/pki/ovirt-engine/requests not writable

2021-01-18 Thread Yedidyah Bar David
On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David  wrote:
>
> On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David  wrote:
> >
> > On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David  wrote:
> > >
> > > On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David  
> > > wrote:
> > > >
> > > > On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David  
> > > > wrote:
> > > > >
> > > > > On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk  
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > my guess is it's selinux-related.
> > > > > >
> > > > > > Unfortunately I can't find any meaningful errors in audit.log in a
> > > > > > scenario where host deployment fails.
> > > > > > However switching selinux to permissive mode before adding hosts 
> > > > > > makes
> > > > > > the problem go away, so it's probably not an error somewhere in 
> > > > > > logic.
> > > > >
> > > > > It's getting weirder: Under strace, it succeeds:
> > > > >
> > > > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
> > > > >
> > > > > (Can't see the actual log, as I didn't add '-A', so it was overwritten
> > > > > on restart...)
> > > >
> > > > After updating it to use '-A' it indeed shows that it worked:
> > > >
> > > > 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK
> > > > 
> > > > 43664 14:16:55.997695 <... access resumed>) = 0
> > > >
> > > > Weird.
> > > >
> > > > Now ran in parallel 'ci test' for this patch and another one from
> > > > master, for comparison:
> > >
> > > Again, the same:
> > >
> > > >
> > > > https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
> > >
> > > With strace, passed,
> > >
> > > >
> > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
> > >
> > > Without strace, failed.
> > >
> > > Last nightly run that passed [1] used:
> > >
> > > ost-images-el8-host-installed-1-202101100446.x86_64
> > > ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
> > >
> > > Trying now with these - not sure it possible to put specific versions 
> > > inside
> > > automation/*packages, let's see:
> > >
> > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977
> >
> > Indeed, with a fixed ost-images and removing updates, it passes. network 
> > suite
> > failed, but he-basic passed:
> >
> > https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/artifact/ci_build_summary.html
> >
> > So I am quite certain this is an OS issue. Not sure how we do not see
> > this in basic-suite.
> > Perhaps it's related to nested-kvm, or to load/slowness caused by that? 
> > Weird.
> >
> > when this fails, we do not collect all engine's /var/log, only
> > messages and ovirt-engine/ .
> > So it's not easy to get a list of the packages that were updated.
> >
> > Pushed now:
> >
> > https://github.com/oVirt/ovirt-ansible-collection/pull/202
> >
> > to get all of engine's /var/log, and ran manual HE job with it:
> >
> > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7680/
>
> This one I accidentally ran with the wrong repo, then ran another one
> with the correct repo [1],
> But:
>
> 1. The repo wasn't used. Emailed about this a separate thread: "manual
> job does not use custom repo"
>
> 2. It passed! Being what seems like a heisenbug, I understand why when
> you run it under strace it
> works differently. But even if you just intend to collect more logs it
> also causes it to behave
> differently? :-) This does not mean that "problem solved" - latest
> nightly run [2] did fail with
> the same error.

Status:

1. he-basic-suite is still failing.

2. Patch to collect all of /var/log from the engine merged.

Dana, can you please update? Did you have any progress?

IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-).
So, how do we continue?

>
> [1] 
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7681/
> [2] 
> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
>
> >
> >
> > >
> > > [1] 
> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
> > --
> > Didi
>
>
>
> --
> Didi



-- 
Didi
___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SMAL3FXKOKNZA3N6YDC6EWIXP4U3WWA2/