Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-05 Thread Michael Scherer
Le vendredi 05 avril 2019 à 16:55 +0530, Nithya Balachandran a écrit :
> On Fri, 5 Apr 2019 at 12:16, Michael Scherer 
> wrote:
> 
> > Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > > I'm not convinced this is solved. Just had what I believe is a
> > > > similar
> > > > failure:
> > > > 
> > > > *00:12:02.532* A dependency job for rpc-statd.service failed.
> > > > See
> > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs:
> > > > rpc.statd is
> > > > not running but is required for remote locking.*00:12:02.532*
> > > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > > specified
> > > > 
> > > > (of course, it can always be my patch!)
> > > > 
> > > > https://build.gluster.org/job/centos7-regression/5384/console
> > > 
> > > same issue, different builder (206). I will check them all, as
> > > the
> > > issue is more widespread than I expected (or it did popup since
> > > last
> > > time I checked).
> > 
> > Deepshika did notice that the issue came back on one server
> > (builder202) after a reboot, so the rpcbind issue is not related to
> > the
> > network initscript one, so the RCA continue.
> > 
> > We are looking for another workaround involving fiddling with the
> > socket (until we find why it do use ipv6 at boot, but not after,
> > when
> > ipv6 is disabled).
> > 
> 
> Could this be relevant?
> https://access.redhat.com/solutions/2798411

Good catch.

So, we already do that, Nigel took care of that (after 2 days of
research). But I didn't knew the exact symptoms, and decided to double
check just in case.

And... there is no sysctl.conf in the initrd. Running dracut -v -f do
not change anything.

Running "dracut -v -f -H" take care of that (and this fix the problem),
but:
- our ansible script already run that
- -H is hostonly, which is already the default on EL7 according to the
doc.  

However, if dracut-config-generic is installed, it doesn't build a
hostonly initrd, and so do not include the sysctl.conf file (who break
rpcbnd, who break the test suite).

And for some reason, it is installed the image in ec2 (likely default),
but not by default on the builders.

So what happen is that after a kernel upgrade, dracut rebuild a generic
initrd instead of a hostonly one, who break things. And kernel was
likely upgraded recently (and upgrade happen nightly (for some value of
"night"), so we didn't see that earlier, nor with a fresh system.


So now, we have several solution:
- be explicit on using hostonly in dracut, so this doesn't happen again
(or not for this reason)

- disable ipv6 in rpcbind in a cleaner way (to be tested)

- get the test suite work with ip v6

In the long term, I also want to monitor the processes, but for that, I
need a VPN between the nagios server and ec2, and that project got
blocked by several issues (like EC2 not support ecdsa keys, and we use
that for ansible, so we have to come back to RSA for full automated
deployment, and openvon requires to use certificates, so I need a newer
python openssl for doing what I want, and RHEL 7 is too old, etc, etc).

As the weekend approach for me, I just rebuilt the initrd for the time
being. I guess forcing hostonly is the safest fix for now, but this
will be for monday.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-05 Thread Nithya Balachandran
On Fri, 5 Apr 2019 at 12:16, Michael Scherer  wrote:

> Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > I'm not convinced this is solved. Just had what I believe is a
> > > similar
> > > failure:
> > >
> > > *00:12:02.532* A dependency job for rpc-statd.service failed. See
> > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
> > > not running but is required for remote locking.*00:12:02.532*
> > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > specified
> > >
> > > (of course, it can always be my patch!)
> > >
> > > https://build.gluster.org/job/centos7-regression/5384/console
> >
> > same issue, different builder (206). I will check them all, as the
> > issue is more widespread than I expected (or it did popup since last
> > time I checked).
>
> Deepshika did notice that the issue came back on one server
> (builder202) after a reboot, so the rpcbind issue is not related to the
> network initscript one, so the RCA continue.
>
> We are looking for another workaround involving fiddling with the
> socket (until we find why it do use ipv6 at boot, but not after, when
> ipv6 is disabled).
>

Could this be relevant?
https://access.redhat.com/solutions/2798411


>
> Maybe we could run the test suite on a node without all the ipv6
> disabling to see if that cause a issue ?
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now

2019-04-05 Thread Yaniv Kaul
On Fri, Apr 5, 2019 at 9:55 AM Deepshikha Khandelwal 
wrote:

>
>
> On Fri, Apr 5, 2019 at 12:16 PM Michael Scherer 
> wrote:
>
>> Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
>> > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
>> > > I'm not convinced this is solved. Just had what I believe is a
>> > > similar
>> > > failure:
>> > >
>> > > *00:12:02.532* A dependency job for rpc-statd.service failed. See
>> > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is
>> > > not running but is required for remote locking.*00:12:02.532*
>> > > mount.nfs: Either use '-o nolock' to keep locks local, or start
>> > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
>> > > specified
>> > >
>> > > (of course, it can always be my patch!)
>> > >
>> > > https://build.gluster.org/job/centos7-regression/5384/console
>> >
>> > same issue, different builder (206). I will check them all, as the
>> > issue is more widespread than I expected (or it did popup since last
>> > time I checked).
>>
>> Deepshika did notice that the issue came back on one server
>> (builder202) after a reboot, so the rpcbind issue is not related to the
>> network initscript one, so the RCA continue.
>>
>> We are looking for another workaround involving fiddling with the
>> socket (until we find why it do use ipv6 at boot, but not after, when
>> ipv6 is disabled).
>>
>> Maybe we could run the test suite on a node without all the ipv6
>> disabling to see if that cause a issue ?
>>
> Do our test regression suit started supporting ipv6 now? Else this
> investigation would lead to further issues.
>

I suspect not yet. But we certainly would like to, at some point, to ensure
we run with IPv6 as well!
Y.

> --
>> Michael Scherer
>> Sysadmin, Community Infrastructure and Platform, OSAS
>>
>>
>> ___
>> Gluster-infra mailing list
>> Gluster-infra@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-infra
>>
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] [Bug 1696518] builder203 does not have a valid hostname set

2019-04-05 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1696518



--- Comment #2 from M. Scherer  ---
So, answering to myself, rpc.statd didn't start after reboot, and the hostname
was ip-172-31-38-158.us-east-2.compute.internal. After "hostnamectl
set-hostname builder203.int.aws.gluster.org", that's better; Guess we need to
automate that (as I used builder203.aws.gluster.org and this was wrong).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra