[Gluster-infra] [Bug 1695484] smoke fails with "Build root is locked by another process"

2019-04-03 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1695484



--- Comment #3 from M. Scherer  ---
So indeed, https://build.gluster.org/job/devrpm-fedora/15404/ aborted the patch
test, then https://build.gluster.org/job/devrpm-fedora/15405/ failed. but the
next run worked.

Maybe the problem is that it take more than 30 seconds to clean the build or
something similar. Maybe we need to add some more time, but I can't seems to
find a log to evaluate how long it does take when things are cancelled. Let's
keep stuff opened if the issue arise again to collect the log, and see if there
is a pattern.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-03 Thread Michael Scherer
Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan 
> wrote:
> 
> > Hi,
> > 
> > is_nfs_export_available is just a wrapper around "showmount"
> > command AFAIR.
> > I saw following messages in console output.
> >  mount.nfs: rpc.statd is not running but is required for remote
> > locking.
> > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or
> > start
> > statd.
> > 05:06:55 mount.nfs: an incorrect mount option was specified
> > 
> > For me it looks rpcbind may not be running on the machine.
> > Usually rpcbind starts automatically on machines, don't know
> > whether it
> > can happen or not.
> > 
> 
> That's precisely what the question is. Why suddenly we're seeing this
> happening too frequently. Today I saw atleast 4 to 5 such failures
> already.
> 
> Deepshika - Can you please help in inspecting this?

So we think (we are not sure) that the issue is a bit complex.

What we were investigating was nightly run fail on aws. When the build
crash, the builder is restarted, since that's the easiest way to clean
everything (since even with a perfect test suite that would clean
itself, we could always end in a corrupt state on the system, WRT
mount, fs, etc).

In turn, this seems to cause trouble on aws, since cloud-init or
something rename eth0 interface to ens5, without cleaning to the
network configuration. 

So the network init script fail (because the image say "start eth0" and
that's not present), but fail in a weird way. Network is initialised
and working (we can connect), but the dhclient process is not in the
right cgroup, and network.service is in failed state. Restarting
network didn't work. In turn, this mean that rpc-statd refuse to start
(due to systemd dependencies), which seems to impact various NFS tests.

We have also seen that on some builders, rpcbind pick some IP v6
autoconfiguration, but we can't reproduce that, and there is no ip v6
set up anywhere. I suspect the network.service failure is somehow
involved, but fail to see how. In turn, rpcbind.socket not starting
could cause NFS test troubles.

Our current stop gap fix was to fix all the builders one by one. Remove
the config, kill the rogue dhclient, restart network service. 

However, we can't be sure this is going to fix the problem long term
since this only manifest after a crash of the test suite, and it
doesn't happen so often. (plus, it was working before some day in the
past, when something did make this fail, and I do not know if that's a
system upgrade, or a test change, or both).

So we are still looking at it to have a complete understanding of the
issue, but so far, we hacked our way to make it work (or so do I
think).

Deepshika is working to fix it long term, by fixing the issue regarding
eth0/ens5 with a new base image.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-03 Thread Michael Scherer
Le mercredi 03 avril 2019 à 15:12 +0300, Yaniv Kaul a écrit :
> On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer 
> wrote:
> 
> > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > jthot...@redhat.com>
> > > wrote:
> > > 
> > > > Hi,
> > > > 
> > > > is_nfs_export_available is just a wrapper around "showmount"
> > > > command AFAIR.
> > > > I saw following messages in console output.
> > > >  mount.nfs: rpc.statd is not running but is required for remote
> > > > locking.
> > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
> > > > or
> > > > start
> > > > statd.
> > > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > > > 
> > > > For me it looks rpcbind may not be running on the machine.
> > > > Usually rpcbind starts automatically on machines, don't know
> > > > whether it
> > > > can happen or not.
> > > > 
> > > 
> > > That's precisely what the question is. Why suddenly we're seeing
> > > this
> > > happening too frequently. Today I saw atleast 4 to 5 such
> > > failures
> > > already.
> > > 
> > > Deepshika - Can you please help in inspecting this?
> > 
> > So in the past, this kind of stuff did happen with ipv6, so this
> > could
> > be a change on AWS and/or a upgrade.
> > 
> 
> We need to enable IPv6, for two reasons:
> 1. IPv6 is common these days, even if we don't test with it, it
> should be
> there.
> 2. We should test with IPv6...
> 
> I'm not sure, but I suspect we do disable IPv6 here and there.
> Example[1].
> Y.
> 
> [1]
> 
https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml

We do disable ipv6 for sure, Nigel spent 3 days just on that for the
AWS migration, and we do have a dedicated playbook applied on all
builders that try to disable everything in every possible way:


https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/jenkins_builder/tasks/disable_ipv6_linux.yml

According to the comment, that's from 2016, and I am sure this go
further in the past since it wasn't just documented before.


-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-03 Thread Yaniv Kaul
On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer  wrote:

> Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan 
> > wrote:
> >
> > > Hi,
> > >
> > > is_nfs_export_available is just a wrapper around "showmount"
> > > command AFAIR.
> > > I saw following messages in console output.
> > >  mount.nfs: rpc.statd is not running but is required for remote
> > > locking.
> > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or
> > > start
> > > statd.
> > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > >
> > > For me it looks rpcbind may not be running on the machine.
> > > Usually rpcbind starts automatically on machines, don't know
> > > whether it
> > > can happen or not.
> > >
> >
> > That's precisely what the question is. Why suddenly we're seeing this
> > happening too frequently. Today I saw atleast 4 to 5 such failures
> > already.
> >
> > Deepshika - Can you please help in inspecting this?
>
> So in the past, this kind of stuff did happen with ipv6, so this could
> be a change on AWS and/or a upgrade.
>

We need to enable IPv6, for two reasons:
1. IPv6 is common these days, even if we don't test with it, it should be
there.
2. We should test with IPv6...

I'm not sure, but I suspect we do disable IPv6 here and there. Example[1].
Y.

[1]
https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml

>
> We are currently investigating a set of failure that happen after
> reboot (resulting in partial network bring up, causing all kind of
> weird issue), but it take some time to verify it, and since we lost 33%
> of the team with Nigel departure, stuff do not move as fast as before.
>
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-03 Thread Michael Scherer
Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan 
> wrote:
> 
> > Hi,
> > 
> > is_nfs_export_available is just a wrapper around "showmount"
> > command AFAIR.
> > I saw following messages in console output.
> >  mount.nfs: rpc.statd is not running but is required for remote
> > locking.
> > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or
> > start
> > statd.
> > 05:06:55 mount.nfs: an incorrect mount option was specified
> > 
> > For me it looks rpcbind may not be running on the machine.
> > Usually rpcbind starts automatically on machines, don't know
> > whether it
> > can happen or not.
> > 
> 
> That's precisely what the question is. Why suddenly we're seeing this
> happening too frequently. Today I saw atleast 4 to 5 such failures
> already.
> 
> Deepshika - Can you please help in inspecting this?

So in the past, this kind of stuff did happen with ipv6, so this could
be a change on AWS and/or a upgrade. 

We are currently investigating a set of failure that happen after
reboot (resulting in partial network bring up, causing all kind of
weird issue), but it take some time to verify it, and since we lost 33%
of the team with Nigel departure, stuff do not move as fast as before.


-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-03 Thread Atin Mukherjee
On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan  wrote:

> Hi,
>
> is_nfs_export_available is just a wrapper around "showmount" command AFAIR.
> I saw following messages in console output.
>  mount.nfs: rpc.statd is not running but is required for remote locking.
> 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or start
> statd.
> 05:06:55 mount.nfs: an incorrect mount option was specified
>
> For me it looks rpcbind may not be running on the machine.
> Usually rpcbind starts automatically on machines, don't know whether it
> can happen or not.
>

That's precisely what the question is. Why suddenly we're seeing this
happening too frequently. Today I saw atleast 4 to 5 such failures already.

Deepshika - Can you please help in inspecting this?


> Regards,
> Jiffin
>
>
> - Original Message -
> From: "Atin Mukherjee" 
> To: "gluster-infra" , "Gluster Devel" <
> gluster-de...@gluster.org>
> Sent: Wednesday, April 3, 2019 10:46:51 AM
> Subject: [Gluster-devel] is_nfs_export_available from nfs.rc failing too
>   often?
>
> I'm observing the above test function failing too often because of which
> arbiter-mount.t test fails in many regression jobs. Such frequency of
> failures wasn't there earlier. Does anyone know what has changed recently
> to cause these failures in regression? I also hear when such failure
> happens a reboot is required, is that true and if so why?
>
> One of the reference :
> https://build.gluster.org/job/centos7-regression/5340/consoleFull
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

[Gluster-infra] [Bug 1695484] smoke fails with "Build root is locked by another process"

2019-04-03 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1695484

M. Scherer  changed:

   What|Removed |Added

 CC||msche...@redhat.com



--- Comment #2 from M. Scherer  ---
Mhh, then shouldn't we clean up when there is something that do stop the build
?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] [Bug 1695484] smoke fails with "Build root is locked by another process"

2019-04-03 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1695484

Deepshikha khandelwal  changed:

   What|Removed |Added

 CC||dkhan...@redhat.com



--- Comment #1 from Deepshikha khandelwal  ---
It happens mainly because your previously running build was aborted by a new
patchset and hence no cleanup. 

Re-triggering might help.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra


[Gluster-infra] [Bug 1695484] New: smoke fails with "Build root is locked by another process"

2019-04-03 Thread bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1695484

Bug ID: 1695484
   Summary: smoke fails with "Build root is locked by another
process"
   Product: GlusterFS
   Version: mainline
Status: NEW
 Component: project-infrastructure
  Assignee: b...@gluster.org
  Reporter: pkara...@redhat.com
CC: b...@gluster.org, gluster-infra@gluster.org
  Target Milestone: ---
Classification: Community



Description of problem:
Please check https://build.gluster.org/job/devrpm-fedora/15405/console for more
details. Smoke is failing with the reason mentioned in the subject.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra