Re: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often?

2019-04-24 Thread Yaniv Kaul
On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer  wrote:

> Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> > Is this back again? The recent patches are failing regression :-\ .
>
> So, on builder206, it took me a while to find that the issue is that
> nfs (the service) was running.
>
> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation
> failed with a rather cryptic message:
>
> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-
> socket.nfs-server: process started listening on port (38465)
> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-
> socket.nfs-server: binding to  failed: Address already in use
> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-
> socket.nfs-server: Port is already in use
> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
> socket.nfs-server: __socket_server_bind failed;closing socket 14
>
> I found where this came from, but a few stuff did surprised me:
>
> - the order of print is different that the order in the code
>

Indeed strange...

> - the message on "started listening" didn't take in account the fact
> that bind failed on:
>

Shouldn't it bail out if it failed to bind?
Some missing 'goto out' around line 975/976?
Y.

>
>
>
> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
>
> The message about port 38465 also threw me off the track. The real
> issue is that the service nfs was already running, and I couldn't find
> anything listening on port 38465
>
> once I do service nfs stop, it no longer failed.
>
> So far, I do know why nfs.service was activated.
>
> But at least, 206 should be fixed, and we know a bit more on what would
> be causing some failure.
>
>
>
> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer 
> > wrote:
> >
> > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > > jthot...@redhat.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > is_nfs_export_available is just a wrapper around "showmount"
> > > > > command AFAIR.
> > > > > I saw following messages in console output.
> > > > >  mount.nfs: rpc.statd is not running but is required for remote
> > > > > locking.
> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
> > > > > or
> > > > > start
> > > > > statd.
> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > > > >
> > > > > For me it looks rpcbind may not be running on the machine.
> > > > > Usually rpcbind starts automatically on machines, don't know
> > > > > whether it
> > > > > can happen or not.
> > > > >
> > > >
> > > > That's precisely what the question is. Why suddenly we're seeing
> > > > this
> > > > happening too frequently. Today I saw atleast 4 to 5 such
> > > > failures
> > > > already.
> > > >
> > > > Deepshika - Can you please help in inspecting this?
> > >
> > > So we think (we are not sure) that the issue is a bit complex.
> > >
> > > What we were investigating was nightly run fail on aws. When the
> > > build
> > > crash, the builder is restarted, since that's the easiest way to
> > > clean
> > > everything (since even with a perfect test suite that would clean
> > > itself, we could always end in a corrupt state on the system, WRT
> > > mount, fs, etc).
> > >
> > > In turn, this seems to cause trouble on aws, since cloud-init or
> > > something rename eth0 interface to ens5, without cleaning to the
> > > network configuration.
> > >
> > > So the network init script fail (because the image say "start eth0"
> > > and
> > > that's not present), but fail in a weird way. Network is
> > > initialised
> > > and working (we can connect), but the dhclient process is not in
> > > the
> > > right cgroup, and network.service is in failed state. Restarting
> > > network didn't work. In turn, this mean that rpc-statd refuse to
> > > start
> > > (due to systemd dependencies), which seems to impact various NFS
> > > tests.
> > >
> > > We have also seen that on some builders, rpcbind pick some IP v6
> > > autoconfiguration, but we can't reproduce that, and there is no ip
> > > v6
> > > set up anywhere. I suspect the network.service failure is somehow
> > > involved, but fail to see how. In turn, rpcbind.socket not starting
> > > could cause NFS test troubles.
> > >
> > > Our current stop gap fix was to fix all the builders one by one.
> > > Remove
> > > the config, kill the rogue dhclient, restart network service.
> > >
> > > However, we can't be sure this is going to fix the problem long
> > > term
> > > since this only manifest after a crash of the test suite, and it
> > > doesn't happen so often. (plus, it was working before some day in
> > > the
> > > past, when something did make this fail, and I do not know if
> > > that's a
> > > system upgrade, or a test change, or both).
> > >
> > > So we are still looking at it to have a complete understanding of
> > > the
> > > 

Re: [Gluster-devel] [Gluster-Maintainers] BZ updates

2019-04-24 Thread Niels de Vos
On Wed, Apr 24, 2019 at 08:44:10AM +0530, Nithya Balachandran wrote:
> All,
> 
> When working on a bug, please ensure that you update the BZ with any
> relevant information as well as the RCA. I have seen several BZs in the
> past which report crashes, however they do not have a bt or RCA captured.
> Having this information in the BZ makes it much easier to see if a newly
> reported issue has already been fixed.
> 
> I propose that maintainers merge patches only if the BZs are updated with
> required information. It will take some time to make this a habit but it
> will pay off in the end.

Great point! I really hope that most of the contributors know that
debugging steps in bugs are extremely valuable. When documented in a
bug, similar issues can be analyzed with the same techniques. As a
reminder for this, I'm proposing this addition to the Maintainer
Guidelines:

   https://github.com/gluster/glusterdocs/pull/471
   - Ensure the related Bug or GitHub Issue has sufficient details about the
 cause of the problem, or description of the introduction for the change.

I'd appreciate it when someone can approve and merge that. Of course,
suggestions for rephrasing are welcome too.

Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often?

2019-04-24 Thread Jiffin Thottan
Below looks like kernel nfs was started (may be enabled on the machine).

Did u start rpcbind manually on that machine, if yes can u please check kernel 
nfs status before and after that service?

--

Jiffin

- Original Message -
From: "Michael Scherer" 
To: "Atin Mukherjee" 
Cc: "Deepshikha Khandelwal" , "Gluster Devel" 
, "Jiffin Thottan" , 
"gluster-infra" 
Sent: Tuesday, April 23, 2019 7:44:49 PM
Subject: Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from 
nfs.rc failing too often?

Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> Is this back again? The recent patches are failing regression :-\ .

So, on builder206, it took me a while to find that the issue is that
nfs (the service) was running.

./tests/basic/afr/tarissue.t failed, because the nfs initialisation
failed with a rather cryptic message:

[2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-
socket.nfs-server: process started listening on port (38465)
[2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-
socket.nfs-server: binding to  failed: Address already in use
[2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-
socket.nfs-server: Port is already in use
[2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
socket.nfs-server: __socket_server_bind failed;closing socket 14

I found where this came from, but a few stuff did surprised me:

- the order of print is different that the order in the code
- the message on "started listening" didn't take in account the fact
that bind failed on:


https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967

The message about port 38465 also threw me off the track. The real
issue is that the service nfs was already running, and I couldn't find
anything listening on port 38465

once I do service nfs stop, it no longer failed.

So far, I do know why nfs.service was activated.

But at least, 206 should be fixed, and we know a bit more on what would
be causing some failure.

 

> On Wed, 3 Apr 2019 at 19:26, Michael Scherer 
> wrote:
> 
> > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > jthot...@redhat.com>
> > > wrote:
> > > 
> > > > Hi,
> > > > 
> > > > is_nfs_export_available is just a wrapper around "showmount"
> > > > command AFAIR.
> > > > I saw following messages in console output.
> > > >  mount.nfs: rpc.statd is not running but is required for remote
> > > > locking.
> > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
> > > > or
> > > > start
> > > > statd.
> > > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > > > 
> > > > For me it looks rpcbind may not be running on the machine.
> > > > Usually rpcbind starts automatically on machines, don't know
> > > > whether it
> > > > can happen or not.
> > > > 
> > > 
> > > That's precisely what the question is. Why suddenly we're seeing
> > > this
> > > happening too frequently. Today I saw atleast 4 to 5 such
> > > failures
> > > already.
> > > 
> > > Deepshika - Can you please help in inspecting this?
> > 
> > So we think (we are not sure) that the issue is a bit complex.
> > 
> > What we were investigating was nightly run fail on aws. When the
> > build
> > crash, the builder is restarted, since that's the easiest way to
> > clean
> > everything (since even with a perfect test suite that would clean
> > itself, we could always end in a corrupt state on the system, WRT
> > mount, fs, etc).
> > 
> > In turn, this seems to cause trouble on aws, since cloud-init or
> > something rename eth0 interface to ens5, without cleaning to the
> > network configuration.
> > 
> > So the network init script fail (because the image say "start eth0"
> > and
> > that's not present), but fail in a weird way. Network is
> > initialised
> > and working (we can connect), but the dhclient process is not in
> > the
> > right cgroup, and network.service is in failed state. Restarting
> > network didn't work. In turn, this mean that rpc-statd refuse to
> > start
> > (due to systemd dependencies), which seems to impact various NFS
> > tests.
> > 
> > We have also seen that on some builders, rpcbind pick some IP v6
> > autoconfiguration, but we can't reproduce that, and there is no ip
> > v6
> > set up anywhere. I suspect the network.service failure is somehow
> > involved, but fail to see how. In turn, rpcbind.socket not starting
> > could cause NFS test troubles.
> > 
> > Our current stop gap fix was to fix all the builders one by one.
> > Remove
> > the config, kill the rogue dhclient, restart network service.
> > 
> > However, we can't be sure this is going to fix the problem long
> > term
> > since this only manifest after a crash of the test suite, and it
> > doesn't happen so often. (plus, it was working before some day in
> > the
> > past, when something did make this fail, and I do not know if
> > that's a
> > system