Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-24 Thread Yaniv Kaul
On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer  wrote:

> Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> > Is this back again? The recent patches are failing regression :-\ .
>
> So, on builder206, it took me a while to find that the issue is that
> nfs (the service) was running.
>
> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation
> failed with a rather cryptic message:
>
> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-
> socket.nfs-server: process started listening on port (38465)
> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-
> socket.nfs-server: binding to  failed: Address already in use
> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-
> socket.nfs-server: Port is already in use
> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
> socket.nfs-server: __socket_server_bind failed;closing socket 14
>
> I found where this came from, but a few stuff did surprised me:
>
> - the order of print is different that the order in the code
>

Indeed strange...

> - the message on "started listening" didn't take in account the fact
> that bind failed on:
>

Shouldn't it bail out if it failed to bind?
Some missing 'goto out' around line 975/976?
Y.

>
>
>
> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
>
> The message about port 38465 also threw me off the track. The real
> issue is that the service nfs was already running, and I couldn't find
> anything listening on port 38465
>
> once I do service nfs stop, it no longer failed.
>
> So far, I do know why nfs.service was activated.
>
> But at least, 206 should be fixed, and we know a bit more on what would
> be causing some failure.
>
>
>
> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer 
> > wrote:
> >
> > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > > jthot...@redhat.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > is_nfs_export_available is just a wrapper around "showmount"
> > > > > command AFAIR.
> > > > > I saw following messages in console output.
> > > > >  mount.nfs: rpc.statd is not running but is required for remote
> > > > > locking.
> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
> > > > > or
> > > > > start
> > > > > statd.
> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > > > >
> > > > > For me it looks rpcbind may not be running on the machine.
> > > > > Usually rpcbind starts automatically on machines, don't know
> > > > > whether it
> > > > > can happen or not.
> > > > >
> > > >
> > > > That's precisely what the question is. Why suddenly we're seeing
> > > > this
> > > > happening too frequently. Today I saw atleast 4 to 5 such
> > > > failures
> > > > already.
> > > >
> > > > Deepshika - Can you please help in inspecting this?
> > >
> > > So we think (we are not sure) that the issue is a bit complex.
> > >
> > > What we were investigating was nightly run fail on aws. When the
> > > build
> > > crash, the builder is restarted, since that's the easiest way to
> > > clean
> > > everything (since even with a perfect test suite that would clean
> > > itself, we could always end in a corrupt state on the system, WRT
> > > mount, fs, etc).
> > >
> > > In turn, this seems to cause trouble on aws, since cloud-init or
> > > something rename eth0 interface to ens5, without cleaning to the
> > > network configuration.
> > >
> > > So the network init script fail (because the image say "start eth0"
> > > and
> > > that's not present), but fail in a weird way. Network is
> > > initialised
> > > and working (we can connect), but the dhclient process is not in
> > > the
> > > right cgroup, and network.service is in failed state. Restarting
> > > network didn't work. In turn, this mean that rpc-statd refuse to
> > > start
> > > (due to systemd dependencies), which seems to impact various NFS
> > > tests.
> > >
> > > We have also seen that on some builders, rpcbind pick some IP v6
> > > autoconfiguration, but we can't reproduce that, and there is no ip
> > > v6
> > > set up anywhere. I suspect the network.service failure is somehow
> > > involved, but fail to see how. In turn, rpcbind.socket not starting
> > > could cause NFS test troubles.
> > >
> > > Our current stop gap fix was to fix all the builders one by one.
> > > Remove
> > > the config, kill the rogue dhclient, restart network service.
> > >
> > > However, we can't be sure this is going to fix the problem long
> > > term
> > > since this only manifest after a crash of the test suite, and it
> > > doesn't happen so often. (plus, it was working before some day in
> > > the
> > > past, when something did make this fail, and I do not know if
> > > that's a
> > > system upgrade, or a test change, or both).
> > >
> > > So we are still looking at it to have a complete understanding of
> > > the
> > > 

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

2019-04-24 Thread Jiffin Thottan
Below looks like kernel nfs was started (may be enabled on the machine).

Did u start rpcbind manually on that machine, if yes can u please check kernel 
nfs status before and after that service?

--

Jiffin

- Original Message -
From: "Michael Scherer" 
To: "Atin Mukherjee" 
Cc: "Deepshikha Khandelwal" , "Gluster Devel" 
, "Jiffin Thottan" , 
"gluster-infra" 
Sent: Tuesday, April 23, 2019 7:44:49 PM
Subject: Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from 
nfs.rc failing too often?

Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
> Is this back again? The recent patches are failing regression :-\ .

So, on builder206, it took me a while to find that the issue is that
nfs (the service) was running.

./tests/basic/afr/tarissue.t failed, because the nfs initialisation
failed with a rather cryptic message:

[2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0-
socket.nfs-server: process started listening on port (38465)
[2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0-
socket.nfs-server: binding to  failed: Address already in use
[2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0-
socket.nfs-server: Port is already in use
[2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
socket.nfs-server: __socket_server_bind failed;closing socket 14

I found where this came from, but a few stuff did surprised me:

- the order of print is different that the order in the code
- the message on "started listening" didn't take in account the fact
that bind failed on:


https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967

The message about port 38465 also threw me off the track. The real
issue is that the service nfs was already running, and I couldn't find
anything listening on port 38465

once I do service nfs stop, it no longer failed.

So far, I do know why nfs.service was activated.

But at least, 206 should be fixed, and we know a bit more on what would
be causing some failure.

 

> On Wed, 3 Apr 2019 at 19:26, Michael Scherer 
> wrote:
> 
> > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit :
> > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
> > > jthot...@redhat.com>
> > > wrote:
> > > 
> > > > Hi,
> > > > 
> > > > is_nfs_export_available is just a wrapper around "showmount"
> > > > command AFAIR.
> > > > I saw following messages in console output.
> > > >  mount.nfs: rpc.statd is not running but is required for remote
> > > > locking.
> > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local,
> > > > or
> > > > start
> > > > statd.
> > > > 05:06:55 mount.nfs: an incorrect mount option was specified
> > > > 
> > > > For me it looks rpcbind may not be running on the machine.
> > > > Usually rpcbind starts automatically on machines, don't know
> > > > whether it
> > > > can happen or not.
> > > > 
> > > 
> > > That's precisely what the question is. Why suddenly we're seeing
> > > this
> > > happening too frequently. Today I saw atleast 4 to 5 such
> > > failures
> > > already.
> > > 
> > > Deepshika - Can you please help in inspecting this?
> > 
> > So we think (we are not sure) that the issue is a bit complex.
> > 
> > What we were investigating was nightly run fail on aws. When the
> > build
> > crash, the builder is restarted, since that's the easiest way to
> > clean
> > everything (since even with a perfect test suite that would clean
> > itself, we could always end in a corrupt state on the system, WRT
> > mount, fs, etc).
> > 
> > In turn, this seems to cause trouble on aws, since cloud-init or
> > something rename eth0 interface to ens5, without cleaning to the
> > network configuration.
> > 
> > So the network init script fail (because the image say "start eth0"
> > and
> > that's not present), but fail in a weird way. Network is
> > initialised
> > and working (we can connect), but the dhclient process is not in
> > the
> > right cgroup, and network.service is in failed state. Restarting
> > network didn't work. In turn, this mean that rpc-statd refuse to
> > start
> > (due to systemd dependencies), which seems to impact various NFS
> > tests.
> > 
> > We have also seen that on some builders, rpcbind pick some IP v6
> > autoconfiguration, but we can't reproduce that, and there is no ip
> > v6
> > set up anywhere. I suspect the network.service failure is somehow
> > involved, but fail to see how. In turn, rpcbind.socket not starting
> > could cause NFS test troubles.
> > 
> > Our current stop gap fix was to fix all the builders one by one.
> > Remove
> > the config, kill the rogue dhclient, restart network service.
> > 
> > However, we can't be sure this is going to fix the problem long
> > term
> > since this only manifest after a crash of the test suite, and it
> > doesn't happen so often. (plus, it was working before some day in
> > the
> > past, when something did make this fail, and I do not know if
> > that's a
> > system