Looks like is_nfs_export_available started failing again in recent centos-regressions.
Michael, can you please check? On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <yk...@redhat.com> wrote: > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <msche...@redhat.com> > wrote: > >> Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : >> > Is this back again? The recent patches are failing regression :-\ . >> >> So, on builder206, it took me a while to find that the issue is that >> nfs (the service) was running. >> >> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation >> failed with a rather cryptic message: >> >> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- >> socket.nfs-server: process started listening on port (38465) >> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- >> socket.nfs-server: binding to failed: Address already in use >> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- >> socket.nfs-server: Port is already in use >> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> socket.nfs-server: __socket_server_bind failed;closing socket 14 >> >> I found where this came from, but a few stuff did surprised me: >> >> - the order of print is different that the order in the code >> > > Indeed strange... > >> - the message on "started listening" didn't take in account the fact >> that bind failed on: >> > > Shouldn't it bail out if it failed to bind? > Some missing 'goto out' around line 975/976? > Y. > >> >> >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> >> The message about port 38465 also threw me off the track. The real >> issue is that the service nfs was already running, and I couldn't find >> anything listening on port 38465 >> >> once I do service nfs stop, it no longer failed. >> >> So far, I do know why nfs.service was activated. >> >> But at least, 206 should be fixed, and we know a bit more on what would >> be causing some failure. >> >> >> >> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer <msche...@redhat.com> >> > wrote: >> > >> > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : >> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > jthot...@redhat.com> >> > > > wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > is_nfs_export_available is just a wrapper around "showmount" >> > > > > command AFAIR. >> > > > > I saw following messages in console output. >> > > > > mount.nfs: rpc.statd is not running but is required for remote >> > > > > locking. >> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, >> > > > > or >> > > > > start >> > > > > statd. >> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified >> > > > > >> > > > > For me it looks rpcbind may not be running on the machine. >> > > > > Usually rpcbind starts automatically on machines, don't know >> > > > > whether it >> > > > > can happen or not. >> > > > > >> > > > >> > > > That's precisely what the question is. Why suddenly we're seeing >> > > > this >> > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > failures >> > > > already. >> > > > >> > > > Deepshika - Can you please help in inspecting this? >> > > >> > > So we think (we are not sure) that the issue is a bit complex. >> > > >> > > What we were investigating was nightly run fail on aws. When the >> > > build >> > > crash, the builder is restarted, since that's the easiest way to >> > > clean >> > > everything (since even with a perfect test suite that would clean >> > > itself, we could always end in a corrupt state on the system, WRT >> > > mount, fs, etc). >> > > >> > > In turn, this seems to cause trouble on aws, since cloud-init or >> > > something rename eth0 interface to ens5, without cleaning to the >> > > network configuration. >> > > >> > > So the network init script fail (because the image say "start eth0" >> > > and >> > > that's not present), but fail in a weird way. Network is >> > > initialised >> > > and working (we can connect), but the dhclient process is not in >> > > the >> > > right cgroup, and network.service is in failed state. Restarting >> > > network didn't work. In turn, this mean that rpc-statd refuse to >> > > start >> > > (due to systemd dependencies), which seems to impact various NFS >> > > tests. >> > > >> > > We have also seen that on some builders, rpcbind pick some IP v6 >> > > autoconfiguration, but we can't reproduce that, and there is no ip >> > > v6 >> > > set up anywhere. I suspect the network.service failure is somehow >> > > involved, but fail to see how. In turn, rpcbind.socket not starting >> > > could cause NFS test troubles. >> > > >> > > Our current stop gap fix was to fix all the builders one by one. >> > > Remove >> > > the config, kill the rogue dhclient, restart network service. >> > > >> > > However, we can't be sure this is going to fix the problem long >> > > term >> > > since this only manifest after a crash of the test suite, and it >> > > doesn't happen so often. (plus, it was working before some day in >> > > the >> > > past, when something did make this fail, and I do not know if >> > > that's a >> > > system upgrade, or a test change, or both). >> > > >> > > So we are still looking at it to have a complete understanding of >> > > the >> > > issue, but so far, we hacked our way to make it work (or so do I >> > > think). >> > > >> > > Deepshika is working to fix it long term, by fixing the issue >> > > regarding >> > > eth0/ens5 with a new base image. >> > > -- >> > > Michael Scherer >> > > Sysadmin, Community Infrastructure and Platform, OSAS >> > > >> > > >> > > -- >> > >> > - Atin (atinm) >> -- >> Michael Scherer >> Sysadmin, Community Infrastructure >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> gluster-de...@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > gluster-de...@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Thanks, Sanju
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra