Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

Deepshikha Khandelwal Wed, 08 May 2019 22:57:19 -0700

I took a quick look at the builders and noticed both have the same error of
'Cannot allocate memory' which comes up every time when the builder is
rebooted after a build abort. It is happening in the same pattern. Though
there's no such memory consumption on the builders.


I’m investigating more on this.

On Thu, May 9, 2019 at 10:02 AM Atin Mukherjee <amukh...@redhat.com> wrote:

>
>
> On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee <amukh...@redhat.com> wrote:
>
>> builder204 needs to be fixed, too many failures, mostly none of the
>> patches are passing regression.
>>
>
> And with that builder201 joins the pool,
> https://build.gluster.org/job/centos7-regression/5943/consoleFull
>
>
>> On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee <amukh...@redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde <srako...@redhat.com>
>>> wrote:
>>>
>>>> Deepshikha,
>>>>
>>>> I see the failure here[1] which ran on builder206. So, we are good.
>>>>
>>>
>>> Not really,
>>> https://build.gluster.org/job/centos7-regression/5909/consoleFull
>>> failed on builder204 for similar reasons I believe?
>>>
>>> I am bit more worried on this issue being resurfacing more often these
>>> days. What can we do to fix this permanently?
>>>
>>>
>>>> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull
>>>>
>>>> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal <
>>>> dkhan...@redhat.com> wrote:
>>>>
>>>>> Sanju, can you please give us more info about the failures.
>>>>>
>>>>> I see the failures occurring on just one of the builder (builder206).
>>>>> I'm taking it back offline for now.
>>>>>
>>>>> On Tue, May 7, 2019 at 9:42 PM Michael Scherer <msche...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit :
>>>>>> > Looks like is_nfs_export_available started failing again in recent
>>>>>> > centos-regressions.
>>>>>> >
>>>>>> > Michael, can you please check?
>>>>>>
>>>>>> I will try but I am leaving for vacation tonight, so if I find
>>>>>> nothing,
>>>>>> until I leave, I guess Deepshika will have to look.
>>>>>>
>>>>>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <yk...@redhat.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > >
>>>>>> > >
>>>>>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer <
>>>>>> > > msche...@redhat.com>
>>>>>> > > wrote:
>>>>>> > >
>>>>>> > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit :
>>>>>> > > > > Is this back again? The recent patches are failing regression
>>>>>> > > > > :-\ .
>>>>>> > > >
>>>>>> > > > So, on builder206, it took me a while to find that the issue is
>>>>>> > > > that
>>>>>> > > > nfs (the service) was running.
>>>>>> > > >
>>>>>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs
>>>>>> > > > initialisation
>>>>>> > > > failed with a rather cryptic message:
>>>>>> > > >
>>>>>> > > > [2019-04-23 13:17:05.371733] I
>>>>>> > > > [socket.c:991:__socket_server_bind] 0-
>>>>>> > > > socket.nfs-server: process started listening on port (38465)
>>>>>> > > > [2019-04-23 13:17:05.385819] E
>>>>>> > > > [socket.c:972:__socket_server_bind] 0-
>>>>>> > > > socket.nfs-server: binding to  failed: Address already in use
>>>>>> > > > [2019-04-23 13:17:05.385843] E
>>>>>> > > > [socket.c:974:__socket_server_bind] 0-
>>>>>> > > > socket.nfs-server: Port is already in use
>>>>>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0-
>>>>>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14
>>>>>> > > >
>>>>>> > > > I found where this came from, but a few stuff did surprised me:
>>>>>> > > >
>>>>>> > > > - the order of print is different that the order in the code
>>>>>> > > >
>>>>>> > >
>>>>>> > > Indeed strange...
>>>>>> > >
>>>>>> > > > - the message on "started listening" didn't take in account the
>>>>>> > > > fact
>>>>>> > > > that bind failed on:
>>>>>> > > >
>>>>>> > >
>>>>>> > > Shouldn't it bail out if it failed to bind?
>>>>>> > > Some missing 'goto out' around line 975/976?
>>>>>> > > Y.
>>>>>> > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>>
>>>>>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967
>>>>>> > > >
>>>>>> > > > The message about port 38465 also threw me off the track. The
>>>>>> > > > real
>>>>>> > > > issue is that the service nfs was already running, and I
>>>>>> couldn't
>>>>>> > > > find
>>>>>> > > > anything listening on port 38465
>>>>>> > > >
>>>>>> > > > once I do service nfs stop, it no longer failed.
>>>>>> > > >
>>>>>> > > > So far, I do know why nfs.service was activated.
>>>>>> > > >
>>>>>> > > > But at least, 206 should be fixed, and we know a bit more on
>>>>>> what
>>>>>> > > > would
>>>>>> > > > be causing some failure.
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer <
>>>>>> > > > > msche...@redhat.com>
>>>>>> > > > > wrote:
>>>>>> > > > >
>>>>>> > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a
>>>>>> > > > > > écrit :
>>>>>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <
>>>>>> > > > > > > jthot...@redhat.com>
>>>>>> > > > > > > wrote:
>>>>>> > > > > > >
>>>>>> > > > > > > > Hi,
>>>>>> > > > > > > >
>>>>>> > > > > > > > is_nfs_export_available is just a wrapper around
>>>>>> > > > > > > > "showmount"
>>>>>> > > > > > > > command AFAIR.
>>>>>> > > > > > > > I saw following messages in console output.
>>>>>> > > > > > > >  mount.nfs: rpc.statd is not running but is required for
>>>>>> > > > > > > > remote
>>>>>> > > > > > > > locking.
>>>>>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks
>>>>>> > > > > > > > local,
>>>>>> > > > > > > > or
>>>>>> > > > > > > > start
>>>>>> > > > > > > > statd.
>>>>>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was
>>>>>> > > > > > > > specified
>>>>>> > > > > > > >
>>>>>> > > > > > > > For me it looks rpcbind may not be running on the
>>>>>> > > > > > > > machine.
>>>>>> > > > > > > > Usually rpcbind starts automatically on machines, don't
>>>>>> > > > > > > > know
>>>>>> > > > > > > > whether it
>>>>>> > > > > > > > can happen or not.
>>>>>> > > > > > > >
>>>>>> > > > > > >
>>>>>> > > > > > > That's precisely what the question is. Why suddenly we're
>>>>>> > > > > > > seeing
>>>>>> > > > > > > this
>>>>>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such
>>>>>> > > > > > > failures
>>>>>> > > > > > > already.
>>>>>> > > > > > >
>>>>>> > > > > > > Deepshika - Can you please help in inspecting this?
>>>>>> > > > > >
>>>>>> > > > > > So we think (we are not sure) that the issue is a bit
>>>>>> > > > > > complex.
>>>>>> > > > > >
>>>>>> > > > > > What we were investigating was nightly run fail on aws. When
>>>>>> > > > > > the
>>>>>> > > > > > build
>>>>>> > > > > > crash, the builder is restarted, since that's the easiest
>>>>>> way
>>>>>> > > > > > to
>>>>>> > > > > > clean
>>>>>> > > > > > everything (since even with a perfect test suite that would
>>>>>> > > > > > clean
>>>>>> > > > > > itself, we could always end in a corrupt state on the
>>>>>> system,
>>>>>> > > > > > WRT
>>>>>> > > > > > mount, fs, etc).
>>>>>> > > > > >
>>>>>> > > > > > In turn, this seems to cause trouble on aws, since
>>>>>> cloud-init
>>>>>> > > > > > or
>>>>>> > > > > > something rename eth0 interface to ens5, without cleaning to
>>>>>> > > > > > the
>>>>>> > > > > > network configuration.
>>>>>> > > > > >
>>>>>> > > > > > So the network init script fail (because the image say
>>>>>> "start
>>>>>> > > > > > eth0"
>>>>>> > > > > > and
>>>>>> > > > > > that's not present), but fail in a weird way. Network is
>>>>>> > > > > > initialised
>>>>>> > > > > > and working (we can connect), but the dhclient process is
>>>>>> not
>>>>>> > > > > > in
>>>>>> > > > > > the
>>>>>> > > > > > right cgroup, and network.service is in failed state.
>>>>>> > > > > > Restarting
>>>>>> > > > > > network didn't work. In turn, this mean that rpc-statd
>>>>>> refuse
>>>>>> > > > > > to
>>>>>> > > > > > start
>>>>>> > > > > > (due to systemd dependencies), which seems to impact various
>>>>>> > > > > > NFS
>>>>>> > > > > > tests.
>>>>>> > > > > >
>>>>>> > > > > > We have also seen that on some builders, rpcbind pick some
>>>>>> IP
>>>>>> > > > > > v6
>>>>>> > > > > > autoconfiguration, but we can't reproduce that, and there is
>>>>>> > > > > > no ip
>>>>>> > > > > > v6
>>>>>> > > > > > set up anywhere. I suspect the network.service failure is
>>>>>> > > > > > somehow
>>>>>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not
>>>>>> > > > > > starting
>>>>>> > > > > > could cause NFS test troubles.
>>>>>> > > > > >
>>>>>> > > > > > Our current stop gap fix was to fix all the builders one by
>>>>>> > > > > > one.
>>>>>> > > > > > Remove
>>>>>> > > > > > the config, kill the rogue dhclient, restart network
>>>>>> service.
>>>>>> > > > > >
>>>>>> > > > > > However, we can't be sure this is going to fix the problem
>>>>>> > > > > > long
>>>>>> > > > > > term
>>>>>> > > > > > since this only manifest after a crash of the test suite,
>>>>>> and
>>>>>> > > > > > it
>>>>>> > > > > > doesn't happen so often. (plus, it was working before some
>>>>>> > > > > > day in
>>>>>> > > > > > the
>>>>>> > > > > > past, when something did make this fail, and I do not know
>>>>>> if
>>>>>> > > > > > that's a
>>>>>> > > > > > system upgrade, or a test change, or both).
>>>>>> > > > > >
>>>>>> > > > > > So we are still looking at it to have a complete
>>>>>> > > > > > understanding of
>>>>>> > > > > > the
>>>>>> > > > > > issue, but so far, we hacked our way to make it work (or so
>>>>>> > > > > > do I
>>>>>> > > > > > think).
>>>>>> > > > > >
>>>>>> > > > > > Deepshika is working to fix it long term, by fixing the
>>>>>> issue
>>>>>> > > > > > regarding
>>>>>> > > > > > eth0/ens5 with a new base image.
>>>>>> > > > > > --
>>>>>> > > > > > Michael Scherer
>>>>>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS
>>>>>> > > > > >
>>>>>> > > > > >
>>>>>> > > > > > --
>>>>>> > > > >
>>>>>> > > > > - Atin (atinm)
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > Michael Scherer
>>>>>> > > > Sysadmin, Community Infrastructure
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > _______________________________________________
>>>>>> > > > Gluster-devel mailing list
>>>>>> > > > gluster-de...@gluster.org
>>>>>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>> > >
>>>>>> > > _______________________________________________
>>>>>> > > Gluster-devel mailing list
>>>>>> > > gluster-de...@gluster.org
>>>>>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> --
>>>>>> Michael Scherer
>>>>>> Sysadmin, Community Infrastructure
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-devel mailing list
>>>>>> gluster-de...@gluster.org
>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Sanju
>>>> _______________________________________________
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> APAC Schedule -
>>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>>> Bridge: https://bluejeans.com/836554017
>>>>
>>>> NA/EMEA Schedule -
>>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>>> Bridge: https://bluejeans.com/486278655
>>>>
>>>> Gluster-devel mailing list
>>>> gluster-de...@gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>>

_______________________________________________
Gluster-infra mailing list
Gluster-infra@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-infra

Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?

Reply via email to