I took a quick look at the builders and noticed both have the same error of 'Cannot allocate memory' which comes up every time when the builder is rebooted after a build abort. It is happening in the same pattern. Though there's no such memory consumption on the builders.
I’m investigating more on this. On Thu, May 9, 2019 at 10:02 AM Atin Mukherjee <amukh...@redhat.com> wrote: > > > On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee <amukh...@redhat.com> wrote: > >> builder204 needs to be fixed, too many failures, mostly none of the >> patches are passing regression. >> > > And with that builder201 joins the pool, > https://build.gluster.org/job/centos7-regression/5943/consoleFull > > >> On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee <amukh...@redhat.com> >> wrote: >> >>> >>> >>> On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde <srako...@redhat.com> >>> wrote: >>> >>>> Deepshikha, >>>> >>>> I see the failure here[1] which ran on builder206. So, we are good. >>>> >>> >>> Not really, >>> https://build.gluster.org/job/centos7-regression/5909/consoleFull >>> failed on builder204 for similar reasons I believe? >>> >>> I am bit more worried on this issue being resurfacing more often these >>> days. What can we do to fix this permanently? >>> >>> >>>> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull >>>> >>>> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < >>>> dkhan...@redhat.com> wrote: >>>> >>>>> Sanju, can you please give us more info about the failures. >>>>> >>>>> I see the failures occurring on just one of the builder (builder206). >>>>> I'm taking it back offline for now. >>>>> >>>>> On Tue, May 7, 2019 at 9:42 PM Michael Scherer <msche...@redhat.com> >>>>> wrote: >>>>> >>>>>> Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : >>>>>> > Looks like is_nfs_export_available started failing again in recent >>>>>> > centos-regressions. >>>>>> > >>>>>> > Michael, can you please check? >>>>>> >>>>>> I will try but I am leaving for vacation tonight, so if I find >>>>>> nothing, >>>>>> until I leave, I guess Deepshika will have to look. >>>>>> >>>>>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul <yk...@redhat.com> >>>>>> wrote: >>>>>> > >>>>>> > > >>>>>> > > >>>>>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>>>>> > > msche...@redhat.com> >>>>>> > > wrote: >>>>>> > > >>>>>> > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : >>>>>> > > > > Is this back again? The recent patches are failing regression >>>>>> > > > > :-\ . >>>>>> > > > >>>>>> > > > So, on builder206, it took me a while to find that the issue is >>>>>> > > > that >>>>>> > > > nfs (the service) was running. >>>>>> > > > >>>>>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>>>>> > > > initialisation >>>>>> > > > failed with a rather cryptic message: >>>>>> > > > >>>>>> > > > [2019-04-23 13:17:05.371733] I >>>>>> > > > [socket.c:991:__socket_server_bind] 0- >>>>>> > > > socket.nfs-server: process started listening on port (38465) >>>>>> > > > [2019-04-23 13:17:05.385819] E >>>>>> > > > [socket.c:972:__socket_server_bind] 0- >>>>>> > > > socket.nfs-server: binding to failed: Address already in use >>>>>> > > > [2019-04-23 13:17:05.385843] E >>>>>> > > > [socket.c:974:__socket_server_bind] 0- >>>>>> > > > socket.nfs-server: Port is already in use >>>>>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>>>>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>>>>> > > > >>>>>> > > > I found where this came from, but a few stuff did surprised me: >>>>>> > > > >>>>>> > > > - the order of print is different that the order in the code >>>>>> > > > >>>>>> > > >>>>>> > > Indeed strange... >>>>>> > > >>>>>> > > > - the message on "started listening" didn't take in account the >>>>>> > > > fact >>>>>> > > > that bind failed on: >>>>>> > > > >>>>>> > > >>>>>> > > Shouldn't it bail out if it failed to bind? >>>>>> > > Some missing 'goto out' around line 975/976? >>>>>> > > Y. >>>>>> > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> >>>>>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>>>>> > > > >>>>>> > > > The message about port 38465 also threw me off the track. The >>>>>> > > > real >>>>>> > > > issue is that the service nfs was already running, and I >>>>>> couldn't >>>>>> > > > find >>>>>> > > > anything listening on port 38465 >>>>>> > > > >>>>>> > > > once I do service nfs stop, it no longer failed. >>>>>> > > > >>>>>> > > > So far, I do know why nfs.service was activated. >>>>>> > > > >>>>>> > > > But at least, 206 should be fixed, and we know a bit more on >>>>>> what >>>>>> > > > would >>>>>> > > > be causing some failure. >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>>>>> > > > > msche...@redhat.com> >>>>>> > > > > wrote: >>>>>> > > > > >>>>>> > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a >>>>>> > > > > > écrit : >>>>>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>>>>> > > > > > > jthot...@redhat.com> >>>>>> > > > > > > wrote: >>>>>> > > > > > > >>>>>> > > > > > > > Hi, >>>>>> > > > > > > > >>>>>> > > > > > > > is_nfs_export_available is just a wrapper around >>>>>> > > > > > > > "showmount" >>>>>> > > > > > > > command AFAIR. >>>>>> > > > > > > > I saw following messages in console output. >>>>>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>>>>> > > > > > > > remote >>>>>> > > > > > > > locking. >>>>>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>>>>> > > > > > > > local, >>>>>> > > > > > > > or >>>>>> > > > > > > > start >>>>>> > > > > > > > statd. >>>>>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>>>>> > > > > > > > specified >>>>>> > > > > > > > >>>>>> > > > > > > > For me it looks rpcbind may not be running on the >>>>>> > > > > > > > machine. >>>>>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>>>>> > > > > > > > know >>>>>> > > > > > > > whether it >>>>>> > > > > > > > can happen or not. >>>>>> > > > > > > > >>>>>> > > > > > > >>>>>> > > > > > > That's precisely what the question is. Why suddenly we're >>>>>> > > > > > > seeing >>>>>> > > > > > > this >>>>>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >>>>>> > > > > > > failures >>>>>> > > > > > > already. >>>>>> > > > > > > >>>>>> > > > > > > Deepshika - Can you please help in inspecting this? >>>>>> > > > > > >>>>>> > > > > > So we think (we are not sure) that the issue is a bit >>>>>> > > > > > complex. >>>>>> > > > > > >>>>>> > > > > > What we were investigating was nightly run fail on aws. When >>>>>> > > > > > the >>>>>> > > > > > build >>>>>> > > > > > crash, the builder is restarted, since that's the easiest >>>>>> way >>>>>> > > > > > to >>>>>> > > > > > clean >>>>>> > > > > > everything (since even with a perfect test suite that would >>>>>> > > > > > clean >>>>>> > > > > > itself, we could always end in a corrupt state on the >>>>>> system, >>>>>> > > > > > WRT >>>>>> > > > > > mount, fs, etc). >>>>>> > > > > > >>>>>> > > > > > In turn, this seems to cause trouble on aws, since >>>>>> cloud-init >>>>>> > > > > > or >>>>>> > > > > > something rename eth0 interface to ens5, without cleaning to >>>>>> > > > > > the >>>>>> > > > > > network configuration. >>>>>> > > > > > >>>>>> > > > > > So the network init script fail (because the image say >>>>>> "start >>>>>> > > > > > eth0" >>>>>> > > > > > and >>>>>> > > > > > that's not present), but fail in a weird way. Network is >>>>>> > > > > > initialised >>>>>> > > > > > and working (we can connect), but the dhclient process is >>>>>> not >>>>>> > > > > > in >>>>>> > > > > > the >>>>>> > > > > > right cgroup, and network.service is in failed state. >>>>>> > > > > > Restarting >>>>>> > > > > > network didn't work. In turn, this mean that rpc-statd >>>>>> refuse >>>>>> > > > > > to >>>>>> > > > > > start >>>>>> > > > > > (due to systemd dependencies), which seems to impact various >>>>>> > > > > > NFS >>>>>> > > > > > tests. >>>>>> > > > > > >>>>>> > > > > > We have also seen that on some builders, rpcbind pick some >>>>>> IP >>>>>> > > > > > v6 >>>>>> > > > > > autoconfiguration, but we can't reproduce that, and there is >>>>>> > > > > > no ip >>>>>> > > > > > v6 >>>>>> > > > > > set up anywhere. I suspect the network.service failure is >>>>>> > > > > > somehow >>>>>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >>>>>> > > > > > starting >>>>>> > > > > > could cause NFS test troubles. >>>>>> > > > > > >>>>>> > > > > > Our current stop gap fix was to fix all the builders one by >>>>>> > > > > > one. >>>>>> > > > > > Remove >>>>>> > > > > > the config, kill the rogue dhclient, restart network >>>>>> service. >>>>>> > > > > > >>>>>> > > > > > However, we can't be sure this is going to fix the problem >>>>>> > > > > > long >>>>>> > > > > > term >>>>>> > > > > > since this only manifest after a crash of the test suite, >>>>>> and >>>>>> > > > > > it >>>>>> > > > > > doesn't happen so often. (plus, it was working before some >>>>>> > > > > > day in >>>>>> > > > > > the >>>>>> > > > > > past, when something did make this fail, and I do not know >>>>>> if >>>>>> > > > > > that's a >>>>>> > > > > > system upgrade, or a test change, or both). >>>>>> > > > > > >>>>>> > > > > > So we are still looking at it to have a complete >>>>>> > > > > > understanding of >>>>>> > > > > > the >>>>>> > > > > > issue, but so far, we hacked our way to make it work (or so >>>>>> > > > > > do I >>>>>> > > > > > think). >>>>>> > > > > > >>>>>> > > > > > Deepshika is working to fix it long term, by fixing the >>>>>> issue >>>>>> > > > > > regarding >>>>>> > > > > > eth0/ens5 with a new base image. >>>>>> > > > > > -- >>>>>> > > > > > Michael Scherer >>>>>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > -- >>>>>> > > > > >>>>>> > > > > - Atin (atinm) >>>>>> > > > >>>>>> > > > -- >>>>>> > > > Michael Scherer >>>>>> > > > Sysadmin, Community Infrastructure >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > _______________________________________________ >>>>>> > > > Gluster-devel mailing list >>>>>> > > > gluster-de...@gluster.org >>>>>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>>> > > >>>>>> > > _______________________________________________ >>>>>> > > Gluster-devel mailing list >>>>>> > > gluster-de...@gluster.org >>>>>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>>> > >>>>>> > >>>>>> > >>>>>> -- >>>>>> Michael Scherer >>>>>> Sysadmin, Community Infrastructure >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-devel mailing list >>>>>> gluster-de...@gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>>> >>>> >>>> -- >>>> Thanks, >>>> Sanju >>>> _______________________________________________ >>>> >>>> Community Meeting Calendar: >>>> >>>> APAC Schedule - >>>> Every 2nd and 4th Tuesday at 11:30 AM IST >>>> Bridge: https://bluejeans.com/836554017 >>>> >>>> NA/EMEA Schedule - >>>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>>> Bridge: https://bluejeans.com/486278655 >>>> >>>> Gluster-devel mailing list >>>> gluster-de...@gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>>
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra