On 2014-07-22 15:12, Joseph Fernandes wrote: > Hi All, > > As with further investigation found the following, > > 1) Was the able to reproduce the issue, without running the complete > regression, just by running bug-1112559.t only on slave30(which is been > rebooted and a clean gluster setup). > This rules out any involvement of previous failure from other spurious > errors like mgmt_v3-locks.t. > 2) Added some messages and script (netstat and ps -ef | grep gluster ) > execution when the binding to a port fails (in > rpc/rpc-transport/socket/src/socket.c) and found the following, > > Always the snapshot brick in second node (127.1.1.2) fails to acquire > the port (eg : 127.1.1.2 : 49155 ) > > Netstat output shows: > tcp 0 0 127.1.1.2:49155 0.0.0.0:* > LISTEN 3555/glusterfsd Could this be a time to propose that gluster understands port reservation a'la systemd (LISTEN_FDS), and make the test harness make sure that random ports do not collide with the set of expected ports, which will be beneficial when starting from systemd as well.
> > and the process that is holding the port 49155 is > > root 3555 1 0 12:38 ? 00:00:00 > /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id > patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p > /d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid > -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name > /d/backends/2/patchy_snap_mnt -l > /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log > --xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 > --brick-port 49155 --xlator-option patchy-server.listen-port=49155 > > Please note even though it says 127.1.1.2 its shows the glusterd-uuid > of the 3 node that was been probed when the snapshot was created > "3af134ec-5552-440f-ad24-1811308ca3a8" > > To clarify things there, there are already a volume brick in 127.1.1.2 > > root 3446 1 0 12:38 ? 00:00:00 > /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id > patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p > /d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid > -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name > /d/backends/2/patchy_snap_mnt -l > /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log > --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f > --brick-port 49153 --xlator-option patchy-server.listen-port=49153 > > And the above brick process(3555) is not visible before the snap > creation or after the failure of the snap brick start on the 127.1.1.2 > This means that this process was spawned and died during the creation > of the snapshot and probe of the 3rd node (which happens simultaneously) > > In addition to these process, we can see multiple snap brick process > for the second brick on second node, which are not seen after the failure to > start the snap brick on 127.1.1.2 > > root 3582 1 0 12:38 ? 00:00:00 > /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id > /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2 > -p > /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid > -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name > /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l > /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log > --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f > --brick-port 49155 --xlator-option > 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155 > root 3583 3582 0 12:38 ? 00:00:00 > /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id > /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2 > -p > /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid > -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name > /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l > /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log > --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f > --brick-port 49155 --xlator-option > 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155 > > > > This looks like the second node tries to start snap brick > 1) with wrong brickinfo and peerinfo (process 3555) > 2) Multiple times with the correct brickinfo (process 3582,3583) 3583 is a subprocess of 3582, so it's only one invocation. > 3) This issue is not seen when, snapshots are created and peer probe is NOT > done simultaneously. > > Will continue on the investigation and will keep you posted. > > > Regards, > Joe > > > > > ----- Original Message ----- > From: "Joseph Fernandes" <josfe...@redhat.com> > To: "Avra Sengupta" <aseng...@redhat.com> > Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" > <gluster-devel@gluster.org>, "Varun Shastry" <vshas...@redhat.com>, "Justin > Clift" <jus...@gluster.org> > Sent: Thursday, July 17, 2014 10:58:14 AM > Subject: Re: [Gluster-devel] spurious regression failures again! > > Hi Avra, > > Just clarifying things here, > 1) When testing with the setup provide by Justin, I found the only place > where bug-1112559.t failed was after the failure mgmt_v3-locks.t in the > previous regression run. The mail attached with the previous mail was just an > OBSERVATION and NOT an INFERENCE that failure of mgmt_v3-locks.t was the root > cause of bug-1112559.t . I am NOT jumping the gun and making any > statement/conclusion here. Its just an OBSERVATION. And thanks for the > clarification on why mgmt_v3-locks.t is failing. > > 2) I agree with you that the cleanup script needs to kill all gluster* > processes. And its also true that port range used by gluster for bricks is > unique. > But bug-1112559.t fails only because of the unavailability of port, to start > the snap brick. Therefore this suggests that there is some process(gluster or > non-gluster) > still using the port. > > 3) And Finally that bug-1112559.t failing individually all the time is not > true as when looked into the links which you have provided there are case > where there are previous other test case failures, on the same testing > machine (slave26). By this I am not pointing out that those failure are the > root cause for bug-1112559.t to fail > As stated earlier its a notable OBSERVATION(Keeping in mind point 2 about > ports and cleanup) > > I have run nearly 30 runs on slave30 and only one time bug-1112559.t failed > (As stated in point 1). I am continuing to run more runs. The only problem is > the occurrence of bug-1112559.t failure is spurious and there is no > deterministic way of reproducing it. > > Will keep all posted about the results. > > Regards, > Joe > > > > ----- Original Message ----- > From: "Avra Sengupta" <aseng...@redhat.com> > To: "Joseph Fernandes" <josfe...@redhat.com>, "Pranith Kumar Karampuri" > <pkara...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Varun Shastry" > <vshas...@redhat.com>, "Justin Clift" <jus...@gluster.org> > Sent: Wednesday, July 16, 2014 1:03:21 PM > Subject: Re: [Gluster-devel] spurious regression failures again! > > Joseph, > > I am not sure I understand how this is affecting the spurious failure of > bug-1112559.t. As per the mail you have attached, and according to your > analysis, bug-1112559.t fails because a cleanup hasn't happened > properly after a previous test-case failed and in your case there was a > crash as well. > > Now out of all the times bug-1112559.t has failed, most of the time it's > the only test case failing and there isn't any crash. Below are the > regression runs that pranith had sent for the same. > > http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull > > http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull > > http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull > > http://build.gluster.org/job/rackspace-regression-2GB/543/console > > In all of the above bug-1112559.t is the only test case that fails and > there is no crash. > > So what I fail to understand here is, if this particular testcase fails > independently as well as with other testcases, then how can we conclude > that any other testcase failing is somehow not doing a cleanup properly > and that is the reason for bug-1112559.t failing. > > mgmt_v3-locks.t fails because glusterd takes more time to register a > node going down, and hence the peer status doesn't return what the > testcase expects it to. It's a race. The testcase ends with a cleanup > routine like every other testcase, that kills all gluster and glusterfsd > processes, which might be using any brick ports. So could you please > explain how or which process still uses the brick ports that the snap > bricks are trying to use leading to the failure of bug-1112559.t. > > Regards, > Avra > > On 07/15/2014 09:57 PM, Joseph Fernandes wrote: >> Just pointing out , >> >> 2) tests/basic/mgmt_v3-locks.t - Author: Avra >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull >> >> This is the similar kind of error I saw in my testing of spurious failure >> tests/bugs/bug-1112559.t >> >> Please refer the attached mail. >> >> Regards, >> Joe >> >> >> >> ----- Original Message ----- >> From: "Pranith Kumar Karampuri" <pkara...@redhat.com> >> To: "Joseph Fernandes" <josfe...@redhat.com> >> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Varun Shastry" >> <vshas...@redhat.com> >> Sent: Tuesday, July 15, 2014 9:34:26 PM >> Subject: Re: [Gluster-devel] spurious regression failures again! >> >> >> On 07/15/2014 09:24 PM, Joseph Fernandes wrote: >>> Hi Pranith, >>> >>> Could you please share the link of the console output of the failures. >> Added them inline. Thanks for reminding :-) >> >> Pranith >>> Regards, >>> Joe >>> >>> ----- Original Message ----- >>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com> >>> To: "Gluster Devel" <gluster-devel@gluster.org>, "Varun Shastry" >>> <vshas...@redhat.com> >>> Sent: Tuesday, July 15, 2014 8:52:44 PM >>> Subject: [Gluster-devel] spurious regression failures again! >>> >>> hi, >>> We have 4 tests failing once in a while causing problems: >>> 1) tests/bugs/bug-1087198.t - Author: Varun >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull >>> 2) tests/basic/mgmt_v3-locks.t - Author: Avra >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull >>> 3) tests/basic/fops-sanity.t - Author: Pranith >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull >>> Please take a look at them and post updates. >>> >>> Pranith /Anders -- Anders Blomdell Email: anders.blomd...@control.lth.se Department of Automatic Control Lund University Phone: +46 46 222 4625 P.O. Box 118 Fax: +46 46 138118 SE-221 00 Lund, Sweden _______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel