Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

Anders Blomdell Tue, 22 Jul 2014 07:20:54 -0700

On 2014-07-22 15:12, Joseph Fernandes wrote:
> Hi All,
> 
> As with further investigation found the following,
> 
> 1) Was the able to reproduce the issue, without running the complete 
> regression, just by running bug-1112559.t only on slave30(which is been 
> rebooted and a clean gluster setup).
>    This rules out any involvement of previous failure from other spurious 
> errors like mgmt_v3-locks.t. 
> 2) Added some messages and script (netstat and ps -ef | grep gluster ) 
> execution when the binding to a port fails (in 
> rpc/rpc-transport/socket/src/socket.c) and found the following,
> 
>         Always the snapshot brick in second node (127.1.1.2) fails to acquire 
> the port (eg : 127.1.1.2 : 49155 )
> 
>         Netstat output shows: 
>         tcp        0      0 127.1.1.2:49155             0.0.0.0:*             
>       LISTEN      3555/glusterfsd
Could this be a time to propose that gluster understands port reservation a'la 
systemd (LISTEN_FDS),
and make the test harness make sure that random ports do not collide with the 
set of expected ports,
which will be beneficial when starting from systemd as well.



> 
>         and the process that is holding the port 49155 is 
> 
>         root      3555     1  0 12:38 ?        00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
> /d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
>  -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name 
> /d/backends/2/patchy_snap_mnt -l 
> /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
> --xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 
> --brick-port 49155 --xlator-option patchy-server.listen-port=49155
> 
>         Please note even though it says 127.1.1.2 its shows the glusterd-uuid 
> of the 3 node that was been probed when the snapshot was created 
> "3af134ec-5552-440f-ad24-1811308ca3a8"
> 
>         To clarify things there, there are already a volume brick in 127.1.1.2
> 
>         root      3446     1  0 12:38 ?        00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
> /d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
>  -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name 
> /d/backends/2/patchy_snap_mnt -l 
> /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
> --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
> --brick-port 49153 --xlator-option patchy-server.listen-port=49153
> 
>         And the above brick process(3555) is not visible before the snap 
> creation or after the failure of the snap brick start on the 127.1.1.2
>         This means that this process was spawned and died during the creation 
> of the snapshot and probe of the 3rd node (which happens simultaneously)
> 
>         In addition to these process, we can see multiple snap brick process 
> for the second brick on second node, which are not seen after the failure to 
> start the snap brick on 127.1.1.2
> 
>         root      3582     1  0 12:38 ?        00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
>  -p 
> /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
>  -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
> /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
> /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
>  --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
> --brick-port 49155 --xlator-option 
> 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
>         root      3583  3582  0 12:38 ?        00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
>  -p 
> /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
>  -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
> /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
> /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
>  --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
> --brick-port 49155 --xlator-option 
> 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
> 
> 
> 
> This looks like the second node tries to start snap brick
> 1) with wrong brickinfo and peerinfo (process 3555)
> 2) Multiple times with the correct brickinfo (process 3582,3583)
3583 is a subprocess of 3582, so it's only one invocation.

> 3) This issue is not seen when, snapshots are created and peer probe is NOT 
> done simultaneously.
> 
> Will continue on the investigation and will keep you posted.
> 
> 
> Regards,
> Joe
> 
> 
> 
> 
> ----- Original Message -----
> From: "Joseph Fernandes" <josfe...@redhat.com>
> To: "Avra Sengupta" <aseng...@redhat.com>
> Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>, "Varun Shastry" <vshas...@redhat.com>, "Justin 
> Clift" <jus...@gluster.org>
> Sent: Thursday, July 17, 2014 10:58:14 AM
> Subject: Re: [Gluster-devel] spurious regression failures again!
> 
> Hi Avra,
> 
> Just clarifying things here,
> 1) When testing with the setup provide by Justin, I found the only place 
> where bug-1112559.t failed was after the failure mgmt_v3-locks.t in the 
> previous regression run. The mail attached with the previous mail was just an 
> OBSERVATION and NOT an INFERENCE that failure of mgmt_v3-locks.t was the root 
> cause of bug-1112559.t . I am NOT jumping the gun and making any 
> statement/conclusion here. Its just an OBSERVATION. And thanks for the 
> clarification on why mgmt_v3-locks.t is failing.
> 
> 2) I agree with you that the cleanup script needs to kill all gluster* 
> processes. And its also true that port range used by gluster for bricks is 
> unique.
> But bug-1112559.t fails only because of the unavailability of port, to start 
> the snap brick. Therefore this suggests that there is some process(gluster or 
> non-gluster)
> still using the port. 
> 
> 3) And Finally that bug-1112559.t failing individually all the time is not 
> true as when looked into the links which you have provided there are case 
> where there are previous other test case failures, on the same testing 
> machine (slave26). By this I am not pointing out that those failure are the 
> root cause for bug-1112559.t to fail 
> As stated earlier its a notable OBSERVATION(Keeping in mind point 2 about 
> ports and cleanup)
> 
> I have run nearly 30 runs on slave30 and only one time bug-1112559.t failed 
> (As stated in point 1). I am continuing to run more runs. The only problem is 
> the occurrence of bug-1112559.t failure is spurious and there is no 
> deterministic way of reproducing it. 
> 
> Will keep all posted about the results.
> 
> Regards,
> Joe
> 
> 
> 
> ----- Original Message -----
> From: "Avra Sengupta" <aseng...@redhat.com>
> To: "Joseph Fernandes" <josfe...@redhat.com>, "Pranith Kumar Karampuri" 
> <pkara...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Varun Shastry" 
> <vshas...@redhat.com>, "Justin Clift" <jus...@gluster.org>
> Sent: Wednesday, July 16, 2014 1:03:21 PM
> Subject: Re: [Gluster-devel] spurious regression failures again!
> 
> Joseph,
> 
> I am not sure I understand how this is affecting the spurious failure of 
> bug-1112559.t. As per the mail you have attached, and according to your 
> analysis,  bug-1112559.t fails because a cleanup hasn't happened 
> properly after a previous test-case failed and in your case there was a 
> crash as well.
> 
> Now out of all the times bug-1112559.t has failed, most of the time it's 
> the only test case failing and there isn't any crash. Below are the 
> regression runs that pranith had sent for the same.
> 
> http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull
> 
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull
> 
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull
> 
> http://build.gluster.org/job/rackspace-regression-2GB/543/console
> 
> In all of the above bug-1112559.t is the only test case that fails and 
> there is no crash.
> 
> So what I fail to understand here is, if this particular testcase fails 
> independently as well as with other testcases, then how can we conclude 
> that any other testcase failing is somehow not doing a cleanup properly 
> and that is the reason for bug-1112559.t failing.
> 
> mgmt_v3-locks.t fails because glusterd takes more time to register a 
> node going down, and hence the peer status doesn't return what the 
> testcase expects it to. It's a race. The testcase ends with a cleanup 
> routine like every other testcase, that kills all gluster and glusterfsd 
> processes, which might be using any brick ports. So could you please 
> explain how or which process still uses the brick ports that the snap 
> bricks are trying to use leading to the failure of bug-1112559.t.
> 
> Regards,
> Avra
> 
> On 07/15/2014 09:57 PM, Joseph Fernandes wrote:
>> Just pointing out ,
>>
>> 2) tests/basic/mgmt_v3-locks.t - Author: Avra
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull
>>
>> This is the similar kind of error I saw in my testing of spurious failure 
>> tests/bugs/bug-1112559.t
>>
>> Please refer the attached mail.
>>
>> Regards,
>> Joe
>>
>>
>>
>> ----- Original Message -----
>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>> To: "Joseph Fernandes" <josfe...@redhat.com>
>> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Varun Shastry" 
>> <vshas...@redhat.com>
>> Sent: Tuesday, July 15, 2014 9:34:26 PM
>> Subject: Re: [Gluster-devel] spurious regression failures again!
>>
>>
>> On 07/15/2014 09:24 PM, Joseph Fernandes wrote:
>>> Hi Pranith,
>>>
>>> Could you please share the link of the console output of the failures.
>> Added them inline. Thanks for reminding :-)
>>
>> Pranith
>>> Regards,
>>> Joe
>>>
>>> ----- Original Message -----
>>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>>> To: "Gluster Devel" <gluster-devel@gluster.org>, "Varun Shastry" 
>>> <vshas...@redhat.com>
>>> Sent: Tuesday, July 15, 2014 8:52:44 PM
>>> Subject: [Gluster-devel] spurious regression failures again!
>>>
>>> hi,
>>>        We have 4 tests failing once in a while causing problems:
>>> 1) tests/bugs/bug-1087198.t - Author: Varun
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull
>>> 2) tests/basic/mgmt_v3-locks.t - Author: Avra
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull
>>> 3) tests/basic/fops-sanity.t - Author: Pranith
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull
>>> Please take a look at them and post updates.
>>>
>>> Pranith
/Anders


-- 
Anders Blomdell                  Email: anders.blomd...@control.lth.se
Department of Automatic Control
Lund University                  Phone:    +46 46 222 4625
P.O. Box 118                     Fax:      +46 46 138118
SE-221 00 Lund, Sweden

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

Reply via email to