Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
>2) N3 tries to start the brick B2. Now the problem lies here. N3 uses >glusterd_resolve_brick() to resolve the UUID of B2->hostname(N2). > In glusterd_resolve_brick(), it cannot find N2 in the peerinfo > list. Then it checks if N2 is a local loop back address. Since > N2(127.1.1.2) starts with > "127" it decides that its a local loop back address. Thus > glusterd_resolve_brick() fills brickinfo->uuid with [UUID3]. Now > as brickinfo->uuid == MY_UUID is > true, N3 initiates the brick process B2 with -s 127.1.1.2 and > *-posix.glusterd-uuid=[UUID3]. This process dies off immediately, > But for a short amount of > time it holds on to the --brick-port, say for example 49155 This is the part that seems "off" to me. If an address doesn't *exactly* match that on some local interface, it's not local. When we implemented the cluster.rc infrastructure so that we could simulate multi-node testing, we had to root out a bunch of stuff like this, but apparently some crept back in. If we just fixed the "127.* == local" mistake, would that be adequate to prevent these errors? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
On 2014-07-24 08:21, Joseph Fernandes wrote: > Hi All, > > After further investigation we have the root cause for this issue. > The root cause is the way in which a new node is added to the cluster. > > Now we have N1(127.1.1.1) and N2(127.1.1.2) as two nodes in the cluster, each > having a brick N1:B1 (127.1.1.1 : 49146) and N2:B2 (127.1.1.2 : 49147) > > Now lets peer probe N3(127.1.1.3) from N1 > > 1) Friend request is sent from N1 to N3. N3 added N1 in the peerinfo list i.e > N1 and its uuid say [UUID1] > 2) N3 get the brick infos from N1 > 3) N3 tries to start the bricks >1) N3 tries to start the brick B1 and find its not a local brick, > using the logic MY_UUID == brickinfo->uuid, which is false in this case, > as the UUID of brickinfo->hostname (N1) is [UUID1] (as suggested by > the peerinfo list) and MY_UUID is [UUID3]. Hence doesn't start it. >2) N3 tries to start the brick B2. Now the problem lies here. N3 uses > glusterd_resolve_brick() to resolve the UUID of B2->hostname(N2). > In glusterd_resolve_brick(), it cannot find N2 in the peerinfo > list. Then it checks if N2 is a local loop back address. Since N2(127.1.1.2) > starts with > "127" it decides that its a local loop back address. Thus > glusterd_resolve_brick() fills brickinfo->uuid with [UUID3]. Now as > brickinfo->uuid == MY_UUID is > true, N3 initiates the brick process B2 with -s 127.1.1.2 and > *-posix.glusterd-uuid=[UUID3]. This process dies off immediately, But for a > short amount of > time it holds on to the --brick-port, say for example 49155 > > All the above is observed & inferred from glusterd logs from N3 (by adding > some extra debug messages) > > Now coming back to our test case, i.e firing snapshot create and peer probe > together. If N2 has assigned 49155 as the port --brick-port for the snapshot > brick, then it finds that 49155 is Already acquired by some other process(i.e > faulty brick process N3:B2 (127.1.1.2 : 49155), which as the -s 127.1.1.2 and > *-posix.glusterd-uuid=[UUID3]) and hence fails to start the snapshot brick > process. > > 1) The error is spurious, as its all about chance when N2 and N3 use the same > port for their brick processes. > 2) This issue is possible only in a regression test scenario, As all the > nodes are on the same machine, differentiated only by a different loop back > address (127.1.1.*). > 3) Plus The logic that "127" is a local loop back address is also not wrong > as glusterd's are suppose to run on different machines in real usage cases. > > Please do share your thoughts on this, And what would be a possible fix. Possible solutions (many/all of them probably breaks important assumptions): * Use some alias address range instead of 127.*.*.* for testing purposes * Stop treating localhost as special * Adopt the systemd LISTEN_FDS approach and have a special program that tries to bind to ports and then hands the port over to the proper daemon /Anders -- Anders Blomdell Email: anders.blomd...@control.lth.se Department of Automatic Control Lund University Phone:+46 46 222 4625 P.O. Box 118 Fax: +46 46 138118 SE-221 00 Lund, Sweden ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
On Thu, 24 Jul 2014 02:21:37 -0400 (EDT) Joseph Fernandes wrote: > Please do share your thoughts on this, And what would be a possible fix. Any idea if there is a cross platform way to check if a port is already in use? If so, that sounds like the first thing to add, and then some way to try using a different port number. :) + Justin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
Hi All, After further investigation we have the root cause for this issue. The root cause is the way in which a new node is added to the cluster. Now we have N1(127.1.1.1) and N2(127.1.1.2) as two nodes in the cluster, each having a brick N1:B1 (127.1.1.1 : 49146) and N2:B2 (127.1.1.2 : 49147) Now lets peer probe N3(127.1.1.3) from N1 1) Friend request is sent from N1 to N3. N3 added N1 in the peerinfo list i.e N1 and its uuid say [UUID1] 2) N3 get the brick infos from N1 3) N3 tries to start the bricks 1) N3 tries to start the brick B1 and find its not a local brick, using the logic MY_UUID == brickinfo->uuid, which is false in this case, as the UUID of brickinfo->hostname (N1) is [UUID1] (as suggested by the peerinfo list) and MY_UUID is [UUID3]. Hence doesn't start it. 2) N3 tries to start the brick B2. Now the problem lies here. N3 uses glusterd_resolve_brick() to resolve the UUID of B2->hostname(N2). In glusterd_resolve_brick(), it cannot find N2 in the peerinfo list. Then it checks if N2 is a local loop back address. Since N2(127.1.1.2) starts with "127" it decides that its a local loop back address. Thus glusterd_resolve_brick() fills brickinfo->uuid with [UUID3]. Now as brickinfo->uuid == MY_UUID is true, N3 initiates the brick process B2 with -s 127.1.1.2 and *-posix.glusterd-uuid=[UUID3]. This process dies off immediately, But for a short amount of time it holds on to the --brick-port, say for example 49155 All the above is observed & inferred from glusterd logs from N3 (by adding some extra debug messages) Now coming back to our test case, i.e firing snapshot create and peer probe together. If N2 has assigned 49155 as the port --brick-port for the snapshot brick, then it finds that 49155 is Already acquired by some other process(i.e faulty brick process N3:B2 (127.1.1.2 : 49155), which as the -s 127.1.1.2 and *-posix.glusterd-uuid=[UUID3]) and hence fails to start the snapshot brick process. 1) The error is spurious, as its all about chance when N2 and N3 use the same port for their brick processes. 2) This issue is possible only in a regression test scenario, As all the nodes are on the same machine, differentiated only by a different loop back address (127.1.1.*). 3) Plus The logic that "127" is a local loop back address is also not wrong as glusterd's are suppose to run on different machines in real usage cases. Please do share your thoughts on this, And what would be a possible fix. Regards, Joe - Original Message - From: "Joseph Fernandes" To: "Avra Sengupta" , "Gluster Devel" Sent: Tuesday, July 22, 2014 6:42:02 PM Subject: Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t] Hi All, As with further investigation found the following, 1) Was the able to reproduce the issue, without running the complete regression, just by running bug-1112559.t only on slave30(which is been rebooted and a clean gluster setup). This rules out any involvement of previous failure from other spurious errors like mgmt_v3-locks.t. 2) Added some messages and script (netstat and ps -ef | grep gluster ) execution when the binding to a port fails (in rpc/rpc-transport/socket/src/socket.c) and found the following, Always the snapshot brick in second node (127.1.1.2) fails to acquire the port (eg : 127.1.1.2 : 49155 ) Netstat output shows: tcp0 0 127.1.1.2:49155 0.0.0.0:* LISTEN 3555/glusterfsd and the process that is holding the port 49155 is root 3555 1 0 12:38 ?00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p /d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name /d/backends/2/patchy_snap_mnt -l /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 --brick-port 49155 --xlator-option patchy-server.listen-port=49155 Please note even though it says 127.1.1.2 its shows the glusterd-uuid of the 3 node that was been probed when the snapshot was created "3af134ec-5552-440f-ad24-1811308ca3a8" To clarify things there, there are already a volume brick in 127.1.1.2 root 3446 1 0 12:38 ?00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p /d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name /d/backends/2/patchy_snap_mnt -l /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=a7f4
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
tin Clift" Sent: Thursday, July 17, 2014 10:58:14 AM Subject: Re: [Gluster-devel] spurious regression failures again! Hi Avra, Just clarifying things here, 1) When testing with the setup provide by Justin, I found the only place where bug-1112559.t failed was after the failure mgmt_v3-locks.t in the previous regression run. The mail attached with the previous mail was just an OBSERVATION and NOT an INFERENCE that failure of mgmt_v3-locks.t was the root cause of bug-1112559.t . I am NOT jumping the gun and making any statement/conclusion here. Its just an OBSERVATION. And thanks for the clarification on why mgmt_v3-locks.t is failing. 2) I agree with you that the cleanup script needs to kill all gluster* processes. And its also true that port range used by gluster for bricks is unique. But bug-1112559.t fails only because of the unavailability of port, to start the snap brick. Therefore this suggests that there is some process(gluster or non-gluster) still using the port. 3) And Finally that bug-1112559.t failing individually all the time is not true as when looked into the links which you have provided there are case where there are previous other test case failures, on the same testing machine (slave26). By this I am not pointing out that those failure are the root cause for bug-1112559.t to fail As stated earlier its a notable OBSERVATION(Keeping in mind point 2 about ports and cleanup) I have run nearly 30 runs on slave30 and only one time bug-1112559.t failed (As stated in point 1). I am continuing to run more runs. The only problem is the occurrence of bug-1112559.t failure is spurious and there is no deterministic way of reproducing it. Will keep all posted about the results. Regards, Joe - Original Message - From: "Avra Sengupta" To: "Joseph Fernandes" , "Pranith Kumar Karampuri" Cc: "Gluster Devel" , "Varun Shastry" , "Justin Clift" Sent: Wednesday, July 16, 2014 1:03:21 PM Subject: Re: [Gluster-devel] spurious regression failures again! Joseph, I am not sure I understand how this is affecting the spurious failure of bug-1112559.t. As per the mail you have attached, and according to your analysis, bug-1112559.t fails because a cleanup hasn't happened properly after a previous test-case failed and in your case there was a crash as well. Now out of all the times bug-1112559.t has failed, most of the time it's the only test case failing and there isn't any crash. Below are the regression runs that pranith had sent for the same. http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB/543/console In all of the above bug-1112559.t is the only test case that fails and there is no crash. So what I fail to understand here is, if this particular testcase fails independently as well as with other testcases, then how can we conclude that any other testcase failing is somehow not doing a cleanup properly and that is the reason for bug-1112559.t failing. mgmt_v3-locks.t fails because glusterd takes more time to register a node going down, and hence the peer status doesn't return what the testcase expects it to. It's a race. The testcase ends with a cleanup routine like every other testcase, that kills all gluster and glusterfsd processes, which might be using any brick ports. So could you please explain how or which process still uses the brick ports that the snap bricks are trying to use leading to the failure of bug-1112559.t. Regards, Avra On 07/15/2014 09:57 PM, Joseph Fernandes wrote: Just pointing out , 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull This is the similar kind of error I saw in my testing of spurious failure tests/bugs/bug-1112559.t Please refer the attached mail. Regards, Joe - Original Message ----- From: "Pranith Kumar Karampuri" To: "Joseph Fernandes" Cc: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 9:34:26 PM Subject: Re: [Gluster-devel] spurious regression failures again! On 07/15/2014 09:24 PM, Joseph Fernandes wrote: Hi Pranith, Could you please share the link of the console output of the failures. Added them inline. Thanks for reminding :-) Pranith Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 8:52:44 PM Subject: [Gluster-devel] spurious regression failures again! hi, We have 4 tests failing once in a while causing problems: 1) tests/bugs/bug-1087198.t - Author: Varun http://build.gluster.org/job/rackspace-regression-2GB-trigge
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
On 2014-07-22 16:44, Justin Clift wrote: > On 22/07/2014, at 3:28 PM, Joe Julian wrote: >> On 07/22/2014 07:19 AM, Anders Blomdell wrote: >>> Could this be a time to propose that gluster understands port reservation >>> a'la systemd (LISTEN_FDS), >>> and make the test harness make sure that random ports do not collide with >>> the set of expected ports, >>> which will be beneficial when starting from systemd as well. >> Wouldn't that only work for Fedora and RHEL7? > > Probably depends how it's done. Maybe make it a conditional > thing that's compiled in or not, depending on the platform? Don't think so, the LISTEN_FDS is dead simple; if LISTEN_FDS is set in the environment, fd#3 to fd#3+LISTEN_FDS are sockets opened by the calling process, and their function has to be deduced via getsockname and sockets should not opened by the process. If LISTEN_FDS is not set, proceed to open sockets just like before. The good thing about this is that systemd can reserve the ports used very early during boot, and no other process can steal them away. For testing purposes, this could be used to assure that all ports are available before starting tests (if random port stealing is the true problem here, that is still an unverified shot in the dark). > > Unless there's a better, cross platform approach of course. :) > > Regards and best wishes, > > Justin Clift > /Anders -- Anders Blomdell Email: anders.blomd...@control.lth.se Department of Automatic Control Lund University Phone:+46 46 222 4625 P.O. Box 118 Fax: +46 46 138118 SE-221 00 Lund, Sweden ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
On 22/07/2014, at 3:28 PM, Joe Julian wrote: > On 07/22/2014 07:19 AM, Anders Blomdell wrote: >> Could this be a time to propose that gluster understands port reservation >> a'la systemd (LISTEN_FDS), >> and make the test harness make sure that random ports do not collide with >> the set of expected ports, >> which will be beneficial when starting from systemd as well. > Wouldn't that only work for Fedora and RHEL7? Probably depends how it's done. Maybe make it a conditional thing that's compiled in or not, depending on the platform? Unless there's a better, cross platform approach of course. :) Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
On 07/22/2014 07:19 AM, Anders Blomdell wrote: Could this be a time to propose that gluster understands port reservation a'la systemd (LISTEN_FDS), and make the test harness make sure that random ports do not collide with the set of expected ports, which will be beneficial when starting from systemd as well. Wouldn't that only work for Fedora and RHEL7? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
0130be84b5c9695df8252a56a6d-server.listen-port=49155 > > > > This looks like the second node tries to start snap brick > 1) with wrong brickinfo and peerinfo (process 3555) > 2) Multiple times with the correct brickinfo (process 3582,3583) 3583 is a subprocess of 3582, so it's only one invocation. > 3) This issue is not seen when, snapshots are created and peer probe is NOT > done simultaneously. > > Will continue on the investigation and will keep you posted. > > > Regards, > Joe > > > > > - Original Message - > From: "Joseph Fernandes" > To: "Avra Sengupta" > Cc: "Pranith Kumar Karampuri" , "Gluster Devel" > , "Varun Shastry" , "Justin > Clift" > Sent: Thursday, July 17, 2014 10:58:14 AM > Subject: Re: [Gluster-devel] spurious regression failures again! > > Hi Avra, > > Just clarifying things here, > 1) When testing with the setup provide by Justin, I found the only place > where bug-1112559.t failed was after the failure mgmt_v3-locks.t in the > previous regression run. The mail attached with the previous mail was just an > OBSERVATION and NOT an INFERENCE that failure of mgmt_v3-locks.t was the root > cause of bug-1112559.t . I am NOT jumping the gun and making any > statement/conclusion here. Its just an OBSERVATION. And thanks for the > clarification on why mgmt_v3-locks.t is failing. > > 2) I agree with you that the cleanup script needs to kill all gluster* > processes. And its also true that port range used by gluster for bricks is > unique. > But bug-1112559.t fails only because of the unavailability of port, to start > the snap brick. Therefore this suggests that there is some process(gluster or > non-gluster) > still using the port. > > 3) And Finally that bug-1112559.t failing individually all the time is not > true as when looked into the links which you have provided there are case > where there are previous other test case failures, on the same testing > machine (slave26). By this I am not pointing out that those failure are the > root cause for bug-1112559.t to fail > As stated earlier its a notable OBSERVATION(Keeping in mind point 2 about > ports and cleanup) > > I have run nearly 30 runs on slave30 and only one time bug-1112559.t failed > (As stated in point 1). I am continuing to run more runs. The only problem is > the occurrence of bug-1112559.t failure is spurious and there is no > deterministic way of reproducing it. > > Will keep all posted about the results. > > Regards, > Joe > > > > - Original Message - > From: "Avra Sengupta" > To: "Joseph Fernandes" , "Pranith Kumar Karampuri" > > Cc: "Gluster Devel" , "Varun Shastry" > , "Justin Clift" > Sent: Wednesday, July 16, 2014 1:03:21 PM > Subject: Re: [Gluster-devel] spurious regression failures again! > > Joseph, > > I am not sure I understand how this is affecting the spurious failure of > bug-1112559.t. As per the mail you have attached, and according to your > analysis, bug-1112559.t fails because a cleanup hasn't happened > properly after a previous test-case failed and in your case there was a > crash as well. > > Now out of all the times bug-1112559.t has failed, most of the time it's > the only test case failing and there isn't any crash. Below are the > regression runs that pranith had sent for the same. > > http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull > > http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull > > http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull > > http://build.gluster.org/job/rackspace-regression-2GB/543/console > > In all of the above bug-1112559.t is the only test case that fails and > there is no crash. > > So what I fail to understand here is, if this particular testcase fails > independently as well as with other testcases, then how can we conclude > that any other testcase failing is somehow not doing a cleanup properly > and that is the reason for bug-1112559.t failing. > > mgmt_v3-locks.t fails because glusterd takes more time to register a > node going down, and hence the peer status doesn't return what the > testcase expects it to. It's a race. The testcase ends with a cleanup > routine like every other testcase, that kills all gluster and glusterfsd > processes, which might be using any brick ports. So could you please > explain how or which process still uses the brick ports that the snap > bricks are trying to use leading to the failure of bug-1112
Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]
Hi All, As with further investigation found the following, 1) Was the able to reproduce the issue, without running the complete regression, just by running bug-1112559.t only on slave30(which is been rebooted and a clean gluster setup). This rules out any involvement of previous failure from other spurious errors like mgmt_v3-locks.t. 2) Added some messages and script (netstat and ps -ef | grep gluster ) execution when the binding to a port fails (in rpc/rpc-transport/socket/src/socket.c) and found the following, Always the snapshot brick in second node (127.1.1.2) fails to acquire the port (eg : 127.1.1.2 : 49155 ) Netstat output shows: tcp0 0 127.1.1.2:49155 0.0.0.0:* LISTEN 3555/glusterfsd and the process that is holding the port 49155 is root 3555 1 0 12:38 ?00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p /d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name /d/backends/2/patchy_snap_mnt -l /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 --brick-port 49155 --xlator-option patchy-server.listen-port=49155 Please note even though it says 127.1.1.2 its shows the glusterd-uuid of the 3 node that was been probed when the snapshot was created "3af134ec-5552-440f-ad24-1811308ca3a8" To clarify things there, there are already a volume brick in 127.1.1.2 root 3446 1 0 12:38 ?00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p /d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name /d/backends/2/patchy_snap_mnt -l /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f --brick-port 49153 --xlator-option patchy-server.listen-port=49153 And the above brick process(3555) is not visible before the snap creation or after the failure of the snap brick start on the 127.1.1.2 This means that this process was spawned and died during the creation of the snapshot and probe of the 3rd node (which happens simultaneously) In addition to these process, we can see multiple snap brick process for the second brick on second node, which are not seen after the failure to start the snap brick on 127.1.1.2 root 3582 1 0 12:38 ?00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2 -p /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f --brick-port 49155 --xlator-option 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155 root 3583 3582 0 12:38 ?00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2 -p /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f --brick-port 49155 --xlator-option 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155 This looks like the second node tries to start snap brick 1) with wrong brickinfo and peerinfo (process 3555) 2) Multiple times with the correct brickinfo (process 3582,3583) 3) This issue is not seen when, snapshots are created and peer probe is NOT done simultaneously. Will continue on the investigation and will keep you posted. Regards, Joe - Original Message - From: "Joseph Fernandes" To: "Avra Sengupta" Cc: "Pranith Kumar Karampuri" , "Gluster Devel" , "Varun Shastry" , "Justin Clift" Sent: Thursday, July 17, 2014 10:58:14 AM Subject: Re: [Gluster-devel] spuriou
Re: [Gluster-devel] spurious regression failures again!
Hi, Created a bug against the same. Please use it to submit if required. https://bugzilla.redhat.com/show_bug.cgi?id=1121014 Thanks Varun Shastry On Tuesday 15 July 2014 09:34 PM, Pranith Kumar Karampuri wrote: On 07/15/2014 09:24 PM, Joseph Fernandes wrote: Hi Pranith, Could you please share the link of the console output of the failures. Added them inline. Thanks for reminding :-) Pranith Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 8:52:44 PM Subject: [Gluster-devel] spurious regression failures again! hi, We have 4 tests failing once in a while causing problems: 1) tests/bugs/bug-1087198.t - Author: Varun http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull 3) tests/basic/fops-sanity.t - Author: Pranith http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull Please take a look at them and post updates. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again!
Hi Avra, Just clarifying things here, 1) When testing with the setup provide by Justin, I found the only place where bug-1112559.t failed was after the failure mgmt_v3-locks.t in the previous regression run. The mail attached with the previous mail was just an OBSERVATION and NOT an INFERENCE that failure of mgmt_v3-locks.t was the root cause of bug-1112559.t . I am NOT jumping the gun and making any statement/conclusion here. Its just an OBSERVATION. And thanks for the clarification on why mgmt_v3-locks.t is failing. 2) I agree with you that the cleanup script needs to kill all gluster* processes. And its also true that port range used by gluster for bricks is unique. But bug-1112559.t fails only because of the unavailability of port, to start the snap brick. Therefore this suggests that there is some process(gluster or non-gluster) still using the port. 3) And Finally that bug-1112559.t failing individually all the time is not true as when looked into the links which you have provided there are case where there are previous other test case failures, on the same testing machine (slave26). By this I am not pointing out that those failure are the root cause for bug-1112559.t to fail As stated earlier its a notable OBSERVATION(Keeping in mind point 2 about ports and cleanup) I have run nearly 30 runs on slave30 and only one time bug-1112559.t failed (As stated in point 1). I am continuing to run more runs. The only problem is the occurrence of bug-1112559.t failure is spurious and there is no deterministic way of reproducing it. Will keep all posted about the results. Regards, Joe - Original Message - From: "Avra Sengupta" To: "Joseph Fernandes" , "Pranith Kumar Karampuri" Cc: "Gluster Devel" , "Varun Shastry" , "Justin Clift" Sent: Wednesday, July 16, 2014 1:03:21 PM Subject: Re: [Gluster-devel] spurious regression failures again! Joseph, I am not sure I understand how this is affecting the spurious failure of bug-1112559.t. As per the mail you have attached, and according to your analysis, bug-1112559.t fails because a cleanup hasn't happened properly after a previous test-case failed and in your case there was a crash as well. Now out of all the times bug-1112559.t has failed, most of the time it's the only test case failing and there isn't any crash. Below are the regression runs that pranith had sent for the same. http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB/543/console In all of the above bug-1112559.t is the only test case that fails and there is no crash. So what I fail to understand here is, if this particular testcase fails independently as well as with other testcases, then how can we conclude that any other testcase failing is somehow not doing a cleanup properly and that is the reason for bug-1112559.t failing. mgmt_v3-locks.t fails because glusterd takes more time to register a node going down, and hence the peer status doesn't return what the testcase expects it to. It's a race. The testcase ends with a cleanup routine like every other testcase, that kills all gluster and glusterfsd processes, which might be using any brick ports. So could you please explain how or which process still uses the brick ports that the snap bricks are trying to use leading to the failure of bug-1112559.t. Regards, Avra On 07/15/2014 09:57 PM, Joseph Fernandes wrote: > Just pointing out , > > 2) tests/basic/mgmt_v3-locks.t - Author: Avra > http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull > > This is the similar kind of error I saw in my testing of spurious failure > tests/bugs/bug-1112559.t > > Please refer the attached mail. > > Regards, > Joe > > > > - Original Message - > From: "Pranith Kumar Karampuri" > To: "Joseph Fernandes" > Cc: "Gluster Devel" , "Varun Shastry" > > Sent: Tuesday, July 15, 2014 9:34:26 PM > Subject: Re: [Gluster-devel] spurious regression failures again! > > > On 07/15/2014 09:24 PM, Joseph Fernandes wrote: >> Hi Pranith, >> >> Could you please share the link of the console output of the failures. > Added them inline. Thanks for reminding :-) > > Pranith >> Regards, >> Joe >> >> - Original Message - >> From: "Pranith Kumar Karampuri" >> To: "Gluster Devel" , "Varun Shastry" >> >> Sent: Tuesday, July 15, 2014 8:52:44 PM >> Subject: [Gluster-devel] spurious regression failures again! >> >> hi, >>We hav
Re: [Gluster-devel] spurious regression failures again!
Joseph, I am not sure I understand how this is affecting the spurious failure of bug-1112559.t. As per the mail you have attached, and according to your analysis, bug-1112559.t fails because a cleanup hasn't happened properly after a previous test-case failed and in your case there was a crash as well. Now out of all the times bug-1112559.t has failed, most of the time it's the only test case failing and there isn't any crash. Below are the regression runs that pranith had sent for the same. http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB/543/console In all of the above bug-1112559.t is the only test case that fails and there is no crash. So what I fail to understand here is, if this particular testcase fails independently as well as with other testcases, then how can we conclude that any other testcase failing is somehow not doing a cleanup properly and that is the reason for bug-1112559.t failing. mgmt_v3-locks.t fails because glusterd takes more time to register a node going down, and hence the peer status doesn't return what the testcase expects it to. It's a race. The testcase ends with a cleanup routine like every other testcase, that kills all gluster and glusterfsd processes, which might be using any brick ports. So could you please explain how or which process still uses the brick ports that the snap bricks are trying to use leading to the failure of bug-1112559.t. Regards, Avra On 07/15/2014 09:57 PM, Joseph Fernandes wrote: Just pointing out , 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull This is the similar kind of error I saw in my testing of spurious failure tests/bugs/bug-1112559.t Please refer the attached mail. Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Joseph Fernandes" Cc: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 9:34:26 PM Subject: Re: [Gluster-devel] spurious regression failures again! On 07/15/2014 09:24 PM, Joseph Fernandes wrote: Hi Pranith, Could you please share the link of the console output of the failures. Added them inline. Thanks for reminding :-) Pranith Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 8:52:44 PM Subject: [Gluster-devel] spurious regression failures again! hi, We have 4 tests failing once in a while causing problems: 1) tests/bugs/bug-1087198.t - Author: Varun http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull 3) tests/basic/fops-sanity.t - Author: Pranith http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull Please take a look at them and post updates. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again!
Just pointing out , 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull This is the similar kind of error I saw in my testing of spurious failure tests/bugs/bug-1112559.t Please refer the attached mail. Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Joseph Fernandes" Cc: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 9:34:26 PM Subject: Re: [Gluster-devel] spurious regression failures again! On 07/15/2014 09:24 PM, Joseph Fernandes wrote: > Hi Pranith, > > Could you please share the link of the console output of the failures. Added them inline. Thanks for reminding :-) Pranith > > Regards, > Joe > > - Original Message - > From: "Pranith Kumar Karampuri" > To: "Gluster Devel" , "Varun Shastry" > > Sent: Tuesday, July 15, 2014 8:52:44 PM > Subject: [Gluster-devel] spurious regression failures again! > > hi, > We have 4 tests failing once in a while causing problems: > 1) tests/bugs/bug-1087198.t - Author: Varun http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull > 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull > 3) tests/basic/fops-sanity.t - Author: Pranith http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull > > Please take a look at them and post updates. > > Pranith > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-devel --- Begin Message --- Hi All, Thanks Justin for the setup(slave30). Executed the whole regression suite on slave30 multiple times. Once there was a failure of ./tests/basic/mgmt_v3-locks.t with a core http://build.gluster.org/job/rackspace-regression-2GB-joe/12/console Test Summary Report --- ./tests/basic/mgmt_v3-locks.t (Wstat: 0 Tests: 14 Failed: 3) Failed tests: 11-13 Files=250, Tests=4897, 3968 wallclock secs ( 1.91 usr 1.41 sys + 330.06 cusr 457.27 csys = 790.65 CPU) Result: FAIL + RET=1 ++ ls -l /core.20215 ++ wc -l There is a glusterd crash Log files and core files are available @ http://build.gluster.org/job/rackspace-regression-2GB-joe/12/console And the very next regression test bug-1112559.t failed with the same port unavailability. http://build.gluster.org/job/rackspace-regression-2GB-joe/13/console After this I restart slave30 and executed the whole regression test again and never hit his issue. Looks like the issue is not originated @ bug-1112559.t. The failure in bug-1112559.t test 10 is the result because of a previous failure or crash. Regards, Joe - Original Message - From: "Justin Clift" To: "Vijay Bellur" Cc: "Avra Sengupta" , "Joseph Fernandes" , "Pranith Kumar Karampuri" , "Gluster Devel" Sent: Thursday, July 10, 2014 8:26:49 PM Subject: Re: [Gluster-devel] regarding spurious failure tests/bugs/bug-1112559.t On 10/07/2014, at 12:44 PM, Vijay Bellur wrote: > A lot of regression runs are failing because of this test unit. Given feature > freeze is around the corner, shall we provide a +1 verified manually for > those patchsets that fail this test? Went through and did this manually, as "Gluster Build System". Also got Joe set up so he can debug things on a Rackspace VM to find out what's wrong. + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift --- End Message --- ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again!
On 07/15/2014 09:24 PM, Joseph Fernandes wrote: Hi Pranith, Could you please share the link of the console output of the failures. Added them inline. Thanks for reminding :-) Pranith Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 8:52:44 PM Subject: [Gluster-devel] spurious regression failures again! hi, We have 4 tests failing once in a while causing problems: 1) tests/bugs/bug-1087198.t - Author: Varun http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull 2) tests/basic/mgmt_v3-locks.t - Author: Avra http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull 3) tests/basic/fops-sanity.t - Author: Pranith http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull Please take a look at them and post updates. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious regression failures again!
Hi Pranith, Could you please share the link of the console output of the failures. Regards, Joe - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" , "Varun Shastry" Sent: Tuesday, July 15, 2014 8:52:44 PM Subject: [Gluster-devel] spurious regression failures again! hi, We have 4 tests failing once in a while causing problems: 1) tests/bugs/bug-1087198.t - Author: Varun 2) tests/basic/mgmt_v3-locks.t - Author: Avra 3) tests/basic/fops-sanity.t - Author: Pranith Please take a look at them and post updates. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] spurious regression failures again!
hi, We have 4 tests failing once in a while causing problems: 1) tests/bugs/bug-1087198.t - Author: Varun 2) tests/basic/mgmt_v3-locks.t - Author: Avra 3) tests/basic/fops-sanity.t - Author: Pranith Please take a look at them and post updates. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel