Re: [Gluster-devel] Spurious failures because of nfs and snapshots
On 05/21/2014 08:50 PM, Vijaikumar M wrote: KP, Atin and myself did some debugging and found that there was a deadlock in glusterd. When creating a volume snapshot, the back-end operation 'taking a lvm_snapshot and starting brick' for the each brick are executed in parallel using synctask framework. brick_start was releasing a big_lock with brick_connect and does a lock again. This caused a deadlock in some race condition where main-thread waiting for one of the synctask thread to finish and synctask-thread waiting for the big_lock. We are working on fixing this issue. If this fix is going to take more time, can we please log a bug to track this problem and remove the test cases that need to be addressed from the test unit? This way other valid patches will not be blocked by the failure of the snapshot test unit. We can introduce these tests again as part of the fix for the problem. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
- Original Message - From: Atin Mukherjee amukh...@redhat.com To: gluster-devel@gluster.org, Pranith Kumar Karampuri pkara...@redhat.com Sent: Wednesday, May 21, 2014 3:39:21 PM Subject: Re: Fwd: Re: [Gluster-devel] Spurious failures because of nfs and snapshots On 05/21/2014 11:42 AM, Atin Mukherjee wrote: On 05/21/2014 10:54 AM, SATHEESARAN wrote: Guys, This is the issue pointed out by Pranith with regard to Barrier. I was reading through it. But I wanted to bring it to concern -- S Original Message Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots Date: Tue, 20 May 2014 21:16:57 -0400 (EDT) From: Pranith Kumar Karampuri pkara...@redhat.com To:Vijaikumar M vmall...@redhat.com, Joseph Fernandes josfe...@redhat.com CC:Gluster Devel gluster-devel@gluster.org Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Pranith, Is this the correct link? I don't see any log having this sequence there. Also looking at the log from this mail, this is expected as per the barrier functionality, an enable request followed by another enable should always fail and the same happens for disable. Can you please confirm the link and which particular regression test is causing this issue, is it bug-1090042.t? --Atin Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. This log is for bug-1092841.t and its expected. Damn :-(. I think I screwed up the timestamps while checking Sorry about that :-(. But there are failures. Check http://build.gluster.org/job/regression/4501/consoleFull Pranith --Atin Pranith - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com, Vijaikumar M vmall...@redhat.com Sent: Tuesday, May 20, 2014 3:41:11 PM Subject: Re: Spurious failures because of nfs and snapshots hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l /build/install/var/log/glusterfs/br icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba 7b4b0 --brick-port 49164 --xlator-option 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) 2 [2014-05-19 04:29:38.141118] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting ping-timeout to 30secs 3 [2014-05-19 04:30:09.139521] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where ping-timer will be disabled by default for all the rpc connection except for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). Thanks, Vijay On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: The latest build failure also has the same issue: Download it from here: http://build.gluster.org:443/logs/glusterfs-logs
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
KP, Atin and myself did some debugging and found that there was a deadlock in glusterd. When creating a volume snapshot, the back-end operation 'taking a lvm_snapshot and starting brick' for the each brick are executed in parallel using synctask framework. brick_start was releasing a big_lock with brick_connect and does a lock again. This caused a deadlock in some race condition where main-thread waiting for one of the synctask thread to finish and synctask-thread waiting for the big_lock. We are working on fixing this issue. Thanks, Vijay On Wednesday 21 May 2014 12:23 PM, Vijaikumar M wrote: From the log: http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit looks like glusterd was hung: *Glusterd log:** * 5305 [2014-05-20 20:08:55.040665] E [glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 0-management: Unable to fetch snap device (vol1.brick_snapdevice0). Leaving empty 5306 [2014-05-20 20:08:55.649146] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5307 [2014-05-20 20:08:55.663181] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5308 [2014-05-20 20:16:55.541197] W [glusterfsd.c:1182:cleanup_and_exit] (-- 0-: received signum (15), shutting down Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t. *Cli log:** *72649 [2014-05-20 20:12:51.960765] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] (--/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (--/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- -/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72651 [2014-05-20 20:12:52.960943] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1 72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) [0x7ff8ad9ec7d0]))) 0-dict: data is NULL 72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) [0x7ff8ad9ec7df]))) 0-dict: data is NULL 72655 [2014-05-20 20:12:52.961079] E [name.c:140:client_fill_address_family] 0-glusterfs: transport.address-family not specified. Could not guess default value from (remote-host:(null) or transport.unix.connect-path:(null)) optio ns 72656 [2014-05-20 20:12:54.961273] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] (--/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (--/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- -/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 110 72659 [2014-05-20 20:12:55.120723] D [cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110 Now we need to find why glusterd was hung. Thanks, Vijay On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote: Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs athttp://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - From: Pranith Kumar Karampuripkara...@redhat.com To: Gluster Develgluster-devel@gluster.org Cc: Joseph Fernandesjosfe...@redhat.com, Vijaikumar
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l /build/install/var/log/glusterfs/br icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba 7b4b0 --brick-port 49164 --xlator-option 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) 2 [2014-05-19 04:29:38.141118] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting ping-timeout to 30secs 3 [2014-05-19 04:30:09.139521] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where ping-timer will be disabled by default for all the rpc connection except for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). Thanks, Vijay On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: The latest build failure also has the same issue: Download it from here: http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Joseph Fernandes josfe...@redhat.com Cc: Pranith Kumar Karampuri pkara...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, 19 May, 2014 11:41:28 AM Subject: Re: Spurious failures because of nfs and snapshots Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in {1..135} ; do ./bugs/bug-1090042.t When checked the logs [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs The issue is with ping-timeout and is tracked under the bug https://bugzilla.redhat.com/show_bug.cgi?id=1096729 The workaround is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8 Regards, Joe - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com Sent: Friday, May 16, 2014 6:19:54 AM Subject: Spurious failures because of nfs and snapshots hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com, Vijaikumar M vmall...@redhat.com Sent: Tuesday, May 20, 2014 3:41:11 PM Subject: Re: Spurious failures because of nfs and snapshots hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l /build/install/var/log/glusterfs/br icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba 7b4b0 --brick-port 49164 --xlator-option 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) 2 [2014-05-19 04:29:38.141118] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting ping-timeout to 30secs 3 [2014-05-19 04:30:09.139521] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where ping-timer will be disabled by default for all the rpc connection except for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). Thanks, Vijay On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: The latest build failure also has the same issue: Download it from here: http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Joseph Fernandes josfe...@redhat.com Cc: Pranith Kumar Karampuri pkara...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, 19 May, 2014 11:41:28 AM Subject: Re: Spurious failures because of nfs and snapshots Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in {1..135} ; do ./bugs/bug-1090042.t When checked the logs [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs The issue is with ping-timeout and is tracked under the bug https://bugzilla.redhat.com/show_bug.cgi?id=1096729 The workaround is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8 Regards, Joe - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com Sent: Friday, May 16, 2014 6:19:54 AM Subject: Spurious failures because of nfs and snapshots hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. We found that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily delay the build execution. So if you guys are going to write any scripts which has to do nfs mounts, please do it the following way: EXPECT_WITHIN 20 1 is_nfs_export_available; TEST mount -t nfs -o vers=3 $H0:/$V0 $N0; Please review http://review.gluster.com/7773 :-) I saw one more spurious failure in a snapshot related script tests/bugs/bug-1090042.t on the next build fired by Niels. Joesph (CCed) is debugging it. He agreed to reply what he finds and share it with us so that we won't introduce similar bugs in future. I encourage you guys to share what you fix to prevent spurious failures in future. Thanks Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
- Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Monday, 19 May, 2014 10:26:04 AM Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots On 16/05/2014, at 1:49 AM, Pranith Kumar Karampuri wrote: hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. Cool. Fixing this NFS problem across all of the tests would be really welcome. That specific failed test (bug-1087198.t) is the most common one I've seen over the last few weeks, causing about half of all failures in master. Eliminating this class of regression failure would be really helpful. :) This particular class is eliminated :-). Patch was merged on Friday. Pranith + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, In the latest build I fired for review.gluster.com/7766 ( http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. We found that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily delay the build execution. So if you guys are going to write any scripts which has to do nfs mounts, please do it the following way: EXPECT_WITHIN 20 1 is_nfs_export_available; TEST mount -t nfs -o vers=3 $H0:/$V0 $N0; Always please also add mount -o soft,intr in the regression scripts for mounting nfs. Becomes so much easier to cleanup any hung mess. We probably need an NFS mounting helper function which can be called like: TEST mount_nfs $H0:/$V0 $N0; Thanks Avati ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel