Re: [Gluster-devel] Spurious failures because of nfs and snapshots
From the log: http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit looks like glusterd was hung: *Glusterd log:** * 5305 [2014-05-20 20:08:55.040665] E [glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 0-management: Unable to fetch snap device (vol1.brick_snapdevice0). Leaving empty 5306 [2014-05-20 20:08:55.649146] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5307 [2014-05-20 20:08:55.663181] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5308 [2014-05-20 20:16:55.541197] W [glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), shutting down Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t. *Cli log:** *72649 [2014-05-20 20:12:51.960765] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] (-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- ->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72651 [2014-05-20 20:12:52.960943] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1 72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) [0x7ff8ad9ec7d0]))) 0-dict: data is NULL 72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) [0x7ff8ad9ec7df]))) 0-dict: data is NULL 72655 [2014-05-20 20:12:52.961079] E [name.c:140:client_fill_address_family] 0-glusterfs: transport.address-family not specified. Could not guess default value from (remote-host:(null) or transport.unix.connect-path:(null)) optio ns 72656 [2014-05-20 20:12:54.961273] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] (-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- ->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 110 72659 [2014-05-20 20:12:55.120723] D [cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110 Now we need to find why glusterd was hung. Thanks, Vijay On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote: Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - From: "Pranith Kumar Karampuri" To: "Gluster Devel" Cc: "Joseph Fernandes" , "Vijaikumar M" Sent: Tuesday, May 20, 2014 3:41:11 PM Subject: Re: Spurious failures because of nfs and snapshots hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: "Vijaikumar M" To: "Pranith Kumar Karampuri" Cc: "Joseph Fernandes" , "Gluster Devel" Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.13
Re: [Gluster-devel] Fwd: Re: Spurious failures because of nfs and snapshots
On 05/21/2014 10:54 AM, SATHEESARAN wrote: > Guys, > > This is the issue pointed out by Pranith with regard to Barrier. > I was reading through it. > > But I wanted to bring it to concern > > -- S > > > Original Message > Subject: Re: [Gluster-devel] Spurious failures because of nfs and > snapshots > Date: Tue, 20 May 2014 21:16:57 -0400 (EDT) > From: Pranith Kumar Karampuri > To: Vijaikumar M , Joseph Fernandes > > CC: Gluster Devel > > > > Hey, > Seems like even after this fix is merged, the regression tests are > failing for the same script. You can check the logs at > http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Pranith, Is this the correct link? I don't see any log having this sequence there. Also looking at the log from this mail, this is expected as per the barrier functionality, an enable request followed by another enable should always fail and the same happens for disable. Can you please confirm the link and which particular regression test is causing this issue, is it bug-1090042.t? --Atin > > Relevant logs: > [2014-05-20 20:17:07.026045] : volume create patchy > build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : > SUCCESS > [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS > [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS > [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : > Failed to reconfigure barrier. > [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS > [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : > Failed to reconfigure barrier. > > Pranith > > - Original Message - >> From: "Pranith Kumar Karampuri" >> To: "Gluster Devel" >> Cc: "Joseph Fernandes" , "Vijaikumar M" >> >> Sent: Tuesday, May 20, 2014 3:41:11 PM >> Subject: Re: Spurious failures because of nfs and snapshots >> >> hi, >> Please resubmit the patches on top of http://review.gluster.com/#/c/7753 >> to prevent frequent regression failures. >> >> Pranith >> - Original Message - >> > From: "Vijaikumar M" >> > To: "Pranith Kumar Karampuri" >> > Cc: "Joseph Fernandes" , "Gluster Devel" >> > >> > Sent: Monday, May 19, 2014 2:40:47 PM >> > Subject: Re: Spurious failures because of nfs and snapshots >> > >> > Brick disconnected with ping-time out: >> > >> > Here is the log message >> > [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] >> > 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi >> > n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s >> > build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 >> > 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 >> > -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f >> > bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid >> > -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name >> > /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l >> > /build/install/var/log/glusterfs/br >> > icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log >> > --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba >> > 7b4b0 --brick-port 49164 --xlator-option >> > 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) >> >2 [2014-05-19 04:29:38.141118] I >> > [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting >> > ping-timeout to 30secs >> >3 [2014-05-19 04:30:09.139521] C >> > [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server >> > 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. >> > >> > >> > >> > Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where >> > ping-timer will be disabled by default for all the rpc connection except >> > for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). >> > >> > >> > Thanks, >> > Vijay >> > >> > >> > On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: >> > > The latest build failure also has the same issue: >> > > Download it from here: >> > >
Re: [Gluster-devel] spurios failures in tests/encryption/crypt.t
- Original Message - > From: "Anand Avati" > To: "Pranith Kumar Karampuri" > Cc: "Edward Shishkin" , "Gluster Devel" > > Sent: Wednesday, May 21, 2014 10:53:54 AM > Subject: Re: [Gluster-devel] spurios failures in tests/encryption/crypt.t > > There are a few suspicious things going on here.. > > On Tue, May 20, 2014 at 10:07 PM, Pranith Kumar Karampuri < > pkara...@redhat.com> wrote: > > > > > > > hi, > > > > crypt.t is failing regression builds once in a while and most of > > > > the times it is because of the failures just after the remount in the > > > > script. > > > > > > > > TEST rm -f $M0/testfile-symlink > > > > TEST rm -f $M0/testfile-link > > > > > > > > Both of these are failing with ENOTCONN. I got a chance to look at > > > > the logs. According to the brick logs, this is what I see: > > > > [2014-05-17 05:43:43.363979] E [posix.c:2272:posix_open] > > > > 0-patchy-posix: open on /d/backends/patchy1/testfile-symlink: > > > > Transport endpoint is not connected > > > > posix_open() happening on a symlink? This should NEVER happen. glusterfs > itself should NEVER EVER by triggering symlink resolution on the server. In > this case, for whatever reason an open() is attempted on a symlink, and it > is getting followed back onto gluster's own mount point (test case is > creating an absolute link). > > So first find out: who is triggering fop->open() on a symlink. Fix the > caller. > > Next: add a check in posix_open() to fail with ELOOP or EINVAL if the inode > is a symlink. I think I understood what you are saying. Open call for symlink on fuse mount lead to an open call again for the target on the same fuse mount. Which lead to deadlock :). That is why we disallow opens on symlink in gluster? Pranith > > > > > > > > > > This is the very first time I saw posix failing with ENOTCONN. Do we > > > > have these bricks on some other network mounts? I wonder why it fails > > > > with ENOTCONN. > > > > > > > > I also see that it happens right after a call_bail on the mount. > > > > > > > > Pranith > > > > > > Hello. > > > OK, I'll try to reproduce it. > > > > I tried re-creating the issue on my fedora VM and it happened just now. > > When this issue happens I am not able to attach the process to gdb. From > > /proc/ the threads are in the following state for a while now: > > root@pranith-vm1 - /proc/4053/task > > 10:20:50 :) ⚡ for i in `ls`; do cat $i/stack; echo > > "-"; done > > [] ep_poll+0x21e/0x330 > > [] SyS_epoll_wait+0xd5/0x100 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] hrtimer_nanosleep+0xad/0x170 > > [] SyS_nanosleep+0x66/0x80 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] do_sigtimedwait+0x161/0x200 > > [] SYSC_rt_sigtimedwait+0x76/0xd0 > > [] SyS_rt_sigtimedwait+0xe/0x10 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] futex_wait_queue_me+0xda/0x140 > > [] futex_wait+0x17e/0x290 > > [] do_futex+0xe6/0xc30 > > [] SyS_futex+0x71/0x150 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] futex_wait_queue_me+0xda/0x140 > > [] futex_wait+0x17e/0x290 > > [] do_futex+0xe6/0xc30 > > [] SyS_futex+0x71/0x150 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] futex_wait_queue_me+0xda/0x140 > > [] futex_wait+0x17e/0x290 > > [] do_futex+0xe6/0xc30 > > [] SyS_futex+0x71/0x150 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] wait_answer_interruptible+0x89/0xd0 [fuse] > > <<--- This is the important thing I think > > [] __fuse_request_send+0x232/0x290 [fuse] > > [] fuse_request_send+0x12/0x20 [fuse] > > [] fuse_do_open+0xca/0x170 [fuse] > > [] fuse_open_common+0x56/0x80 [fuse] > > [] fuse_open+0x10/0x20 [fuse] > > [] do_dentry_open+0x1eb/0x280 > > [] finish_open+0x31/0x40 > > [] do_last+0x4ca/0xe00 > > [] path_openat+0x420/0x690 > > [] do_filp_open+0x3a/0x90 > > [] do_sys_open+0x12e/0x210 > > [] SyS_open+0x1e/0x20 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] futex_wait_queue_me+0xda/0x140 > > [] futex_wait+0x17e/0x290 > > [] do_futex+0xe6/0xc30 > > [] SyS_futex+0x71/0x150 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] futex_wait_queue_me+0xda/0x140 > > [] futex_wait+0x17e/0x290 > > [] do_futex+0xe6/0xc30 > > [] SyS_futex+0x71/0x150 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > [] hrtimer_nanosleep+0xad/0x170 > > [] SyS_nanosleep+0x66/0x80 > > [] system_call_fastpath+0x16/0x1b > > [] 0x > > - > > > > I don't know how to debug further but it seems like the s
Re: [Gluster-devel] spurios failures in tests/encryption/crypt.t
There are a few suspicious things going on here.. On Tue, May 20, 2014 at 10:07 PM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > > hi, > > > crypt.t is failing regression builds once in a while and most of > > > the times it is because of the failures just after the remount in the > > > script. > > > > > > TEST rm -f $M0/testfile-symlink > > > TEST rm -f $M0/testfile-link > > > > > > Both of these are failing with ENOTCONN. I got a chance to look at > > > the logs. According to the brick logs, this is what I see: > > > [2014-05-17 05:43:43.363979] E [posix.c:2272:posix_open] > > > 0-patchy-posix: open on /d/backends/patchy1/testfile-symlink: > > > Transport endpoint is not connected > posix_open() happening on a symlink? This should NEVER happen. glusterfs itself should NEVER EVER by triggering symlink resolution on the server. In this case, for whatever reason an open() is attempted on a symlink, and it is getting followed back onto gluster's own mount point (test case is creating an absolute link). So first find out: who is triggering fop->open() on a symlink. Fix the caller. Next: add a check in posix_open() to fail with ELOOP or EINVAL if the inode is a symlink. > > > > > > This is the very first time I saw posix failing with ENOTCONN. Do we > > > have these bricks on some other network mounts? I wonder why it fails > > > with ENOTCONN. > > > > > > I also see that it happens right after a call_bail on the mount. > > > > > > Pranith > > > > Hello. > > OK, I'll try to reproduce it. > > I tried re-creating the issue on my fedora VM and it happened just now. > When this issue happens I am not able to attach the process to gdb. From > /proc/ the threads are in the following state for a while now: > root@pranith-vm1 - /proc/4053/task > 10:20:50 :) ⚡ for i in `ls`; do cat $i/stack; echo > "-"; done > [] ep_poll+0x21e/0x330 > [] SyS_epoll_wait+0xd5/0x100 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] hrtimer_nanosleep+0xad/0x170 > [] SyS_nanosleep+0x66/0x80 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] do_sigtimedwait+0x161/0x200 > [] SYSC_rt_sigtimedwait+0x76/0xd0 > [] SyS_rt_sigtimedwait+0xe/0x10 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] futex_wait_queue_me+0xda/0x140 > [] futex_wait+0x17e/0x290 > [] do_futex+0xe6/0xc30 > [] SyS_futex+0x71/0x150 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] futex_wait_queue_me+0xda/0x140 > [] futex_wait+0x17e/0x290 > [] do_futex+0xe6/0xc30 > [] SyS_futex+0x71/0x150 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] futex_wait_queue_me+0xda/0x140 > [] futex_wait+0x17e/0x290 > [] do_futex+0xe6/0xc30 > [] SyS_futex+0x71/0x150 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] wait_answer_interruptible+0x89/0xd0 [fuse] > <<--- This is the important thing I think > [] __fuse_request_send+0x232/0x290 [fuse] > [] fuse_request_send+0x12/0x20 [fuse] > [] fuse_do_open+0xca/0x170 [fuse] > [] fuse_open_common+0x56/0x80 [fuse] > [] fuse_open+0x10/0x20 [fuse] > [] do_dentry_open+0x1eb/0x280 > [] finish_open+0x31/0x40 > [] do_last+0x4ca/0xe00 > [] path_openat+0x420/0x690 > [] do_filp_open+0x3a/0x90 > [] do_sys_open+0x12e/0x210 > [] SyS_open+0x1e/0x20 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] futex_wait_queue_me+0xda/0x140 > [] futex_wait+0x17e/0x290 > [] do_futex+0xe6/0xc30 > [] SyS_futex+0x71/0x150 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] futex_wait_queue_me+0xda/0x140 > [] futex_wait+0x17e/0x290 > [] do_futex+0xe6/0xc30 > [] SyS_futex+0x71/0x150 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > [] hrtimer_nanosleep+0xad/0x170 > [] SyS_nanosleep+0x66/0x80 > [] system_call_fastpath+0x16/0x1b > [] 0x > - > > I don't know how to debug further but it seems like the system call hung > The threads in the above process are of glusterfsd, and glusterfsd is ending up an open() attempt on a FUSE (its own) mount. Pretty obvious that it is deadlocking. Find the open()er on the symlink and you have your fix. Avati ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurios failures in tests/encryption/crypt.t
- Original Message - > From: "Edward Shishkin" > To: "Pranith Kumar Karampuri" > Cc: "Vijay Bellur" , "Anand Avati" , > "Gluster Devel" > > Sent: Monday, May 19, 2014 6:05:02 PM > Subject: Re: spurios failures in tests/encryption/crypt.t > > On Sat, 17 May 2014 04:28:45 -0400 (EDT) > Pranith Kumar Karampuri wrote: > > > hi, > > crypt.t is failing regression builds once in a while and most of > > the times it is because of the failures just after the remount in the > > script. > > > > TEST rm -f $M0/testfile-symlink > > TEST rm -f $M0/testfile-link > > > > Both of these are failing with ENOTCONN. I got a chance to look at > > the logs. According to the brick logs, this is what I see: > > [2014-05-17 05:43:43.363979] E [posix.c:2272:posix_open] > > 0-patchy-posix: open on /d/backends/patchy1/testfile-symlink: > > Transport endpoint is not connected > > > > This is the very first time I saw posix failing with ENOTCONN. Do we > > have these bricks on some other network mounts? I wonder why it fails > > with ENOTCONN. > > > > I also see that it happens right after a call_bail on the mount. > > > > Pranith > > Hello. > OK, I'll try to reproduce it. I tried re-creating the issue on my fedora VM and it happened just now. When this issue happens I am not able to attach the process to gdb. From /proc/ the threads are in the following state for a while now: root@pranith-vm1 - /proc/4053/task 10:20:50 :) ⚡ for i in `ls`; do cat $i/stack; echo "-"; done [] ep_poll+0x21e/0x330 [] SyS_epoll_wait+0xd5/0x100 [] system_call_fastpath+0x16/0x1b [] 0x - [] hrtimer_nanosleep+0xad/0x170 [] SyS_nanosleep+0x66/0x80 [] system_call_fastpath+0x16/0x1b [] 0x - [] do_sigtimedwait+0x161/0x200 [] SYSC_rt_sigtimedwait+0x76/0xd0 [] SyS_rt_sigtimedwait+0xe/0x10 [] system_call_fastpath+0x16/0x1b [] 0x - [] futex_wait_queue_me+0xda/0x140 [] futex_wait+0x17e/0x290 [] do_futex+0xe6/0xc30 [] SyS_futex+0x71/0x150 [] system_call_fastpath+0x16/0x1b [] 0x - [] futex_wait_queue_me+0xda/0x140 [] futex_wait+0x17e/0x290 [] do_futex+0xe6/0xc30 [] SyS_futex+0x71/0x150 [] system_call_fastpath+0x16/0x1b [] 0x - [] futex_wait_queue_me+0xda/0x140 [] futex_wait+0x17e/0x290 [] do_futex+0xe6/0xc30 [] SyS_futex+0x71/0x150 [] system_call_fastpath+0x16/0x1b [] 0x - [] wait_answer_interruptible+0x89/0xd0 [fuse] <<--- This is the important thing I think [] __fuse_request_send+0x232/0x290 [fuse] [] fuse_request_send+0x12/0x20 [fuse] [] fuse_do_open+0xca/0x170 [fuse] [] fuse_open_common+0x56/0x80 [fuse] [] fuse_open+0x10/0x20 [fuse] [] do_dentry_open+0x1eb/0x280 [] finish_open+0x31/0x40 [] do_last+0x4ca/0xe00 [] path_openat+0x420/0x690 [] do_filp_open+0x3a/0x90 [] do_sys_open+0x12e/0x210 [] SyS_open+0x1e/0x20 [] system_call_fastpath+0x16/0x1b [] 0x - [] futex_wait_queue_me+0xda/0x140 [] futex_wait+0x17e/0x290 [] do_futex+0xe6/0xc30 [] SyS_futex+0x71/0x150 [] system_call_fastpath+0x16/0x1b [] 0x - [] futex_wait_queue_me+0xda/0x140 [] futex_wait+0x17e/0x290 [] do_futex+0xe6/0xc30 [] SyS_futex+0x71/0x150 [] system_call_fastpath+0x16/0x1b [] 0x - [] hrtimer_nanosleep+0xad/0x170 [] SyS_nanosleep+0x66/0x80 [] system_call_fastpath+0x16/0x1b [] 0x - I don't know how to debug further but it seems like the system call hung CC Brian Foster. Pranith > > Thanks for the report! > Edward. > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Need sensible default value for detecting unclean client disconnects
Niels, This is a good addition. While gluster clients do a reasonably good job at detecting dead/hung servers with ping-timeout, the server side detection has been rather weak. TCP_KEEPALIVE has helped to some extent, for cases where an idling client (which holds a lock) goes dead. However if an active client with pending data in server's socket buffer dies, we have been subject to long tcp retransmission to finish and give up. The way I see it, this option is complementary to TCP_KEEPALIVE (keepalive works for idle and only idle connections, user_timeout works only when there is pending acknowledgements, thus covering the full spectrum). To that end, it might make sense to present the admin a single timeout configuration value rather than two. It would be very frustrating for the admin to configure one of them to, say, 30 seconds, and then find that the server does not clean up after 30 seconds of a hung client only because the connection was idle (or not idle). Configuring a second timeout for the other case can be very unintuitive. In fact, I would suggest to have a single network timeout configuration, which gets applied to all the three: ping-timeout on the client, user_timeout on the server, keepalive on both. I think that is what a user would be expecting anyways. Each is for a slightly different technical situation, but all just internal details as far as a user is concerned. Thoughts? On Tue, May 20, 2014 at 4:30 AM, Niels de Vos wrote: > Hi all, > > the last few days I've been looking at a problem [1] where a client > locks a file over a FUSE-mount, and a 2nd client tries to grab that lock > too. It is expected that the 2nd client gets blocked until the 1st > client releases the lock. This all work as long as the 1st client > cleanly releases the lock. > > Whenever the 1st client crashes (like a kernel panic) or the network is > split and the 1st client is unreachable, the 2nd client may not get the > lock until the bricks detect that the connection to the 1st client is > dead. If there are pending Replies, the bricks may need 15-20 minutes > until the re-transmissions of the replies have timed-out. > > The current default of 15-20 minutes is quite long for a fail-over > scenario. Relatively recently [2], the Linux kernel got > a TCP_USER_TIMEOUT socket option (similar to TCP_KEEPALIVE). This option > can be used to configure a per-socket timeout, instead of a system-wide > configuration through the net.ipv4.tcp_retries2 sysctl. > > The default network.ping-timeout is set to 42 seconds. I'd like to > propose a network.tcp-timeout option that can be set per volume. This > option should then set TCP_USER_TIMEOUT for the socket, which causes > re-transmission failures to be fatal after the timeout has passed. > > Now the remaining question, what shall be the default timeout in seconds > for this new network.tcp-timeout option? I'm currently thinking of > making it high enough (like 5 minutes) to prevent false positives. > > Thoughts and comments welcome, > Niels > > > 1 https://bugzilla.redhat.com/show_bug.cgi?id=1099460 > 2 > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7 > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - > From: "Pranith Kumar Karampuri" > To: "Gluster Devel" > Cc: "Joseph Fernandes" , "Vijaikumar M" > > Sent: Tuesday, May 20, 2014 3:41:11 PM > Subject: Re: Spurious failures because of nfs and snapshots > > hi, > Please resubmit the patches on top of http://review.gluster.com/#/c/7753 > to prevent frequent regression failures. > > Pranith > - Original Message - > > From: "Vijaikumar M" > > To: "Pranith Kumar Karampuri" > > Cc: "Joseph Fernandes" , "Gluster Devel" > > > > Sent: Monday, May 19, 2014 2:40:47 PM > > Subject: Re: Spurious failures because of nfs and snapshots > > > > Brick disconnected with ping-time out: > > > > Here is the log message > > [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] > > 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi > > n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s > > build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 > > 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 > > -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f > > bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid > > -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name > > /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l > > /build/install/var/log/glusterfs/br > > icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log > > --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba > > 7b4b0 --brick-port 49164 --xlator-option > > 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) > >2 [2014-05-19 04:29:38.141118] I > > [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting > > ping-timeout to 30secs > >3 [2014-05-19 04:30:09.139521] C > > [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server > > 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. > > > > > > > > Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where > > ping-timer will be disabled by default for all the rpc connection except > > for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). > > > > > > Thanks, > > Vijay > > > > > > On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: > > > The latest build failure also has the same issue: > > > Download it from here: > > > http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz > > > > > > Pranith > > > > > > - Original Message - > > >> From: "Vijaikumar M" > > >> To: "Joseph Fernandes" > > >> Cc: "Pranith Kumar Karampuri" , "Gluster Devel" > > >> > > >> Sent: Monday, 19 May, 2014 11:41:28 AM > > >> Subject: Re: Spurious failures because of nfs and snapshots > > >> > > >> Hi Joseph, > > >> > > >> In the log mentioned below, it say ping-time is set to default value > > >> 30sec.I think issue is different. > > >> Can you please point me to the logs where you where able to re-create > > >> the problem. > > >> > > >> Thanks, > > >> Vijay > > >> > > >> > > >> > > >> On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: > > >>> hi Vijai, Joseph, > > >>> In 2 of the last 3 build failures, > > >>> http://build.gluster.org/job/regression/4479/console, > > >>> http://build.gluste
Re: [Gluster-devel] Split-brain present and future in afr
> 1. Better protection for split-brain over time. > 2. Policy based split-brain resolution. > 3. Provide better availability with client quorum and replica 2. I would add the following: (4) Quorum enforcement - any kind - on by default. (5) Fix the problem of volumes losing quorum because unrelated nodes went down (i.e. implement volume-level quorum). (6) Better tools for users to resolve split brain themselves. > For 3, we are planning to introduce arbiter bricks that can be used to > determine quorum. The arbiter bricks will be dummy bricks that host only > files that will be updated from multiple clients. This will be achieved by > bringing about variable replication count for configurable class of files > within a volume. > In the case of a replicated volume with one arbiter brick per replica group, > certain files that are prone to split-brain will be in 3 bricks (2 data > bricks + 1 arbiter brick). All other files will be present in the regular > data bricks. For example, when oVirt VM disks are hosted on a replica 2 > volume, sanlock is used by oVirt for arbitration. sanloclk lease files will > be written by all clients and VM disks are written by only a single client > at any given point of time. In this scenario, we can place sanlock lease > files on 2 data + 1 arbiter bricks. The VM disk files will only be present > on the 2 data bricks. Client quorum is now determined by looking at 3 > bricks instead of 2 and we have better protection when network split-brains > happen. Constantly filtering requests to use either N or N+1 bricks is going to be complicated and hard to debug. Every data-structure allocation or loop based on replica count will have to be examined, and many will have to be modified. That's a *lot* of places. This also overlaps significantly with functionality that can be achieved with data classification (i.e. supporting multiple replica levels within the same volume). What use case requires that it be implemented within AFR instead of more generally and flexibly? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Need sensible default value for detecting unclean client disconnects
On Tue, May 20, 2014 at 01:30:24PM +0200, Niels de Vos wrote: > Hi all, > > the last few days I've been looking at a problem [1] where a client > locks a file over a FUSE-mount, and a 2nd client tries to grab that lock > too. It is expected that the 2nd client gets blocked until the 1st > client releases the lock. This all work as long as the 1st client > cleanly releases the lock. > > Whenever the 1st client crashes (like a kernel panic) or the network is > split and the 1st client is unreachable, the 2nd client may not get the > lock until the bricks detect that the connection to the 1st client is > dead. If there are pending Replies, the bricks may need 15-20 minutes > until the re-transmissions of the replies have timed-out. > > The current default of 15-20 minutes is quite long for a fail-over > scenario. Relatively recently [2], the Linux kernel got > a TCP_USER_TIMEOUT socket option (similar to TCP_KEEPALIVE). This option > can be used to configure a per-socket timeout, instead of a system-wide > configuration through the net.ipv4.tcp_retries2 sysctl. > > The default network.ping-timeout is set to 42 seconds. I'd like to > propose a network.tcp-timeout option that can be set per volume. This > option should then set TCP_USER_TIMEOUT for the socket, which causes > re-transmission failures to be fatal after the timeout has passed. > > Now the remaining question, what shall be the default timeout in seconds > for this new network.tcp-timeout option? I'm currently thinking of > making it high enough (like 5 minutes) to prevent false positives. > > Thoughts and comments welcome, > Niels > > > 1 https://bugzilla.redhat.com/show_bug.cgi?id=1099460 > 2 > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7 Posted a patch for review: http://review.gluster.org/7814 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Split-brain present and future in afr
hi, Thanks to Vijay Bellur for helping with the re-write of the draft I sent him :-). Present: Split-brains of files happen in afr today due to 2 primary reasons: 1. Split-brains due to network partition or network split-brains 2. Split-brains due to servers in a replicated group being offline at different points in time without self-heal happening in the common period of time when the servers were online. For further discussion, this is referred to as split-brain over time. To prevent the occurence of split-brains, we have the following quorum implementations in place: a> Client quorum - Driven by afr (client) and writes are allowed when majority of bricks in a replica group are online. Majority is by default N/2 + 1, where N is the replication factor for files in a volume. b> Server quorum - Driven by glusterd (server) and writes are allowed when majority of peers are online. Majority by default is N/2 + 1, where N is the number of peers in a trusted storage pool. Both a> and b> primarily safeguard network split-brains. The protection of these quorum implementations for split-brain over time scenarios is not very high. Let us consider how replica 3 and replica 2 can be protected against split-brains. Replica 3: Client quorum is quite effective in this case as writes are only allowed when at least 2 of 3 bricks that form a replica group is seen by afr/client. A recent fix for a corner case race in client quorum, (http://review.gluster.org/7600) makes it very robust. This patch is now part of master and release-3.5. We plan to backport it to release-3.4 too. Replica 2: Majority for client quorum in a deployment with 2 bricks per replica group is 2. Hence availability becomes a problem with replica 2 when either of the bricks is offline. To provide better avaialbility for replica-2, the first brick in a replica set is provided higher weight and quorum is met as long as the first brick is online. If the first brick is offline, then quorum is lost. Let us consider the following cases with B1 and B2 forming a replicated set: B1B2Quorum Online OnlineMet Online Offline Met Offline OfflineNot Met Offline OfflineNot Met Though better availability is provided by client quorum in replica 2 scenarios, it is not very optimal and hence an improvement in behavior seems desirable. Future: Our focus in afr going forward would be to solve three problems to provide better protection against split-brains and resolving them: 1. Better protection for split-brain over time. 2. Policy based split-brain resolution. 3. Provide better availability with client quorum and replica 2. For 1, implementation of outcasting logic will address the problem: - An outcast is a copy of a file on which writes have been performed only when quorum is met. - When a brick goes down and comes back up self-heal daemon will go and mark the affected files on the brick that just came back up as outcasts. The outcast marking can be implemented even before the brick is declared available to regular clients. Once a copy of a file is marked as needing self-heal (or as an outcast), writes from clients will not land on that copy till self-heal is completed and the outcast tag is removed. For 2, we plan to provide commands that can heal based on user configurable policies. Examples of policies would be: - Pick up the largest file as the winner for resolving a self-heal - Choose brick foo as the winner for resolving split-brains - Pick up the file with the latest version as the winner (when versioning for files is available). For 3, we are planning to introduce arbiter bricks that can be used to determine quorum. The arbiter bricks will be dummy bricks that host only files that will be updated from multiple clients. This will be achieved by bringing about variable replication count for configurable class of files within a volume. In the case of a replicated volume with one arbiter brick per replica group, certain files that are prone to split-brain will be in 3 bricks (2 data bricks + 1 arbiter brick). All other files will be present in the regular data bricks. For example, when oVirt VM disks are hosted on a replica 2 volume, sanlock is used by oVirt for arbitration. sanloclk lease files will be written by all clients and VM disks are written by only a single client at any given point of time. In this scenario, we can place sanlock lease files on 2 data + 1 arbiter bricks. The VM disk files will only be present on the 2 data bricks. Client quorum is now determined by looking at 3 bricks instead of 2 and we have better protection when network split-brains happen. A combination of 1. and 3. does s
[Gluster-devel] Test, pls ignore
Ignore this, just testing mailing list archiving... + Justin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Changes to Regression script
- Original Message - > From: "Kaushal M" > To: "Pranith Kumar Karampuri" > Cc: "Vijay Bellur" , "Gluster Devel" > , "gluster-infra" > > Sent: Tuesday, May 20, 2014 4:42:25 PM > Subject: Re: [Gluster-devel] Changes to Regression script > > The build.gluster.org machine had the PDT timezone set. So the timestamps > should be UTC-7 or UTC-8 depending on daylight savings. It's currently > UTC-7. > > Would having the archive timestamps also in UTC help? Hey!, interesting. I guess we can just do export TZ=UTC in run-tests.sh? prove will print start time in UTC against each test. Let me send out that patch. That should help narrow search space for trying to figure out the relevant logs. The archive timestamp is only for making sure the files have unique name isn't it? I am not sure if that would help much. Pranith. > > ~kaushal > > > On Mon, May 19, 2014 at 10:32 AM, Pranith Kumar Karampuri < > pkara...@redhat.com> wrote: > > > > > > > - Original Message - > > > From: "Vijay Bellur" > > > To: "Pranith Kumar Karampuri" > > > Cc: "gluster-infra" , > > gluster-devel@gluster.org > > > Sent: Monday, 19 May, 2014 10:03:41 AM > > > Subject: Re: [Gluster-devel] Changes to Regression script > > > > > > On 05/19/2014 09:41 AM, Pranith Kumar Karampuri wrote: > > > > > > > > > > > > - Original Message - > > > >> From: "Vijay Bellur" > > > >> To: "Pranith Kumar Karampuri" > > > >> Cc: "gluster-infra" , > > gluster-devel@gluster.org > > > >> Sent: Saturday, 17 May, 2014 2:52:03 PM > > > >> Subject: Re: [Gluster-devel] Changes to Regression script > > > >> > > > >> On 05/17/2014 02:10 PM, Pranith Kumar Karampuri wrote: > > > >>> > > > >>> > > > >>> - Original Message - > > > From: "Vijay Bellur" > > > To: "gluster-infra" > > > Cc: gluster-devel@gluster.org > > > Sent: Tuesday, May 13, 2014 4:13:02 PM > > > Subject: [Gluster-devel] Changes to Regression script > > > > > > Hi All, > > > > > > Me and Kaushal have effected the following changes on regression.sh > > in > > > build.gluster.org: > > > > > > 1. If a regression run results in a core and all tests pass, that > > > particular run will be flagged as a failure. Previously a core that > > > would cause test failures only would get marked as a failure. > > > > > > 2. Cores from a particular test run are now archived and are > > available > > > at /d/archived_builds/. This will also prevent manual intervention > > for > > > managing cores. > > > > > > 3. Logs from failed regression runs are now archived and are > > available > > > at /d/logs/glusterfs-.tgz > > > > > > Do let us know if you have any comments on these changes. > > > >>> > > > >>> This is already proving to be useful :-). I was able to debug one of > > the > > > >>> spurious failures for crypt.t. But the only problem is I was not able > > > >>> copy > > > >>> out the logs. Had to take avati's help to get the log files. Will it > > be > > > >>> possible to give access to these files so that anyone can download > > them? > > > >>> > > > >> > > > >> Good to know! > > > >> > > > >> You can access the .tgz files from: > > > >> > > > >> http://build.gluster.org:443/logs/ > > > > > > > > I was able to access these yesterday. But now it gives 404. > > > > Its working now. But how do we convert the timestamp to logs' timestamp. I > > want to know the time difference. > > > > Pranith. > > > > > > > > > > > > Fixed. > > > > > > -Vijay > > > > > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Need sensible default value for detecting unclean client disconnects
Hi all, the last few days I've been looking at a problem [1] where a client locks a file over a FUSE-mount, and a 2nd client tries to grab that lock too. It is expected that the 2nd client gets blocked until the 1st client releases the lock. This all work as long as the 1st client cleanly releases the lock. Whenever the 1st client crashes (like a kernel panic) or the network is split and the 1st client is unreachable, the 2nd client may not get the lock until the bricks detect that the connection to the 1st client is dead. If there are pending Replies, the bricks may need 15-20 minutes until the re-transmissions of the replies have timed-out. The current default of 15-20 minutes is quite long for a fail-over scenario. Relatively recently [2], the Linux kernel got a TCP_USER_TIMEOUT socket option (similar to TCP_KEEPALIVE). This option can be used to configure a per-socket timeout, instead of a system-wide configuration through the net.ipv4.tcp_retries2 sysctl. The default network.ping-timeout is set to 42 seconds. I'd like to propose a network.tcp-timeout option that can be set per volume. This option should then set TCP_USER_TIMEOUT for the socket, which causes re-transmission failures to be fatal after the timeout has passed. Now the remaining question, what shall be the default timeout in seconds for this new network.tcp-timeout option? I'm currently thinking of making it high enough (like 5 minutes) to prevent false positives. Thoughts and comments welcome, Niels 1 https://bugzilla.redhat.com/show_bug.cgi?id=1099460 2 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Changes to Regression script
The build.gluster.org machine had the PDT timezone set. So the timestamps should be UTC-7 or UTC-8 depending on daylight savings. It's currently UTC-7. Would having the archive timestamps also in UTC help? ~kaushal On Mon, May 19, 2014 at 10:32 AM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > - Original Message - > > From: "Vijay Bellur" > > To: "Pranith Kumar Karampuri" > > Cc: "gluster-infra" , > gluster-devel@gluster.org > > Sent: Monday, 19 May, 2014 10:03:41 AM > > Subject: Re: [Gluster-devel] Changes to Regression script > > > > On 05/19/2014 09:41 AM, Pranith Kumar Karampuri wrote: > > > > > > > > > - Original Message - > > >> From: "Vijay Bellur" > > >> To: "Pranith Kumar Karampuri" > > >> Cc: "gluster-infra" , > gluster-devel@gluster.org > > >> Sent: Saturday, 17 May, 2014 2:52:03 PM > > >> Subject: Re: [Gluster-devel] Changes to Regression script > > >> > > >> On 05/17/2014 02:10 PM, Pranith Kumar Karampuri wrote: > > >>> > > >>> > > >>> - Original Message - > > From: "Vijay Bellur" > > To: "gluster-infra" > > Cc: gluster-devel@gluster.org > > Sent: Tuesday, May 13, 2014 4:13:02 PM > > Subject: [Gluster-devel] Changes to Regression script > > > > Hi All, > > > > Me and Kaushal have effected the following changes on regression.sh > in > > build.gluster.org: > > > > 1. If a regression run results in a core and all tests pass, that > > particular run will be flagged as a failure. Previously a core that > > would cause test failures only would get marked as a failure. > > > > 2. Cores from a particular test run are now archived and are > available > > at /d/archived_builds/. This will also prevent manual intervention > for > > managing cores. > > > > 3. Logs from failed regression runs are now archived and are > available > > at /d/logs/glusterfs-.tgz > > > > Do let us know if you have any comments on these changes. > > >>> > > >>> This is already proving to be useful :-). I was able to debug one of > the > > >>> spurious failures for crypt.t. But the only problem is I was not able > > >>> copy > > >>> out the logs. Had to take avati's help to get the log files. Will it > be > > >>> possible to give access to these files so that anyone can download > them? > > >>> > > >> > > >> Good to know! > > >> > > >> You can access the .tgz files from: > > >> > > >> http://build.gluster.org:443/logs/ > > > > > > I was able to access these yesterday. But now it gives 404. > > Its working now. But how do we convert the timestamp to logs' timestamp. I > want to know the time difference. > > Pranith. > > > > > > > > Fixed. > > > > -Vijay > > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - > From: "Vijaikumar M" > To: "Pranith Kumar Karampuri" > Cc: "Joseph Fernandes" , "Gluster Devel" > > Sent: Monday, May 19, 2014 2:40:47 PM > Subject: Re: Spurious failures because of nfs and snapshots > > Brick disconnected with ping-time out: > > Here is the log message > [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] > 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi > n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s > build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 > 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 > -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f > bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid > -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name > /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l > /build/install/var/log/glusterfs/br > icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log > --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba > 7b4b0 --brick-port 49164 --xlator-option > 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) >2 [2014-05-19 04:29:38.141118] I > [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting > ping-timeout to 30secs >3 [2014-05-19 04:30:09.139521] C > [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server > 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. > > > > Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where > ping-timer will be disabled by default for all the rpc connection except > for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). > > > Thanks, > Vijay > > > On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: > > The latest build failure also has the same issue: > > Download it from here: > > http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz > > > > Pranith > > > > - Original Message - > >> From: "Vijaikumar M" > >> To: "Joseph Fernandes" > >> Cc: "Pranith Kumar Karampuri" , "Gluster Devel" > >> > >> Sent: Monday, 19 May, 2014 11:41:28 AM > >> Subject: Re: Spurious failures because of nfs and snapshots > >> > >> Hi Joseph, > >> > >> In the log mentioned below, it say ping-time is set to default value > >> 30sec.I think issue is different. > >> Can you please point me to the logs where you where able to re-create > >> the problem. > >> > >> Thanks, > >> Vijay > >> > >> > >> > >> On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: > >>> hi Vijai, Joseph, > >>> In 2 of the last 3 build failures, > >>> http://build.gluster.org/job/regression/4479/console, > >>> http://build.gluster.org/job/regression/4478/console this > >>> test(tests/bugs/bug-1090042.t) failed. Do you guys think it is > >>> better > >>> to revert this test until the fix is available? Please send a patch > >>> to revert the test case if you guys feel so. You can re-submit it > >>> along with the fix to the bug mentioned by Joseph. > >>> > >>> Pranith. > >>> > >>> - Original Message - > From: "Joseph Fernandes" > To: "Pranith Kumar Karampuri" > Cc: "Gluster Devel" > Sent: Friday, 16 May, 2014 5:13:57 PM > Subject: Re: Spurious failures because of nfs and snapshots > > > Hi All, > > tests/bugs/bug-1090042.t : > > I was able to reproduce the issue i.e when this test is done in a loop > > for i in {1..135} ; do ./bugs/bug-1090042.t > > When checked the logs > [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] > 0-management: defaulting ping-timeout to 30secs > [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] > 0-management: setting frame-timeout to 600 > [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] > 0-management: defaulting ping-timeout to 30secs > > The issue is with ping-timeout and is tracked under the bug > > https://bugzilla.redhat.com/show_bug.cgi?id=1096729 > > > The workaround is mentioned in > https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8 > > > Regards, > Joe > > - Original Message - > From: "Pranith Kumar Karampuri" > To: "Gluster Devel" > Cc: "Joseph Fernandes" > Sent: Friday, May 16, 2014 6:19:54 AM > Subject: Spurious failures because of nfs and snapshots > > hi, > In
Re: [Gluster-devel] Regression tests: Should we test non-XFS too?
On 05/19/2014 06:56 AM, Dan Mons wrote: On 15 May 2014 14:35, Ric Wheeler wrote: it is up to those developers and users to test their preferred combination. Not sure if this was quoting me or someone else. BtrFS is in-tree for most distros these days, and RHEL is putting it in as a "technology preview" in 7, which likely means it'll be supported in a point release down the road somewhere. My question was merely if that's going to be a bigger emphasis for Gluster.org folks to test into the future, or if XFS is going to remain the default/recommended for a lot longer yet. If the answer is "it depends on our customers' needs", then put me down as one who needs something better than XFS. I'll happily put in the hard yards to test BtrFS with GlusterFS, but at the same time I'm keen to know if that's a wise use of my time or a complete waste of my time if I'm deviating too far from what RedHat/Gluster.org is planning on blessing in the future. From a gluster.org perspective, btrfs is certainly very interesting. Integrating with btrfs and exposing its capabilities like bitrot, snapshots etc. through glusterfs is on the cards. There have been few reports of using glusterfs over btrfs in the community. I would definitely be interested in hearing more feedback and addressing issues in this combination by collaborating with the btrfs community. Regards, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel