Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator
On 06/04/2014 11:37 AM, Krutika Dhananjay wrote: Hi, Recently there was a crash in locks translator (BZ 1103347, BZ 1097102) with the following backtrace: (gdb) bt #0 uuid_unpack (in=0x8 Address 0x8 out of bounds, uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44 #1 0x7feeba9e19d6 in uuid_unparse_x (uu=value optimized out, out=0x2350fc0 081bbc7a-7551-44ac-85c7-aad5e2633db9, fmt=0x7feebaa08e00 %08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x) at ../../contrib/uuid/unparse.c:55 #2 0x7feeba9be837 in uuid_utoa (uuid=0x8 Address 0x8 out of bounds) at common-utils.c:2138 #3 0x7feeb06e8a58 in pl_inodelk_log_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396 #4 pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:428 #5 0x7feeb06ddf3a in pl_client_disconnect_cbk (this=0x230d910, client=value optimized out) at posix.c:2550 #6 0x7feeba9fa2dd in gf_client_disconnect (client=0x27724a0) at client_t.c:368 #7 0x7feeab77ed48 in server_connection_cleanup (this=0x2316390, client=0x27724a0, flags=value optimized out) at server-helpers.c:354 #8 0x7feeab77ae2c in server_rpc_notify (rpc=value optimized out, xl=0x2316390, event=value optimized out, data=0x2bf51c0) at server.c:527 #9 0x7feeba775155 in rpcsvc_handle_disconnect (svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720 #10 0x7feeba776c30 in rpcsvc_notify (trans=0x2bf51c0, mydata=value optimized out, event=value optimized out, data=0x2bf51c0) at rpcsvc.c:758 #11 0x7feeba778638 in rpc_transport_notify (this=value optimized out, event=value optimized out, data=value optimized out) at rpc-transport.c:512 #12 0x7feeb115e971 in socket_event_poll_err (fd=value optimized out, idx=value optimized out, data=0x2bf51c0, poll_in=value optimized out, poll_out=0, poll_err=0) at socket.c:1071 #13 socket_event_handler (fd=value optimized out, idx=value optimized out, data=0x2bf51c0, poll_in=value optimized out, poll_out=0, poll_err=0) at socket.c:2240 #14 0x7feeba9fc6a7 in event_dispatch_epoll_handler (event_pool=0x22e2d00) at event-epoll.c:384 #15 event_dispatch_epoll (event_pool=0x22e2d00) at event-epoll.c:445 #16 0x00407e93 in main (argc=19, argv=0x7fffea6c7f88) at glusterfsd.c:2023 (gdb) f 4 #4 pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:428 428pl_inodelk_log_cleanup (l); (gdb) p l-pl_inode-refkeeper $1 = (inode_t *) 0x0 (gdb) pl_inode-refkeeper was found to be NULL even when there were some blocked inodelks in a certain domain of the inode, which when dereferenced by the epoll thread in the cleanup codepath led to a crash. On inspecting the code (for want of a consistent reproducer), three things were found: 1. The function where the crash happens (pl_inodelk_log_cleanup()), makes an attempt to resolve the inode to path as can be seen below. But the way inode_path() itself works is to first construct the path based on the given inode's ancestry and place it in the buffer provided. And if all else fails, the gfid of the inode is placed in a certain format (gfid:%s). This eliminates the need for statements from line 4 through 7 below, thereby preventing dereferencing of pl_inode-refkeeper. Now, although this change prevents the crash altogether, it still does not fix the race that led to pl_inode-refkeeper becoming NULL, and comes at the cost of printing (null) in the log message on line 9 every time pl_inode-refkeeper is found to be NULL, rendering the logged messages somewhat useless. code 0 pl_inode = lock-pl_inode; 1 2 inode_path (pl_inode-refkeeper, NULL, path); 3 4 if (path) 5 file = path; 6 else 7 file = uuid_utoa (pl_inode-refkeeper-gfid); 8 9 gf_log (THIS-name, GF_LOG_WARNING, 10 releasing lock on %s held by 11 {client=%p, pid=%PRId64 lk-owner=%s}, 12 file, lock-client, (uint64_t) lock-client_pid, 13 lkowner_utoa (lock-owner)); \code I think this logging code is from the days when gfid handle concept was not there. So it wasn't returning gfid:gfid-str in cases the path is not present in the dentries. I believe the else block can be deleted safely now. Pranith 2. There is at least one codepath found that can lead to this crash: Imagine an inode on which an inodelk operation is attempted by a client and is successfully granted too. Now, between the time the lock was granted and pl_update_refkeeper() was called by this thread, the client could send a DISCONNECT event, causing cleanup codepath to be executed, where the epoll thread crashes on dereferencing pl_inode-refkeeper which is STILL NULL at this point. Besides, there are still places in locks xlator where the refkeeper is NOT updated whenever the lists are modified - for instance in the cleanup codepath from a
Re: [Gluster-devel] All builds are failing with BUILD ERROR
On Wed, Jun 04, 2014 at 11:45:17AM +0530, Kaleb KEITHLEY wrote: And since doing this, regression runs seem to be proceeding without issues. I remember that this used to be an issue when someone cancelled/stopped a running regression test through Jenkins. Maybe that was done? I don't know if anyone ever looked into solving that particular issue. Niels On 06/04/2014 09:59 AM, Kaleb KEITHLEY wrote: On 06/03/2014 04:42 PM, Pranith Kumar Karampuri wrote: Guys its failing again with the same error: Please proceed with configuring, compiling, and installing. rm: cannot remove `/build/install/var/run/gluster/patchy': Device or resource busy + RET=1 + '[' 1 '!=' 0 ']' + VERDICT='BUILD FAILURE' Has someone changed the way builds are cleaned up? Recent regressions are now failing with ... Running automake... Running autogen.sh in argp-standalone ... Please proceed with configuring, compiling, and installing. configure: error: source directory already configured; run make distclean there first + RET=1 + '[' 1 '!=' 0 ']' + VERDICT='BUILD FAILURE' ... I looked and found that /var/lib/jenkins/jobs/regression/workspace contained what looked like a _previous_ successful regression build. I manually cleaned the directory (moved it and created a new workspace dir actually) and now a regression is running. What's going on? !!! -- Kaleb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] autodelete in snapshots
On Wednesday 04 June 2014 11:23 AM, Rajesh Joseph wrote: - Original Message - From: M S Vishwanath Bhat msvb...@gmail.com To: Rajesh Joseph rjos...@redhat.com Cc: Vijay Bellur vbel...@redhat.com, Seema Naik sen...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, June 3, 2014 5:55:27 PM Subject: Re: [Gluster-devel] autodelete in snapshots On 3 June 2014 15:21, Rajesh Joseph rjos...@redhat.com wrote: - Original Message - From: M S Vishwanath Bhat msvb...@gmail.com To: Vijay Bellur vbel...@redhat.com Cc: Seema Naik sen...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, June 3, 2014 1:02:08 AM Subject: Re: [Gluster-devel] autodelete in snapshots On 2 June 2014 20:22, Vijay Bellur vbel...@redhat.com wrote: On 04/23/2014 05:50 AM, Vijay Bellur wrote: On 04/20/2014 11:42 PM, Lalatendu Mohanty wrote: On 04/16/2014 11:39 AM, Avra Sengupta wrote: The whole purpose of introducing the soft-limit is, that at any point of time the number of snaps should not exceed the hard limit. If we trigger auto-delete on hitting hard-limit, then the purpose itself is lost, because at that point we would be taking a snap, making the limit hard-limit + 1, and then triggering auto-delete, which violates the sanctity of the hard-limit. Also what happens when we are at hard-limit + 1, and another snap is issued, while auto-delete is yet to process the first delete. At that point we end up at hard-limit + 1. Also what happens if for a particular snap the auto-delete fails. We should see the hard-limit, as something set by the admin keeping in mind the resource consumption and at no-point should we cross this limit, come what may. If we hit this limit, the create command should fail asking the user to delete snaps using the snapshot delete command. The two options Raghavendra mentioned are applicable for the soft-limit only, in which cases on hitting the soft-limit 1. Trigger auto-delete or 2. Log a warning-message, for the user saying the number of snaps is exceeding the snap-limit and display the number of available snaps Now which of these should happen also depends on the user, because the auto-delete option is configurable. So if the auto-delete option is set as true, auto-delete should be triggered and the above message should also be logged. But if the option is set as false, only the message should be logged. This is the behaviour as designed. Adding Rahul, and Seema in the mail, to reflect upon the behaviour as well. Regards, Avra This sounds correct. However we need to make sure that the usage or documentation around this should be good enough , so that users understand the each of the limits correctly. It might be better to avoid the usage of the term soft-limit. soft-limit as used in quota and other places generally has an alerting connotation. Something like auto-deletion-limit might be better. I still see references to soft-limit and auto deletion seems to get triggered upon reaching soft-limit. Why is the ability to auto delete not configurable? It does seem pretty nasty to go about deleting snapshots without obtaining explicit consent from the user. I agree with Vijay here. It's not good to delete a snap (even though it is oldest) without the explicit consent from user. FYI It took me more than 2 weeks to figure out that my snaps were getting autodeleted after reaching soft-limit. For all I know I had not done anything and my snap restore were failing. I propose to remove the terms soft and hard limit. I believe there should be a limit (just limit) after which all snapshot creates should fail with proper error messages. And there can be a water-mark after which user should get warning messages. So below is my proposal. auto-delete + snap-limit: If the snap-limit is set to n , next snap create (n+1th) will succeed only if if auto-delete is set to on/true/1 and oldest snap will get deleted automatically. If autodelete is set to off/false/0 , (n+1)th snap create will fail with proper error message from gluster CLI command. But again by default autodelete should be off. snap-water-mark : This should come in picture only if autodelete is turned off. It should not have any meaning if auto-delete is turned ON. Basically it's usage is to give the user warning that limit almost being reached and it is time for admin to decide which snaps should be deleted (or which should be kept) *my two cents* -MS The reason for having a hard-limit is to stop snapshot creation once we reached this limit. This helps to have a control over the resource consumption. Therefore if we only have this limit (as snap-limit) then there is no question of auto-delete. Auto-delete can only be triggered once the count crosses the limit. Therefore we introduced the concept of soft-limit and a hard-limit. As the name suggests once the hard-limit is reached no more snaps will be created. Perhaps I could have been more clearer. auto-delete value does come into
Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4
On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote: On 06/04/2014 01:35 AM, Ben Turner wrote: Sent: Thursday, May 29, 2014 6:12:40 PM snip FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance postmark read_large rpc syscallbench tiobench I am starting on NFS now, I'll have results tonight or tomorrow morning. I'll look updating the component scripts to work and run them as well. Thanks a lot for this ben. Justin, Ben, Do you think we can automate running of these scripts without a lot of human intervention? If yes, how can I help? We can use that just before making any release in future :-). It's a decent idea. :) Do you have time to get this up and running? + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4
On 03/06/2014, at 9:05 PM, Ben Turner wrote: snip So far so good on 3.4.4, sorry for the delay here. I had to fix my downstream test suites to run outside of RHS / downstream gluster. I did basic sanity testing on glusterfs mounts including: FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance postmark read_large rpc syscallbench tiobench I am starting on NFS now, I'll have results tonight or tomorrow morning. I'll look updating the component scripts to work and run them as well. Out of curiosity, do you have the time/inclination to test 3.5.1beta1 as well? :) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] rackspace-regression job history disappeared?
Good news. After reloading the Jenkins configuration from disk the other day, the complete job history isn't disappearing any more. + Justin On 30/05/2014, at 8:44 PM, Justin Clift wrote: As a FYI, there weren't any jobs running on build.gluster.org, so I hit the reload configuration from disk button. All of the historical jobs for rackspace-regression are now visible through the UI. No idea how long for though. ;) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4
On 04/06/2014, at 3:14 PM, Ben Turner wrote: - Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Ben Turner btur...@redhat.com, gluster-us...@gluster.org, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 4, 2014 9:35:47 AM Subject: Re: [Gluster-users] [Gluster-devel] Need testers for GlusterFS 3.4.4 On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote: On 06/04/2014 01:35 AM, Ben Turner wrote: Sent: Thursday, May 29, 2014 6:12:40 PM snip FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance postmark read_large rpc syscallbench tiobench I am starting on NFS now, I'll have results tonight or tomorrow morning. I'll look updating the component scripts to work and run them as well. Thanks a lot for this ben. Justin, Ben, Do you think we can automate running of these scripts without a lot of human intervention? If yes, how can I help? We can use that just before making any release in future :-). It's a decent idea. :) Do you have time to get this up and running? Yep, can do. I'll see what else I can get going as well, I'll start with the sanity tests I mentioned above and go from there. How often do we want these run? Daily? Weekly? On GIT checkin? Only on RC? As often as practical, given the hardware resources available atm. On git checkout would be great, but may not be practical (unsure). ;) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4
On 06/04/2014 07:44 PM, Ben Turner wrote: - Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Ben Turner btur...@redhat.com, gluster-us...@gluster.org, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 4, 2014 9:35:47 AM Subject: Re: [Gluster-users] [Gluster-devel] Need testers for GlusterFS 3.4.4 On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote: On 06/04/2014 01:35 AM, Ben Turner wrote: Sent: Thursday, May 29, 2014 6:12:40 PM snip FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance postmark read_large rpc syscallbench tiobench I am starting on NFS now, I'll have results tonight or tomorrow morning. I'll look updating the component scripts to work and run them as well. Thanks a lot for this ben. Justin, Ben, Do you think we can automate running of these scripts without a lot of human intervention? If yes, how can I help? We can use that just before making any release in future :-). It's a decent idea. :) Do you have time to get this up and running? Yep, can do. I'll see what else I can get going as well, I'll start with the sanity tests I mentioned above and go from there. How often do we want these run? Daily? Weekly? On GIT checkin? Only on RC? How long does it take to run them? Pranith -b + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Reminder: Weekly Gluster Community meeting is in 27 mins
Reminder!!! The weekly Gluster Community meeting is in 30 mins, in #gluster-meeting on IRC. This is a completely public meeting, everyone is encouraged to attend and be a part of it. :) To add Agenda items *** Just add them to the main text of the Etherpad, and be at the meeting. :) https://public.pad.fsfe.org/p/gluster-community-meetings Regards and best wishes, Justin Clift -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4
- Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Ben Turner btur...@redhat.com, gluster-us...@gluster.org, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 4, 2014 9:35:47 AM Subject: Re: [Gluster-users] [Gluster-devel] Need testers for GlusterFS 3.4.4 On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote: On 06/04/2014 01:35 AM, Ben Turner wrote: Sent: Thursday, May 29, 2014 6:12:40 PM snip FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance postmark read_large rpc syscallbench tiobench I am starting on NFS now, I'll have results tonight or tomorrow morning. I'll look updating the component scripts to work and run them as well. Thanks a lot for this ben. Justin, Ben, Do you think we can automate running of these scripts without a lot of human intervention? If yes, how can I help? We can use that just before making any release in future :-). It's a decent idea. :) Do you have time to get this up and running? Yep, can do. I'll see what else I can get going as well, I'll start with the sanity tests I mentioned above and go from there. How often do we want these run? Daily? Weekly? On GIT checkin? Only on RC? -b + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Need reviewers for these two 3.5.1 beta 2 patches
Hi all, We need some people to review: http://review.gluster.org/#/c/7963/ and: http://review.gluster.org/#/c/7978/ If that gets done, we can release 3.5.1 beta 2 this week. eg if anyone has the time, that would be directly helpful :) + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel