Re: [Gluster-devel] regarding special treatment of ENOTSUP for setxattr
I think with repetitive log message suppression patch being merged, we don't really need gf_log_occasionally (except if they are logged in DEBUG or TRACE levels). - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Vijay Bellur vbel...@redhat.com Cc: gluster-devel@gluster.org, Anand Avati aav...@redhat.com Sent: Wednesday, 7 May, 2014 3:12:10 PM Subject: Re: [Gluster-devel] regarding special treatment of ENOTSUP for setxattr - Original Message - From: Vijay Bellur vbel...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com, Anand Avati aav...@redhat.com Cc: gluster-devel@gluster.org Sent: Tuesday, May 6, 2014 7:16:12 PM Subject: Re: [Gluster-devel] regarding special treatment of ENOTSUP for setxattr On 05/06/2014 01:07 PM, Pranith Kumar Karampuri wrote: hi, Why is there occasional logging for ENOTSUP errno when setxattr fails? In the absence of occasional logging, the log files would be flooded with this message every time there is a setxattr() call. How to know which keys are failing setxattr with ENOTSUPP if it is not logged when the key keeps changing? Pranith -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] corrupted hash table
Hi Emmanuel, Is it possible to get valgrind reports (or a test case which caused this crash)? The inode table is corrupted in this case. regards, Raghavendra - Original Message - From: Emmanuel Dreyfus m...@netbsd.org To: gluster-devel@gluster.org Sent: Wednesday, May 21, 2014 12:43:50 PM Subject: Re: [Gluster-devel] corrupted hash table Nobody has an idea on this one? This is master branch, client side: Program terminated with signal 11, Segmentation fault. #0 uuid_unpack (in=0xffc0 Address 0xffc0 out of bounds, uu=0xbf7fd7b0) at ../../contrib/uuid/unpack.c:43 warning: Source file is more recent than executable. 43 tmp = *ptr++; (gdb) print tmp Cannot access memory at address 0xffc0 (gdb) bt #0 uuid_unpack (in=0xffc0 Address 0xffc0 out of bounds, uu=0xbf7fd7b0) at ../../contrib/uuid/unpack.c:43 #1 0xbb788f63 in uuid_compare ( uu1=0xffc0 Address 0xffc0 out of bounds, uu2=0xb811f938 k\350_6) at ../../contrib/uuid/compare.c:46 #2 0xbb769993 in __inode_find (table=0xbb213368, gfid=0xb811f938 k\350_6) at inode.c:763 #3 0xbb769cdc in __inode_link (inode=0x5a70b768, parent=optimized out, name=0x5a7e3148 conf24746.file, iatt=0xb811f930) at inode.c:831 #4 0xbb769f3f in inode_link (inode=0x5a70b768, parent=0x5af47728, name=0x5a7e3148 conf24746.file, iatt=0xb811f930) at inode.c:892 #5 0xbb36bdaa in fuse_create_cbk (frame=0xba417c44, cookie=0xbb28cb98, this=0xb9cbe018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768, buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0) at fuse-bridge.c:1888 #6 0xb92a30a0 in io_stats_create_cbk (frame=0xbb28cb98, cookie=0xbb287418, this=0xb9df2018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768, buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0) at io-stats.c:1260 #7 0xb92afd80 in mdc_create_cbk (frame=0xbb287418, cookie=0xbb28d7d8, this=0xb9df1018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768, buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0) at md-cache.c:1404 #8 0xb92c790f in ioc_create_cbk (frame=0xbb28d7d8, cookie=0xbb28b008, this=0xb9dee018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768, buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0) at io-cache.c:701 #9 0xbb3079ba in ra_create_cbk (frame=0xbb28b008, cookie=0xbb287f08, this=0xb9dec018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768, buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0) at read-ahead.c:173 #10 0xb92f66b9 in dht_create_cbk (frame=0xbb287f08, cookie=0xbb28f988, this=0xb9cff018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768, stbuf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x5c491028) at dht-common.c:3942 #11 0xb932fa22 in afr_create_unwind (frame=0xba40439c, this=0xb9cfd018) at afr-dir-write.c:397 #12 0xb9330a02 in __afr_dir_write_cbk (frame=0xba40439c, cookie=0x2, this=0xb9cfd018, op_ret=0, op_errno=0, buf=0xbf7fdfe4, preparent=0xbf7fdf7c, postparent=0xbf7fdf14, preparent2=0x0, postparent2=0x0, xdata=0x5cb61ea8) at afr-dir-write.c:244 #13 0xb939a401 in client3_3_create_cbk (req=0xb805f028, iov=0xb805f048, count=1, myframe=0xbb28e4f8) at client-rpc-fops.c:2211 #14 0xbb7daecf in rpc_clnt_handle_reply (clnt=0xb9cd93b8, pollin=0x5a7dbe38) at rpc-clnt.c:767 #15 0xbb7db7a4 in rpc_clnt_notify (trans=0xb80a7018, mydata=0xb9cd93d8, ---Type return to continue, or q return to quit--- event=RPC_TRANSPORT_MSG_RECEIVED, data=0x5a7dbe38) at rpc-clnt.c:895 #16 0xbb7d7d9c in rpc_transport_notify (this=0xb80a7018, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x5a7dbe38) at rpc-transport.c:512 #17 0xbb3214ab in socket_event_poll_in (this=0xb80a7018) at socket.c:2120 #18 0xbb3246fc in socket_event_handler (fd=16, idx=4, data=0xb80a7018, poll_in=1, poll_out=0, poll_err=0) at socket.c:2233 #19 0xbb7a4c9a in event_dispatch_poll_handler (i=4, ufds=0xbb285118, event_pool=0xbb242098) at event-poll.c:357 #20 event_dispatch_poll (event_pool=0xbb242098) at event-poll.c:436 #21 0xbb77a160 in event_dispatch (event_pool=0xbb242098) at event.c:113 #22 0x08050567 in main (argc=4, argv=0xbf7fe880) at glusterfsd.c:2023 (gdb) frame 2 #2 0xbb769993 in __inode_find (table=0xbb213368, gfid=0xb811f938 k\350_6) at inode.c:763 763 if (uuid_compare (tmp-gfid, gfid) == 0) { (gdb) list 758 return table-root; 759 760 hash = hash_gfid (gfid, 65536); 761 762 list_for_each_entry (tmp, table-inode_hash[hash], hash) { 763 if (uuid_compare (tmp-gfid, gfid) == 0) { 764 inode = tmp; 765
Re: [Gluster-devel] Plea for reviews
Jeff, Comments inlined. - Original Message - From: Jeff Darcy jda...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Monday, June 23, 2014 6:53:53 PM Subject: [Gluster-devel] Plea for reviews I have several patches queued up for 3.6, which have all passed regression tests. Unfortunately, they're all in areas where our resources are pretty thin, so getting the required +1 reviews is proving to be a challenge. The patches are as follows: * For heterogeneous bricks [1] http://review.gluster.org/8093 Caught up with release schedules. I'll try to take up this on high priority. * For better ssl [2] http://review.gluster.org/3695 http://review.gluster.org/8040 http://review.gluster.org/8094 * Not in feature list, but turning out to be important http://review.gluster.org/7702 I know these are all in tricky areas. I'd be glad to do walkthroughs to explain what each patch is doing in more detail. Thanks in advance to anyone who can help! [1] http://www.gluster.org/community/documentation/index.php/Features/heterogeneous-bricks [2] http://www.gluster.org/community/documentation/index.php/Features/better-ssl regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
- Original Message - From: Shyamsundar Ranganathan srang...@redhat.com To: Xavier Hernandez xhernan...@datalab.es Cc: gluster-devel@gluster.org Sent: Tuesday, July 1, 2014 1:48:09 AM Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance From: Xavier Hernandez xhernan...@datalab.es Hi Shyam, On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote: It also touches upon a rebalance on access like mechanism where we could potentially, move data out of existing bricks to a newer brick faster, in the case of brick addition, and vice versa for brick removal, and heal the rest of the data on access. Will this rebalance on access feature be enabled always or only during a brick addition/removal to move files that do not go to the affected brick while the main rebalance is populating or removing files from the brick ? The rebalance on access, in my head, stands as follows, (a little more detailed than what is in the feature page) Step 1: Initiation of the process - Admin chooses to rebalance _changed_ bricks - This could mean added/removed/changed size bricks [3]- Rebalance on access is triggered, so as to move files when they are accessed but asynchronously [1]- Background rebalance, acts only to (re)move data (from)to these bricks [2]- This would also change the layout for all directories, to include the new configuration of the cluster, so that newer data is placed in the correct bricks Step 2: Completion of background rebalance - Once background rebalance is complete, the rebalance status is noted as success/failure based on what the backgrould rebalance process did - This will not stop the on access rebalance, as data is still all over the place, and enhancements like lookup-unhashed=auto will have trouble Step 3: Admin can initiate a full rebalance - When this is complete then the on access rebalance would be turned off, as the cluster is rebalanced! Step 2.5/4: Choosing to stop the on access rebalance - This can be initiated by the admin, post 3 which is more logical or between 2 and 3, in which case lookup everywhere for files etc. cannot be avoided due to [2] above Issues and possible solutions: [4] One other thought is to create link files, as a part of [1], for files that do not belong to the right bricks but are _not_ going to be rebalanced as their source/destination is not a changed brick. This _should_ be faster than moving data around and rebalancing these files. It should also avoid the problem that, post a rebalance _changed_ command, the cluster may have files in the wrong place based on the layout, as the link files would be present to correct the situation. In this situation the rebalance on access can be left on indefinitely and turning it off does not serve much purpose. Enabling rebalance on access always is fine, but I am not sure it buys us gluster states that mean the cluster is in a balanced situation, for other actions like the lookup-unhashed mentioned which may not just need the link files in place. Examples could be mismatched or overly space committed bricks with old, not accessed data etc. but do not have a clear example yet. Just stating, the core intention of rebalance _changed_ is to create space in existing bricks when the cluster grows faster, or be able to remove bricks from the cluster faster. Redoing a rebalance _changed_ again due to a gluster configuration change, i.e expanding the cluster again say, needs some thought. It does not impact if rebalance on access is running or not, the only thing it may impact is the choice of files that are already put into the on access queue based on the older layout, due to the older cluster configuration. Just noting this here. In short if we do [4] then we can leave rebalance on access turned on always, unless we have some other counter examples or use cases that are not thought of. Doing [4] seems logical, so I would state that we should, but from a performance angle of improving rebalance, we need to determine the worth against access paths from IO post not having [4] (again considering the improvement that lookup-unhashed brings, this maybe obvious that [4] should be done). A note on [3], the intention is to start an asynchronous sync task that rebalances the file on access, and not impact the IO path. So if a file is chosen by the IO path as to needing a rebalance, then a sync task with the required xattr to trigger a file move is setup, and setxattr is called, that should take care of the file migration and enabling the IO path to progress as is. Reading through your mail, a better way of doing this by sharing the load, would be to use an index, so that each node in the cluster has a list of files accessed that need a rebalance. The above method for [3] would be client heavy and would incur a network read and write, whereas the index manner of doing
Re: [Gluster-devel] Feature review: Improved rebalance performance
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Shyamsundar Ranganathan srang...@redhat.com, gluster-devel@gluster.org Sent: Tuesday, July 1, 2014 3:10:29 PM Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance On Tuesday 01 July 2014 02:37:34 Raghavendra Gowdappa wrote: Another thing to consider for future versions is to modify the current DHT to a consistent hashing and even the hash value (using gfid instead of a hash of the name would solve the rename problem). The consistent hashing would drastically reduce the number of files that need to be moved and already solves some of the current problems. This change needs a lot of thinking though. The problem with using gfid for hashing instead of name is that we run into a chicken and egg problem. Before lookup, we cannot know the gfid of the file and to lookup the file, we need gfid to find out the node in which file resides. Of course, this problem would go away if we lookup (may be just during fresh lookups) on all the nodes, but that slows down the fresh lookups and may not be acceptable. I think it's not so problematic, and the benefits would be considerable. The gfid of the root directory is always known. This means that we could always do a lookup on root by gfid. I haven't tested it but as I understand it, when you want to do a getxattr on a file inside a subdirectory, for example, the kernel will issue lookups on all intermediate directories to check, Yes, but how does dht handle these lookups? Are you suggesting that we wind the lookup call to all subvolumes (since we don't know which subvolume the file is present for lack of gfid)? at least, the access rights before finally reading the xattr of the file. This means that we can get and cache gfid's of all intermediate directories in the process. Even if there's some operation that does not issue a previous lookup, we could do that lookup if it's not cached. Of course if there were many more operations not issuing a previous lookup, this solution won't be good, but I think this is not the case. I'll try to do some tests to see if this is correct. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] syncops and thread specific memory regions
Hi all, The bug fixed by [1] is a one instance of the class of problems where: 1. we access a variable which is stored in thread-specific area and hence can be stored in different memory regions across different threads. 2. A single (code) control flow is executed in more than one thread. 3. Optimization prevents recalculating address of variable mentioned in 1 every time its accessed, instead using an address calculated earlier. The bug fixed by [1] involved errno as the variable. However there are other pointers which are stored in TLS like, 1. The xlator object in whose context the current code is executing in (aka THIS, set/read by using __glusterfs_this_location() ). 2. A buffer used to parse binary uuids into strings (used by uuid_utoa () ). I think we can hit the corruption uncovered by [1] in the above two scenarios too. Comments? [1] http://review.gluster.org/6475 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding inode_link/unlink
- Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Anand Avati av...@gluster.org, Brian Foster bfos...@redhat.com, Raghavendra Bhat rab...@redhat.com Sent: Friday, July 4, 2014 5:39:03 PM Subject: Re: regarding inode_link/unlink On 07/04/2014 04:28 PM, Raghavendra Gowdappa wrote: - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org, Anand Avati av...@gluster.org, Brian Foster bfos...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Raghavendra Bhat rab...@redhat.com Sent: Friday, July 4, 2014 3:44:29 PM Subject: regarding inode_link/unlink hi, I have a doubt about when a particular dentry_unset thus inode_unref on parent dir happens on fuse-bridge in gluster. When a file is looked up for the first time fuse_entry_cbk does 'inode_link' with parent-gfid/bname. Whenever an unlink/rmdir/(lookup gives ENOENT) happens then corresponding inode unlink happens. The question is, will the present set of operations lead to leaks: 1) Mount 'M0' creates a file 'a' 2) Mount 'M1' of same volume deletes file 'a' M0 never touches 'a' anymore. When will inode_unlink happen for such cases? Will it lead to memory leaks? Kernel will eventually send forget (a) on M0 and that will cleanup the dentries and inode. Its equivalent to a file being looked up and never used again (deleting doesn't matter in this case). Do you know the trigger points for that? When I do 'touch a' on the mount point and leave the system like that, forget is not coming. If I do unlink on the file then forget is coming. I am not very familiar with how kernel manages its inodes. However, as Avati has mentioned in another mail, you can force kernel to send forgets by invalidating the inode. I think he has given enough details in another mail. Pranith Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] When inode table is populated?
- Original Message - From: Jiffin Thottan jthot...@redhat.com To: gluster-devel@gluster.org Sent: Wednesday, July 30, 2014 12:22:30 PM Subject: [Gluster-devel] When inode table is populated? Hi, When we were trying to call rename from translator (in reconfigure) using STACK_WIND , inode table(this-itable) value seems to be null. Since inode is required for performing rename, When will inode table gets populated and Why it is not populated in reconfigure or init? Not every translator has an inode table (nor it is required to). Only the translators which do inode management (like fuse-bridge, protocol/server, libgfapi, possibly nfsv3 server??) will have an inode table associated with them. If you need to access itable, you can do that using inode-table. Or should we create a private inode table and generate inode using it? -Jiffin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Monotonically increasing memory
Anders, Mostly its a case of memory leak. It would be helpful if you can file a bug on this. Following information would be useful to fix the issue: 1. valgrind reports (if possible). a. To start brick and nfs processes with valgrind you can use following cmdline when starting glusterd. # glusterd --xlator-option *.run-with-valgrind=yes In this case all the valgrind logs can be found in standard glusterfs log directory. b. For client you can start glusterfs just like any other process in valgrind. Since glusterfs is daemonized, while running with valgrind we need to prevent it by running it in foreground. We can use -N option to do that # valgrind --leak-check=full --log-file=path-to-valgrind-log glusterfs --volfile-id=xyz --volfile-server=abc -N /mnt/glfs 2. Once you observe a considerable leak in memory, please get a statedump of glusterfs # gluster volume statedump volname and attach the reports in the bug. regards, Raghavendra. - Original Message - From: Anders Blomdell anders.blomd...@control.lth.se To: Gluster Devel gluster-devel@gluster.org Sent: Friday, August 1, 2014 12:01:15 AM Subject: [Gluster-devel] Monotonically increasing memory During rsync of 35 files, memory consumption of glusterfs rose to 12 GB (after approx 14 hours), I take it that this is a bug I should try to track down? Version is 3.7dev as of tuesday... /Anders -- Anders Blomdell Email: anders.blomd...@control.lth.se Department of Automatic Control Lund University Phone:+46 46 222 4625 P.O. Box 118 Fax: +46 46 138118 SE-221 00 Lund, Sweden ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Rackspace regression slaves hung?
- Original Message - From: Krutika Dhananjay kdhan...@redhat.com To: Justin Clift jcl...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, August 28, 2014 12:25:35 PM Subject: [Gluster-devel] Rackspace regression slaves hung? Hi Justin, It looks like slaves 22-25 are hung for over 23 hours now? There are couple of patches [1] submitted by me are resulting in hang. I think these slaves were spawned to test the patch [1] and its dependencies. If yes, they can be killed. [1] http://review.gluster.com/#/c/8523/ -Krutika ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Rackspace regression slaves hung?
I've killed the jobs in question. - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Krutika Dhananjay kdhan...@redhat.com Cc: Justin Clift jcl...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Thursday, August 28, 2014 12:37:07 PM Subject: Re: [Gluster-devel] Rackspace regression slaves hung? - Original Message - From: Krutika Dhananjay kdhan...@redhat.com To: Justin Clift jcl...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, August 28, 2014 12:25:35 PM Subject: [Gluster-devel] Rackspace regression slaves hung? Hi Justin, It looks like slaves 22-25 are hung for over 23 hours now? There are couple of patches [1] submitted by me are resulting in hang. I think these slaves were spawned to test the patch [1] and its dependencies. If yes, they can be killed. [1] http://review.gluster.com/#/c/8523/ -Krutika ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] how do you debug ref leaks?
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, September 18, 2014 10:08:15 AM Subject: Re: [Gluster-devel] how do you debug ref leaks? For eg., if a dictionary is not freed because of non-zero refcount, if there is an information on who has held these references would help to narrow down the code path or component. This solution might be rudimentary. However, someone who has worked on things like garbage collection can give better answers I think. This discussion also reminds me of Greenspun's tenth rule [1] [1] http://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, September 18, 2014 10:05:18 AM Subject: Re: [Gluster-devel] how do you debug ref leaks? On 09/18/2014 09:59 AM, Raghavendra Gowdappa wrote: One thing that would be helpful is allocator info for generic objects like dict, inode, fd etc. That way we wouldn't have to sift through large amount of code. Could you elaborate the idea please. Pranith - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Thursday, September 18, 2014 7:43:00 AM Subject: [Gluster-devel] how do you debug ref leaks? hi, Till now the only method I used to find ref leaks effectively is to find what operation is causing ref leaks and read the code to find if there is a ref-leak somewhere. Valgrind doesn't solve this problem because it is reachable memory from inode-table etc. I am just wondering if there is an effective way anyone else knows of. Do you guys think we need a better mechanism of finding refleaks? At least which decreases the search space significantly i.e. xlator y, fop f etc? It would be better if we can come up with ways to integrate statedump and this infra just like we did for mem-accounting. One way I thought was to introduce new apis called xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops per inode/dict/fd and increments/decrements accordingly. Dump this info on statedump. I myself am not completely sure about this idea. It requires all xlators to change. Any ideas? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] question on quota
Hi Emmanuel, If Quota cannot allow the entire payload in the write call without exceeding limit, it allows the fraction of payload that can fit-in within the quota boundaries. This has led to short writes which the comment in frame 12 mentions. Its the behaviour of write-behind to return EIO for short writes. Since its unlikely that the writes incident on quota when the limit is about be reached, can be allowed in their entirety, we can expect an EIO before EDQUOT. However, without write-behind in xlator graph, there would be no EIO. regards, Raghavendra. - Original Message - From: Emmanuel Dreyfus m...@netbsd.org To: Gluster Devel gluster-devel@gluster.org Sent: Sunday, September 21, 2014 11:09:11 PM Subject: [Gluster-devel] question on quota Hi I am trying to get tests/basic/quota.t working on NetBSD and I notice an oddity: before betting EDQUOT, I get EIO. The backtrace leading there is below. Note the comments at frame 12. Shall I assume it is the expected behavior to get EIO for an over quota write? #0 0xbb491dc7 in _lwp_kill () from /lib/libc.so.12 #1 0xbb491d68 in raise () from /lib/libc.so.12 #2 0xbb491982 in abort () from /lib/libc.so.12 #3 0xba5a68bb in fuse_writev_cbk (frame=0xb99b7670, cookie=0xb98b5e28, this=0xbb287018, op_ret=-1, op_errno=5, stbuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at fuse-bridge.c:2271 #4 0xb9ba0d48 in io_stats_writev_cbk (frame=0xb98b5e28, cookie=0xb98b5eb8, this=0xbb2d9018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at io-stats.c:1402 #5 0xb9bbb504 in mdc_writev_cbk (frame=0xb98b5eb8, cookie=0xb98b5fd8, this=0xbb2d7018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at md-cache.c:1509 #6 0xbb768d7c in default_writev_cbk (frame=0xb98b5fd8, cookie=0xb98b6068, this=0xbb2d6018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at defaults.c:1019 #7 0xbb768d7c in default_writev_cbk (frame=0xb98b6068, cookie=0xb98b6338, this=0xbb2d5018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at defaults.c:1019 #8 0xb9be372d in ioc_writev_cbk (frame=0xb98b6338, cookie=0xb98b6458, this=0xbb2d4018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at io-cache.c:1225 #9 0xb9bf421f in ra_writev_cbk (frame=0xb98b6458, cookie=0xb98b64e8, this=0xbb2d3018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0, postbuf=0xbf7fe1e0, xdata=0x0) at read-ahead.c:654 #10 0xbb3068c4 in wb_do_unwinds (wb_inode=0xbb2403a8, lies=0xbf7fe288) at write-behind.c:921 #11 0xbb30724c in wb_process_queue (wb_inode=0xbb2403a8) at write-behind.c:1209 #12 0xbb305ec0 in wb_fulfill_cbk (frame=0xb98e9e70, cookie=0xb98b5be8, this=0xbb2d2018, op_ret=81920, op_errno=0, prebuf=0xb98de3fc, postbuf=0xb98de464, xdata=0xb98bbaa8) at write-behind.c:758 Here we have this code: 742 if (op_ret == -1) { 743 wb_fulfill_err (head, op_errno); 744 } else if (op_ret head-total_size) { 745 /* 746 * We've encountered a short write, for whatever reason. 747 * Set an EIO error for the next fop. This should be 748 * valid for writev or flush (close). 749 * 750 * TODO: Retry the write so we can potentially capture 751 * a real error condition (i.e., ENOSPC). 752 */ 753 wb_fulfill_err (head, EIO); 754 } #13 0xb9c3fed8 in dht_writev_cbk (frame=0xb98b5be8, cookie=0xb98b5c78, this=0xbb2d1018, op_ret=81920, op_errno=0, prebuf=0xb98de3fc, postbuf=0xb98de464, xdata=0xb98bbaa8) at dht-inode-write.c:84 #14 0xb9c7e97d in afr_writev_unwind (frame=0xb98b5c78, this=0xbb2cf018) at afr-inode-write.c:188 #15 0xb9c7ed43 in afr_writev_wind_cbk (frame=0xb99b6e70, cookie=0x1, this=0xbb2cf018, op_ret=81920, op_errno=0, prebuf=0xbf7fe438, postbuf=0xbf7fe3d0, xdata=0xb98bbaa8) at afr-inode-write.c:313 #16 0xb9cdd30d in client3_3_writev_cbk (req=0xb9ad4428, iov=0xb9ad4448, count=1, myframe=0xb98b5918) at client-rpc-fops.c:855 #17 0xbb7330c2 in rpc_clnt_handle_reply (clnt=0xbb2af508, pollin=0xb98a6fc8) at rpc-clnt.c:766 #18 0xbb7333ba in rpc_clnt_notify (trans=0xb99a2018, mydata=0xbb2af528, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xb98a6fc8) at rpc-clnt.c:894 #19 0xbb72fab5 in rpc_transport_notify (this=0xb99a2018, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xb98a6fc8) at rpc-transport.c:516 #20 0xb9d6d832 in socket_event_poll_in (this=0xb99a2018) at socket.c:2153 #21 0xb9d6dce7 in socket_event_handler (fd=15, idx=5, data=0xb99a2018, poll_in=1, poll_out=0, poll_err=0) at socket.c:2266 #22 0xbb7be78f in event_dispatch_poll_handler (event_pool=0xbb242098, ufds=0xbb2856b8, i=5) at
Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle)
- Original Message - From: Niels de Vos nde...@redhat.com To: Emmanuel Dreyfus m...@netbsd.org Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, September 30, 2014 2:08:06 PM Subject: Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle) On Tue, Sep 30, 2014 at 06:03:44AM +0200, Emmanuel Dreyfus wrote: Hello I observe this kind of errors in bricks logs: [2014-09-30 03:56:10.172889] E [server-rpc-fops.c:681:server_opendir_cbk] 0-patchy-server: 11: OPENDIR (null) (63a151ad-a8b7-496b-92a8-5c3c7897e6fa) == (Stale NFS file handle) ESTALE gets returned when a directory is opened by handle (in this case the GFID). The posix xlator should do the OPENDIR on the brick, through the .glusterfs/...GFID... structure. gfid handle 63a151ad-a8b7-496b-92a8-5c3c7897e6fa is missing from .glusterfs directory (A nameless lookup on this gfid failed with ENOENT in storage/posix) and when this happens server resolver returns an ESTALE. Seems like earlier lookup was successful (since client got the gfid) and before opendir came the handle was deleted. Is there a possibility that the directory was deleted from some other client? In that case, this is not really an error. Otherwise, there might be some issue. Here is the backtrace leading to it. Is that a real error? #3 0xb9c45934 in server_opendir_cbk (frame=0xbb235e70, cookie=0x0, this=0xbb2ca018, op_ret=-1, op_errno=70, fd=0x0, xdata=0x0) at server-rpc-fops.c:682 #4 0xb9c4d402 in server_opendir_resume (frame=0xbb235e70, bound_xl=0xbb2c8018) at server-rpc-fops.c:2507 #5 0xb9c3ef51 in server_resolve_done (frame=0xbb235e70) at server-resolve.c:557 #6 0xb9c3f02d in server_resolve_all (frame=0xbb235e70) at server-resolve.c:592 #7 0xb9c3eefa in server_resolve (frame=0xbb235e70) at server-resolve.c:541 #8 0xb9c3f00a in server_resolve_all (frame=0xbb235e70) at server-resolve.c:588 #9 0xb9c3e662 in resolve_continue (frame=0xbb235e70) at server-resolve.c:233 #10 0xb9c3e242 in resolve_gfid_cbk (frame=0xbb235e70, cookie=0xbb287528, this=0xbb2ca018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, xdata=0x0, postparent=0xb9b2ccac) at server-resolve.c:171 #11 0xb9c710a7 in io_stats_lookup_cbk (frame=0xbb287528, cookie=0xbb2875b8, this=0xbb2c8018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, xdata=0x0, postparent=0xb9b2ccac) at io-stats.c:1510 #12 0xb9cbc09f in marker_lookup_cbk (frame=0xbb2875b8, cookie=0xbb287648, this=0xbb2c5018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, dict=0x0, postparent=0xb9b2ccac) at marker.c:2614 #13 0xbb7667d8 in default_lookup_cbk (frame=0xbb287648, cookie=0xbb2876d8, this=0xbb2c4018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, xdata=0x0, postparent=0xb9b2ccac) at defaults.c:841 #14 0xbb7667d8 in default_lookup_cbk (frame=0xbb2876d8, cookie=0xbb287768, this=0xbb2c2018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, xdata=0x0, postparent=0xb9b2ccac) at defaults.c:841 #15 0xb9cf12ab in pl_lookup_cbk (frame=0xbb287768, cookie=0xbb287888, this=0xbb2c1018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, xdata=0x0, postparent=0xb9b2ccac) at posix.c:2036 #16 0xb9d03fb0 in posix_acl_lookup_cbk (frame=0xbb287888, cookie=0xbb287918, this=0xbb2c0018, op_ret=-1, op_errno=2, inode=0xbb287498, buf=0xb9b2cd14, xattr=0x0, postparent=0xb9b2ccac) at posix-acl.c:806 #17 0xb9d30601 in posix_lookup (frame=0xbb287918, this=0xbb2be018, loc=0xb9910048, xdata=0xbb2432a8) at posix.c:189 #18 0xbb771646 in default_lookup (frame=0xbb287918, this=0xbb2bf018, loc=0xb9910048, xdata=0xbb2432a8) at defaults.c:2117 #19 0xb9d04384 in posix_acl_lookup (frame=0xbb287888, this=0xbb2c0018, loc=0xb9910048, xattr=0x0) at posix-acl.c:858 #20 0xb9cf1713 in pl_lookup (frame=0xbb287768, this=0xbb2c1018, loc=0xb9910048, xdata=0x0) at posix.c:2080 #21 0xbb76f4da in default_lookup_resume (frame=0xbb2876d8, this=0xbb2c2018, loc=0xb9910048, xdata=0x0) at defaults.c:1683 #22 0xbb786667 in call_resume_wind (stub=0xb9910028) at call-stub.c:2478 #23 0xbb78d4f5 in call_resume (stub=0xb9910028) at call-stub.c:2841 #24 0xbb30402f in iot_worker ( data=error reading variable: Cannot access memory at address 0xb9b2cfd8, data@entry=error reading variable: Cannot access memory at address 0xb9b2cfd4) at io-threads.c:214 This error suggests that 'data' can not be accessed. I have no idea why io-threads would fail here though... Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org
Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle)
Pranith had RCAed one of the race-conditions where a stale dentry was left in server inode table. The race can be outlined as below (T1 and T2 are two threads): 1. T1: readdirp in storage/posix reads a dentry (say pgfid1, bname1) along with metadata information and gfid. 2. T2: unlink (pgfid1, bname1) is done in storage/posix and the dentry pgfid1, bname1 is purged from server inode table (inode table management is done by protocol/server). 3. T1: links (pgfid1, bname1) with corresponding gfid read in step 1. Now, since the last unlink was done on pgfid1, bname1 the dentry remains in server inode table (only in server inode table, since the entry was deleted on the exported brick) resulting in ESTALE errors. This situation can be hit when T1 does a lookup on the same dentry instead of readdirp. However I am not sure this is a serious problem since entry is deleted from the backend (and we are not giving ESTALE errors for a file/directory which is actually present on backend). In this case just restarting the volume would make the problem go away since after restarting servers start with fresh inode-cache. I am not sure whether this is the same problem you are facing, but this seems something related. regards, Raghavendra. - Original Message - From: Emmanuel Dreyfus m...@netbsd.org To: Raghavendra Gowdappa rgowd...@redhat.com, Niels de Vos nde...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, September 30, 2014 5:19:31 PM Subject: Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle) Raghavendra Gowdappa rgowd...@redhat.com wrote: Is there a possibility that the directory was deleted from some other client? In that case, this is not really an error. Otherwise, there might be some issue. I deleted the volume and started over: the problem vanished. I wonder how to cope with that on a production machine where data should not be deleted like that. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Chaning position of md-cache in xlator graph
Adding correct gluster-devel mail id. - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: gluster-devel gluster-de...@nongnu.org Sent: Tuesday, 21 October, 2014 3:26:21 PM Subject: Chaning position of md-cache in xlator graph Hi all, The context is bz 1138970 [1]. As discussed in the bug, it would make more sense loading md-cache closer to bricks (as a descendant of write-behind to be specific) from the point of correctness, since stats are affected by writes. Does anyone of you see any issue in doing this? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1138970 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Quota problems with dispersed volumes
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org, Krishnan Parthasarathi kpart...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com Cc: Dan Lambright dlamb...@redhat.com Sent: Monday, October 27, 2014 11:07:40 PM Subject: Quota problems with dispersed volumes Hi, testing quota on a dispersed volume I've found a problem on how the total used space is calculated. # gluster volume create test disperse server{0..2}:/bricks/disperse # gluster volume start test # gluster volume quota test enable # gluster volume quota test limit-usage / 1GB # gluster volume quota test list Path H-LS-L UsedAvailable S-L exceeded? H-L exceeded? - / 1.0GB 80% 0Bytes 1.0GB No No # mount -t glusterfs server0:/test /gluster/test # dd if=/dev/zero of=/gluster/test/file bs=1024k count=512 # ls -lh /gluster/test total 512M -rw-r--r-- 1 root root 512M Oct 27 18:29 file # gluster volume quota test list Path H-LS-L UsedAvailable S-L exceeded? H-L exceeded? / 1.0GB 80% 256.0MB 768.0MBNoNo As you can see quota seems to only count the space used by the file in one of the bricks (each file uses 256MB on each brick). How would be the best way to solve this problem ? I don't know quota internals so I'm a bit lost about where to adjust real file sizes... We use extended attribute with key trusted.glusterfs.quota.size to get the size of a directory/file. The value for this key is probed in lookup and getxattr calls. You can implement logic to handle this key appropriately in disperse xlator to give a proper size to higher layers. In your existing implementation you might have been probably passing xattrs from one of the bricks and hence seeing size from only one brick. Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Quota problems with dispersed volumes
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, October 29, 2014 6:24:55 PM Subject: Re: [Gluster-devel] Quota problems with dispersed volumes On 10/28/2014 02:05 PM, Xavier Hernandez wrote: On 10/28/2014 04:30 AM, Raghavendra Gowdappa wrote: We use extended attribute with key trusted.glusterfs.quota.size to get the size of a directory/file. The value for this key is probed in lookup and getxattr calls. You can implement logic to handle this key appropriately in disperse xlator to give a proper size to higher layers. In your existing implementation you might have been probably passing xattrs from one of the bricks and hence seeing size from only one brick. I think this patch fixes the problem: http://review.gluster.org/8990 It seems that there are some other xattrs visible from client side. I've identified 'trusted.glusterfs.quota.*.contri'. Are there any other xattrs that I should handle on the client side ? this is an internal xattr which only marker (disk usage accounting xlator) uses. The applications running on glusterfs shouldn't be seeing this. If you are seeing this xattr from mount, we should filter this xattr from being listed (at fuse-bridge and gfapi). It seems that there's also a 'trusted.glusterfs.quota.dirty' This is again an internal xattr. You should not worry about handling this. This also needs to be filtered from being displayed to application. and 'trusted.glusterfs.quota.limit-set'. This should be visible from mount point, as this xattr holds the value of quota limit set on that inode. You can handle this in disperse xlator by picking the value from any of its children. How I should handle visible xattrs in ec xlator if they have different values in each brick ? trusted.glusterfs.quota.size is handled by choosing the maximum value. This depends on how ec is handling the files/directories and the meaning of xattr. For eg., trusted.glusterfs.quota.size represents the size of the file/directory. When read from brick, the value will be the size of directory on that brick. When read from a cluster translator like dht, it will be the size of that directory across the whole cluster. So, in dht we add up the values from all bricks and set the sum as the value. However, in case of replicate/afr, we just pick the value from any of the subvolume. Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Quota problems with dispersed volumes
From quota perspective I don't see any other xattrs related to quota. You've listed them all :). - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Xavier Hernandez xhernan...@datalab.es Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, October 29, 2014 8:30:33 PM Subject: Re: [Gluster-devel] Quota problems with dispersed volumes - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, October 29, 2014 6:24:55 PM Subject: Re: [Gluster-devel] Quota problems with dispersed volumes On 10/28/2014 02:05 PM, Xavier Hernandez wrote: On 10/28/2014 04:30 AM, Raghavendra Gowdappa wrote: We use extended attribute with key trusted.glusterfs.quota.size to get the size of a directory/file. The value for this key is probed in lookup and getxattr calls. You can implement logic to handle this key appropriately in disperse xlator to give a proper size to higher layers. In your existing implementation you might have been probably passing xattrs from one of the bricks and hence seeing size from only one brick. I think this patch fixes the problem: http://review.gluster.org/8990 It seems that there are some other xattrs visible from client side. I've identified 'trusted.glusterfs.quota.*.contri'. Are there any other xattrs that I should handle on the client side ? this is an internal xattr which only marker (disk usage accounting xlator) uses. The applications running on glusterfs shouldn't be seeing this. If you are seeing this xattr from mount, we should filter this xattr from being listed (at fuse-bridge and gfapi). It seems that there's also a 'trusted.glusterfs.quota.dirty' This is again an internal xattr. You should not worry about handling this. This also needs to be filtered from being displayed to application. and 'trusted.glusterfs.quota.limit-set'. This should be visible from mount point, as this xattr holds the value of quota limit set on that inode. You can handle this in disperse xlator by picking the value from any of its children. How I should handle visible xattrs in ec xlator if they have different values in each brick ? trusted.glusterfs.quota.size is handled by choosing the maximum value. This depends on how ec is handling the files/directories and the meaning of xattr. For eg., trusted.glusterfs.quota.size represents the size of the file/directory. When read from brick, the value will be the size of directory on that brick. When read from a cluster translator like dht, it will be the size of that directory across the whole cluster. So, in dht we add up the values from all bricks and set the sum as the value. However, in case of replicate/afr, we just pick the value from any of the subvolume. Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 2:05:25 PM Subject: Re: Wrong behavior on fsync of md-cache ? On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote: - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 12:49:03 AM Subject: Re: Wrong behavior on fsync of md-cache ? I think the problem is here: the first thing wb_fsync() checks is if there's an error in the fd (wd_fd_err()). If that's the case, the call is immediately unwinded with that error. The error seems to be set in wb_fulfill_cbk(). I don't know the internals of write-back xlator, but this seems to be the problem. Yes, your analysis is correct. Once the error is hit, fsync is not queued behind unfulfilled writes. Whether it can be considered as a bug is debatable. Since there is already an error in one of the writes which was written-behind fsync should return the error. I am not sure whether it should wait till we try to flush _all_ the writes that were written behind. Any suggestions on what is the expected behaviour here? I think that it should wait for all pending writes. In the test case I used, all pending writes will fail the same way that the first one, but in other situations it's possible to have a write failing (for example due to a damaged block in disk) and following writes succeeding. From the man page of fsync: fsync() transfers (flushes) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)). As I understand it, when fsync is received all queued writes must be sent to the device (regardless if a previous write has failed or not). It also says that the call blocks until the device has finished all the operations. However it's not clear to me how to control file consistency because this allows some writes to succeed after a failed one. Though fsync doesn't wait on queued writes after a failure, the queued writes are flushed to disk even in the existing codebase. Can you file a bug to make fsync to wait for completion of queued writes irrespective of whether flushing any of them failed or not? I'll send a patch to fix the issue. Just to prioritise this, how important is the fix? I assume that controlling this is the responsibility of the calling application that should issue fsyncs on critical points to guarantee consistency. Anyway it seems that there's a difference between linux and NetBSD because this test only fails on NetBSD. Is it possible that linux's fuse implementation delays the fsync request until all pending writes have been answered ? this would explain why this problem has not manifested till now. NetBSD seems to send fsync (probably as the first step of a close() call) when the first write fails. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Emmanuel Dreyfus m...@netbsd.org Cc: Raghavendra Gowdappa rgowd...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, November 26, 2014 2:05:58 PM Subject: Re: Wrong behavior on fsync of md-cache ? On 11/25/2014 06:45 PM, Xavier Hernandez wrote: On 11/25/2014 02:25 PM, Emmanuel Dreyfus wrote: On Tue, Nov 25, 2014 at 01:42:21PM +0100, Xavier Hernandez wrote: It seems to fail only in NetBSD. I'm not sure what priority it has. Emmanuel is trying to create a regression test for new patches that checks all tests in tests/basic, and tests/basic/ec/quota.t hits this issue. FWIW, I just tried to change NetBSD FUSE to queue fsync after write, but that does not help, I still crash in dht_writev_cbk() Not sure what could be the problem. I added a sleep between 'dd' and 'rm' to let all pending writes to finish before removing the file and it seemed to pass the test reliably. On second though, I think your change on fuse haven't solved the problem because fuse really sees all answers to the writes it has sent. If I understand correctly how write-behind works, when it receives a write, it queues it to be processed later, but immediately returns an answer to the upper layers. This is why write-behind improves performance. This means that it won't be possible to solve the problem in the fuse layer because it doesn't have enough information about the real state of all caches. Raghavendra, if that's true, fsync will need to be propagated always, even in case of error so that other xlators (even the remote brick filesystem) will have a chance to flush its caches, if any. Xavi, yes you are right. I'll take that into consideration. Not sure if this could be important/interesting, but I think (not really sure) that posix says that only answered writes must be flushed on fsync. If a recent write has been received but not yet answered when fsync is received, it's not mandatory to flush it. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests: reviews required
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Emmanuel Dreyfus m...@netbsd.org, Vijay Bellur vbel...@redhat.com, Justin Clift jus...@gluster.org, Pranith Kumar Karampuri pkara...@redhat.com, Krishnan Parthasarathi kpart...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com Cc: gluster-devel@gluster.org Sent: Monday, December 1, 2014 2:32:32 PM Subject: Re: [Gluster-devel] NetBSD regression tests: reviews required On 12/01/2014 05:49 AM, Emmanuel Dreyfus wrote: Vijay Bellur vbel...@redhat.com wrote: And as the fix crop, I have a few others to share :-) More the merrier :-). Here is the latest list of NetBSD fixes for regression tests: http://review.gluster.com/8982 http://review.gluster.com/9071 http://review.gluster.com/9075 http://review.gluster.com/9074 http://review.gluster.com/9212 [1] http://review.gluster.com/9216 [2] http://review.gluster.com/9217 http://review.gluster.com/9219 http://review.gluster.com/9220 [1] Krishnan Parthasarathi will probably want to improve the commit message before merging. [2] Here I fix the symptom rather than the cause. Hints are welcome to help fixing the cause, but perhaps the symptom fix could be merged as an interim solution so that glustershd stops crashing during the test. The regression.sh script on nbslave71 and nbslave72 still disable two test that always fail ./tests/basic/afr/entry-self-heal.t - I am working on it ./tests/basic/ec/quota.t - Xavier Hernandez and Raghavendra Gowdappa may have a word about it. A temporal solution if you need to implement this very soon is to add a sleep of a few seconds between the 'dd' and 'rm' commands in the quota.t script. This prevents the crash on DHT and allows the test to pass. I can do that if needed. Go ahead :). Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] explicit lookup of inods linked via readdirp
- Original Message - From: Raghavendra Bhat rab...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Anand Avati aav...@redhat.com Sent: Thursday, December 18, 2014 12:31:41 PM Subject: [Gluster-devel] explicit lookup of inods linked via readdirp Hi, In fuse I saw, that as part of resolving a inode, an explicit lookup is done on it if the inode is found to be linked via readdirp (At the time of linking in readdirp, fuse sets a flag in the inode context). It is done because, many xlators such as afr depend upon lookup call for many things such as healing. Yes. But the lookup is a nameless lookup and hence is not sufficient enough. Some of the functionalities that get affected AFAIK are: 1. dht cannot create/heal directories and their layouts. 2. afr cannot identify gfid mismatch of a file across its subvolumes, since to identify a gfid mismatch we need a name. From what I heard, afr relies on crawls done by self-heal daemon for named-lookups. But dht is worst hit in terms of maintaining directory structure on newly added bricks (this problem is slightly different, since we don't hit this because of nameless lookup after readdirp. Instead it is because of a lack of named-lookup on the file after a graph switch. Neverthless I am clubbing both because a named lookup would've solved the issue). I've a feeling that different components have built their own way of handling what is essentially same issue. Its better we devise a single comprehensive solution. But that logic is not there in gfapi. I am thinking of introducing that mechanism in gfapi as well, where as part of resolve it checks if the inode is linked from readdirp. And if so it will do an explicit lookup on that inode. As you've mentioned a lookup gives a chance to afr to heal the file. So, its needed in gfapi too. However you've to speak to afr folks to discuss whether nameless lookup is sufficient enough. NOTE: It can be done in NFS server as well. Dht in NFS setup is also hit because of lack of named-lookups resulting in non-healing of directories on newly added brick. Please provide feedback. Regards, Raghavendra Bhat ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] explicit lookup of inods linked via readdirp
+Pranith - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Raghavendra Bhat rab...@redhat.com Cc: Anand Avati aav...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Thursday, December 18, 2014 12:58:27 PM Subject: Re: [Gluster-devel] explicit lookup of inods linked via readdirp - Original Message - From: Raghavendra Bhat rab...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Anand Avati aav...@redhat.com Sent: Thursday, December 18, 2014 12:31:41 PM Subject: [Gluster-devel] explicit lookup of inods linked via readdirp Hi, In fuse I saw, that as part of resolving a inode, an explicit lookup is done on it if the inode is found to be linked via readdirp (At the time of linking in readdirp, fuse sets a flag in the inode context). It is done because, many xlators such as afr depend upon lookup call for many things such as healing. Yes. But the lookup is a nameless lookup and hence is not sufficient enough. Some of the functionalities that get affected AFAIK are: 1. dht cannot create/heal directories and their layouts. 2. afr cannot identify gfid mismatch of a file across its subvolumes, since to identify a gfid mismatch we need a name. From what I heard, afr relies on crawls done by self-heal daemon for named-lookups. But dht is worst hit in terms of maintaining directory structure on newly added bricks (this problem is slightly different, since we don't hit this because of nameless lookup after readdirp. Instead it is because of a lack of named-lookup on the file after a graph switch. Neverthless I am clubbing both because a named lookup would've solved the issue). I've a feeling that different components have built their own way of handling what is essentially same issue. Its better we devise a single comprehensive solution. But that logic is not there in gfapi. I am thinking of introducing that mechanism in gfapi as well, where as part of resolve it checks if the inode is linked from readdirp. And if so it will do an explicit lookup on that inode. As you've mentioned a lookup gives a chance to afr to heal the file. So, its needed in gfapi too. However you've to speak to afr folks to discuss whether nameless lookup is sufficient enough. NOTE: It can be done in NFS server as well. Dht in NFS setup is also hit because of lack of named-lookups resulting in non-healing of directories on newly added brick. Please provide feedback. Regards, Raghavendra Bhat ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Problems with graph switch in disperse
Do you know the origins of EIO? fuse-bridge only fails a lookup fop with EIO (when NULL gfid is received in a successful lookup reply). So, there might be other xlator which is sending EIO. - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, December 24, 2014 6:25:17 PM Subject: [Gluster-devel] Problems with graph switch in disperse Hi, I'm experiencing a problem when gluster graph is changed as a result of a replace-brick operation (probably with any other operation that changes the graph) while the client is also doing other tasks, like writing a file. When operation starts, I see that the replaced brick is disconnected, but writes continue working normally with one brick less. At some point, another graph is created and comes online. Remaining bricks on the old graph are disconnected and the old graph is destroyed. I see how new write requests are sent to the new graph. This seems correct. However there's a point where I see this: [2014-12-24 11:29:58.541130] T [fuse-bridge.c:2305:fuse_write_resume] 0-glusterfs-fuse: 2234: WRITE (0x16dcf3c, size=131072, offset=255721472) [2014-12-24 11:29:58.541156] T [ec-helpers.c:101:ec_trace] 2-ec: WIND(INODELK) 0x7f8921b7a9a4(0x7f8921b78e14) [refs=5, winds=3, jobs=1] frame=0x7f8932e92c38/0x7f8932e9e6b0, min/exp=3/3, err=0 state=1 {111:000:000} idx=0 [2014-12-24 11:29:58.541292] T [rpc-clnt.c:1384:rpc_clnt_record] 2-patchy-client-0: Auth Info: pid: 0, uid: 0, gid: 0, owner: d025e932897f [2014-12-24 11:29:58.541296] T [io-cache.c:133:ioc_inode_flush] 2-patchy-io-cache: locked inode(0x16d2810) [2014-12-24 11:29:58.541354] T [rpc-clnt.c:1241:rpc_clnt_record_build_header] 2-rpc-clnt: Request fraglen 152, payload: 84, rpc hdr: 68 [2014-12-24 11:29:58.541408] T [io-cache.c:137:ioc_inode_flush] 2-patchy-io-cache: unlocked inode(0x16d2810) [2014-12-24 11:29:58.541493] T [io-cache.c:133:ioc_inode_flush] 2-patchy-io-cache: locked inode(0x16d2810) [2014-12-24 11:29:58.541536] T [io-cache.c:137:ioc_inode_flush] 2-patchy-io-cache: unlocked inode(0x16d2810) [2014-12-24 11:29:58.541537] T [rpc-clnt.c:1577:rpc_clnt_submit] 2-rpc-clnt: submitted request (XID: 0x17 Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to rpc-transport (patchy-client-0) [2014-12-24 11:29:58.541646] W [fuse-bridge.c:2271:fuse_writev_cbk] 0-glusterfs-fuse: 2234: WRITE = -1 (Input/output error) It seems that fuse still has a write request pending for graph 0. It is resumed but it returns EIO without calling the xlator stack (operations seen between the two log messages are from other operations and they are sent to graph 2). I'm not sure why this happens and how I should aviod this. I tried the same scenario with replicate and it seems to work, so there must be something wrong in disperse, but I don't see where the problem could be. Any ideas ? Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Quota problems without a way of fixing them
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Joe Julian j...@julianfamily.org Cc: Gluster-devel@gluster.org gluster-devel@gluster.org, pk...@grid.auth.gr Sent: Thursday, January 22, 2015 12:58:47 PM Subject: Re: [Gluster-devel] Quota problems without a way of fixing them - Original Message - From: Joe Julian j...@julianfamily.org To: Raghavendra Gowdappa rgowd...@redhat.com Cc: pk...@grid.auth.gr, Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Thursday, January 22, 2015 11:16:39 AM Subject: Re: [Gluster-devel] Quota problems without a way of fixing them On 01/21/2015 09:32 PM, Raghavendra Gowdappa wrote: - Original Message - From: Joe Julian j...@julianfamily.org To: Gluster Devel gluster-devel@gluster.org Cc: Paschalis Korosoglou pk...@grid.auth.gr Sent: Thursday, January 22, 2015 12:54:44 AM Subject: [Gluster-devel] Quota problems without a way of fixing them Paschalis (PeterA in #gluster) has reported these bugs and we've tried to find the source of the problem to no avail. Worse yet, there's no way to just reset the quotas to match what's actually there, as far as I can tell. What should we look for to isolate the source of this problem since this is a production system with enough activity to make isolating the repro difficult at best, and debug logs have enough noise to make isolation nearly impossible? Finally, isn't there some simple way to trigger quota to rescan a path to reset trusted.glusterfs.quota.size ? 1. Delete following xattrs from all the files/directories on all the bricks a) trusted.glusterfs.quota.size b) trusted.glusterfs.quota.*.contri c) trusted.glusterfs.quota.dirty 2. Turn off md-cache # gluster volume set volname performance.stat-prefetch off 3. Mount glusterfs asking not to use readdirp instead of readdir # mount -t glusterfs -o use-readdirp=no volfile-server:volfile-id /mnt/glusterfs 4. Do a crawl on the mountpoint # find /mnt/glusterfs -exec stat \{} \; /dev/null This should correct the accounting on bricks. Once done, you should see correct values in quota list output. Please let us know if it doesn't work for you. But that could be a months-long process with the size of many of our users volumes. There should be a way to do this with a single directory tree. If you can isolate a sub-directory tree where size accounting has gone bad, But, the problem with this approach is that how do we know whether parents of this sub-directory have correct size. If a subdirectory has wrong size, then most likely accounting of all the ancestors of that sub-directory till root has gone bad. Hence I am skeptic about just healing part of a directory tree. this can be done by setting xattr trusted.glusterfs.quota.dirty of a directory to 1 and sending a lookup on that directory. Basically what this does is to add sizes of all immediate children and set that as the value of trusted.glusterfs.quota.size on the directory. But, the catch here is that the sizes of immediate children need not be accounted correctly. Hence this healing should be done bottom up starting with bottom-most directory and healing towards the top-level subdirectory which is isolated. We can have an algorithm like this: void heal (char *path) { char value = 1; struct stbuf = {0, }; setxattr (path, trusted.glusterfs.quota.dirty, (const void *) value, sizeof (value)); /* now the dirty xattr has been set, trigger a lookup, so that the directory is healed */ stat (path, stbuf); return; } void crawl (DIR *dirfd, char *path) { struct dirent *result = NULL, entry = {0, }; while (result = readdir (dirfd, entry, NULL)) { if (IA_ISDIR (result-d_type)) { DIR *childfd = NULL; char *childpath = NULL; childpath = construct_path (path, entry-d_name); childfd = opendir (entry-d_name); crawl (childfd, childpath); } } heal (dirfd); return; } Now call crawl on isolated sub-directory (on the mountpoint). Note that above is a psudo-code, and a tool should be written using the above algo. We'll try to add a program to extras/utils which does this. His production system has been unmanageable for months now. It is possible for someone spare some cycles to get this looked at? 2013-03-04 - https://bugzilla.redhat.com/show_bug.cgi?id=917901 2013-10-24 - https://bugzilla.redhat.com/show_bug.cgi?id=1023134 We are working on these bugs. We'll update on the bugzilla once we find anything substantial. ___ Gluster-devel mailing list Gluster-devel@gluster.org
Re: [Gluster-devel] cannot delete non-empty directory
- Original Message - From: David F. Robinson david.robin...@corvidtec.com To: Shyam srang...@redhat.com, Gluster Devel gluster-devel@gluster.org, gluster-us...@gluster.org, Susant Palai spa...@redhat.com Sent: Monday, February 9, 2015 10:55:44 PM Subject: Re: [Gluster-devel] cannot delete non-empty directory So, just to be sure before I do this, it is okay to do the following if I want to get rid of everything in the /old_shelf4/Aegis directory and below? rm -rf /data/brick*/homegfs_bkp/backup.0/old_shelf4/Aegis Yes. This will solve the issue you are facing now. After this Aegis can be removed from the mount point. What happens to all of the files in the .glusterfs directory? Does this get rebuilt or do the links stay there for files that now no longer exist? Links stay there for files that now no longer exist. This is not an issue except that we'll be loosing an inode (no data-blocks as file size was 0). And, is this same issue what causes all of the broken links in .glusterfs. See attached image for example. There appears to be a lot of broken links the .glusterfs directories. Is this normal or does it indicate another problem. There can be other issues which can result in links not getting deleted from .glusterfs directory. Current issue is not related to that. Finally, if I search through the /data/brick* directories, should I find no entries of ---T permission files with zero length files? Do I need to clean all of these up somehow? A quick look at /data/brick01bkp/homegfs_bkp/.glusterfs/2f/54 shows many of these files. They look like -T 3 rbhinge pme_ics 0 Jan 9 16:45 2f54d7d6-968b-442f-8cfe-eff01d6cefe7 -T 2 rbhinge pme_ics 0 Jan 9 21:40 2f54d7e7-b198-4fd4-aec7-f5d0ff020f72 How do I find out what file these entries were pointing to? As shyam had mentioned in an earlier mail, these files represent dht linkto files. These are sort of metadata containing the name of the subvolume where actual file is stored (hence the name link-to). The destination to which this linkto is pointing is stored in xattrs. A dump of all the xattrs matching regex trusted.glusterfs.* should list all the xattrs. The value of trusted.glusterfs.dht.linkto xattr should give the destination subvolume. If the file is not present on the destination, then its a stale linkto file pointing to non-existent file (on destination subvol) and it can be removed. Otherwise they are valid and shouldn't be removed. Again as shyam mentioned in previous mail [1] should've fixed the issue (it is present in v3.6.0 and above). Not sure why we are seeing this issue again. [1] http://review.gluster.org/8602 David -- Original Message -- From: Shyam srang...@redhat.com To: David F. Robinson david.robin...@corvidtec.com; Gluster Devel gluster-devel@gluster.org; gluster-us...@gluster.org gluster-us...@gluster.org; Susant Palai spa...@redhat.com Sent: 2/9/2015 11:11:20 AM Subject: Re: [Gluster-devel] cannot delete non-empty directory On 02/08/2015 12:19 PM, David F. Robinson wrote: I am seeing these messsages after I delete large amounts of data using gluster 3.6.2. cannot delete non-empty directory: old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final *_From the FUSE mount (as root), the directory shows up as empty:_* # pwd /backup/homegfs/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final # ls -al total 5 d- 2 root root 4106 Feb 6 13:55 . drwxrws--- 3 601 dmiller 72 Feb 6 13:55 .. However, when you look at the bricks, the files are still there (none on brick01bkp, all files are on brick02bkp). All of the files are 0-length and have --T permissions. These files are linkto files that are created by DHT, which basically mean the files were either renamed, or the brick layout changed (I suspect the former to be the cause). These files should have been deleted when the files that they point to were deleted, looks like this did not happen. Can I get the following information for some of the files here, - getfattr -d -m . -e text path to file on brick - The output of trusted.glusterfs.dht.linkto xattr should state where the real file belongs, in this case as there are only 2 bricks, it should be brick01bkp subvol - As the second brick is empty, we should be able to safely delete these files from the brick and proceed to do an rmdir on the mount point of the volume as the directory is now empty. - Please check, the one sub-directory that is showing up in this case as well, save1 Any suggestions on how to fix this and how to prevent it from happening? I believe there are renames happening here, possibly by the archive creator, one way to prevent the rename from creating a linkto file is to use the DHT set parameter to set a pattern so that file name hash considers only the static part of
Re: [Gluster-devel] mandatory lock
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Harmeet Kalsi kharm...@hotmail.com Cc: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Thursday, January 8, 2015 4:12:44 PM Subject: Re: [Gluster-devel] mandatory lock - Original Message - From: Harmeet Kalsi kharm...@hotmail.com To: Gluster-devel@gluster.org gluster-devel@gluster.org Sent: Wednesday, January 7, 2015 5:55:43 PM Subject: [Gluster-devel] mandatory lock Dear All. Would it be possible for someone to guide me in the right direction to enable the mandatory lock on a volume please. At the moment two clients can edit the same file at the same time which is causing issues. I see code related to mandatory locking in posix-locks xlator (pl_writev, pl_truncate etc). To enable it you've to set option mandatory-locks yes in posix-locks xlator loaded on bricks (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this option through gluster cli. Also, I am not sure to what extent this feature is tested/used till now. You can try it out and please let us know whether it worked for you :). If mandatory locking doesn't work for you, can you modify your application to use advisory locking, since advisory locking is tested well and being used for long time? Many thanks in advance Kind Regards ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failure in quota-nfs.t
- Original Message - From: Sachin Pandit span...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, May 5, 2015 3:24:59 PM Subject: Re: [Gluster-devel] spurious failure in quota-nfs.t - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Vijaikumar Mallikarjuna vmall...@redhat.com, Sachin Pandit span...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, May 5, 2015 8:43:18 AM Subject: spurious failure in quota-nfs.t hi Vijai/Sachin, http://build.gluster.org/job/rackspace-regression-2GB-triggered/8268/console Doesn't seem like an obvious failure. Know anything about it? Hi Pranith, I checked the logs and could not find any significant information. It seems like the marker has failed to update the extended attributes till the root. Any idea why it couldn't update till root? I have started the execution of this test case in a loop, it has completed more than 20 successful runs till now. I will update in this thread if I make any progress on root causing the issue. ~ Sachin. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Moratorium on new patch acceptance
- Original Message - From: Shyam srang...@redhat.com To: gluster-devel@gluster.org Sent: Tuesday, May 19, 2015 6:13:06 AM Subject: Re: [Gluster-devel] Moratorium on new patch acceptance On 05/18/2015 07:05 PM, Shyam wrote: On 05/18/2015 03:49 PM, Shyam wrote: On 05/18/2015 10:33 AM, Vijay Bellur wrote: The etherpad did not call out, ./tests/bugs/distribute/bug-1161156.t which did not have an owner, and so I took a stab at it and below are the results. I also think failure in ./tests/bugs/quota/bug-1038598.t is the same as the observation below. NOTE: Anyone with better knowledge of Quota can possibly chip in as to what should we expect in this case and how to correct the expectation from these test cases. (Details of ./tests/bugs/distribute/bug-1161156.t) 1) Failure is in TEST #20 Failed line: TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=1k count=10240 conv=fdatasync 2) The above line is expected to fail (i.e dd is expected to fail) as, the set quota is 20MB and we are attempting to exceed it by another 5MB at this point in the test case. 3) The failure is easily reproducible in my laptop, 2/10 times 4) On debugging, I see that when the above dd succeeds (or the test fails, which means dd succeeded in writing more than the set quota), there are no write errors from the bricks or any errors on the final COMMIT RPC call to NFS. As a result the expectation of this test fails. NOTE: Sometimes there is a write failure from one of the bricks (the above test uses AFR as well), but AFR self healing kicks in and fixes the problem, as expected, as the write succeeded on one of the replicas. I add this observation, as the failed regression run logs, has some EDQUOT errors reported in the client xlator, but only from one of the client bricks, and there are further AFR self heal logs noted in the logs. 5) When the test case succeeds the writes fail with EDQUOT as expected. There are times when the quota is exceeded by say 1MB - 4.8MB, but the test case still passes. Which means that, if we were to try to exceed the quota by 1MB (instead of the 5MB as in the test case), this test case may fail always. Here is why I think this passes by quota sometime and not others making this and the other test case mentioned below spurious. - Each write is 256K from the client (that is what is sent over the wire) - If more IO was queued by io-threads after passing quota checks, which in this 5MB case requires 20 IOs to be queued (16 IOs could be active in io-threads itself), we could end up writing more than the quota amount So, if quota checks to see if a write is violating the quota, and let's it through, and updates on the UNWIND the space used for future checks, we could have more IO outstanding than what the quota allows, and as a result allow such a larger write to pass through, considering IO threads queue and active IOs as well. Would this be a fair assumption of how quota works? Yes, this is a possible scenario. There is a finite time window between, 1. Querying the size of a directory. In other words checking whether current write can be allowed 2. The effect of this write getting reflected in size of all the parent directories of a file till root If 1 and 2 were atomic, another parallel write which could've exceed the quota-limit could not have slipped through. Unfortunately, in the current scheme of things they are not atomic. Now there can be parallel writes in this test case because of nfs-client and/or glusterfs write-back (though we've one single threaded application - dd - running). One way of testing this hypothesis is to disable nfs and glusterfs write-back and run the same (unmodified) test and the test should succeed always (dd should fail). To disable write-back in nfs you can use noac option while mounting. The situation becomes worse in real-life scenarios because of parallelism involved at many layers: 1. multiple applications, each possibly being multithreaded writing to possibly many/or single file(s) in a quota subtree 2. write-back in NFS-client and glusterfs 3. Multiple bricks holding files of a quota-subtree. Each brick processing simultaneously many write requests through io-threads. I've tried in past to fix the issue, though unsuccessfully. It seems to me that one effective strategy is to make enforcement and updation of size of parents atomic. But if we do that we end up adding latency of accounting to latency of fop. Other options can be explored. But, our Quota functionality requirements allow a buffer of 10% while enforcing limits. So, this issue has not been high on our priority list till now. So, our tests should also expect failures allowing for this 10% buffer. I believe this is what is happening in this case. Checking a fix on my machine, and will post the same if it proves to be help the situation. Posted a patch to
Re: [Gluster-devel] Moratorium on new patch acceptance
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Shyam srang...@redhat.com Cc: gluster-devel@gluster.org Sent: Tuesday, May 19, 2015 11:46:19 AM Subject: Re: [Gluster-devel] Moratorium on new patch acceptance - Original Message - From: Shyam srang...@redhat.com To: gluster-devel@gluster.org Sent: Tuesday, May 19, 2015 6:13:06 AM Subject: Re: [Gluster-devel] Moratorium on new patch acceptance On 05/18/2015 07:05 PM, Shyam wrote: On 05/18/2015 03:49 PM, Shyam wrote: On 05/18/2015 10:33 AM, Vijay Bellur wrote: The etherpad did not call out, ./tests/bugs/distribute/bug-1161156.t which did not have an owner, and so I took a stab at it and below are the results. I also think failure in ./tests/bugs/quota/bug-1038598.t is the same as the observation below. NOTE: Anyone with better knowledge of Quota can possibly chip in as to what should we expect in this case and how to correct the expectation from these test cases. (Details of ./tests/bugs/distribute/bug-1161156.t) 1) Failure is in TEST #20 Failed line: TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=1k count=10240 conv=fdatasync 2) The above line is expected to fail (i.e dd is expected to fail) as, the set quota is 20MB and we are attempting to exceed it by another 5MB at this point in the test case. 3) The failure is easily reproducible in my laptop, 2/10 times 4) On debugging, I see that when the above dd succeeds (or the test fails, which means dd succeeded in writing more than the set quota), there are no write errors from the bricks or any errors on the final COMMIT RPC call to NFS. As a result the expectation of this test fails. NOTE: Sometimes there is a write failure from one of the bricks (the above test uses AFR as well), but AFR self healing kicks in and fixes the problem, as expected, as the write succeeded on one of the replicas. I add this observation, as the failed regression run logs, has some EDQUOT errors reported in the client xlator, but only from one of the client bricks, and there are further AFR self heal logs noted in the logs. 5) When the test case succeeds the writes fail with EDQUOT as expected. There are times when the quota is exceeded by say 1MB - 4.8MB, but the test case still passes. Which means that, if we were to try to exceed the quota by 1MB (instead of the 5MB as in the test case), this test case may fail always. Here is why I think this passes by quota sometime and not others making this and the other test case mentioned below spurious. - Each write is 256K from the client (that is what is sent over the wire) - If more IO was queued by io-threads after passing quota checks, which in this 5MB case requires 20 IOs to be queued (16 IOs could be active in io-threads itself), we could end up writing more than the quota amount So, if quota checks to see if a write is violating the quota, and let's it through, and updates on the UNWIND the space used for future checks, we could have more IO outstanding than what the quota allows, and as a result allow such a larger write to pass through, considering IO threads queue and active IOs as well. Would this be a fair assumption of how quota works? Yes, this is a possible scenario. There is a finite time window between, 1. Querying the size of a directory. In other words checking whether current write can be allowed 2. The effect of this write getting reflected in size of all the parent directories of a file till root If 1 and 2 were atomic, another parallel write which could've exceed the quota-limit could not have slipped through. Unfortunately, in the current scheme of things they are not atomic. Now there can be parallel writes in this test case because of nfs-client and/or glusterfs write-back (though we've one single threaded application - dd - running). One way of testing this hypothesis is to disable nfs and glusterfs write-back and run the same (unmodified) test and the test should succeed always (dd should fail). To disable write-back in nfs you can use noac option while mounting. The situation becomes worse in real-life scenarios because of parallelism involved at many layers: 1. multiple applications, each possibly being multithreaded writing to possibly many/or single file(s) in a quota subtree 2. write-back in NFS-client and glusterfs 3. Multiple bricks holding files of a quota-subtree. Each brick processing simultaneously many write requests through io-threads. 4. Background accounting of directory sizes _after_ a write is complete. I've tried in past to fix the issue, though unsuccessfully. It seems to me that one effective strategy is to make enforcement and updation of size of parents atomic. But if we do that we end up adding latency of accounting to latency of fop. Other options can
Re: [Gluster-devel] Moratorium on new patch acceptance
- Original Message - From: Vijay Bellur vbel...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com, Shyam srang...@redhat.com Cc: gluster-devel@gluster.org Sent: Tuesday, May 19, 2015 1:29:57 PM Subject: Re: [Gluster-devel] Moratorium on new patch acceptance On 05/19/2015 12:21 PM, Raghavendra Gowdappa wrote: Yes, this is a possible scenario. There is a finite time window between, 1. Querying the size of a directory. In other words checking whether current write can be allowed 2. The effect of this write getting reflected in size of all the parent directories of a file till root If 1 and 2 were atomic, another parallel write which could've exceed the quota-limit could not have slipped through. Unfortunately, in the current scheme of things they are not atomic. Now there can be parallel writes in this test case because of nfs-client and/or glusterfs write-back (though we've one single threaded application - dd - running). One way of testing this hypothesis is to disable nfs and glusterfs write-back and run the same (unmodified) test and the test should succeed always (dd should fail). To disable write-back in nfs you can use noac option while mounting. The situation becomes worse in real-life scenarios because of parallelism involved at many layers: 1. multiple applications, each possibly being multithreaded writing to possibly many/or single file(s) in a quota subtree 2. write-back in NFS-client and glusterfs 3. Multiple bricks holding files of a quota-subtree. Each brick processing simultaneously many write requests through io-threads. 4. Background accounting of directory sizes _after_ a write is complete. I've tried in past to fix the issue, though unsuccessfully. It seems to me that one effective strategy is to make enforcement and updation of size of parents atomic. But if we do that we end up adding latency of accounting to latency of fop. Other options can be explored. But, our Quota functionality requirements allow a buffer of 10% while enforcing limits. So, this issue has not been high on our priority list till now. So, our tests should also expect failures allowing for this 10% buffer. Since most of our tests are a single instance of single threaded dd running on a single mount, if the hypothesis turns out true, we can turn off nfs-client and glusterfs write-back in all tests related to Quota. Comments? Even with write-behind enabled, dd should get a failure upon close() if quota were to return EDQUOT for any of the writes. I suspect that flush-behind being enabled by default in write-behind can mask a failure for close(). Disabling flush-behind in the tests might take care of fixing the tests. No, my suggestion was aimed at not having parallel writes. In this case quota won't even fail the writes with EDQUOT because of reasons explained above. Yes, we need to disable flush-behind along with this so that errors are delivered to application. It would be good to have nfs + quota coverage in the tests. So let us not disable nfs tests for quota. The suggestion was to continue using nfs, but preventing nfs-clients from using a write-back cache. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Regression failures
Few more: 1. ./tests/basic/afr/sparse-file-self-heal.t From http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/5887/consoleFull 2. ./tests/bugs/protocol/bug-808400-repl.t From http://build.gluster.org/job/rackspace-regression-2GB-triggered/9920/consoleFull 3. ./tests/bugs/quota/inode-quota.t From http://build.gluster.org/job/rackspace-regression-2GB-triggered/10035/consoleFull Last two failures are on the same patch http://review.gluster.org/#/c/10967/. Though I've not got time to look into each failure, it seems like spurious/intermittent failures since the patch passed all the regressions master and different tests are failing on different runs. - Original Message - From: Sachin Pandit span...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 3, 2015 4:36:50 PM Subject: [Gluster-devel] Regression failures Hi, http://review.gluster.org/#/c/11024/ failed in tests/basic/volume-snapshot-clone.t testcase. http://build.gluster.org/job/rackspace-regression-2GB-triggered/10057/consoleFull http://review.gluster.org/#/c/11000/ failed in tests/bugs/replicate/bug-979365.t testcase. http://build.gluster.org/job/rackspace-regression-2GB-triggered/9985/consoleFull Seems like a spurious failure. Can anyone please have a look at this. Regards, Sachin Pandit. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing
- Original Message - From: Atin Mukherjee amukh...@redhat.com To: Niels de Vos nde...@redhat.com, Vijaikumar M vmall...@redhat.com Cc: Raghavendra G raghaven...@gluster.com, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 24, 2015 10:15:12 AM Subject: Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing When is this patch getting merged, this is blocking other patches to get in. Revert of http://review.gluster.org/11311, is waiting for regression runs to pass. There are three patches (duplicates of each other). If anyone of them pass both regression runs, I'll merge them. As far as refcounting mechanism go, it'll take some time to review and merge the patch. ~Atin On 06/23/2015 06:26 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 05:30:39PM +0530, Vijaikumar M wrote: On Tuesday 23 June 2015 04:28 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 03:45:43PM +0530, Vijaikumar M wrote: I have submitted below patch which fixes this issue. I am handling memory clean-up with reference countmechanism. http://review.gluster.org/#/c/11361 Is there a reason you can not use the (new) refcounting functions that were introduceed with http://review.gluster.org/11022 ? I was not aware that ref-counting patch was merged. Sure we will use these function and re-submit my patch. Ok, thanks! Niels Thanks, Vijay It would be nicer to standardize all refcounting mechanisms on one implementation. I hope we can replace existing refcounting with this one too. Introducing more refcounting ways is not going to be helpful. Thanks, Niels Thanks, Vijay On Tuesday 23 June 2015 12:58 PM, Raghavendra G wrote: Multiple replies to same query. Pick one ;). On Tue, Jun 23, 2015 at 12:55 PM, Venky Shankar yknev.shan...@gmail.com mailto:yknev.shan...@gmail.com wrote: OK. Two reverts of the same patch ;) Pick one. On Tue, Jun 23, 2015 at 12:51 PM, Raghavendra Gowdappa rgowd...@redhat.com mailto:rgowd...@redhat.com wrote: Seems like its a memory corruption caused by: http://review.gluster.org/11311 I've reverted the patch at: http://review.gluster.org/11360 - Original Message - From: Xavier Hernandez xhernan...@datalab.es mailto:xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org mailto:gluster-devel@gluster.org Sent: Tuesday, June 23, 2015 12:44:47 PM Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing Hi, the quota test bug-1153964.t is failing consistently for a totally unrelated patch. Is this a known issue ? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Raghavendra G ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- ~Atin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing
Seems like its a memory corruption caused by: http://review.gluster.org/11311 I've reverted the patch at: http://review.gluster.org/11360 - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, June 23, 2015 12:44:47 PM Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing Hi, the quota test bug-1153964.t is failing consistently for a totally unrelated patch. Is this a known issue ? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Unable to send patches to gerrit
Now, not able to open/review/merge any patches on review.gluster.org. - Original Message - From: Avra Sengupta aseng...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com, Anoop C S achir...@redhat.com, gluster-devel@gluster.org, Kaushal M kshlms...@gmail.com, Vijay Bellur vbel...@redhat.com, gluster-infra gluster-in...@gluster.org Sent: Thursday, 11 June, 2015 11:16:15 AM Subject: Re: [Gluster-devel] Unable to send patches to gerrit +Adding gluster-infra On 06/11/2015 10:39 AM, Pranith Kumar Karampuri wrote: Last time when this happened Kaushal/vijay fixed it if I remember correctly. +kaushal +Vijay Pranith On 06/11/2015 10:38 AM, Anoop C S wrote: On 06/11/2015 10:33 AM, Ravishankar N wrote: I'm unable to push a patch on release-3.6, getting different errors every time: This happens for master too. I continuously get the following error: error: unpack failed: error No space left on device [ravi@tuxpad glusterfs]$ ./rfc.sh [detached HEAD a59646a] afr: honour selfheal enable/disable volume set options Date: Sat May 30 10:23:33 2015 +0530 3 files changed, 108 insertions(+), 4 deletions(-) create mode 100644 tests/basic/afr/client-side-heal.t Successfully rebased and updated refs/heads/3.6_honour_heal_options. Counting objects: 11, done. Delta compression using up to 4 threads. Compressing objects: 100% (11/11), done. Writing objects: 100% (11/11), 1.77 KiB | 0 bytes/s, done. Total 11 (delta 9), reused 0 (delta 0) *error: unpack failed: error No space left on device** **fatal: Unpack error, check server log* To ssh://itisr...@git.gluster.org/glusterfs.git ! [remote rejected] HEAD - refs/for/release-3.6/bug-1230259 (n/a (unpacker error)) error: failed to push some refs to 'ssh://itisr...@git.gluster.org/glusterfs.git' [ravi@tuxpad glusterfs]$ [ravi@tuxpad glusterfs]$ ./rfc.sh [detached HEAD 8b28efd] afr: honour selfheal enable/disable volume set options Date: Sat May 30 10:23:33 2015 +0530 3 files changed, 108 insertions(+), 4 deletions(-) create mode 100644 tests/basic/afr/client-side-heal.t Successfully rebased and updated refs/heads/3.6_honour_heal_options. *fatal: internal server error** **fatal: Could not read from remote repository.** ** **Please make sure you have the correct access rights** **and the repository exists.* Anybody else facing problems? -Ravi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Only netbsd regressions seem to be triggered
All, It seems only netbsd regressions are triggered. Linux based regressions seems to be not triggered. I've observed this with two patches [1][2]. Pranith also feels same. Have any of you seen similar issue? [1]http://review.gluster.org/#/c/10943/ [2]http://review.gluster.org/#/c/10834/ regards, Raghavendra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Unable to send patches to release 3.7 branch.
- Original Message - From: Avra Sengupta aseng...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Friday, May 29, 2015 9:11:22 AM Subject: [Gluster-devel] Unable to send patches to release 3.7 branch. Hi, Usually when a patch is backported to release 3.7 branch it contains the following from the patch already merged in master: Change-Id: Ib878f39814af566b9250cf6b8ed47da0ca5b1128 BUG: 1226120 Signed-off-by: Avra Sengupta aseng...@redhat.com Reviewed-on: http://review.gluster.org/10641 Reviewed-by: Rajesh Joseph rjos...@redhat.com Tested-by: NetBSD Build System Remove this line. NetBSD Build System is not a valid email id. Reviewed-by: Kaushal M kaus...@redhat.com While trying to send this patch from release 3.7 branch I am getting the following error from checkpatch.pl which is not letting me send the patch, and this is new because it didn't use to happen earlier. # ./extras/checkpatch.pl 0001-glusterd-snapshot-Return-correct-errno-in-events-of-.patch Use of uninitialized value $gerrit_url in regexp compilation at Use options --gerrit-url ./extras/checkpatch.pl --gerrit-url review.gluster.org ./extras/checkpatch.pl line 1958. ERROR: Unrecognized url address: 'http://review.gluster.org/10313' #16: Reviewed-on: http://review.gluster.org/10313 ERROR: Unrecognized email address: 'NetBSD Build System' #19: Tested-by: NetBSD Build System Patch not according to coding guidelines! please fix. total: 2 errors, 0 warnings, 326 lines checked Why is checkpatch unable to reference the gerrit url? Regards, Avra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Crash in dht_fsync()
During graph change we migrate all the fds opened on old subvol. In this case we are trying to migrate fds opened on virtual .meta subtree. This has resulted in crash as these inodes and fds have to be handled differently than the ones in real volumes. - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Vijay Bellur vbel...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Tuesday, May 26, 2015 12:02:46 PM Subject: Re: [Gluster-devel] Crash in dht_fsync() Is there a way to get coredump? - Raghavendra Gowdappa rgowd...@redhat.com wrote: Will take a look. - Original Message - From: Vijay Bellur vbel...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Monday, May 25, 2015 11:08:46 PM Subject: [Gluster-devel] Crash in dht_fsync() While running tests/performance/open-behind.t in a loop on mainline, I observe the crash at [1]. The backtrace seems to point to dht_fsync(). Can one of the dht developers please take a look in? Thanks, Vijay [1] http://fpaste.org/225375/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Huge memory consumption with quota-marker
- Original Message - From: Krishnan Parthasarathi kpart...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Pranith Kumar Karampuri pkara...@redhat.com, Vijay Bellur vbel...@redhat.com, Vijaikumar M vmall...@redhat.com, Gluster Devel gluster-devel@gluster.org, Nagaprasad Sathyanarayana nsath...@redhat.com Sent: Thursday, July 2, 2015 11:27:34 AM Subject: Re: Huge memory consumption with quota-marker Yes. The PROC_MAX is the maximum no. of 'worker' threads that would be spawned for a given syncenv. So, if we create a new syncenv with smaller stack-size, threads spawned in that syncenv will add to the number of threads in the process. However, if you create synctasks with stacksize different from the default env-stacksize, tasks will have lesser stack size but utilizing same threads of default syncenv. - Original Message - - Original Message - From: Krishnan Parthasarathi kpart...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Vijay Bellur vbel...@redhat.com, Vijaikumar M vmall...@redhat.com, Gluster Devel gluster-devel@gluster.org, Raghavendra Gowdappa rgowd...@redhat.com, Nagaprasad Sathyanarayana nsath...@redhat.com Sent: Thursday, July 2, 2015 10:54:44 AM Subject: Re: Huge memory consumption with quota-marker Yes, we could take synctask size as an argument for synctask_create. The increase in synctask threads is not really a problem, it can't grow more than 16 (SYNCENV_PROC_MAX). That is it cannot grow more than PROC_MAX in _single_ syncenv I suppose. - Original Message - On 07/02/2015 10:40 AM, Krishnan Parthasarathi wrote: - Original Message - On Wednesday 01 July 2015 08:41 AM, Vijaikumar M wrote: Hi, The new marker xlator uses syncop framework to update quota-size in the background, it uses one synctask per write FOP. If there are 100 parallel writes with all different inodes but on the same directory '/dir', there will be ~100 txn waiting in queue to acquire a lock on on its parent i.e '/dir'. Each of this txn uses a syntack and each synctask allocates stack size of 2M (default size), so total 0f 200M usage. This usage can increase depending on the load. I am think of of using the stacksize for synctask to 256k, will this mem be sufficient as we perform very limited operations within a synctask in marker updation? Seems like a good idea to me. Do we need a 256k stacksize or can we live with something even smaller? It was 16K when synctask was introduced. This is a property of syncenv. We could create a separate syncenv for marker transactions which has smaller stacks. env-stacksize (and SYNCTASK_DEFAULT_STACKSIZE) was increased to 2MB to support pump xlator based data migration for replace-brick. For the no. of stack frames a marker transaction could use at any given time, we could use much lesser, 16K say. Does that make sense? Creating one more syncenv will lead to extra sync-threads, may be we can take stacksize as argument. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Build and Regression failure in master branch!
Thanks Kotresh. - Original Message - From: Kotresh Hiremath Ravishankar khire...@redhat.com To: Atin Mukherjee atin.mukherje...@gmail.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Sunday, June 28, 2015 1:57:24 PM Subject: Re: [Gluster-devel] Build and Regression failure in master branch! Yes, Atin. You are right, header files are missing in Makefiles. Build because of the commit 3741804bec65a33d400af38dcc80700c8a668b81 I have sent the patch for the same. http://review.gluster.org/#/c/11451/ Please someone review and merge it. Thanks and Regards, Kotresh H R - Original Message - From: Atin Mukherjee atin.mukherje...@gmail.com To: Kotresh Hiremath Ravishankar khire...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Sunday, June 28, 2015 12:56:21 PM Subject: Re: [Gluster-devel] Build and Regression failure in master branch! -Atin Sent from one plus one On Jun 28, 2015 12:01 PM, Kotresh Hiremath Ravishankar khire...@redhat.com wrote: Hi, rpm build is consistently failing for the patch ( http://review.gluster.org/#/c/11443/) with following error where as it is passing in local setup. ... Making all in performance Making all in write-behind Making all in src CC write-behind.lo write-behind.c:24:35: fatal error: write-behind-messages.h: No such file or directory #include write-behind-messages.h ^ compilation terminated. make[5]: *** [write-behind.lo] Error 1 make[4]: *** [all-recursive] Error 1 make[3]: *** [all-recursive] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 RPM build errors: error: Bad exit status from /var/tmp/rpm-tmp.8QmLg0 (%build) Bad exit status from /var/tmp/rpm-tmp.8QmLg0 (%build) This means the entry of this file is missing in the respective makefile. Regression Failures: ./tests/basic/afr/client-side-heal.t Above test case is consistently failing for the patch. http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/7596/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11641/consoleFull Are there known issues? Thanks and Regards, Kotresh H R ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Failure in tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t
I've reverted [1] which brought the change allow-insecure to be on by default. The patch seems to have issues which will be addressed and merged later. The revert can be found at [2]. [1] http://review.gluster.org/11274 [2] http://review.gluster.org/11507 Please let me know if the regressions are still failing. regards, Raghavendra. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Atin Mukherjee atin.mukherje...@gmail.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, July 2, 2015 9:49:16 PM Subject: Re: [Gluster-devel] Failure in tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t Yep.. Thanks Guys. - Original Message - From: Atin Mukherjee atin.mukherje...@gmail.com To: Joseph Fernandes josfe...@redhat.com Cc: kpart...@redhat.com, Atin Mukherjee amukh...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Thursday, July 2, 2015 9:45:01 PM Subject: Re: [Gluster-devel] Failure in tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t Joe, Please refer to Prasanna's mail. He has uploaded a patch to solve it. -Atin Sent from one plus one On Jul 2, 2015 9:42 PM, Joseph Fernandes josfe...@redhat.com wrote: Hi All, This is the same issue as the previous tiering regression failure. Volume brick not able to start brick because port is busy [2015-07-02 10:20:20.601372] [run.c:190:runner_log] (-- /build/install/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7f05e080bc32] (-- /build/install/lib/libglusterfs.so.0(runner_log+0x192)[0x7f05e08754ce] (-- /build/install/lib/glusterfs/3.8dev/xlator/mgmt/glusterd.so(glusterd_volume_start_glusterfs+0xae7)[0x7f05d5c935d7] (-- /build/install/lib/glusterfs/3.8dev/xlator/mgmt/glusterd.so(glusterd_brick_start+0x151)[0x7f05d5c9d4e3] (-- /build/install/lib/glusterfs/3.8dev/xlator/mgmt/glusterd.so(glusterd_op_perform_add_bricks+0x8fe)[0x7f05d5d10661] ) 0-: Starting GlusterFS: /build/install/sbin/glusterfsd -s slave33.cloud.gluster.org --volfile-id patchy.slave33.cloud.gluster.org.d-backends-patchy5 -p /var/lib/glusterd/vols/patchy/run/slave33.cloud.gluster.org-d-backends-patchy5.pid -S /var/run/gluster/ca5f5a89aa3a24f0a54852590ab82ad5.socket --brick-name /d/backends/patchy5 -l /var/log/glusterfs/bricks/d-backends-patchy5.log --xlator-option *-posix.glusterd-uuid=da011de8-9103-4cf2-9f4b-03707d0019d0 --brick-port 49167 --xlator-option patchy-server.listen-port=49167 [2015-07-02 10:20:20.624297] I [MSGID: 106144] [glusterd-pmap.c:269:pmap_registry_remove] 0-pmap: removing brick (null) on port 49167 [2015-07-02 10:20:20.625315] E [MSGID: 106005] [glusterd-utils.c:4448:glusterd_brick_start] 0-management: Unable to start brick slave33.cloud.gluster.org:/d/backends/patchy5 [2015-07-02 10:20:20.625354] E [MSGID: 106074] [glusterd-brick-ops.c:2096:glusterd_op_add_brick] 0-glusterd: Unable to add bricks [2015-07-02 10:20:20.625368] E [MSGID: 106123] [glusterd-syncop.c:1416:gd_commit_op_phase] 0-management: Commit of operation 'Volume Add brick' failed on localhost Brick Log: [2015-07-02 10:20:20.608547] I [MSGID: 100030] [glusterfsd.c:2296:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 3.8dev (args: /build/install/sbin/glusterfsd -s slave33.cloud.gluster.org --volfile-id patchy.slave33.cloud.gluster.org.d-backends-patchy5 -p /var/lib/glusterd/vols/patchy/run/slave33.cloud.gluster.org-d-backends-patchy5.pid -S /var/run/gluster/ca5f5a89aa3a24f0a54852590ab82ad5.socket --brick-name /d/backends/patchy5 -l /var/log/glusterfs/bricks/d-backends-patchy5.log --xlator-option *-posix.glusterd-uuid=da011de8-9103-4cf2-9f4b-03707d0019d0 --brick-port 49167 --xlator-option patchy-server.listen-port=49167) [2015-07-02 10:20:20.617113] I [MSGID: 101190] [event-epoll.c:627:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2015-07-02 10:20:20.623097] I [MSGID: 101173] [graph.c:268:gf_add_cmdline_options] 0-patchy-server: adding option 'listen-port' for volume 'patchy-server' with value '49167' [2015-07-02 10:20:20.623135] I [MSGID: 101173] [graph.c:268:gf_add_cmdline_options] 0-patchy-posix: adding option 'glusterd-uuid' for volume 'patchy-posix' with value 'da011de8-9103-4cf2-9f4b-03707d0019d0' [2015-07-02 10:20:20.623358] I [MSGID: 115034] [server.c:392:_check_for_auth_option] 0-/d/backends/patchy5: skip format check for non-addr auth option auth.login./d/backends/patchy5.allow [2015-07-02 10:20:20.623374] I [MSGID: 115034] [server.c:392:_check_for_auth_option] 0-/d/backends/patchy5: skip format check for non-addr auth option auth.login.96bcb872-559b-4f19-84ad-a735dc6068f6.password [2015-07-02 10:20:20.623568] I [rpcsvc.c:2210:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2015-07-02 10:20:20.623633] W [MSGID: 101002]
Re: [Gluster-devel] Crash in dht_fsync()
Will take a look. - Original Message - From: Vijay Bellur vbel...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Monday, May 25, 2015 11:08:46 PM Subject: [Gluster-devel] Crash in dht_fsync() While running tests/performance/open-behind.t in a loop on mainline, I observe the crash at [1]. The backtrace seems to point to dht_fsync(). Can one of the dht developers please take a look in? Thanks, Vijay [1] http://fpaste.org/225375/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Crash in dht_fsync()
Is there a way to get coredump? - Raghavendra Gowdappa rgowd...@redhat.com wrote: Will take a look. - Original Message - From: Vijay Bellur vbel...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Monday, May 25, 2015 11:08:46 PM Subject: [Gluster-devel] Crash in dht_fsync() While running tests/performance/open-behind.t in a loop on mainline, I observe the crash at [1]. The backtrace seems to point to dht_fsync(). Can one of the dht developers please take a look in? Thanks, Vijay [1] http://fpaste.org/225375/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Moratorium on new patch acceptance
Status on crash in dht_fsync reported by Vijay Bellur: I am not able to reproduce the issue. However I am consistently hitting a failure in the last test of ./tests/performance/open-behind.t (test 18). The test reads as: gluster volume top $V0 open | grep -w $F0 /dev/null 21 TEST [ $? -eq 0 ]; gluster volume top is not giving file name in the output causing grep to fail resulting in failure of test. regards, Raghavendra. - Original Message - From: Vijaikumar M vmall...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, May 26, 2015 4:43:23 PM Subject: Re: [Gluster-devel] Moratorium on new patch acceptance Here is the status on quota test-case spurious failure: There were 3 issues 1) Quota exceeding the limit because of parallel writes - Merged Upstream, patch submitted to release-3.7 #10910 ./tests/bugs/quota/bug-1038598.t ./tests/bugs/distribute/bug-1161156.t 2) Quoting accounting going wrong - Patch Submitted #10918 ./tests/basic/ec/quota.t ./tests/basic/quota-nfs.t 3) Quota with anonymous FDs on NetBSD: This is NFS client caching issue on NetBSD. Sachin and Myself are working on this issue. ./tests/basic/quota-anon-fd-nfs.t Thanks, Vijay On Friday 22 May 2015 11:45 PM, Vijay Bellur wrote: On 05/21/2015 12:07 AM, Vijay Bellur wrote: On 05/19/2015 11:56 PM, Vijay Bellur wrote: On 05/18/2015 08:03 PM, Vijay Bellur wrote: On 05/16/2015 03:34 PM, Vijay Bellur wrote: I will send daily status updates from Monday (05/18) about this so that we are clear about where we are and what needs to be done to remove this moratorium. Appreciate your help in having a clean set of regression tests going forward! We have made some progress since Saturday. The problem with glupy.t has been fixed - thanks to Niels! All but following tests have developers looking into them: ./tests/basic/afr/entry-self-heal.t ./tests/bugs/replicate/bug-976800.t ./tests/bugs/replicate/bug-1015990.t ./tests/bugs/quota/bug-1038598.t ./tests/basic/ec/quota.t ./tests/basic/quota-nfs.t ./tests/bugs/glusterd/bug-974007.t Can submitters of these test cases or current feature owners pick these up and start looking into the failures please? Do update the spurious failures etherpad [1] once you pick up a particular test. [1] https://public.pad.fsfe.org/p/gluster-spurious-failures Update for today - all tests that are known to fail have owners. Thanks everyone for chipping in! I think we should be able to lift this moratorium and resume normal patch acceptance shortly. Today's update - Pranith fixed a bunch of failures in erasure coding and Avra removed a test that was not relevant anymore - thanks for that! Quota, afr, snapshot tiering tests are being looked into. Will provide an update on where we are with these tomorrow. A few tests have not been readily reproducible. Of the remaining tests, all but the following have either been root caused or we have patches in review: ./tests/basic/mount-nfs-auth.t ./tests/performance/open-behind.t ./tests/basic/ec/ec-5-2.t ./tests/basic/quota-nfs.t With some reviews and investigations of failing tests happening over the weekend, I am optimistic about being able to accept patches as usual from early next week. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Enhancing Quota enforcement during parallel writes
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Vijaikumar Mallikarjuna vmall...@redhat.com, Sachin Pandit span...@redhat.com Sent: Friday, 22 May, 2015 10:50:14 AM Subject: Enhancing Quota enforcement during parallel writes All, As pointed by [1], parallel writes can result in incorrect quota enforcement. [2] was an (unsuccessful) attempt to solve the issue. Some points about [2]: in_progress_writes is updated _after_ we fetch the size. Due to this, two writes can see the same size and hence the issue is not solved. What we should be doing is to update in_progress_writes even before we fetch the size. If we do this, it is guaranteed that at-least one write sees the other's size accounted in in_progress_writes. This approach has two issues: 1. since we had added current write size to in_progress_writes, current write would already be accounted in the size of the directory. This is a minor issue and can be solved by subtracting the size of the current write from the resultant cluster-wide in-progress-size of the directory. 2. We might prematurely fail the writes even though there is some space available. Assume there is a 5MB of free space. If two 5MB writes are issued in parallel, both might fail as both might see each other's size already accounted, though none of them has succeeded. Of course, we can go with this limitation as we are erring on conservative side if the following logic seems too complicated. To solve this issue, I am proposing following algo: * we assign an identity that is unique across the cluster for each write - say uuid * Among all the in-progress-writes we pick a write. The policy used can be a random criteria like smallest of all the uuids. So, each brick selects a candidate among its own in-progress-writes _AND_ incoming candidate (see the psuedocode of get_dir_size below for more clarity). It sends back this candidate along with size of directory. The brick also remembers the last candidate it approved. clustering translators like dht pick one write among these replies, using the same logic bricks had used. Now along with size we also get a candidate to choose from in-progress writes. However, there might be a new write on the brick in the time-window where we try to fetch size which could be the candidate. We should compare the resultant cluster_wide candidate with the per-brick candidate. So, the enforcement logic will be as below: /* Both enforcer and get_dir_size are executed in brick process. I've left out logic of get_dir_size in cluster translators like dht */ enforcer () { /* Note that this logic is executed independently for each directory on which quota limit is set. All the in-progress writes, sizes, candidates are valid in the context of that directory */ my_delta = iov_length (input_iovec, input_count); my_id = getuuid(); add_my_delta_to_in_progress_size (); get_dir_size (my_id, size, in_progress_size, cluster_candidate); in_progress_size -= my_delta; if (((size + my_delta) quota_limit) ((size + in_progress_size + my_delta) quota_limit) { /* we've to choose among in-progress writes */ brick_candidate = least_of_uuids (directory-in_progress_write_list, directory-last_winning_candidate); if ((my_id == cluster_candidate) (my_id == brick_candidate)) { /* 1. subtract my_delta from per-brick in-progress writes 2. add my_delta to per-brick sizes of all parents 3. allow-write getting brick_candidate above, 1 and 2 should be done atomically */ } else { /* 1. subtract my_delta from per-brick in-progress writes 2. fail_write */ } else if ((size + my_delta) quota_limit) { /* 1. subtract my_delta from per-brick in-progress writes 2. add my_delta to per-brick sizes of all parents 3. allow-write 1 and 2 should be done atomically */ } else { fail_write (); } } get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate, ...) { directory-last_winning_candidate = winning_candidate = least_uuid (directory-in_progress_write_list, incoming_candidate_id); } Comments? [1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html [2] http://review.gluster.org/#/c/6220/ regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] rm -rf issues in Geo-replication
- Original Message - From: Aravinda avish...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Raghavendra Gowdappa rgowd...@redhat.com Sent: Friday, 22 May, 2015 12:42:11 PM Subject: rm -rf issues in Geo-replication Problem: Each geo-rep workers process Changelogs available in their bricks, if worker sees RMDIR, it tries to remove that directory recursively. Since rmdir is recorded in all the bricks, rm -rf is executed in parallel. Due to DHTs open issues of parallel rm -rf, Some of the directories will not get deleted in Slave Volume(Stale directory layout). If same named dir is created in Master, then Geo-rep will end up in inconsistent state since GFID is different for the new directory and directory exists in Slave. Solution - Fix in DHT: - Hold lock during rmdir, so that parallel rmdir will get blocked and no stale layouts. Solution - Fix in Geo-rep: -- Temporarily we can fix in Geo-rep till DHT fixes this issue. Since Meta Volume is available with each Cluster, Geo-rep can keep lock for GFID of dir to be deleted. If it fixes a currently pressing problem in geo-rep, we can use this. However, please note that the problem is directory self heal done during lookup racing with rmdir. So, theoretically, any path based operation (like stat, opendir, chmod, chown, etc and not just rmdir) can result in this bug. So, even with this solution you can see the issue (like doing find dir while doing rmdir dir). There is an age old patch supposed to resolve this issue at [1], but not merged because of various reasons (one being synchronization required to prevent stale layouts being stored in inode-ctx). [1] http://review.gluster.org/#/c/4846/ For example, when rmdir: while True: try: # fcntl lock in Meta volume $METAVOL/.rmdirlocks/GFID get_lock(GFID) recursive_delete() release_and_del_lock_file() break except (EACCES, EAGAIN): continue One worker will succeed and all other workers will get ENOENT/ESTALE, which can be safely ignored. Let us know your thoughts. -- regards Aravinda ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] on patch #11553
+gluster-devel - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Krishnan Parthasarathi kpart...@redhat.com Cc: Nithya Balachandran nbala...@redhat.com, Anoop C S achir...@redhat.com Sent: Tuesday, 7 July, 2015 11:32:01 AM Subject: on patch #11553 KP, Though the crash because of lack of init while fops are in progress is solved, concerns addressed by [1] are still valid. Basically what we need to guarantee is that when is it safe to wind fops through a particular subvol of protocol/server. So, if some xlators are doing things in events like CHILD_UP (like trash), server_setvolume should wait for CHILD_UP on a particular subvol before accepting a client. So, [1] is necessary but following changes need to be made: 1. protocol/server _can_ have multiple subvol as children. In that case we should track whether the exported subvol has received CHILD_UP and only after a successful CHILD_UP on that subvol connections to that subvol can be accepted. 2. It is valid (though not a common thing on brick process) that some subvols can be up and some might be down. So, child readiness should be localised to that subvol instead of tracking readiness at protocol/server level. So, please revive [1] and send it with corrections and I'll merge it. [1] http://review.gluster.org/11553 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Inconsistent behavior due to lack of lookup on entry followed by readdirp
- Original Message - From: Krutika Dhananjay kdhan...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org, Dan Lambright dlamb...@redhat.com, Nithya Balachandran nbala...@redhat.com, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith Kumar Karampuri pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com, xhernan...@datalab.es Sent: Thursday, August 13, 2015 9:53:41 AM Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by readdirp - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Krutika Dhananjay kdhan...@redhat.com Cc: Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel gluster-devel@gluster.org, Dan Lambright dlamb...@redhat.com, Nithya Balachandran nbala...@redhat.com, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith Kumar Karampuri pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com, xhernan...@datalab.es Sent: Thursday, August 13, 2015 9:06:37 AM Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by readdirp - Original Message - From: Krutika Dhananjay kdhan...@redhat.com To: Mohammed Rafi K C rkavu...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Dan Lambright dlamb...@redhat.com, Nithya Balachandran nbala...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith Kumar Karampuri pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com, xhernan...@datalab.es Sent: Wednesday, August 12, 2015 9:02:44 PM Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by readdirp I faced the same issue with the sharding translator. I fixed it by making its readdirp callback initialize individual entries' inode ctx, some of these being xattr values, which are filled in entry-dict by the posix translator. Here is the patch that got merged recently: http://review.gluster.org/11854 Would that be as easy to do in DHT as well? The problem is not just filling out state in the inode. The bigger problem is healing, which is supposed to maintain a directory/file to be in state consistent with our design before a successful reply to lookup. The operations can involve creating directories on missing subvols, setting appropriate layout, etc. Effectively for readdirp to replace lookup, it should be calling dht_lookup on each of the dentry it is passing back to application. OK. As far as AFR is concerned, it indirectly forces LOOKUP on entries which are being retrieved for the first time through a READDIRP (and as a result do not have their inode ctx etc initialised yet) by setting entry-inode to NULL. See afr_readdir_transform_entries(). Hmm. Then we already have disabled readdirp through code :). Without an inode corresponding to entry, readdirp will be effectively readdir stripping any performance benefits by having readdirp as a batched lookup (of all the dentries). No. Not every single READDIRP will be transformed into a READDIR by AFR. AFR resets the inode corresponding to an entry, before responding to its parent, _only_ under the following two conditions: 1) if this entry in question is being retrieved by this client for the first time through a READDIRP. In other words, this client has not _yet_ performed a LOOKUP on it. 2) if that sub-volume of AFR on which the parent directory is being READDIRP'd (remember AFR would only need to serve inode and directory reads from one of the replicas) does _not_ contain a good copy of the entry. In other words this entry needs to be healed on parent's read child. This is because we do not want the caching translators or the application itself to get incorrect entry attributes. Thanks Krutika. We'll be borrowing the idea of setting entry-inode to NULL, when dht determines that inode needs to be healed. Since afr is already doing that for all dentries during first READDIRP (barring any lookups on that inode before), I don't think doing this will have any further performance degradation (As most of the setups will be distributed-replicated). This means that more often than not, AFR _would_ be leaving the inode corresponding to the entry as it is, and not setting it to NULL. This is the default behavior which is being made optional as part of http://review.gluster.org/#/c/11846/ which is still under review (see BZ 1250803, a performance bug :) ). If it is made optional, when we enable setting entry-inode we still see consistency issues. Also, it seems to me that there is no point in having each individual xlator
Re: [Gluster-devel] Inconsistent behavior due to lack of lookup on entry followed by readdirp
- Original Message - From: Krutika Dhananjay kdhan...@redhat.com To: Mohammed Rafi K C rkavu...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Dan Lambright dlamb...@redhat.com, Nithya Balachandran nbala...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith Kumar Karampuri pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com, xhernan...@datalab.es Sent: Wednesday, August 12, 2015 9:02:44 PM Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by readdirp I faced the same issue with the sharding translator. I fixed it by making its readdirp callback initialize individual entries' inode ctx, some of these being xattr values, which are filled in entry-dict by the posix translator. Here is the patch that got merged recently: http://review.gluster.org/11854 Would that be as easy to do in DHT as well? The problem is not just filling out state in the inode. The bigger problem is healing, which is supposed to maintain a directory/file to be in state consistent with our design before a successful reply to lookup. The operations can involve creating directories on missing subvols, setting appropriate layout, etc. Effectively for readdirp to replace lookup, it should be calling dht_lookup on each of the dentry it is passing back to application. As far as AFR is concerned, it indirectly forces LOOKUP on entries which are being retrieved for the first time through a READDIRP (and as a result do not have their inode ctx etc initialised yet) by setting entry-inode to NULL. See afr_readdir_transform_entries(). Hmm. Then we already have disabled readdirp through code :). Without an inode corresponding to entry, readdirp will be effectively readdir stripping any performance benefits by having readdirp as a batched lookup (of all the dentries). This is the default behavior which is being made optional as part of http://review.gluster.org/#/c/11846/ which is still under review (see BZ 1250803, a performance bug :) ). If it is made optional, when we enable setting entry-inode we still see consistency issues. Also, it seems to me that there is no point in having each individual xlator option controlling this behaviour. Instead we can make each xlator behave in compliance to global mount option --use-readdirp=yes/no. Is there any specific reason to have an option to control this behaviour in afr? -Krutika - Original Message - From: Mohammed Rafi K C rkavu...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Dan Lambright dlamb...@redhat.com, Nithya Balachandran nbala...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Ben Turner btur...@redhat.com, Ben England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith Kumar Karampuri pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com, kdhan...@redhat.com, xhernan...@datalab.es Sent: Wednesday, August 12, 2015 7:29:48 PM Subject: Inconsistent behavior due to lack of lookup on entry followed by readdirp Hi All, We are facing some inconsistent behavior for fops like rename, unlink etc due to lack of lookup followed by a readdirp, more specifically if inodes/gfid are populated via readdirp call and this nodeid is shared with kernal, md-cache will cache this based on base-name. Then subsequent named lookup will be served from md-cache and it winds-back immediately. So there is a chance to have an FOP triggered with out having a lookup on an entry. DHT does lot of things like creating link files and populate inode_ctx etc, during lookup. In such scenario it is must to have at least one lookup to be happened on an entry. Since readdirp preventing the lookup, it has been very hard for fops to proceed without a first lookup on the entry. We are also suspecting some problems due to same with afr/ec self healing also. So If we remove readdirp from md-cache ([1], [2]) it causes, an additional hop for first lookup for every entry. I'm mostly concerned with this one extra network call, and the performance degradation caused by the same. Now with this, the only advantage with readdirp is, it removes one context switch between kernal and userspace. Is it really worth to sacrifice this for consistency ? What do you think about removing readdirp functionality? Please provide your input/suggestion/ideas. [1] : http://review.gluster.org/#/c/11892/ [2] : http://review.gluster.org/#/c/11894/ Thanks in Advance Rafi KC ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Sakshi Bansal saban...@redhat.com Sent: Thursday, August 20, 2015 10:24:46 AM Subject: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file Hi all, Most of the code currently treats inode table (and dentry structure associated with that) as the correct representative of underlying backend file-system. While this is correct for most of the cases, the representation might be out of sync for small time-windows (like file deleted on disk, but dentry and inode is not removed in our inode table etc). While working on locking directories in dht for better consistency we ran into one such issue. The issue is basically to make rmdir and directory creation during dht-selfheal mutually exclusive. The idea is to have a blocking inodelk on inode before proceeding with rmdir or directory self-heal. However, consider following scenario: 1. (dht_)rmdir acquires a lock. 2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired by rmdir. 3. rmdir deletes directory and unlocks the lock. Its possible for inode to remain in inode table and searchable through gfid till there is a positive reference count on it. In this case lock-request (by lookup) and granted-lock (to rmdir) makes the inode to remain in inode table even after rmdir. as both of them have a refcount each on inode. 4. lock request issued by lookup is granted. Note that at step 4, its still possible rmdir might be in progress from dht perspective (it just completed on one node). However, this is precisely the situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal instead of allowing it to proceed. In this scenario at step 4, the directory is removed on backend file-system, but its representation is still present in inode table. We tried to solve this by doing a lookup on gfid before granting a lock [1]. However, because of [1] 1. we no longer treat inode table as source of truth as opposed to other non-lookup code 2. performance hit in terms of a lookup on backend-filesystem for _every_ granted lock. This may not be as big considering that there is no network call involved. There are other ways where dht could've avoided above scenario altogether with different trade-offs we didn't want to make. Few alternatives would've been, 1. use entrylk during lookup-selfheal and rmdir. This fits naturally as both are entry operations. However, dht-selfheal also sets layouts which should be synchronized other operations where we don't have name information. tl;dr we wanted to avoid using entrylk for reasons that are out of scope for this problem. 2. Use non-blocking inodelk by dht during lookup-selfheal. This solves the problem for most of the practical cases, but theoretically race can still exist. To summarize, the problem of granted-locks and unlink/rmdir still remains and I am not sure what exactly should be the behavior of posix-locks in that scenario. Inputs in way of review on [1] are greatly appreciated. [1] http://review.gluster.org/#/c/11916/ regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file
To put the problem in simple words, A lock is granted by posix-locks xlator even after a directory is deleted on backend. - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Sakshi Bansal saban...@redhat.com Sent: Thursday, August 20, 2015 10:31:55 AM Subject: Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file - Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Sakshi Bansal saban...@redhat.com Sent: Thursday, August 20, 2015 10:24:46 AM Subject: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file Hi all, Most of the code currently treats inode table (and dentry structure associated with that) as the correct representative of underlying backend file-system. While this is correct for most of the cases, the representation might be out of sync for small time-windows (like file deleted on disk, but dentry and inode is not removed in our inode table etc). While working on locking directories in dht for better consistency we ran into one such issue. The issue is basically to make rmdir and directory creation during dht-selfheal mutually exclusive. The idea is to have a blocking inodelk on inode before proceeding with rmdir or directory self-heal. However, consider following scenario: 1. (dht_)rmdir acquires a lock. 2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired by rmdir. 3. rmdir deletes directory and unlocks the lock. Its possible for inode to remain in inode table and searchable through gfid till there is a positive reference count on it. In this case lock-request (by lookup) and granted-lock (to rmdir) makes the inode to remain in inode table even after rmdir. as both of them have a refcount each on inode. 4. lock request issued by lookup is granted. Note that at step 4, its still possible rmdir might be in progress from dht perspective (it just completed on one node). However, this is precisely the situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal instead of allowing it to proceed. In this scenario at step 4, the directory is removed on backend file-system, but its representation is still present in inode table. We tried to solve this by doing a lookup on gfid before granting a lock [1]. However, because of [1] 1. we no longer treat inode table as source of truth as opposed to other non-lookup code 2. performance hit in terms of a lookup on backend-filesystem for _every_ granted lock. This may not be as big considering that there is no network call involved. There are other ways where dht could've avoided above scenario altogether with different trade-offs we didn't want to make. Few alternatives would've been, 1. use entrylk during lookup-selfheal and rmdir. This fits naturally as both are entry operations. However, dht-selfheal also sets layouts which should be synchronized other operations where we don't have name information. tl;dr we wanted to avoid using entrylk for reasons that are out of scope for this problem. 2. Use non-blocking inodelk by dht during lookup-selfheal. This solves the problem for most of the practical cases, but theoretically race can still exist. To summarize, the problem of granted-locks and unlink/rmdir still remains and I am not sure what exactly should be the behavior of posix-locks in that scenario. Inputs in way of review on [1] are greatly appreciated. [1] http://review.gluster.org/#/c/11916/ regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Serialization of fops acting on same dentry on server
All, Pranith and me were discussing about implementation of compound operations like create + lock, mkdir + lock, open + lock etc. These operations are useful in situations like: 1. To prevent locking on all subvols during directory creation as part of self heal in dht. Currently we are following approach of locking _all_ subvols by both rmdir and lookup-heal [1]. 2. To lock a file in advance so that there is less performance hit during transactions in afr. While thinking about implementing such compound operations, it occurred to me that one of the problems would be how do we handle a racing mkdir/create and a (named lookup - simply referred as lookup from now on - followed by lock). This is because, 1. creation of directory/file on backend 2. linking of the inode with the gfid corresponding to that file/directory are not atomic. It is not guaranteed that inode passed down during mkdir/create call need not be the one that survives in inode table. Since posix-locks xlator maintains all the lock-state in inode, it would be a problem if a different inode is linked in inode table than the one passed during mkdir/create. One way to solve this problem is to serialize fops (like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a particular dentry. This serialization would also solve other bugs like: 1. issues solved by [2][3] and possibly many such issues. 2. Stale dentries left out in bricks' inode table because of a racing lookup and dentry modification ops (like rmdir, unlink, rename etc). Initial idea I've now is to maintain fops in-progress on a dentry in parent inode (may be resolver code in protocol/server). Based on this we can serialize the operations. Since we need to serialize _only_ operations on a dentry (we don't serialize nameless lookups), it is guaranteed that we do have a parent inode always. Any comments/discussion on this would be appreciated. [1] http://review.gluster.org/11725 [2] http://review.gluster.org/9913 [3] http://review.gluster.org/5240 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Serialization of fops acting on same dentry on server
- Original Message - From: Raghavendra Gowdappa rgowd...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Sakshi Bansal saban...@redhat.com Sent: Monday, 17 August, 2015 10:39:38 AM Subject: [Gluster-devel] Serialization of fops acting on same dentry on server All, Pranith and me were discussing about implementation of compound operations like create + lock, mkdir + lock, open + lock etc. These operations are useful in situations like: 1. To prevent locking on all subvols during directory creation as part of self heal in dht. Currently we are following approach of locking _all_ subvols by both rmdir and lookup-heal [1]. Correction. It should've been, to prevent locking on all subvols during rmdir. The lookup self-heal should lock on all subvols (with compound mkdir + lookup if directory is not present on a subvol). With this rmdir/rename can lock on just any one subvol and this will prevent any parallel lookup-heal from preventing directory creation. 2. To lock a file in advance so that there is less performance hit during transactions in afr. While thinking about implementing such compound operations, it occurred to me that one of the problems would be how do we handle a racing mkdir/create and a (named lookup - simply referred as lookup from now on - followed by lock). This is because, 1. creation of directory/file on backend 2. linking of the inode with the gfid corresponding to that file/directory are not atomic. It is not guaranteed that inode passed down during mkdir/create call need not be the one that survives in inode table. Since posix-locks xlator maintains all the lock-state in inode, it would be a problem if a different inode is linked in inode table than the one passed during mkdir/create. One way to solve this problem is to serialize fops (like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a particular dentry. This serialization would also solve other bugs like: 1. issues solved by [2][3] and possibly many such issues. 2. Stale dentries left out in bricks' inode table because of a racing lookup and dentry modification ops (like rmdir, unlink, rename etc). Initial idea I've now is to maintain fops in-progress on a dentry in parent inode (may be resolver code in protocol/server). Based on this we can serialize the operations. Since we need to serialize _only_ operations on a dentry (we don't serialize nameless lookups), it is guaranteed that we do have a parent inode always. Any comments/discussion on this would be appreciated. [1] http://review.gluster.org/11725 [2] http://review.gluster.org/9913 [3] http://review.gluster.org/5240 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Serialization of fops acting on same dentry on server
- Original Message - From: Niels de Vos nde...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Sakshi Bansal saban...@redhat.com Sent: Monday, 17 August, 2015 11:14:18 AM Subject: Re: [Gluster-devel] Serialization of fops acting on same dentry on server On Mon, Aug 17, 2015 at 01:09:38AM -0400, Raghavendra Gowdappa wrote: All, Pranith and me were discussing about implementation of compound operations like create + lock, mkdir + lock, open + lock etc. These operations are useful in situations like: 1. To prevent locking on all subvols during directory creation as part of self heal in dht. Currently we are following approach of locking _all_ subvols by both rmdir and lookup-heal [1]. 2. To lock a file in advance so that there is less performance hit during transactions in afr. I have an interest in compound/composite procedures too. My use-case is a little different, and I (was and still) am planning to send more details about it soon. Basically, there are certain cases where libgfapi will not be able to automatically pass the uid/gid in the RPC-header. A design for supporting Kerberos will mainly use the standardized RPCSEC_GSS. If there is no option to use the Kerberos credentials of the user doing I/O (remote client, not using Kerberos to talk to samba/ganesha), the username (or uid/gid) needs to be passed to the storage servers. A compound/composite procedure would then look like this: [RPC header] [AUTH_GSS + Kerberos principal for libgfapi/samba/ganesha/...] [GlusterFS COMPOUND] [SETFSUID] [SETLOCKOWNER] [${FOP}] [.. more FOPs?] This idea has not been reviewed/commented on with some of the Kerberos experts that I want to involve. A more complete description about the plans to support Kerberos will follow. Do you think that this matches your ideas on compound operations? The thing we had in mind was more of compounding more than one Gluster fops. We really didn't think at the granularity of setfsuid, setlkowner etc. But, yes its not something fundamentally different from what we had in mind. Thanks, Niels While thinking about implementing such compound operations, it occurred to me that one of the problems would be how do we handle a racing mkdir/create and a (named lookup - simply referred as lookup from now on - followed by lock). This is because, 1. creation of directory/file on backend 2. linking of the inode with the gfid corresponding to that file/directory are not atomic. It is not guaranteed that inode passed down during mkdir/create call need not be the one that survives in inode table. Since posix-locks xlator maintains all the lock-state in inode, it would be a problem if a different inode is linked in inode table than the one passed during mkdir/create. One way to solve this problem is to serialize fops (like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a particular dentry. This serialization would also solve other bugs like: 1. issues solved by [2][3] and possibly many such issues. 2. Stale dentries left out in bricks' inode table because of a racing lookup and dentry modification ops (like rmdir, unlink, rename etc). Initial idea I've now is to maintain fops in-progress on a dentry in parent inode (may be resolver code in protocol/server). Based on this we can serialize the operations. Since we need to serialize _only_ operations on a dentry (we don't serialize nameless lookups), it is guaranteed that we do have a parent inode always. Any comments/discussion on this would be appreciated. [1] http://review.gluster.org/11725 [2] http://review.gluster.org/9913 [3] http://review.gluster.org/5240 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file
Hi all, Most of the code currently treats inode table (and dentry structure associated with that) as the correct representative of underlying backend file-system. While this is correct for most of the cases, the representation might be out of sync for small time-windows (like file deleted on disk, but dentry and inode is not removed in our inode table etc). While working on locking directories in dht for better consistency we ran into one such issue. The issue is basically to make rmdir and directory creation during dht-selfheal mutually exclusive. The idea is to have a blocking inodelk on inode before proceeding with rmdir or directory self-heal. However, consider following scenario: 1. (dht_)rmdir acquires a lock. 2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired by rmdir. 3. rmdir deletes directory and unlocks the lock. Its possible for inode to remain in inode table and searchable through gfid till there is a positive reference count on it. In this case lock-request (by lookup) and granted-lock (to rmdir) makes the inode to remain in inode table even after rmdir. 4. lock request issued by lookup is granted. Note that at step 4, its still possible rmdir might be in progress from dht perspective (it just completed on one node). However, this is precisely the situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal instead of allowing it to proceed. In this scenario at step 4, the directory is removed on backend file-system, but its representation is still present in inode table. We tried to solve this by doing a lookup on gfid before granting a lock [1]. However, because of [1] 1. we no longer treat inode table as source of truth as opposed to other non-lookup code 2. performance hit in terms of a lookup on backend-filesystem for _every_ granted lock. This may not be as big considering that there is no network call involved. There are other ways where dht could've avoided above scenario altogether with different trade-offs we didn't want to make. Few alternatives would've been, 1. use entrylk during lookup-selfheal and rmdir. This fits naturally as both are entry operations. However, dht-selfheal also sets layouts which should be synchronized other operations where we don't have name information. tl;dr we wanted to avoid using entrylk for reasons that are out of scope for this problem. 2. Use non-blocking inodelk by dht during lookup-selfheal. This solves the problem for most of the practical cases, but theoretically race can still exist. To summarize, the problem of granted-locks and unlink/rmdir still remains and I am not sure what exactly should be the behavior of posix-locks in that scenario. Inputs in way of review on [1] are greatly appreciated. [1] http://review.gluster.org/#/c/11916/ regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] SSL enabled glusterd crash
There is a race b/w gf_timer_call_cancel and firing of timer addressed by [1]. Can this be the cause? Also note that [1] is not sufficient enough, as the callers of gf_timer_call_cancel should check return value and shouldn't free opaque pointer it passed to gf_timer_call_after during timer registration when gf_timer_call_cancel returns -1. Note that [1] is not in 3.7.3. [1] http://review.gluster.org/6459 - Original Message - From: Emmanuel Dreyfus m...@netbsd.org To: gluster-devel@gluster.org Sent: Thursday, August 6, 2015 3:29:36 PM Subject: [Gluster-devel] SSL enabled glusterd crash On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable, with peers and bricks showing up or not in gluster status outputs. And results can be different on different peers, and even not symetrical: a peer sees the bricks of another but not the other way around. After playing a bit, I managed to get a real crash on restarting glusterd on all peers. 3 of them crash here: Program terminated with signal 11, Segmentation fault. #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409 409 gf_timer_call_cancel (clnt-ctx, #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409 #1 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address 0xba9fffd8) at timer.c:194 (gdb) list 404 if (!trans) { 405 pthread_mutex_unlock (conn-lock); 406 return; 407 } 408 if (conn-reconnect) 409 gf_timer_call_cancel (clnt-ctx, 410 conn-reconnect); 411 conn-reconnect = 0; 412 413 if ((conn-connected == 0) !clnt-disabled) { (gdb) print clnt $1 = (struct rpc_clnt *) 0x39bb (gdb) print conn $2 = (rpc_clnt_connection_t *) 0xb9ce5150 (gdb) print conn-lock $3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = 0Q\316, ptm_interlock = 185 '\271', ptm_pad2 = \336\300\255, ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200, ptm_spare2 = 0xce513000} ptm_magix is wrong. NetBSD libpthread sets it as 0x0003 when created and as 0xDEAD0003 when destroyed. This means we either have memory corruption, or the mutex was never initialized. The last one crashes somewhere else: Program terminated with signal 11, Segmentation fault#0 0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241 241 if (!ctx-timer) { (gdb) bt #0 0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241 #1 0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24) at timer.c:121 #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409 #3 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address 0xba9fffd8) at timer.c:194 (gdb) print ctx $1 = (glusterfs_ctx_t *) 0x80 (gdb) frame 2 #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409 409 gf_timer_call_cancel (clnt-ctx, (gdb) print clnt $2 = (struct rpc_clnt *) 0xb9dffd94 (gdb) print clnt-lock.ptm_magic $3 = 1 Here again, corrupted or not initialized. I kept the cores for further investigation if this is needed. -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Patch merge request-3.7 branch: http://review.gluster.org/#/c/11858/
- Original Message - From: Ravishankar N ravishan...@redhat.com To: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, August 12, 2015 6:01:16 PM Subject: [Gluster-devel] Patch merge request-3.7 branch: http://review.gluster.org/#/c/11858/ Could some one with merge rights take http://review.gluster.org/#/c/11858/ in for the 3.7 branch? This backport has +2 from the maintainer and has passed regressions. Done. Thanks in advance :-) Ravi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Lack of named lookups during resolution of inodes after graph switch (was Discuss: http://review.gluster.org/#/c/11368/)
+gluster-devel - Original Message - From: Dan Lambright dlamb...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Shyam srang...@redhat.com, Nithya Balachandran nbala...@redhat.com, Sakshi Bansal saban...@redhat.com Sent: Monday, July 20, 2015 8:23:16 AM Subject: Re: Discuss: http://review.gluster.org/#/c/11368/ I am posting another version of the patch to discuss.. Here is a summary in simplest form; The fix tries to address problems we have with tiered volumes and fix-layout. If we try to use both the hot and cold tier before fix-layout has completed, we get many stale file errors; the new hot tier does not have layouts for the inodes. To avoid such problems, we only use the cold tier until fix-layout is done. (sub volume count = 1) When we detect fix layout is done, we will do a graph switch which will create new layouts on demand. We would like to switch to using both tiers (subvolume_cnt=2) only at the time of the graph switch is done. There is a hole with that solution. If we make a directory after fix layout has past the parent (of the new directory), fix-layout will not copy the new directory to the new tier. If we try to access such directories, the code fails (dht_access does not have a cached sub volume). So, we detect such directories when we do a lookup/revalidate/discover, and store their peculiar state in the layout if they are only accessible on the cold tier. Eventually a self heal will happen, and this state will age out. I have a unit test and system test for this. Basically my questions are - cleanest way to invoke the graph switch between using just the cold tier to using both the cold and hot tier. This is a long standing problem which needs a fix very badly. I think the client/mount cannot rely on rebalance/tier process for directory creation since I/O on client is independent and there is no way to synchronize it with rebalance directory heal. The culprit here is lack of hierarchical named lookups from root till that directory after a graph switch in mount process. If named-lookups are sent, dht is quite capable of creating directories on newly added subvols. So, I am proposing some solutions below. Interface layers (fuse-bridge, gfapi, nfs etc) should make sure that there is at least once entire directory hierarchy till root is looked up before sending fops on an inode after graph-switch. For dht, its sufficient only if inodes associated with directory are looked up in this fashion. However, non-directory inodes might also benefit from this since VFS essentially would've done a hierarchical lookup before doing fops. Its only glusterfs which has introduced nameless lookups, but much of the logic is designed around named hierarchical lookup. Now, to address the question whether its possible for interface layers to figure out ancestry of an inode, * With fuse-bridge, entire dentry structure is preserved (at least in the first graph which witnessed named-lookups from kernel and we can migrate this structure to newer graphs too). We can use dentry structure from older graph to send these named lookups and build similar dentry structure in newer graph too. This resolution is still on-demand when a fop is sent on an inode (like existing code, but the change being instead of one nameless lookup on inode, we do named lookup of parents and inode in newer graph). So, named lookups can be sent for all inodes irrespective of whether inode corresponds to directory or non-directory. * I am assuming gfapi is similar to fuse-bridge. Would need verifications from people maintaining gfapi whether my assumption is correct. * NFS-v3 server allows client to just pass file-handle and can construct relevant state to access the files (one of the reasons why nameless lookups were introduced in first place). Since it relies heavily on nameless lookups the dentry structure need not always be present in NFS server process. However we can borrow some ideas from [1]. If it seems that maintaining the list of parents of a file in xattrs is overkill (basically we are constructing reverse dentry tree), at least for problems faced by dht/tier its good enough we get this hierarchy for directory inodes. With gfid based backend, we can always get path/hierarchy for a directory using gfid of inode using .glusterfs directory (within .glusterfs there is a symbolic link with name of gfid whose contents can get us ancestry till root). This solution works for _all_ interface layers. I am suspecting its not just dht, but also other cluster xlators like EC, afr, non-cluster entities like quota, geo-rep which face this issue. I am aware of atleast one problem in afr - difficulty in identifying gfid mismatch of an entry across subvols after graph switch. Geo-replication too is using some form of gfid to path conversion. So, comments from other maintainers/developers are highly appreciated. [1] http
Re: [Gluster-devel] Gerrit review, submit type and Jenkins testing
- Original Message - > From: "Raghavendra Talur"> To: "Gluster Devel" > Sent: Tuesday, November 10, 2015 3:10:34 AM > Subject: [Gluster-devel] Gerrit review, submit type and Jenkins testing > > Hi, > > While trying to understand how our gerrit+jenkins setup works, I realized of > a possibility of allowing bugs to get in. > > Currently, our gerrit is setup to have cherry-pick as the submit type. Now > consider a case where: > > Dev1 sends a commit B with parent commit A(A is already merged). > Dev2 sends a commit C with parent commit A(A is already merged). > > Both the patches get +2 from Jenkins. > > Maintainer merges commit B from Dev1. > Another maintainer merges commit C from Dev2. > > If the two commits B and C changed code which had no merge conflicts but were > conflicting in logic, > then we have a master which has bugs. > > If Dev3 now sends a commit D with re-based master as parent, we have the > following cases: > > 1. If bug introduced above is not racy, we have tests always failing for Dev3 > on commit D. Tests that fail would be from components that commit B and C > changed. Dev3 has no idea on how to fix them and has to enlist help from > Dev1 and Dev2. > > 2. If bug introduced above is racy, then there is a probability that Dev3 > escapes from this trouble and someone else will bear it later. Even if the > racy code is hit and test fails, Dev3 will probably re-trigger the tests > given that they failed for a component which is not related to his/her code > and the bug stays in code longer. > > The most obvious but not practical solution to the above problem is to change > the submit type in gerrit to "fast-forward only". It would then ensure that > once commit B is merged, Dev2 has to re-base and re-run the tests on commit > C with commit B as parent, before it could be merged. It is not practical > because it will cause all patches in review to get re-based and re-triggered > whenever a patch is merged. > > A little modification to the above solution would be to > > > * change submit type to fast-forward only > * don't run any jenkins job on patches till they get +2 from reviewers > * once a +2 is given, run jenkins job on patch and automatically submit > it if test passes. > * automatically rebase all patches on review with new master and mark > conflict if merge conflict arises. Seems like a good suggestion. How about a slight variation to the above process? Can we run one initial set of regression immediately after submission, but before any reviews? That way reviewers can prioritize those patches that have passed regression over the ones that have failed? Flip side is that minimum two sets of regressions are needed to merge any patch. I am making this suggestion with the assumption that dev/reviewer time is more precious than machine time. Of course, this will have issues with patches that need to get in urgently (user/customer hot fix etc) where time is a constraint. But that can be worked around on a case-by-case basis. > > As a side effect of this, Dev would now be forced to run a complete > regression on dev machine before sending a patch for review. > > Any thoughts on the above solutions or other suggestions? > > Thanks, > Raghavendra Talur > > > > > > > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] not able to open gerrit (review.glustster.org)
Its loading now. - Original Message - > From: "Gaurav Garg"> To: "Gluster Devel" > Sent: Monday, October 19, 2015 11:11:37 AM > Subject: [Gluster-devel] not able to open gerrit (review.glustster.org) > > Anybody facing the same issue ? > > Thanx, > > ~Gaurav > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Need advice re some major issues with glusterfind
Hi John, - Original Message - > From: "John Sincock [FLCPTY]"> To: gluster-devel@gluster.org > Sent: Wednesday, October 21, 2015 5:53:23 AM > Subject: [Gluster-devel] Need advice re some major issues with glusterfind > > Hi Everybody, > > We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying > to use the new glusterfind feature but have been having some serious > problems with it. Overall the glusterfind looks very promising, so I don't > want to offend anyone by raising these issues. > > If these issues can be resolved or worked around, glusterfind will be a great > feature. So I would really appreciate any information or advice: > > 1) What can be done about the vast number of tiny changelogs? We are seeing > often 5+ small 89 byte changelog files per minute on EACH brick. Larger > files if busier. We've been generating these changelogs for a few weeks and > have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds > very, very slow, especially on a node which has a lot of bricks, and looks > unsustainable in the long run. Why are these files so small, and why are > there so many of them, and how are they supposed to be managed in the long > run? The sheer number of these files looks sure to impact performance in the > long run. > > 2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster > adds this extended attribute to files it changes the ctime, which we were > using to determine which files need to be archived. There should be a > warning added to release notes & upgrade notes, so people can make a plan to > manage this if required. > > Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the > rebalance took 5 days or so to complete, which looks like a major speed > improvement over the more serial rebalance algorithm, so that's good. But I > was hoping that the rebalance would also have had the side-effect of > triggering all files to be labelled with the pgfid attribute by the time the > rebalance completed, or failing that, after creation of an mlocate database > across our entire gluster (which would have accessed every file, unless it > is getting the info it needs only from directory inodes). Now it looks like > ctimes are still being modified, and I think this can only be caused by > files still being labelled with pgfids. > > How can we force gluster to get this pgfid labelling over and done with, for > all files that are already on the volume? We can't have gluster continuing > to add pgfids in bursts here and there, eg when files are read for the first > time since the upgrade. We need to get it over and done with. We have just > had to turn off pgfid creation on the volume until we can force gluster to > get it over and done with in one go. We are looking into pgfid xattr issue. Its a long weekend here in India. So, kindly expect a delay on update on this issue. > > 3) Files modified just before a glusterfind pre are often not included in the > changed files list, unless pre command is run again a bit later - I think > changelogs are missing very recent changes and need to be flushed or > something before the pre command uses them? > > 4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted > directories (and will cause these shares to be mounted if you have autofs > enabled). Glusterfind should definitely not follow symlinks, but it does. > For now, we are getting around this by turning off autofs when re run > glusterfinds, but this should not be necessary. Glusterfind must be fixed so > it never follows symlinks and never leaves the brick it is currently > searching. > > 5) We have one of our nodes with 16 bricks, and on this machine, glusterfind > pre command seems to get stuck pegging all 8 cores to 100%, an strace of an > offending processes gives an endless stream of these lseeks and reads and > very little else. What is going on here? It doesn't look right... : > > lseek(13, 17188864, SEEK_SET) = 17188864 > read(13, > "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > lseek(13, 17189888, SEEK_SET) = 17189888 > read(13, > "\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > lseek(13, 17190912, SEEK_SET) = 17190912 > read(13, > "\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > lseek(13, 17191936, SEEK_SET) = 17191936 > read(13, > "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > lseek(13, 17192960, SEEK_SET) = 17192960 > read(13, > "\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > lseek(13, 17193984, SEEK_SET) = 17193984 > read(13, > "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > > I saved one of these straces for 20 or 30 secs or so, and then doing a quick
Re: [Gluster-devel] Testcase '/tests/bugs/snapshot/bug-1109889.t' failing
- Original Message - From: Nithya Balachandran nbala...@redhat.com To: Vijaikumar M vmall...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, July 8, 2015 12:17:29 PM Subject: Re: [Gluster-devel] Testcase '/tests/bugs/snapshot/bug-1109889.t' failing Also failing on: http://build.gluster.org/job/rackspace-regression-2GB-triggered/12069/consoleFull Regards, Nithya - Original Message - From: Vijaikumar M vmall...@redhat.com To: Gluster Devel gluster-devel@gluster.org, Avra Sengupta aseng...@redhat.com, Rajesh Joseph rjos...@redhat.com Sent: Wednesday, 8 July, 2015 11:09:14 AM Subject: [Gluster-devel] Testcase '/tests/bugs/snapshot/bug-1109889.t' failing Hi, Testcase '/tests/bugs/snapshot/bug-1109889.t' is failing consistently Earlier this test was resulting a brick crash because of brick accepting fops even before its xlator graph is initialised. A recent fix makes server to reject any client connections till xlator graph is initialised. I think stat is failing with ENOTCONN because of that. Now the question is why/how client is trying to connect before server is initialized. Hope that helps. http://build.gluster.org/job/rackspace-regression-2GB-triggered/12048/consoleFull I think below test at line# 72 need to be changed: TEST stat $M0/.snaps; To EXPECT_WITHIN $PROCESS_UP_TIMEOUT 0 STAT $M0/.snaps Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file
- Original Message - From: Vijay Bellur vbel...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com, Gluster Devel gluster-devel@gluster.org Cc: Sakshi Bansal saban...@redhat.com Sent: Monday, August 24, 2015 3:52:09 PM Subject: Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file On Thursday 20 August 2015 10:24 AM, Raghavendra Gowdappa wrote: Hi all, Most of the code currently treats inode table (and dentry structure associated with that) as the correct representative of underlying backend file-system. While this is correct for most of the cases, the representation might be out of sync for small time-windows (like file deleted on disk, but dentry and inode is not removed in our inode table etc). While working on locking directories in dht for better consistency we ran into one such issue. The issue is basically to make rmdir and directory creation during dht-selfheal mutually exclusive. The idea is to have a blocking inodelk on inode before proceeding with rmdir or directory self-heal. However, consider following scenario: 1. (dht_)rmdir acquires a lock. 2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired by rmdir. 3. rmdir deletes directory and unlocks the lock. Its possible for inode to remain in inode table and searchable through gfid till there is a positive reference count on it. In this case lock-request (by lookup) and granted-lock (to rmdir) makes the inode to remain in inode table even after rmdir. 4. lock request issued by lookup is granted. Note that at step 4, its still possible rmdir might be in progress from dht perspective (it just completed on one node). However, this is precisely the situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal instead of allowing it to proceed. In this scenario at step 4, the directory is removed on backend file-system, but its representation is still present in inode table. We tried to solve this by doing a lookup on gfid before granting a lock [1]. However, because of [1] 1. we no longer treat inode table as source of truth as opposed to other non-lookup code 2. performance hit in terms of a lookup on backend-filesystem for _every_ granted lock. This may not be as big considering that there is no network call involved. Can we not mark the in memory inode as having been unlinked in posix_rmdir() and use this information to determine whether a lock request can be processed? Yes. Nithya suggested the same. But this seemed like a hacky fix. Reason is: 1. Currently we don't really differentiate in inode management based on inode type. The code (dentry management, inode management) is agnostic to type. With this fix we are bringing such explicit differentiation. Note that only for directories we can have such a flag indicating that inode is removed from backend. Since, files have hardlinks it would be difficult (if not impossible, as unlink_cbk doesn't carry iatt to figure out whether current unlink is on last link). This makes the solution only applicable for directories. For lock requests on files we still need to lookup on the backend (for our use-case this is fine, since we are not locking on files). Not a show-stopper, but something in terms of aesthetics. The whole thing about locks during/after file/directory is removed seems to be not well defined as of now IMHO. a. We can acquire lock because inode exists in inode-table. b. inode-exists in inode-table because there are some locks on inode holding reference. Of course, this situation will be fixed with maintaining a flag indicating file/directory removal (or by doing a lookup). But some clarifications needed - If we are going to have some information indicating file/directory is removed, what should be the behaviour of future lock/unlock calls? should we fail them? For lock calls, we can fail them. But for unlock there are two choices: a. Let the consumer send an unlock even after remove b. Or clear out the locks during unlink/rmdir. I prefer approach a. here. Comments? stat() calls can be significantly expensive if the disk seek times happen to be high. It would be better if we can avoid an additional stat() for every granted lock. Regards, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] requesting for review
Its been reviewed and merged. - Original Message - > From: "Hari Gowtham"> To: "Gluster Devel" > Sent: Monday, August 31, 2015 6:13:44 PM > Subject: [Gluster-devel] requesting for review > > Hi, > > Could anyone review this patch? it has passed the regression and netbsd. > > http://review.gluster.org/#/c/11906/ > > > -- > Regards, > Hari. > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS cache architecture
- Original Message - > From: "Oleksandr Natalenko"> To: gluster-devel@gluster.org > Sent: Monday, August 31, 2015 7:37:51 PM > Subject: [Gluster-devel] GlusterFS cache architecture > > Hello. > > I'm trying to investigate how GlusterFS manages cache on both server and > client side, but unfortunately cannot find any exhaustive, appropriate > and up > to date information. > > The disposition is that we have, saying, 2 GlusterFS nodes (server_a and > server_b) with replicated volume some_volume. Also we have several > clients > (saying client_1 and client_2) that mount some_volume and do some > manipulation > with files on it (lets assume some_volume contains web-related assets, > and > client_1/client_2 are web-servers). Also there is client_3 that does > web- > related deploying on some_volume (lets assume that client_3 is > web-developer). > > We would like to use multilayered cache scheme that involves filesystem > cache > (on both client/server sides) as well as web server cache. > > So, my questions are: > > 1) does caching-related items (performance.cache-size, > performance.cache-min- > file-size, performance.cache-max-file-size etc.) affect server side > only? Actually, caching is on the client side (this caching aims to beat network and disk latency to add up into our fop - file operation - latency). There is no server side caching in glusterfs as of now (except for what ever caching underlying OS/drivers provide in backend). > 2) are there any tunables that affect client side caching? Yes. Basic tunables one need to be aware of are the ones affecting cache-sizes. There are some tunables which define glusterfs behaviour for better/lesser consistency (with a possible trade-off of performance). These consistency related tunables are mostly (but not limited to) in write-behind (like strict-ordering, flush-behind etc). There are various timeouts in each xlator that can be configured to tune cache-coherency. "gluster volume set help" should give you a starting point. > 3) how client-side caching (we are talking about read cache only, write > cache > is not interesting to us) is performed (if it is at all)? client side read-caching is done across multiple xlators: 1. read-ahead: to boost performance during sequential reads. We read "ahead" of the application, so that data can be in our read-cache by the time application requests it. 2. io-cache: to boost performance if application "re-reads" same region of file. We cache after application has requested some data, so that subsequent accesses are served from io-cache. 3. quick-read (in conjunction with open-behind): to boost reads on small files. Quick read caches the entire file during lookup. Any further opens are "faked" by open-behind, assuming that the application is doing open solely to read the file (which is anyways cached already). If the application does a different fop, then an fd is opened and fop is performed after successful open. Quick read aims to save time spent in open, multiple reads and a release over network. 4. md-cache (or stat-prefetch): Caches metadata (like iatt - gluster equivalent of stat, user xattrs etc). 5. readdir-ahead: similar to read-ahead, but for directory entries during readdir. This helps to boost performance of readdir. > 4) how and in what cases client cache is discarded (and how that relates > to > upcall framework)? As of now read-cache is discarded based on the availability of free space in cache and timeouts (age of data in cache). Currently upcall is not used to address cache-coherency issues, but can be used in future. > > Ideally, there should be some documentation that covers general > GlusterFS > cache workflow. > > Any info would be appreciated. Thanks. > > -- > Oleksandr post-factum Natalenko, MSc > pf-kernel community > https://natalenko.name/ > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS cache architecture
- Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Oleksandr Natalenko" <oleksa...@natalenko.name> > Cc: gluster-devel@gluster.org > Sent: Tuesday, September 1, 2015 9:20:04 AM > Subject: Re: [Gluster-devel] GlusterFS cache architecture > > > > - Original Message - > > From: "Oleksandr Natalenko" <oleksa...@natalenko.name> > > To: gluster-devel@gluster.org > > Sent: Monday, August 31, 2015 7:37:51 PM > > Subject: [Gluster-devel] GlusterFS cache architecture > > > > Hello. > > > > I'm trying to investigate how GlusterFS manages cache on both server and > > client side, but unfortunately cannot find any exhaustive, appropriate > > and up > > to date information. > > > > The disposition is that we have, saying, 2 GlusterFS nodes (server_a and > > server_b) with replicated volume some_volume. Also we have several > > clients > > (saying client_1 and client_2) that mount some_volume and do some > > manipulation > > with files on it (lets assume some_volume contains web-related assets, > > and > > client_1/client_2 are web-servers). Also there is client_3 that does > > web- > > related deploying on some_volume (lets assume that client_3 is > > web-developer). > > > > We would like to use multilayered cache scheme that involves filesystem > > cache > > (on both client/server sides) as well as web server cache. > > > > So, my questions are: > > > > 1) does caching-related items (performance.cache-size, > > performance.cache-min- > > file-size, performance.cache-max-file-size etc.) affect server side > > only? > > Actually, caching is on the client side (this caching aims to beat network > and disk latency to add up into our fop - file operation - latency). There > is no server side caching in glusterfs as of now (except for what ever > caching underlying OS/drivers provide in backend). > > > 2) are there any tunables that affect client side caching? > > Yes. Basic tunables one need to be aware of are the ones affecting > cache-sizes. There are some tunables which define glusterfs behaviour for > better/lesser consistency (with a possible trade-off of performance). These > consistency related tunables are mostly (but not limited to) in write-behind > (like strict-ordering, flush-behind etc). There are various timeouts in each > xlator that can be configured to tune cache-coherency. "gluster volume set > help" should give you a starting point. If you don't find documentation anywhere, you can look into source code of each of the xlators for a global definition of array "options" which is of type "struct volume_options" :). They also carry basic few line description of what the option is supposed to do. > > > 3) how client-side caching (we are talking about read cache only, write > > cache > > is not interesting to us) is performed (if it is at all)? > > client side read-caching is done across multiple xlators: > > 1. read-ahead: to boost performance during sequential reads. We read "ahead" > of the application, so that data can be in our read-cache by the time > application requests it. > > 2. io-cache: to boost performance if application "re-reads" same region of > file. We cache after application has requested some data, so that subsequent > accesses are served from io-cache. > > 3. quick-read (in conjunction with open-behind): to boost reads on small > files. Quick read caches the entire file during lookup. Any further opens > are "faked" by open-behind, assuming that the application is doing open > solely to read the file (which is anyways cached already). If the > application does a different fop, then an fd is opened and fop is performed > after successful open. Quick read aims to save time spent in open, multiple > reads and a release over network. > > 4. md-cache (or stat-prefetch): Caches metadata (like iatt - gluster > equivalent of stat, user xattrs etc). > > 5. readdir-ahead: similar to read-ahead, but for directory entries during > readdir. This helps to boost performance of readdir. > > > > 4) how and in what cases client cache is discarded (and how that relates > > to > > upcall framework)? > > As of now read-cache is discarded based on the availability of free space in > cache and timeouts (age of data in cache). Currently upcall is not used to > address cache-coherency issues, but can be used in future. > > > > > Ideally, there should be some documentation that covers general > > GlusterFS > > cache workflow. > > > > Any info would be appreciated. Thanks. > > > > -- > > Oleksandr post-factum Natalenko, MSc > > pf-kernel community > > https://natalenko.name/ > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] FOP ratelimit?
- Original Message - > From: "Pranith Kumar Karampuri"> To: "Emmanuel Dreyfus" , gluster-devel@gluster.org > Sent: Wednesday, September 2, 2015 2:04:32 PM > Subject: Re: [Gluster-devel] FOP ratelimit? > > > > On 09/02/2015 01:59 PM, Emmanuel Dreyfus wrote: > > Hi > > > > Yesterday I experienced the problem of a single user bringing down > > a glusterfs cluster to its knees because of a high amount of rename > > operations. > > > > I understand rename on DHT can be very costly because data really have > > to be moved from a brick to another one just for a file name change. > > Is there a workaround for this behavior? > This is not true. Data is not moved across bricks during rename. So, may be something else is causing the issue. Were you running rebalance while these renames were being done? > > > > And more generally, do we have a way to ratelimit FOPs per client, so > > that one client cannot make the cluster unusable for the others? > Do you have profile data? > > Raghavendra G is working on some QOS related enahancements in gluster. > Please let us know if you have any inputs here. Thanks Pranith. @Manu and others, Its helpful if you can give some pointers on what parameters (like latency, throughput etc) you want us to consider for QoS. Also, any ideas (like interface for QoS) in this area is welcome. With my very basic search, seems like there are not many filesystems with QoS functionality. regards, Raghavendra. > > Pranith > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] FOP ratelimit?
- Original Message - > From: "Emmanuel Dreyfus" <m...@netbsd.org> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Pranith Kumar Karampuri" > <pkara...@redhat.com> > Cc: gluster-devel@gluster.org > Sent: Wednesday, September 2, 2015 8:12:37 PM > Subject: Re: [Gluster-devel] FOP ratelimit? > > Raghavendra Gowdappa <rgowd...@redhat.com> wrote: > > > Its helpful if you can give some pointers on what parameters (like > > latency, throughput etc) you want us to consider for QoS. > > Full blown QoS would be nice, but a first line of defense against > resource hogs seems just badly required. > > A bare minimum could be to process client's FOP in a round robin > fashion. That way even if one client sends a lot of FOPs, there is > always some window for others to slip in. > > Any opinion? As of now we depend on epoll/poll events informing servers about incoming messages. All sockets are put in the same event-pool represented by a single poll-control fd. So, the order of our processing of msgs from various clients really depends on how epoll/poll picks events across multiple sockets. Do poll/epoll have any sort of scheduling? or is it random? Any pointers on this are appreciated. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] [posix-compliance] unlink and access to file through open fd
All, Posix allows access to file through open fds even if name associated with file is deleted. While this works for glusterfs for most of the cases, there are some corner cases where we fail. 1. Reboot of brick: === With the reboot of brick, fd is lost. unlink would've deleted both gfid and path links to file and we would loose the file. As a solution, perhaps we should create an hardlink to the file (say in .glusterfs) which gets deleted only when last fd is closed? 2. Graph switch: = The issue is captured in bz 1259995 [1]. Pasting the content from bz verbatim: Consider following sequence of operations: 1. fd = open ("/mnt/glusterfs/file"); 2. unlink ("/mnt/glusterfs/file"); 3. Do a graph-switch, lets say by adding a new brick to volume. 4. migration of fd to new graph fails. This is because as part of migration we do a lookup and open. But, lookup fails as file is already deleted and hence migration fails and fd is marked bad. In fact this test case is already present in our regression tests, though the test checks whether the fd is just marked as bad. But the expectation of filing this bug is that migration should succeed. This is possible since there is an fd opened on brick through old-graph and hence can be duped using dup syscall. Of course the solution outlined here doesn't cover the case where file is not present on brick at all. For eg., a new brick was added to replica set and that new brick doesn't contain the file. Now, since the file is deleted, how do replica heals that file to another brick etc. But atleast this can be solved for those cases where file was present on a brick and fd was already opened. 3. Open-behind and unlink from a different client: == While open-behind handles unlink from the same client (through which open was performed), if unlink and open are done from two different clients, file is lost. I cannot think of any good solution for this. I wanted to know whether these problems are real enough to channel our efforts to fix these issues. Comments are welcome in terms of solutions or other possible scenarios which can lead to this issue. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1259995 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Handling Failed flushes in write-behind
+ gluster-devel > > On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote: > > Hi All, > > > > Currently on failure of flushing of writeback cache, we mark the fd bad. > > The rationale behind this is that since the application doesn't know which > > of the writes that are cached failed, fd is in a bad state and cannot > > possibly do a meaningful/correct read. However, this approach (though > > posix-complaint) is not acceptable for long standing applications like > > QEMU [1]. So, a two part solution was decided: > > > > 1. No longer mark the fd bad during failures while flushing data to backend > > from write-behind cache. > > 2. retry the writes > > > > As for as 2, goes, application can checkpoint by doing fsync and on write > > failures, roll-back to last checkpoint and replay writes from that > > checkpoint. Or, glusterfs can retry the writes on behalf of the > > application. However, glusterfs retrying writes cannot be a complete > > solution as the error-condition we've run into might never get resolved > > (For eg., running out of space). So, glusterfs has to give up after some > > time. > > > > It would be helpful if you give your inputs on how other writeback systems > > (Eg., kernel page-cache, nfs, samba, ceph, lustre etc) behave in this > > scenario and what would be a sane policy for glusterfs. > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1200862 > > > > regards, > > Raghavendra > > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Handling Failed flushes in write-behind
- Original Message - > From: "Prashanth Pai" <p...@redhat.com> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Thiago da Silva" > <thi...@redhat.com> > Sent: Wednesday, September 30, 2015 11:38:38 AM > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind > > > > > As for as 2, goes, application can checkpoint by doing fsync and on > > > > write > > > > failures, roll-back to last checkpoint and replay writes from that > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the > > > > application. However, glusterfs retrying writes cannot be a complete > > > > solution as the error-condition we've run into might never get resolved > > > > (For eg., running out of space). So, glusterfs has to give up after > > > > some > > > > time. > > The application should not be expected to replay writes. glusterfs must be > retrying the failed write. Well, failed writes can fail due to two categories of errors: 1. The error condition can be transient or file-system can do something to alleviate the error. 2. The error condition can be permanent or file-system has no control over how to recover from the failure condition. For eg., Network failure. The best a file-system can do in scenario 1 is: 1. try to do things to alleviate the error. 2. retry the writes For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some space by freeing some extents (again extents are managed by filesystem) and retries. Again this retry is only once after failure. After that page is marked with error. As far as failure scenarios 2, there is no point in retrying and it is difficult to have a well defined policy on how long we can keep retrying. The purpose of this mail is to identify errors that fall into scenario 1 above and have a recovery policy. I am afraid, glusterfs cannot do much in scenario 2. If you've ideas that can help for scenario 2, I am open to incorporate them. I did a quick look at how various filesystems handle writeback failures (this is not extensive research and hence there might be some incorrectness): 1. FUSE: == FUSE implemented write-back from kernel version 3.15. In its current version, it doesn't replay the writes at all on writeback failure. 2. xfs: xfs seem to have an intelligent failure handling mechanism on writeback failure. It marks the pages as dirty again after writeback failure for some errors. For other errors, it doesn't retry. I couldn't look into details of what errors are retried and what errors are not 3. ext4: = Only ENOSPC errors are retried. That too, only once. Also, please note that to the best of my knowledge, POSIX only guarantees writes that are checkpointed by fsync to have been persisted. Given the above constraints I am curious to know how the applications handle similar issues on other filesystems. > In gluster-swift, we had hit into a case where the application would get EIO > but the write had actually failed because of ENOSPC. >From linux kernel source tree, static inline void mapping_set_error(struct address_space *mapping, int error) { if (unlikely(error)) { if (error == -ENOSPC) set_bit(AS_ENOSPC, >flags); else set_bit(AS_EIO, >flags); } } Seems like only ENOSPC is stored. Rest of the errors are transformed into EIO. Again, we are ready to comply to whatever is the standard practise. > https://bugzilla.redhat.com/show_bug.cgi?id=986812 > > Regards, > -Prashanth Pai > > - Original Message - > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > To: "Vijay Bellur" <vbel...@redhat.com> > > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Ben Turner" > > <btur...@redhat.com>, "Ira Cooper" <icoo...@redhat.com> > > Sent: Tuesday, September 29, 2015 4:56:33 PM > > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind > > > > + gluster-devel > > > > > > > > On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote: > > > > Hi All, > > > > > > > > Currently on failure of flushing of writeback cache, we mark the fd > > > > bad. > > > > The rationale behind this is that since the application doesn't know > > > > which > > > > of the writes that are cached failed, fd is in a bad state and cannot > > > > possibly do a meaningful/correct read. However, this approach (though > > > > posix-complaint) is
Re: [Gluster-devel] Handling Failed flushes in write-behind
+kevin - Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Prashanth Pai" <p...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Thiago da Silva" > <thi...@redhat.com> > Sent: Monday, October 5, 2015 11:37:00 AM > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind > > > > - Original Message - > > From: "Prashanth Pai" <p...@redhat.com> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Thiago da Silva" > > <thi...@redhat.com> > > Sent: Wednesday, September 30, 2015 11:38:38 AM > > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind > > > > > > > As for as 2, goes, application can checkpoint by doing fsync and on > > > > > write > > > > > failures, roll-back to last checkpoint and replay writes from that > > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the > > > > > application. However, glusterfs retrying writes cannot be a complete > > > > > solution as the error-condition we've run into might never get > > > > > resolved > > > > > (For eg., running out of space). So, glusterfs has to give up after > > > > > some > > > > > time. > > > > The application should not be expected to replay writes. glusterfs must be > > retrying the failed write. > > Well, failed writes can fail due to two categories of errors: > > 1. The error condition can be transient or file-system can do something to > alleviate the error. > 2. The error condition can be permanent or file-system has no control over > how to recover from the failure condition. For eg., Network failure. > > The best a file-system can do in scenario 1 is: > 1. try to do things to alleviate the error. > 2. retry the writes > > For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some > space by freeing some extents (again extents are managed by filesystem) and > retries. Again this retry is only once after failure. After that page is > marked with error. > > As far as failure scenarios 2, there is no point in retrying and it is > difficult to have a well defined policy on how long we can keep retrying. > The purpose of this mail is to identify errors that fall into scenario 1 > above and have a recovery policy. I am afraid, glusterfs cannot do much in > scenario 2. If you've ideas that can help for scenario 2, I am open to > incorporate them. > > I did a quick look at how various filesystems handle writeback failures (this > is not extensive research and hence there might be some incorrectness): > > 1. FUSE: >== > FUSE implemented write-back from kernel version 3.15. In its current > version, it doesn't replay the writes at all on writeback failure. > > 2. xfs: > > xfs seem to have an intelligent failure handling mechanism on writeback > failure. It marks the pages as dirty again after writeback failure for some > errors. For other errors, it doesn't retry. I couldn't look into details of > what errors are retried and what errors are not > > 3. ext4: >= > Only ENOSPC errors are retried. That too, only once. > > Also, please note that to the best of my knowledge, POSIX only guarantees > writes that are checkpointed by fsync to have been persisted. Given the > above constraints I am curious to know how the applications handle similar > issues on other filesystems. > > > In gluster-swift, we had hit into a case where the application would get > > EIO > > but the write had actually failed because of ENOSPC. > > From linux kernel source tree, > > static inline void mapping_set_error(struct address_space *mapping, int > error) > { > if (unlikely(error)) { > if (error == -ENOSPC) > set_bit(AS_ENOSPC, >flags); > else > set_bit(AS_EIO, >flags); > } > } > > Seems like only ENOSPC is stored. Rest of the errors are transformed into > EIO. Again, we are ready to comply to whatever is the standard practise. > > > https://bugzilla.redhat.com/show_bug.cgi?id=986812 > > > > Regards, > > -Prashanth Pai > > > > - Original Message - > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > To: "Vijay Bellur" <vbel...@redhat.com> > > > Cc: "Gluster Devel&qu
Re: [Gluster-devel] compound fop design first cut
> > On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote: > > > > > > On 12/08/2015 02:53 AM, Shyam wrote: > >> Hi, > >> > >> Why not think along the lines of new FOPs like fop_compound(_cbk) > >> where, the inargs to this FOP is a list of FOPs to execute (either in > >> order or any order)? > > That is the intent. The question is how do we specify the fops that we > > want to do and the arguments to the fop. In this approach, for example > > xl_fxattrop_writev() is a new FOP. List of fops that need to be done > > are fxattrop, writev in that order and the arguments are a union of > > the arguments needed to perform the fops fxattrop, writev. The reason > > why this fop is not implemented through out the graph is to not change > > most of the stack on the brick side in the first cut of the > > implementation. i.e. quota/barrier/geo-rep/io-threads > > priorities/bit-rot may have to implement these new compund fops. We > > still get the benefit of avoiding the network round trips. > >> > >> With a scheme like the above we could, > >> - compound any set of FOPs (of course, we need to take care here, > >> but still the feasibility exists) > > It still exists but the fop space will be blown for each of the > > combination. > >> - Each xlator can inspect the compound relation and chose to > >> uncompound them. So if an xlator cannot perform FOPA+B as a single > >> compound FOP, it can choose to send FOPA and then FOPB and chain up > >> the responses back to the compound request sent to it. Also, the > >> intention here would be to leverage existing FOP code in any xlator, > >> to appropriately modify the inargs > >> - The RPC payload is constructed based on existing FOP RPC > >> definitions, but compounded based on the compound FOP RPC definition > > This will be done in phase-3 after learning a bit more about how best > > to implement it to prevent stuffing arguments in xdata in future as > > much as possible. After which we can choose to retire > > compound-fop-sender and receiver xlators. > >> > >> Possibly on the brick graph as well, pass these down as compounded > >> FOPs, till someone decides to break it open and do it in phases > >> (ultimately POSIX xlator). > > This will be done in phase-2. At the moment we are not giving any > > choice for the xlators on the brick side. > >> > >> The intention would be to break a compound FOP in case an xlator in > >> between cannot support it or, even expand a compound FOP request, say > >> the fxattropAndWrite is an AFR compounding decision, but a compound > >> request to AFR maybe WriteandClose, hence AFR needs to extend this > >> compound request. > > Yes. There was a discussion with krutika where if shard wants to do > > write then xattrop in a single fop, then we need dht to implement > > dht_writev_fxattrop which should look somewhat similar to > > dht_writev(), and afr will need to implement afr_writev_fxattrop() as > > full blown transaction where it needs to take data+metadata domain > > locks then do data+metadata pre-op then wind to > > compound_fop_sender_writev_fxattrop() and then data+metadata post-op > > then unlocks. > > > > If we were to do writev, fxattrop separately, fops will be (In > > unoptimized case): > > 1) finodelk for write > > 2) fxattrop for preop of write. > > 3) write > > 4) fxattrop for post op of write > > 5) unlock for write > > 6) finodelk for fxattrop > > 7) fxattrop for preop of shard-fxattrop > > 8) shard-fxattrop > > 9) fxattrop for post op of shard fxattrop > > 10) unlock forfxattrop > > > > If AFR chooses to implement writev_fxattrop: means data+metadata > > transaction. > > 1) finodelk in data, metadata domain simultaneously (just like we take > > multiple locks in rename) > > 2) preop for data, metadata parts as part of the compound fop > > 3) writev+fxattrop > > 4)postop for data, metadata parts as part of the compound fop > > 5) unlocks simultaneously. > > > > So it is still 2x reduction of the number of network fops except for > > may be locking. > >> > >> The above is just a off the cuff thought on the same. > > We need to arrive at a consensus about how to specify the list of fops > > and their arguments. The reason why I went against list_of_fops is to > > make discovery of possibile optimizations we can do easier per > > compound fop (Inspired by ec's implementation of multiplications by > > all possible elements in the Galois field, where multiplication with > > different number has a different optimization). Could you elaborate > > more about the idea you have about list_of_fops and its arguments? May > > be we can come up with combinations of fops where we can employ this > > technique of just list_of_fops and wind. I think rest of the solutions > > you mentioned is where it will converge towards over time. Intention > > is to avoid network round trips without waiting for the whole stack to > > change as much as possible. > May be I am over thinking it. Not a lot of combinations could be > transactions. In any
Re: [Gluster-devel] intermittent test failure: tests/bugs/tier/bug-1279376-rename-demoted-file.t
- Original Message - > From: "Michael Adam"> To: gluster-devel@gluster.org > Sent: Wednesday, December 9, 2015 1:46:32 PM > Subject: [Gluster-devel] intermittent test failure: > tests/bugs/tier/bug-1279376-rename-demoted-file.t > > Hi, > > found another one. See > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/16603/consoleFull > > Run by http://review.gluster.org/#/c/12830/ > which should not change any test result. A bug has been filed at: https://bugzilla.redhat.com/show_bug.cgi?id=1289845 > > Michael > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] libgfapi compound operations - multiple writes
forking off since it muddles the original conversation. I've some questions: 1. Why do multiple writes need to be compounded together? 2. If the reason is aggregation, cant we tune write-behind to do the same? regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] intermittent test failure: tests/basic/tier/record-metadata-heat.t ?
> Looks like the run failed due to: > > /tests/bugs/fuse/bug-924726.t (Wstat: 0 Tests: 20 Failed: 1) >Failed test: 20 > > Raghavendra - this test has been reported previously too as affecting > other regression runs. Can you please take a look in as you are the > original author of this test unit? I tried reproducing the problem in my > local setup a few times but that does not seem to happen easily. A fix has been sent to: http://review.gluster.org/12906 > > Thanks, > Vijay > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] libgfapi compound operations - multiple writes
- Original Message - > From: "Jeff Darcy" <jda...@redhat.com> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Poornima Gurusiddaiah" > <pguru...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Wednesday, December 9, 2015 10:36:43 PM > Subject: Re: [Gluster-devel] libgfapi compound operations - multiple writes > > > > > On December 9, 2015 at 10:31:03 AM, Raghavendra Gowdappa > (rgowd...@redhat.com) wrote: > > forking off since it muddles the original conversation. I've some > > questions: > > > > 1. Why do multiple writes need to be compounded together? > > 2. If the reason is aggregation, cant we tune write-behind to do the same? > > I think compounding (as we’ve been discussing it) is only necessary when > there’s a dependency between operations.  For example, if the first > creates a value (e.g. file descriptor) used by the second, or if the > second should not proceed unless the first (e.g. a lock) succeeded.  If > multiple operations are completely independent of one another, as is the > case for writes without fsync, then I think we should rely on > write-behind or something similar instead.  Compounding is likely to be > the wrong solution here for two reasons: > >  * Correctness: if the writes are independent, there’s no reason why >   failure of the first should cause the second not to be issued (as >   would be the case with compounding). > >  * Performance: compounding would keep the writes separate, whereas >   write-behind can reduce overhead even more by coalescing them into a >   single request. Yes. I had similar thoughts while asking the question. Thanks for elaborating. > > There is, however, one case where compounding would be the right answer: > when there really is a dependency between the writes.  There’s no way to > specify this through the POSIX/VFS interface (more’s the pity), but it’s > easy to imagine GFAPI or internal use cases where a second write should > not overtake or continue without the first - e.g.  a key/value store > that writes new data followed by an index update pointing to that data. > The strictly-sequential behavior of a compound operation might be just > the right match for such cases. We have one such use-case already i.e., O_APPEND writes. In fact write-behind has enough logic to address dependencies like conflicting writes, read, stat etc on just written regions etc (Of course, we would loose performance gains as write-behind still wind calls across network for dependent ops. But again, if write-behind cache is sufficient enough, this latency is not witnessed by application). So, I am wondering can we pass down these dependency requirements down the stack and let write-behind handle them. @Poornima and others, Did you've any such use-cases in mind when you proposed compounding? regards, Raghavendra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Help needed in understanding GlusterFS logs and debugging elasticsearch failures
- Original Message - > From: "Sachidananda URS"> To: "Gluster Devel" > Sent: Friday, December 11, 2015 8:56:04 PM > Subject: [Gluster-devel] Help needed in understanding GlusterFS logs and > debugging elasticsearch failures > > Hi, > > I was trying to use GlusterFS as a backend filesystem for storing the > elasticsearch indices on GlusterFS mount. > > The filesystem operations as far as I can understand is, lucene engine > does a lot of renames on the index files. And multiple threads read > from the same file concurrently. > > While writing index, elasticsearch/lucene complains of index corruption and > the > health of the cluster goes to red, and all the operations on the index fail > hereafter. > > === > > [2015-12-10 02:43:45,614][WARN ][index.engine ] [client-2] > [logstash-2015.12.09][3] failed engine [merge failed] > org.apache.lucene.index.MergePolicy$MergeException: > org.apache.lucene.index.CorruptIndexException: checksum failed (hardware > problem?) : expected=0 actual=6d811d06 > (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/gluster2/rhs/nodes/0/indices/logstash-2015.12.09/3/index/_a7.cfs") > [slice=_a7_Lucene50_0.doc])) > at > > org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$1.doRun(InternalEngine.java:1233) > at > > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed May be read returned different data than expected? Logs doesn't indicate anything suspicious. > (hardware problem?) : expected=0 actual=6d811d06 > (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/gluster2/rhs/nodes/0/indices/logstash-2015.12.09/3/index/_a7.cfs") > [slice=_a7_Lucene50_0.doc])) > > = > > > Server logs does not have anything. The client logs is full of messages like: > > > > [2015-12-03 18:44:17.882032] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-61881676454442626.tlog > (hash=esearch-replicate-0/cache=esearch-replicate-0) => > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-311.ckp > (hash=esearch-replicate-1/cache=) > [2015-12-03 18:45:31.276316] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-2384654015514619399.tlog > (hash=esearch-replicate-0/cache=esearch-replicate-0) => > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-312.ckp > (hash=esearch-replicate-0/cache=) > [2015-12-03 18:45:31.587660] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-4957943728738197940.tlog > (hash=esearch-replicate-0/cache=esearch-replicate-0) => > /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-312.ckp > (hash=esearch-replicate-0/cache=) > [2015-12-03 18:46:48.424605] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-1731620600607498012.tlog > (hash=esearch-replicate-1/cache=esearch-replicate-1) => > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-313.ckp > (hash=esearch-replicate-1/cache=) > [2015-12-03 18:46:48.466558] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-5214949393126318982.tlog > (hash=esearch-replicate-1/cache=esearch-replicate-1) => > /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-313.ckp > (hash=esearch-replicate-1/cache=) > [2015-12-03 18:48:06.314138] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-9110755229226773921.tlog > (hash=esearch-replicate-0/cache=esearch-replicate-0) => > /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-314.ckp > (hash=esearch-replicate-1/cache=) > [2015-12-03 18:48:06.332919] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-5193443717817038271.tlog > (hash=esearch-replicate-1/cache=esearch-replicate-1) => > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-314.ckp > (hash=esearch-replicate-1/cache=) > [2015-12-03 18:49:24.694263] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] > 0-esearch-dht: renaming > /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-2750483795035758522.tlog >
Re: [Gluster-devel] Is there any advantage or disadvantage to multiple cpu cores?
- Original Message - > From: "Joe Julian"> To: "Gluster Devel" > Sent: Monday, December 14, 2015 2:40:14 PM > Subject: [Gluster-devel] Is there any advantage or disadvantage to multiple > cpu cores? > > Does the code take advantage of multiple cpu cores? On client: * We've multiple threads to receive replies from bricks parallely (multithreaded epoll). * the thread that reads from /dev/fuse doesn't generally process replies. So, request and reply processing can happen parallely. On Bricks: * io-threads enables parallelism for processing all request/reply parallely. So, we've multiple threads that can execute on multiple cores simultaneously. However, we've don't really assign threads to cores. > If I assigned a single core to gluster, would it have an effect on > performance? Long time back there was this proposal to make sure a request gets assigned to a thread executing on the same core on which application issued the syscall. The idea was to minimize too many process context switches on a core and thereby conserve relevancy of cpu-cache. But nothing concrete happened towards that goal. > If yes, explain so I can determine a sane number of cores to allocate > per server. > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Design for lookup-optimize made default
> > > > > > Sakshi, > > > > > > In the doc. there is reference to the fact that when a client fixes a > > > layout it assigns the same dircommit hash to the layout which is > > > equivalent to the vol commit hash. I think this assumption is incorrect, > > > when a client heals the layout, the commit hash is set to 1 > > > (DHT_LAYOUT_HASH_INVALID) [1]. > > > > Yes. You are correct. Thats an oversight on my part. Sorry about it :). > > > > > > > > What the above basically means is that when anyone other than rebalance > > > changes the layout of an existing directory, it's commit-hash will start > > > disagreeing with the volume commit hash. So that part is already handled > > > (unless I am missing something, which case it is a bug and we need it > > > fixed). > > > > > > The other part of the self-heal, I would consider *not* needed. If a > > > client heals a layout, it is because a previous layout creator > > > (rebalance or mkdir) was incomplete, and hence the client needs to set > > > the layout. If this was by rebalance, the rebalance process would have > > > failed and hence would need to be rerun. For abnormal failures on > > > directory creations, I think the proposed solution is heavy weight, as > > > lookup-optimize is an *optimization* and so it can devolve into > > > non-optimized modes in such cases. IOW, I am stating we do not need to > > > do this self healing. > > > > If this happens to a large number of directories, then performance hit can > > be > > large (and its not an optimization in the sense that hashing should've > > helped us to conclusively say when a file is absent and its basic design of > > dht, which we had strayed away because of bugs). However, the question as > > you pointed out is, can it happen often enough? As of now, healing can be > > triggered because of following reasons: > > > > 1. As of now, no synchronization between rename (src, dst) and healing. > > There > > are two cases here: > >a. healing of src by a racing lookup on src. This falls in the class of > >bugs similar to lookup-heal creating directories deleted by a racing > >rmdir and hence will be fixed when we fix that class of bugs (the > >solution to which is ready, implementation is pending). > >b. Healing of destination (as layout of src breaks the continuum of dst > >layout). But again this is not a problem rename overwrites dst only if > >its an empty directory and no children need to be healed for empty > >directory. > > > > 2. Race b/w fix layout from a rebalance process and lookup-heal from a > > client. > >We don't have synchronization b/w these two as of now and *might* end up > >with too many directories with DHT_LAYOUT_HASH_INVALID set resulting in > >poor performance. > > > > 3. Any failures in layout setting (because of node going down after we > > choose > > to heal layout, setxattr failures etc). > > > > Given the above considerations, I conservatively chose to heal children of > > directories. I am not sure whether these considerations are just > > theoretical > > or something realistic that can be hit in field. With the above details, do > > you still think healing from selfheal daemon is not worth the effort? > > And the other thing to note that, once a directory ends up with > DHT_LAYOUT_HASH_INVALID (in non add/remove-brick) scenario, its stays in > that state till there is a fix layout is run or for the entire lifetime of > the directory. Another case where we might end up with DHT_LAYOUT_HASH_INVALID for a directory is a race b/w lookup on a directory and mkdir of the same name. In this race if lookup wins the race and sets the layout, we'll have invalid_hash set on the layout. I was worried about these unknowns and I tried to solve this by having a fall-back option in terms of heal by self-heal daemon in case if we end up with invalid-hash. It seemed easier 1. to identify scenarios where we might heal and add directory to index. 2. poll the index and heal the children of entries found and remove the entry from index. This fall-back option I think helps to recover instead of assuming that not many use cases lead to invalid-hash of a directory. > > > > > > > > > I think we still need to handle stale layouts and the lookup (and other > > > problems). > > > > Yes, the more we avoid spurious heals, the less we need healing from > > self-heal daemon. In fact we need healing from self-heal daemon only for > > those directories self-heal was triggered spuriously. > > > > > > > > [1] > > > https://github.com/gluster/glusterfs/blob/master/xlators/cluster/dht/src/dht-selfheal.c#L1685 > > > > > > On 12/11/2015 06:08 AM, Sakshi Bansal wrote: > > > > The above link may not be accessible to all. In that case please refer > > > > to > > > > this: > > > > https://public.pad.fsfe.org/p/dht_lookup_optimize > > > > ___ > > > > Gluster-devel mailing list > > > > Gluster-devel@gluster.org
Re: [Gluster-devel] Design for lookup-optimize made default
- Original Message - > From: "Shyam"> To: "Sakshi Bansal" , "Gluster Devel" > > Sent: Monday, December 14, 2015 10:40:09 PM > Subject: Re: [Gluster-devel] Design for lookup-optimize made default > > Sakshi, > > In the doc. there is reference to the fact that when a client fixes a > layout it assigns the same dircommit hash to the layout which is > equivalent to the vol commit hash. I think this assumption is incorrect, > when a client heals the layout, the commit hash is set to 1 > (DHT_LAYOUT_HASH_INVALID) [1]. Yes. You are correct. Thats an oversight on my part. Sorry about it :). > > What the above basically means is that when anyone other than rebalance > changes the layout of an existing directory, it's commit-hash will start > disagreeing with the volume commit hash. So that part is already handled > (unless I am missing something, which case it is a bug and we need it > fixed). > > The other part of the self-heal, I would consider *not* needed. If a > client heals a layout, it is because a previous layout creator > (rebalance or mkdir) was incomplete, and hence the client needs to set > the layout. If this was by rebalance, the rebalance process would have > failed and hence would need to be rerun. For abnormal failures on > directory creations, I think the proposed solution is heavy weight, as > lookup-optimize is an *optimization* and so it can devolve into > non-optimized modes in such cases. IOW, I am stating we do not need to > do this self healing. If this happens to a large number of directories, then performance hit can be large (and its not an optimization in the sense that hashing should've helped us to conclusively say when a file is absent and its basic design of dht, which we had strayed away because of bugs). However, the question as you pointed out is, can it happen often enough? As of now, healing can be triggered because of following reasons: 1. As of now, no synchronization between rename (src, dst) and healing. There are two cases here: a. healing of src by a racing lookup on src. This falls in the class of bugs similar to lookup-heal creating directories deleted by a racing rmdir and hence will be fixed when we fix that class of bugs (the solution to which is ready, implementation is pending). b. Healing of destination (as layout of src breaks the continuum of dst layout). But again this is not a problem rename overwrites dst only if its an empty directory and no children need to be healed for empty directory. 2. Race b/w fix layout from a rebalance process and lookup-heal from a client. We don't have synchronization b/w these two as of now and *might* end up with too many directories with DHT_LAYOUT_HASH_INVALID set resulting in poor performance. 3. Any failures in layout setting (because of node going down after we choose to heal layout, setxattr failures etc). Given the above considerations, I conservatively chose to heal children of directories. I am not sure whether these considerations are just theoretical or something realistic that can be hit in field. With the above details, do you still think healing from selfheal daemon is not worth the effort? > > I think we still need to handle stale layouts and the lookup (and other > problems). Yes, the more we avoid spurious heals, the less we need healing from self-heal daemon. In fact we need healing from self-heal daemon only for those directories self-heal was triggered spuriously. > > [1] > https://github.com/gluster/glusterfs/blob/master/xlators/cluster/dht/src/dht-selfheal.c#L1685 > > On 12/11/2015 06:08 AM, Sakshi Bansal wrote: > > The above link may not be accessible to all. In that case please refer to > > this: > > https://public.pad.fsfe.org/p/dht_lookup_optimize > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Design for lookup-optimize made default
- Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Shyam" <srang...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Tuesday, December 15, 2015 11:38:03 AM > Subject: Re: [Gluster-devel] Design for lookup-optimize made default > > > > - Original Message - > > From: "Shyam" <srang...@redhat.com> > > To: "Sakshi Bansal" <saban...@redhat.com>, "Gluster Devel" > > <gluster-devel@gluster.org> > > Sent: Monday, December 14, 2015 10:40:09 PM > > Subject: Re: [Gluster-devel] Design for lookup-optimize made default > > > > Sakshi, > > > > In the doc. there is reference to the fact that when a client fixes a > > layout it assigns the same dircommit hash to the layout which is > > equivalent to the vol commit hash. I think this assumption is incorrect, > > when a client heals the layout, the commit hash is set to 1 > > (DHT_LAYOUT_HASH_INVALID) [1]. > > Yes. You are correct. Thats an oversight on my part. Sorry about it :). > > > > > What the above basically means is that when anyone other than rebalance > > changes the layout of an existing directory, it's commit-hash will start > > disagreeing with the volume commit hash. So that part is already handled > > (unless I am missing something, which case it is a bug and we need it > > fixed). > > > > The other part of the self-heal, I would consider *not* needed. If a > > client heals a layout, it is because a previous layout creator > > (rebalance or mkdir) was incomplete, and hence the client needs to set > > the layout. If this was by rebalance, the rebalance process would have > > failed and hence would need to be rerun. For abnormal failures on > > directory creations, I think the proposed solution is heavy weight, as > > lookup-optimize is an *optimization* and so it can devolve into > > non-optimized modes in such cases. IOW, I am stating we do not need to > > do this self healing. > > If this happens to a large number of directories, then performance hit can be > large (and its not an optimization in the sense that hashing should've > helped us to conclusively say when a file is absent and its basic design of > dht, which we had strayed away because of bugs). However, the question as > you pointed out is, can it happen often enough? As of now, healing can be > triggered because of following reasons: > > 1. As of now, no synchronization between rename (src, dst) and healing. There > are two cases here: >a. healing of src by a racing lookup on src. This falls in the class of >bugs similar to lookup-heal creating directories deleted by a racing >rmdir and hence will be fixed when we fix that class of bugs (the >solution to which is ready, implementation is pending). >b. Healing of destination (as layout of src breaks the continuum of dst >layout). But again this is not a problem rename overwrites dst only if >its an empty directory and no children need to be healed for empty >directory. > > 2. Race b/w fix layout from a rebalance process and lookup-heal from a > client. >We don't have synchronization b/w these two as of now and *might* end up >with too many directories with DHT_LAYOUT_HASH_INVALID set resulting in >poor performance. > > 3. Any failures in layout setting (because of node going down after we choose > to heal layout, setxattr failures etc). > > Given the above considerations, I conservatively chose to heal children of > directories. I am not sure whether these considerations are just theoretical > or something realistic that can be hit in field. With the above details, do > you still think healing from selfheal daemon is not worth the effort? And the other thing to note that, once a directory ends up with DHT_LAYOUT_HASH_INVALID (in non add/remove-brick) scenario, its stays in that state till there is a fix layout is run or for the entire lifetime of the directory. > > > > > I think we still need to handle stale layouts and the lookup (and other > > problems). > > Yes, the more we avoid spurious heals, the less we need healing from > self-heal daemon. In fact we need healing from self-heal daemon only for > those directories self-heal was triggered spuriously. > > > > > [1] > > https://github.com/gluster/glusterfs/blob/master/xlators/cluster/dht/src/dht-selfheal.c#L1685 > > > > On 12/11/2015 06:08 AM, Sakshi Bansal wrote: > > > The above link may not be accessible to all. In that case please refer to > > > this: > > > https://pu
Re: [Gluster-devel] quota.t hangs on NetBSD machines
- Original Message - > From: "Emmanuel Dreyfus" <m...@netbsd.org> > To: "Gluster Devel" <gluster-devel@gluster.org> > Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Raghavendra Talur" > <rta...@redhat.com> > Sent: Monday, January 4, 2016 2:35:23 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > On Mon, Jan 04, 2016 at 09:00:43AM +, Emmanuel Dreyfus wrote: > > gluster volume info/status seem to hang too. > > No, sorry, I was not running the right binary. > I now have a statedump, but how can I use it to conclude anything > about whodropped the request? Can you send the statedump? Please look for frames with "complete=0". This indicates that the frame is not unwound. > > -- > Emmanuel Dreyfus > m...@netbsd.org > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] quota.t hangs on NetBSD machines
- Original Message - > From: "Emmanuel Dreyfus" <m...@netbsd.org> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Raghavendra Talur" <rta...@redhat.com>, "Gluster Devel" > <gluster-devel@gluster.org> > Sent: Thursday, December 31, 2015 6:32:22 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Raghavendra Gowdappa <rgowd...@redhat.com> wrote: > > > We saw similar bt on test process. At that time we took statedump of > > client process. While we were going through statedump, surprisingly the > > test program resumed and completed. > > That suggests the problem would be in glusterfs client, doesn't it? Not conclusively. If there is a frame-loss most likely attaching gdb has no effect. But, client can one of the potential areas where the problem is. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] quota.t hangs on NetBSD machines
- Original Message - > From: "Emmanuel Dreyfus" <m...@netbsd.org> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > <gluster-devel@gluster.org>, "Raghavendra Talur" > <rta...@redhat.com> > Sent: Monday, January 4, 2016 4:03:22 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > On Mon, Jan 04, 2016 at 05:16:16AM -0500, Raghavendra Gowdappa wrote: > > Can you send the statedump? Please look for frames with "complete=0". This > > indicates that the frame is not unwound. > > Here it is. No unwound frame? Is this statedump? Statedumps carry lots of other information too. This seems more like profile output from io-stats. Steps to obtain statedump from my previous mail: 1. Add "all=yes" /var/run/gluster/glusterdump.options 2. kill -SIGUSR1 3. statedump can be found in /var/run/gluster/*dump* Also, please take statedump only after you find the application process is in "uninterruptible sleep" ('D' on Linux) state. > > { > "gluster.patchy.aggr.read_1b": "0", > "gluster.patchy.aggr.write_1b": "0", > "gluster.patchy.aggr.read_2b": "0", > "gluster.patchy.aggr.write_2b": "0", > "gluster.patchy.aggr.read_4b": "0", > "gluster.patchy.aggr.write_4b": "0", > "gluster.patchy.aggr.read_8b": "0", > "gluster.patchy.aggr.write_8b": "0", > "gluster.patchy.aggr.read_16b": "0", > "gluster.patchy.aggr.write_16b": "0", > "gluster.patchy.aggr.read_32b": "0", > "gluster.patchy.aggr.write_32b": "0", > "gluster.patchy.aggr.read_64b": "0", > "gluster.patchy.aggr.write_64b": "0", > "gluster.patchy.aggr.read_128b": "0", > "gluster.patchy.aggr.write_128b": "0", > "gluster.patchy.aggr.read_256b": "0", > "gluster.patchy.aggr.write_256b": "0", > "gluster.patchy.aggr.read_512b": "0", > "gluster.patchy.aggr.write_512b": "0", > "gluster.patchy.aggr.read_1kb": "0", > "gluster.patchy.aggr.write_1kb": "0", > "gluster.patchy.aggr.read_2kb": "0", > "gluster.patchy.aggr.write_2kb": "0", > "gluster.patchy.aggr.read_4kb": "0", > "gluster.patchy.aggr.write_4kb": "0", > "gluster.patchy.aggr.read_8kb": "0", > "gluster.patchy.aggr.write_8kb": "0", > "gluster.patchy.aggr.read_16kb": "0", > "gluster.patchy.aggr.write_16kb": "0", > "gluster.patchy.aggr.read_32kb": "0", > "gluster.patchy.aggr.write_32kb": "357", > "gluster.patchy.aggr.read_64kb": "0", > "gluster.patchy.aggr.write_64kb": "0", > "gluster.patchy.aggr.read_128kb": "0", > "gluster.patchy.aggr.write_128kb": "0", > "gluster.patchy.aggr.read_256kb": "0", > "gluster.patchy.aggr.write_256kb": "0", > "gluster.patchy.aggr.read_512kb": "0", > "gluster.patchy.aggr.write_512kb": "0", > "gluster.patchy.aggr.read_1mb": "0", > "gluster.patchy.aggr.write_1mb": "0", > "gluster.patchy.aggr.read_2mb": "0", > "gluster.patchy.aggr.write_2mb": "0", > "gluster.patchy.aggr.read_4mb": "0", > "gluster.patchy.aggr.write_4mb": "0", > "gluster.patchy.aggr.read_8mb": "0", > "gluster.patchy.aggr.write_8mb": "0", > "gluster.patchy.aggr.read_16mb": "0", > "gluster.patchy.aggr.write_16mb": "0", > "gluster.patchy.aggr.read_32mb": "0", > "gluster.patchy.aggr.write_32mb": "0", > "gluster.patchy.aggr.read_64mb": "0", > "gluster.patchy.aggr.write_64mb": "0", > "gluster.patchy.aggr.read_128mb": "0", > "gluster.patchy.aggr.write_128mb": "0", > "gluster.patchy.aggr.read_256mb": "0", > "gluster.patchy.aggr.write_256mb": "0", > "gluster.patchy.aggr.read_512mb": "0", > "gluster.patchy.aggr.write_512mb": "
Re: [Gluster-devel] quota.t hangs on NetBSD machines
Thanks. There is a write call which is not unwound by write-behind as can be seen below: [.WRITE] request-ptr=0xb80a4830 refcount=2 wound=no generation-number=90 req->op_ret=32768 req->op_errno=0 sync-attempts=0 sync-in-progress=no size=32768 offset=11665408 lied=0 append=0 fulfilled=0 go=0 note, the request is not wound (wound=no and sync-in-progress=no), not unwound (lied=0). I am yet to figure out the RCA. will be sending a patch soon. - Original Message - > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel Dreyfus" > <m...@netbsd.org>, "Raghavendra Talur" > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org> > Sent: Tuesday, January 5, 2016 11:43:55 AM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Hi Raghavendra, > > Yeah, we have taken the statedump when the test program was in 'D' state. I > have enabled statedump of inodes too. > > Attaching the entire statedump file. > > Thank you :-) > > -- > Regards, > Manikandan Selvaganesh. > > - Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Manikandan Selvaganesh" <mselv...@redhat.com> > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna" <vmall...@redhat.com> > Sent: Monday, January 4, 2016 11:56:03 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > - Original Message - > > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna" > > <vmall...@redhat.com> > > Sent: Monday, January 4, 2016 7:00:16 PM > > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > Hi, > > > > We have taken statedump of fuse client process, quotad and bricks. > > Apparently, we could not find any stack information in brick's statedump. > > Below is the client statedump state information: > > Thanks Manikandan :). You took this statedump once the test program was in > 'D' state, right? Otherwise these can be just in-transit fops. > > > > > [global.callpool.stack.1.frame.1] > > frame=0xb80775f0 > > ref_count=0 > > translator=patchy-write-behind > > complete=0 > > parent=patchy-read-ahead > > wind_from=ra_writev > > wind_to=FIRST_CHILD(this)->fops->writev > > unwind_to=ra_writev_cbk > > As I suspected, write-behind seems to be the culprit. Can you upload the > entire statedump file? Also, make sure you've enabled statedump of inodes > (by setting "all=yes" in glusterdump.options as explained in my previous > mail). > > > > > [global.callpool.stack.1.frame.2] > > frame=0xb8077540 > > ref_count=1 > > translator=patchy-read-ahead > > complete=0 > > parent=patchy-io-cache > > wind_from=ioc_writev > > wind_to=FIRST_CHILD(this)->fops->writev > > unwind_to=ioc_writev_cbk > > > > [global.callpool.stack.1.frame.3] > > frame=0xb8077490 > > ref_count=1 > > translator=patchy-io-cache > > complete=0 > > parent=patchy-quick-read > > wind_from=qr_writev > > wind_to=FIRST_CHILD (this)->fops->writev > > unwind_to=default_writev_cbk > > > > [global.callpool.stack.1.frame.4] > > frame=0xb80773e0 > > ref_count=1 > > translator=patchy-quick-read > > complete=0 > > parent=patchy-open-behind > > wind_from=default_writev_resume > > wind_to=FIRST_CHILD(this)->fops->writev > > unwind_to=default_writev_cbk > > > > [global.callpool.stack.1.frame.5] > > frame=0xb8077330 > > ref_count=1 > > translator=patchy-open-behind > > complete=0 > > parent=patchy-md-cache > > wind_from=mdc_writev > > wind_to=FIRST_CHILD(this)->fops->writev > > unwind_to=mdc_writev_cbk > > > > [global.callpool.stack.1.frame.6] > > frame=0xb80771d0 > > ref_count=1 > > translator=patchy-md-cache > > complete=0 > > parent=patchy > > wind_from=io_stats_writev > > wind_to=FIRST_CHILD(this)->fops->writev > > unwind_to=io_stats_writev_cbk > > > > [global.callpool.stack.1.f
Re: [Gluster-devel] quota.t hangs on NetBSD machines
- Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Manikandan Selvaganesh" <mselv...@redhat.com> > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel Dreyfus" > <m...@netbsd.org>, "Raghavendra Talur" > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org> > Sent: Tuesday, January 5, 2016 12:16:27 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Thanks. There is a write call which is not unwound by write-behind as can be > seen below: > > [.WRITE] > request-ptr=0xb80a4830 > refcount=2 > wound=no > generation-number=90 > req->op_ret=32768 > req->op_errno=0 > sync-attempts=0 > sync-in-progress=no > size=32768 > offset=11665408 > lied=0 > append=0 > fulfilled=0 > go=0 > > note, the request is not wound (wound=no and sync-in-progress=no), not > unwound (lied=0). I am yet to figure out the RCA. will be sending a patch > soon. I figured out this issue occurs when "trickling-writes" is on in write-behind (by default its on). Unfortunately this option cannot be turned off using cli as of now (one can edit volfiles though). > > - Original Message - > > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel Dreyfus" > > <m...@netbsd.org>, "Raghavendra Talur" > > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org> > > Sent: Tuesday, January 5, 2016 11:43:55 AM > > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > Hi Raghavendra, > > > > Yeah, we have taken the statedump when the test program was in 'D' state. I > > have enabled statedump of inodes too. > > > > Attaching the entire statedump file. > > > > Thank you :-) > > > > -- > > Regards, > > Manikandan Selvaganesh. > > > > - Original Message - > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > To: "Manikandan Selvaganesh" <mselv...@redhat.com> > > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna" > > <vmall...@redhat.com> > > Sent: Monday, January 4, 2016 11:56:03 PM > > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > > > > > - Original Message - > > > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > > > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna" > > > <vmall...@redhat.com> > > > Sent: Monday, January 4, 2016 7:00:16 PM > > > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > > > Hi, > > > > > > We have taken statedump of fuse client process, quotad and bricks. > > > Apparently, we could not find any stack information in brick's statedump. > > > Below is the client statedump state information: > > > > Thanks Manikandan :). You took this statedump once the test program was in > > 'D' state, right? Otherwise these can be just in-transit fops. > > > > > > > > [global.callpool.stack.1.frame.1] > > > frame=0xb80775f0 > > > ref_count=0 > > > translator=patchy-write-behind > > > complete=0 > > > parent=patchy-read-ahead > > > wind_from=ra_writev > > > wind_to=FIRST_CHILD(this)->fops->writev > > > unwind_to=ra_writev_cbk > > > > As I suspected, write-behind seems to be the culprit. Can you upload the > > entire statedump file? Also, make sure you've enabled statedump of inodes > > (by setting "all=yes" in glusterdump.options as explained in my previous > > mail). > > > > > > > > [global.callpool.stack.1.frame.2] > > > frame=0xb8077540 > > > ref_count=1 > > > translator=patchy-read-ahead > > > complete=0 > > > parent=patchy-io-cache > > > wind_from=ioc_writev > > > wind_to=FIRST_CHILD(this)->fops->writev > > > unwind_to=ioc_writev_cbk > > > > > > [global.callpool.stack.1.frame.3] > &g
Re: [Gluster-devel] quota.t hangs on NetBSD machines
- Original Message - > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > To: "Raghavendra G" <raghaven...@gluster.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Wednesday, January 6, 2016 7:54:32 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Hi, > We are debugging the issue. With the patch[1], the quota.t doesn't seem to > hang and all the test passes successfully but it throws an error(while > running test #24 in quota.t) "perfused: perfuse_node_inactive: > perfuse_node_fsync failed error = 69: Resource temporarily unavailable". I > have attached a tar file which contains the logs while the test(quota.t) is > being run. > Thanks to Raghavendra Talur and Vijay for helping :) > > [1] http://review.gluster.org/#/c/13177/ I've merged this patch. If there are any tests hung, please kill them and retrigger. > > Thank you :-) > > -- > Regards, > Manikandan Selvaganesh. > > - Original Message - > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > To: "Raghavendra G" <raghaven...@gluster.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Wednesday, January 6, 2016 11:04:23 AM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Hi Raghavendra, > I will check out this with the fix and update you soon on this :) > > -- > Regards, > Manikandan Selvaganesh. > > - Original Message - > From: "Raghavendra G" <raghaven...@gluster.com> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Manikandan Selvaganesh" <mselv...@redhat.com>, "Gluster Devel" > <gluster-devel@gluster.org> > Sent: Tuesday, January 5, 2016 11:15:57 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Manikandan, > > Can you test this fix since you've the setup ready? > > regards, > Raghavendra > > On Tue, Jan 5, 2016 at 11:11 PM, Raghavendra G <raghaven...@gluster.com> > wrote: > > > A fix has been sent to: > > http://review.gluster.org/13177 > > > > On Tue, Jan 5, 2016 at 2:24 PM, Raghavendra Gowdappa <rgowd...@redhat.com> > > wrote: > > > >> > >> > >> - Original Message - > >> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > >> > To: "Manikandan Selvaganesh" <mselv...@redhat.com> > >> > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel > >> Dreyfus" <m...@netbsd.org>, "Raghavendra Talur" > >> > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org> > >> > Sent: Tuesday, January 5, 2016 12:16:27 PM > >> > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > >> > > >> > Thanks. There is a write call which is not unwound by write-behind as > >> can be > >> > seen below: > >> > > >> > [.WRITE] > >> > request-ptr=0xb80a4830 > >> > refcount=2 > >> > wound=no > >> > generation-number=90 > >> > req->op_ret=32768 > >> > req->op_errno=0 > >> > sync-attempts=0 > >> > sync-in-progress=no > >> > size=32768 > >> > offset=11665408 > >> > lied=0 > >> > append=0 > >> > fulfilled=0 > >> > go=0 > >> > > >> > note, the request is not wound (wound=no and sync-in-progress=no), not > >> > unwound (lied=0). I am yet to figure out the RCA. will be sending a > >> patch > >> > soon. > >> > >> I figured out this issue occurs when "trickling-writes" is on in > >> write-behind (by default its on). Unfortunately this option cannot be > >> turned off using cli as of now (one can edit volfiles though). > >> > >> > > >> > - Original Message - > >> > > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > >> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > >> > > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel > >> Dreyfus" > >> > > <m...@netbsd.org>, "Raghavendra Talur" > >> > > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org> > >> > > Sent: Tuesday, January 5, 2016 11:43:55 AM > >> > >
Re: [Gluster-devel] NetBSD tests not running to completion.
> On 01/07/2016 02:39 PM, Emmanuel Dreyfus wrote: > > On Wed, Jan 06, 2016 at 05:49:04PM +0530, Ravishankar N wrote: > >> I re triggered NetBSD regressions for > >> http://review.gluster.org/#/c/13041/3 > >> but they are being run in silent mode and are not completing. Can some one > >> from the infra-team take a look? The last 22 tests in > >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/ have > >> failed. Highly unlikely that something is wrong with all those patches. > > I note your latest test compelted with an error in mount-nfs-auth.t: > > https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13260/consoleFull > > > > Would you have the jenkins build that did not complete s that I can have a > > look at it? > > > > Generally speaking, I have to pĂ´int that NetBSD regression does show light > > on generic bugs, we had a recent exemple with quota-nfs.t. For now there > > are not other well supported platforms, but if you want glusterfs to > > be really portable, removing mandatory NetBSD regression is not a good > > idea: > > portability bugs will crop. > > > > Even a daily or weekly regression run seems a bad idea to me. If you do not > > prevent integration of patches that break NetBSD regression, that will get > > in, and tests will break one by one over time. I have a first hand > > experience of this situation, when I was actually trying to catch on with > > NetBSD regression. Many time I reached something reliable enough to become > > mandatory, and got broken by a new patch before it became actualy > > mandatory. > > > > IMO, relaxing NetBSD regression requirement means the project drops the > > goal > > of being portable. > > > hi Emmanuel, > This Sunday I have some time I can spend helping in making > tests better for NetBSD. I have seen bugs that are caught only by NetBSD > regression just recently, so I see value in making NetBSD more reliable. +1. As Manu and Ravi's conversation pointed out, its better to take a call based on data (how many tests are failing, how many are spurious). As my recent work on quota-nfs.t shows, I was actively trying to seek a reproducer for write-behind issue, but the reproducer seemed elusive. We were able to hit the bug very inconsistently. Couple that with the pressure to take things to closure, a tendency to push things under carpet creeps in. Having said that you can find some of my commits where netbsd results are skipped (or not waited for completion of netbsd runs). A knowledge that infra is stable and there are less false-positives (of bugs) will shift responsibility on developers to own the issue and fix it. > Please let me know what are the things we can work on. It would help if > you give me something specific to glusterfs to make it more valuable in > the short term. Over time I would like to learn enough to share the load > with you however little it may be (Please bear with me, I some times go > quiet). Here are the initial things I would like to know to begin with: I can try to help out here too. But mostly on best effort basis as there are other responsibilities where I am evaluated directly. > > 1) How to set up NetBSD VMs on my laptop which is of exact version as > the ones that are run on build systems. > 2) How to prevent NetBSD machines hang when things crash (At least I > used to see that the machines hang when fuse crashes before, not sure if > this is still the case)? (This failure needs manual intervention at the > moment on NetBSD regressions, if we make it report failures and pick > next job that would be the best way forward) > 3) We should come up with a list of known problems and how to > troubleshoot those problems, when things are not going smooth in NetBSD. > Again, we really need to make things automatic, this should be last > resort. Our top goal should be to make NetBSD machines report failures > and go to execute next job. > 4) How can we make debugging better in NetBSD? In the worst case we can > make all tests execute in trace/debug mode on NetBSD. > > I really want to appreciate the fine job you have done so far in making > sure glusterfs is stable on NetBSD. ++1. I appreciate Emmanuel's effort/support from such a long time and will try to chip in to whatever extent I can. > > Infra team, > I think we need to make some improvements to our infra. We need > to get information about health of linux, NetBSD regression builds. > 1) Something like, in the last 100 builds how many builds succeeded on > Linux, how many succeeded on NetBSD. > 2) What are the tests that failed in the last 100 builds and how many > times on both Linux and NetBSD. (I actually wrote this part in some > parts, but the whole command output has changed making my scripts stale) > Any other ideas you guys have? > 3) Which components have highest number of spurious failures. > 4) How many builds did not complete/manually aborted etc. > > Once we start measuring these things, next
Re: [Gluster-devel] quota.t hangs on NetBSD machines
- Original Message - > From: "Manikandan Selvaganesh" <mselv...@redhat.com> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna" > <vmall...@redhat.com> > Sent: Monday, January 4, 2016 7:00:16 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > Hi, > > We have taken statedump of fuse client process, quotad and bricks. > Apparently, we could not find any stack information in brick's statedump. > Below is the client statedump state information: Thanks Manikandan :). You took this statedump once the test program was in 'D' state, right? Otherwise these can be just in-transit fops. > > [global.callpool.stack.1.frame.1] > frame=0xb80775f0 > ref_count=0 > translator=patchy-write-behind > complete=0 > parent=patchy-read-ahead > wind_from=ra_writev > wind_to=FIRST_CHILD(this)->fops->writev > unwind_to=ra_writev_cbk As I suspected, write-behind seems to be the culprit. Can you upload the entire statedump file? Also, make sure you've enabled statedump of inodes (by setting "all=yes" in glusterdump.options as explained in my previous mail). > > [global.callpool.stack.1.frame.2] > frame=0xb8077540 > ref_count=1 > translator=patchy-read-ahead > complete=0 > parent=patchy-io-cache > wind_from=ioc_writev > wind_to=FIRST_CHILD(this)->fops->writev > unwind_to=ioc_writev_cbk > > [global.callpool.stack.1.frame.3] > frame=0xb8077490 > ref_count=1 > translator=patchy-io-cache > complete=0 > parent=patchy-quick-read > wind_from=qr_writev > wind_to=FIRST_CHILD (this)->fops->writev > unwind_to=default_writev_cbk > > [global.callpool.stack.1.frame.4] > frame=0xb80773e0 > ref_count=1 > translator=patchy-quick-read > complete=0 > parent=patchy-open-behind > wind_from=default_writev_resume > wind_to=FIRST_CHILD(this)->fops->writev > unwind_to=default_writev_cbk > > [global.callpool.stack.1.frame.5] > frame=0xb8077330 > ref_count=1 > translator=patchy-open-behind > complete=0 > parent=patchy-md-cache > wind_from=mdc_writev > wind_to=FIRST_CHILD(this)->fops->writev > unwind_to=mdc_writev_cbk > > [global.callpool.stack.1.frame.6] > frame=0xb80771d0 > ref_count=1 > translator=patchy-md-cache > complete=0 > parent=patchy > wind_from=io_stats_writev > wind_to=FIRST_CHILD(this)->fops->writev > unwind_to=io_stats_writev_cbk > > [global.callpool.stack.1.frame.7] > frame=0xb8077120 > ref_count=1 > translator=patchy > complete=0 > parent=fuse > wind_from=fuse_write_resume > wind_to=FIRST_CHILD(this)->fops->writev > unwind_to=fuse_writev_cbk > > [global.callpool.stack.1.frame.8] > frame=0xb8077070 > ref_count=1 > translator=fuse > complete=0 > > [global.callpool.stack.2] > stack=0xba420040 > uid=0 > gid=0 > pid=0 > unique=0 > lk-owner= > op=stack > type=0 > cnt=1 > > [global.callpool.stack.2.frame.1] > frame=0xba43c6d0 > ref_count=0 > translator=glusterfs > complete=0 > > [fuse] > > Below is the statedump information for quotad > > [global.callpool.stack.1.frame.1] > frame=0xbb1d0620 > ref_count=0 > translator=glusterfs > complete=0 > > Thank you :-) > > -- > Regards, > Manikandan Selvaganesh. > > - Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Emmanuel Dreyfus" <m...@netbsd.org> > Cc: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Monday, January 4, 2016 6:11:24 PM > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > - Original Message - > > From: "Emmanuel Dreyfus" <m...@netbsd.org> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" > > <gluster-devel@gluster.org>, "Raghavendra Talur" > > <rta...@redhat.com> > > Sent: Monday, January 4, 2016 4:03:22 PM > > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines > > > > On Mon, Jan 04, 2016 at 05:16:16AM -0500, Raghavendra Gowdappa wrote: > > > Can you send the statedump? Please look for frames with "complete=0". > > > This > > > indicates that the frame is not unwound. > > > > Here it is. No unwound frame? > > Is this statedump? Statedumps carry lots of other information too. This seems > more li
Re: [Gluster-devel] FreeBSD port of GlusterFS racks up a lot of CPU usage
- Original Message - > From: "Rick Macklem"> To: "Jeff Darcy" > Cc: "Raghavendra G" , "freebsd-fs" > , "Hubbard Jordan" > , "Xavier Hernandez" , "Gluster > Devel" > Sent: Saturday, January 9, 2016 7:29:59 AM > Subject: Re: [Gluster-devel] FreeBSD port of GlusterFS racks up a lot of CPU > usage > > Jeff Darcy wrote: > > > > I don't know anything about gluster's poll implementation so I may > > > > be totally wrong, but would it be possible to use an eventfd (or a > > > > pipe if eventfd is not supported) to signal the need to add more > > > > file descriptors to the poll call ? > > > > > > > > > > > > The poll call should listen on this new fd. When we need to change > > > > the fd list, we should simply write to the eventfd or pipe from > > > > another thread. This will cause the poll call to return and we will > > > > be able to change the fd list without having a short timeout nor > > > > having to decide on any trade-off. > > > > > > > > > Thats a nice idea. Based on my understanding of why timeouts are being > > > used, this approach can work. > > > > The own-thread code which preceded the current poll implementation did > > something similar, using a pipe fd to be woken up for new *outgoing* > > messages. That code still exists, and might provide some insight into > > how to do this for the current poll code. > I took a look at event-poll.c and found something interesting... > - A pipe called "breaker" is already set up by event_pool_new_poll() and > closed by event_pool_destroy_poll(), however it never gets used for > anything. I did a check on history, but couldn't find any information on why it was removed. Can you send this patch to http://review.gluster.org ? We can review and merge the patch over there. If you are not aware, development work flow can be found at: http://www.gluster.org/community/documentation/index.php/Developers > > So, I added a few lines of code that writes a byte to it whenever the list of > file descriptors is changed and read when poll() returns, if its revents is > set. > I also changed the timeout to -1 (infinity) and it seems to work for a > trivial > test. > --> Btw, I also noticed the "changed" variable gets set to 1 on a change, but > never reset to 0. I didn't change this, since it looks "racey". (ie. I > think you could easily get a race between a thread that clears it and one > that adds a new fd.) > > A slightly safer version of the patch would set a long (100msec ??) timeout > instead > of -1. > > Anyhow, I've attached the patch in case anyone would like to try it and will > create a bug report for this after I've had more time to test it. > (I only use a couple of laptops, so my testing will be minimal.) > > Thanks for all the help, rick > > > ___ > > freebsd...@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org" > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] volume-snapshot.t failures
Seems like a snapshot failure on build machines. Found another failure: https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console Test failed: ./tests/basic/tier/tier-snapshot.t Debug-msg: ++ gluster --mode=script --wignore snapshot create snap2 patchy no-timestamp snapshot create: failed: Pre-validation failed on localhost. Please check log file for details + test_footer + RET=1 + local err= + '[' 1 -eq 0 ']' + echo 'not ok 11 ' not ok 11 + '[' x0 = x0 ']' + echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 patchy no-timestamp' FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 patchy no-timestamp - Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Rajesh Joseph" <rjos...@redhat.com> > Sent: Tuesday, December 22, 2015 10:05:22 AM > Subject: volume-snapshot.t failures > > Hi Rajesh > > There is a failure of volume-snapshot.t on build machine: > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull > > > However, on my local machine test succeeds always. Is it a known case of > spurious failure? > > regards, > Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] volume-snapshot.t failures
Both these tests succeed on my local machine. - Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Rajesh Joseph" <rjos...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Tuesday, December 22, 2015 12:05:24 PM > Subject: Re: [Gluster-devel] volume-snapshot.t failures > > Seems like a snapshot failure on build machines. Found another failure: > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console > > Test failed: > ./tests/basic/tier/tier-snapshot.t > > Debug-msg: > ++ gluster --mode=script --wignore snapshot create snap2 patchy no-timestamp > snapshot create: failed: Pre-validation failed on localhost. Please check log > file for details > + test_footer > + RET=1 > + local err= > + '[' 1 -eq 0 ']' > + echo 'not ok 11 ' > not ok 11 > + '[' x0 = x0 ']' > + echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 > patchy no-timestamp' > FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 patchy > no-timestamp > > - Original Message - > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > To: "Rajesh Joseph" <rjos...@redhat.com> > > Sent: Tuesday, December 22, 2015 10:05:22 AM > > Subject: volume-snapshot.t failures > > > > Hi Rajesh > > > > There is a failure of volume-snapshot.t on build machine: > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull > > > > > > However, on my local machine test succeeds always. Is it a known case of > > spurious failure? > > > > regards, > > Raghavendra. > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] volume-snapshot.t failures
Thanks Dan!! - Original Message - > From: "Dan Lambright" <dlamb...@redhat.com> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com> > Cc: "Rajesh Joseph" <rjos...@redhat.com>, "Gluster Devel" > <gluster-devel@gluster.org> > Sent: Tuesday, December 22, 2015 12:16:05 PM > Subject: Re: [Gluster-devel] volume-snapshot.t failures > > They fail on RHEL6 machines due to an issue with sqlite. The test has been > moved to the ignore list already (13056). > I'd like to look into having tests that do not run on RHEL6, only RHEL7+. > I've been running it on RHEL7 the last few hours in a loop successfully. > > - Original Message - > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > To: "Rajesh Joseph" <rjos...@redhat.com> > > Cc: "Gluster Devel" <gluster-devel@gluster.org> > > Sent: Tuesday, December 22, 2015 1:42:13 AM > > Subject: Re: [Gluster-devel] volume-snapshot.t failures > > > > Both these tests succeed on my local machine. > > > > - Original Message - > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > To: "Rajesh Joseph" <rjos...@redhat.com> > > > Cc: "Gluster Devel" <gluster-devel@gluster.org> > > > Sent: Tuesday, December 22, 2015 12:05:24 PM > > > Subject: Re: [Gluster-devel] volume-snapshot.t failures > > > > > > Seems like a snapshot failure on build machines. Found another failure: > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console > > > > > > Test failed: > > > ./tests/basic/tier/tier-snapshot.t > > > > > > Debug-msg: > > > ++ gluster --mode=script --wignore snapshot create snap2 patchy > > > no-timestamp > > > snapshot create: failed: Pre-validation failed on localhost. Please check > > > log > > > file for details > > > + test_footer > > > + RET=1 > > > + local err= > > > + '[' 1 -eq 0 ']' > > > + echo 'not ok 11 ' > > > not ok 11 > > > + '[' x0 = x0 ']' > > > + echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create > > > snap2 > > > patchy no-timestamp' > > > FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 > > > patchy > > > no-timestamp > > > > > > - Original Message - > > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > > To: "Rajesh Joseph" <rjos...@redhat.com> > > > > Sent: Tuesday, December 22, 2015 10:05:22 AM > > > > Subject: volume-snapshot.t failures > > > > > > > > Hi Rajesh > > > > > > > > There is a failure of volume-snapshot.t on build machine: > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull > > > > > > > > > > > > However, on my local machine test succeeds always. Is it a known case > > > > of > > > > spurious failure? > > > > > > > > regards, > > > > Raghavendra. > > > ___ > > > Gluster-devel mailing list > > > Gluster-devel@gluster.org > > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next available executor'
- Original Message - > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > To: "Ravishankar N" <ravishan...@redhat.com> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "gluster-infra" > <gluster-in...@gluster.org> > Sent: Thursday, December 24, 2015 12:11:46 PM > Subject: Re: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next > available executor' > > https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/12961/consoleFull > > Seems to be hung. May be a hung syscall? I've tried to kill it, but seems > like its not dead. May be patch #12594 is causing some issues on netbsd. It > has passed gluster regression. s/gluster/Linux/ > > - Original Message - > > From: "Ravishankar N" <ravishan...@redhat.com> > > To: "Gluster Devel" <gluster-devel@gluster.org>, "gluster-infra" > > <gluster-in...@gluster.org> > > Sent: Thursday, December 24, 2015 9:27:53 AM > > Subject: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next > > available executor' > > > > $subject. > > Since yesterday. > > The build queue is growing. Something's wrong. > > > > " If you see a little black clock icon in the build queue as shown below, > > it > > is an indication that your job is sitting in the queue unnecessarily." is > > what it says. > > > > > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next available executor'
https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/12961/consoleFull Seems to be hung. May be a hung syscall? I've tried to kill it, but seems like its not dead. May be patch #12594 is causing some issues on netbsd. It has passed gluster regression. - Original Message - > From: "Ravishankar N"> To: "Gluster Devel" , "gluster-infra" > > Sent: Thursday, December 24, 2015 9:27:53 AM > Subject: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next > available executor' > > $subject. > Since yesterday. > The build queue is growing. Something's wrong. > > " If you see a little black clock icon in the build queue as shown below, it > is an indication that your job is sitting in the queue unnecessarily." is > what it says. > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Native RDMA in libgfapi
- Original Message - > From: "Piotr Rybicki"> To: "Gluster Devel" > Sent: Friday, November 20, 2015 9:00:28 PM > Subject: [Gluster-devel] Native RDMA in libgfapi > > Hi All. > > Are there any plans for this feature? > > Just tested latest glusterfs (3.7.6), and it still doesn't work (as > expected, since there was no info in changelog about it). What errors did you get? Is it possible to send across log files (bricks, clients and glusterd logs)? It works for fuse mounts, so theoretically should work for gfapi too. > > Native RDMA transport should give a significant boost in performance, > based on my observations in fuse mount. > > If that is any of help, I'm more than happy to test patches ;-) > > I'm using Mellanox QDR cards and ofed 3.12. > > Best regards > Piotr Rybicki > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] [Review request] write-behind to retry failed syncs
Hi all, [1] adds retry logic to failed syncs (to backend). It would be helpful if you can comment on: 1. Interface 2. Design 3. Implementation [1] review.gluster.org/#/c/12594/7 regards, Raghavendra. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Review request] write-behind to retry failed syncs
For ease of access, I am posting the summary from commit-msg below: 1. When sync fails, the cached-write is still preserved unless there is a flush/fsync waiting on it. 2. When a sync fails and there is a flush/fsync waiting on the cached-write, the cache is thrown away and no further retries will be made. In other words flush/fsync act as barriers for all the previous writes. All previous writes are either successfully synced to backend or forgotten in case of an error. Without such barrier fop (especially flush which is issued prior to a close), we end up retrying for ever even after fd is closed. 3. If a fop is waiting on cached-write and syncing to backend fails, the waiting fop is failed. 4. sync failures when no fop is waiting are ignored and are not propagated to application. 5. The effect of repeated sync failures is that, there will be no cache for future writes and they cannot be written behind. Above algo is for handling of transient errors (EDQUOT, ENOSPC, ENOTCONN). Handling of non-transient errors is slightly different as below: 1. Throw away the write-buffer, so that cache is freed. This means no retries are made for non-transient errors. Also, since cache is freed, future writes can be written-behind. 2. Retain the request till an fsync or flush. This means all future operations to failed regions will fail till an fsync/flush. This is a conservative error handling to force application to know that a written-behind write has failed and take remedial action like rollback to last fsync and retrying all the writes from that point. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel