Re: [Gluster-users] Would there be a use for cluster-specific filesystem tools?
This is great stuff! I think there is a huge need for this. It's amazing how much faster basic coreutils operations can be, even when just using one client as you show. I've had a lot of success using dftw[1,2] to speed up these recursive operations on distributed filesystems (conditional finds, lustre OST retirement, etc). It's MPI-based (hybrid, actually) and many such tasks really scale well, and keep scaling when you spread the work over multiple client nodes (especially with gluster it seems). It's really fantastic. I've done things like combine it with mrmpi[3] to create a general mapreduce[4] for these situations. Take for example du: standard /usr/bin/du (or just a find that prints size) on our 10-node distributed gluster filesystem takes well over a couple of hours for trees with tens of thousands of files. We've cut that down over an order of magnitude, to about 10 minutes, with a simple parallel du calculation using the above[5] (running with 16 procs across 4 nodes, ~85% strong scaling efficiency). Like you, I have hopes of making a package of such utilities. Probably your threaded model will make this much more approachable, though, and elastic, too. I'll happily try out your tools when you're ready to post them, and a bet a lot of others will, too. Best, John [1] https://github.com/hpc/libdftw [2] http://dl.acm.org/citation.cfm?id=2389114 [3] http://mapreduce.sandia.gov/ [4] https://github.com/jabrcx/fsmr [5] https://github.com/jabrcx/fsmr/blob/master/examples/fsmr.du_by_owner/example.c On Wed, Apr 16, 2014 at 10:31 AM, Joe Julian j...@julianfamily.org wrote: Excellent! I've been toying with the same concept in the back of my mind for a long while now. I'm sure there is an unrealized desire for such tools. When your ready, please put such a toolset on forge.gluster.orghttps://urldefense.proofpoint.com/v1/url?u=http://forge.gluster.orgk=AjZjj3dyY74kKL92lieHqQ%3D%3D%0Ar=%2FQG8E5VPw7JQVkUUr0s3matmYRxzaY0KPU5nnXGigBI%3D%0Am=w3pb14ExePHzQy9Y9dkKj8Y7PhtZd1RDukRFHcy8eig%3D%0As=6ff566767b846e9debfd2f0390ee8c500a231e4d7f3b479d1931309a11a8. On April 16, 2014 6:50:48 AM PDT, Michael Peek p...@nimbios.org wrote: Hi guys, (I'm new to this, so pardon me if my shenanigans turns out to be a waste of your time.) I have been experimenting with Gluster by copying and deleting large numbers of files of all sizes. What I found was that when deleting a large number of small files, the deletion process seems to take a good chunk of my time -- in some cases it seemed to take a significant percentage of the time that it took to copy the files to the cluster to begin with. I'm guessing that the reason is a combination of find and rm -fr processing files serially and having to wait on the packets to travel back and forth over the network. But with a clustering filesystem, the bottleneck is processing files serially and waiting for network packets when you don't have to. So I decided to try an experiment. Instead of using /bin/rm to delete files serially, I wrote my own quick-and-dirty recursive rm (and recursive ls) that uses pthreads (listed as cluster-rm and cluster-ls in the table below): Methods: 1) This was done on a Linux system. I suspect that Linux (or any modern OS) caches filesystem information. For example, after setting up a directory, when running rm -fr on that directory, the time for rm to complete is lessened if I first run find on the same directory. So to avoid this caching effect, each command was run on it's own test directory. (I.e. find was never run on the same directory as rm -fr or cluster-rm.) This approach seemed to prevent inconsistencies resulting from any caching behavior, resulting in run times that were more consistent. 2) Each test directory contained the exact same data for each of the four commands tested (find, cluster-ls, rm, cluster-rm) for each test run. 3) All commands were run on a client machine and not one of the cluster nodes. Results: *Data Size* *Command* *Test #1* *Test #2* *Test #3* *Test #4* 49GB find -print real6m45.066s user0m0.172s sys0m0.748s real6m18.524s user0m0.140s sys0m0.508s real5m45.301s user0m0.156s sys0m0.484s real5m58.577s user0m0.132s sys0m0.480s cluster-ls real2m32.770s user0m0.208s sys0m1.876s real2m21.376s user0m0.164s sys0m1.568s real2m40.511s user0m0.184s sys0m1.488s real2m36.202s user0m0.172s sys0m1.412s 49GB rm -fr real16m36.264s user0m0.232s sys0m1.724s real16m16.795s user0m0.248s sys0m1.528s real15m54.503s user0m0.204s sys0m1.396s real16m10.037s user0m0.168s sys0m1.448s cluster-rm real1m50.717s user0m0.236s sys0m1.820s real1m44.803s user0m0.192s sys0m2.100s real2m6.250s user0m0.224s sys0m2.200s real2m6.367s user
Re: [Gluster-users] Horrendously slow directory access
Sure thing, #1086303: https://bugzilla.redhat.com/show_bug.cgi?id=1086303 On Thu, Apr 10, 2014 at 7:04 AM, John Mark Walker jowal...@redhat.com wrote: Hi James, This definitely looks worthy of investigation. Could you file a bug? We need to get our guys on this. Thanks for doing your homework. Send us the BZ #, and we'll start poking around. -JM - Original Message - Hey Joe! Yeah we are all XFS all the time round here - none of that nasty ext4 combo that we know causes raised levels of mercury :-) The brick errors, we have not seen any we have been busy grepping and alerting on anything suspect in our logs. Mind you there are hundreds of brick logs to search through I'm not going to say we may have missed one, but after asking the boys in chat just now they are pretty convinced that was not the smoking gun. I'm sure they will chip in on this thread if there is anything. j. -- dr. james cuff, assistant dean for research computing, harvard university | division of science | thirty eight oxford street, cambridge. ma. 02138 | +1 617 384 7647 | http://rc.fas.harvard.edu On Wed, Apr 9, 2014 at 10:36 AM, Joe Julian j...@julianfamily.org wrote: What's the backend filesystem? Were there any brick errors, probably around 2014-03-31 22:44:04 (half an hour before the frame timeout)? On April 9, 2014 7:10:58 AM PDT, James Cuff james_c...@harvard.edu wrote: Hi team, I hate me too emails sometimes not at all constructive, but I feel I really ought chip in from real world systems we use in anger and at massive scale here. So we also use NFS to mask this and other performance issues. The cluster.readdir-optimize gave us similar results unfortunately. We reported our other challenge back last summer but we stalled on this: http://www.gluster.org/pipermail/gluster-users/2013-June/036252.html We also unfortunately now see a new NFS phenotype that I've pasted below which is again is causing real heartburn. Small files, always difficult for any FS, might be worth doing some regression testing with small file directory scenarios in test - it's an easy reproducer on even moderately sized gluster clusters. Hope some good progress can be made, and I understand it's a tough one to track down performance hangs and issues. I just wanted to say that we really do see them, and have tried many things to avoid them. Here's the note from my team: We were hitting 30 minute timeouts on getxattr/system.posix_acl_access calls on directories in a NFS v3 mount (w/ acl option) of a 10-node 40-brick gluster 3.4.0 volume. Strace shows where the client hangs: $ strace -tt -T getfacl d6h_take1 ... 18:43:57.929225 lstat(d6h_take1, {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 0.257107 18:43:58.186461 getxattr(d6h_take1, system.posix_acl_access, 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) 1806.296893 19:14:04.483556 stat(d6h_take1, {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 0.642362 19:14:05.126025 getxattr(d6h_take1, system.posix_acl_default, 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) 0.24 19:14:05.126114 stat(d6h_take1, {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 0.10 ... Load on the servers was moderate. While the above was hanging, getfacl worked nearly instantaneously on that directory on all bricks. When it finally hit the 30 minute timeout, gluster logged it in nfs.log: [2014-03-31 23:14:04.481154] E [rpc-clnt.c:207:call_bail] 0-holyscratch-client-36: bailing out frame type(GlusterFS 3.3) op(GETXATTR(18)) xid = 0x8168809x sent = 2014-03-31 22:43:58.442411. timeout = 1800 [2014-03-31 23:14:04.481233] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-holyscratch-client-36: remote operation failed: Transport endpoint is not connected. Path: gfid:b116fb01-b13d-448a-90d0-a8693a98698b (b116fb01-b13d-448a-90d0-a8693a98698b). Key: (null) Other than that, we didn't see anything directly related in the nfs or brick logs or anything out of sorts with the gluster services. A couple other errors raise eyebrows, but these are different directories (neighbors of the example above) and at different times: holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:30:47.794454] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:31:47.794447] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:33:47.802135] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1 overlaps=0 holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:34:47.802182]
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
Thanks for the reply, Vijay. I set that parameter On, but it hasn't helped, and in fact it seems a bit worse. After making the change on the volume and dropping caches on some test clients, some are now seeing zero subdirectories at all. In my tests before, after dropping caches clients go back to seeing all the subdirectories, and it's only after a while they start disappearing (and have never gone to zero before). Any other ideas? Thanks, John On Fri, Jun 14, 2013 at 10:35 AM, Vijay Bellur vbel...@redhat.com wrote: On 06/13/2013 03:38 PM, John Brunelle wrote: Hello, We're having an issue with our distributed gluster filesystem: * gluster 3.3.1 servers and clients * distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes * xfs backend * nfs clients * nfs.enable-ino32: On * servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64 * cleints: CentOS 5.7, 2.6.18-274.12.1.el5 We have a directory containing 3,343 subdirectories. On some clients, ls lists only a subset of the directories (a different amount on different clients). On others, ls gets stuck in a getdents loop and consumes more and more memory until it hits ENOMEM. On yet others, it works fine. Having the bad clients remount or drop caches makes the problem temporarily go away, but eventually it comes back. The issue sounds a lot like bug #838784, but we are using xfs on the backend, and this seems like more of a client issue. Turning on cluster.readdir-optimize can help readdir when a directory contains a number of sub-directories and there are more bricks in the volume. Do you observe any change with this option enabled? -Vijay ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
Ah, I did not know that about 0x7. Is it of note that the clients do *not* get this? This is on an NFS mount, and the volume has nfs.enable-ino32 On. (I should've pointed that out again when Jeff mentioned FUSE.) Side note -- we do have a couple FUSE mounts, too, and I had not seen this issue on any of them before, but when I checked now, zero subdirectories were listed on some. Since I had only seen this on NFS clients after setting cluster.readdir-optimize On, I have now set that back Off. FUSE mounts are now behaving fine again. Thanks, John On Fri, Jun 14, 2013 at 2:17 PM, Anand Avati anand.av...@gmail.com wrote: Are the ls commands (which list partially, or loop and die of ENOMEM eventually) executed on an NFS mount or FUSE mount? Or does it happen on both? Avati On Fri, Jun 14, 2013 at 11:14 AM, Anand Avati anand.av...@gmail.com wrote: On Fri, Jun 14, 2013 at 10:04 AM, John Brunelle john_brune...@harvard.edu wrote: Thanks, Jeff! I ran readdir.c on all 23 bricks on the gluster nfs server to which my test clients are connected (one client that's working, and one that's not; and I ran on those, too). The results are attached. The values it prints are all well within 32 bits, *except* for one that's suspiciously the max 32-bit signed int: $ cat readdir.out.* | awk '{print $1}' | sort | uniq | tail 0xfd59 0xfd6b 0xfd7d 0xfd8f 0xfda1 0xfdb3 0xfdc5 0xfdd7 0xfde8 0x7fff That outlier is the same subdirectory on all 23 bricks. Could this be the issue? Thanks, John 0x7 is the EOF marker. You should find that as last entry in _every_ directory. Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
Hello, We're having an issue with our distributed gluster filesystem: * gluster 3.3.1 servers and clients * distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes * xfs backend * nfs clients * nfs.enable-ino32: On * servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64 * cleints: CentOS 5.7, 2.6.18-274.12.1.el5 We have a directory containing 3,343 subdirectories. On some clients, ls lists only a subset of the directories (a different amount on different clients). On others, ls gets stuck in a getdents loop and consumes more and more memory until it hits ENOMEM. On yet others, it works fine. Having the bad clients remount or drop caches makes the problem temporarily go away, but eventually it comes back. The issue sounds a lot like bug #838784, but we are using xfs on the backend, and this seems like more of a client issue. But we are also getting some page allocation failures on the server side, e.g. the stack strace below. These are nearly identical to bug #842206 and bug #767127. I'm trying to sort out if these are related to the above issue or just recoverable nic driver GFP_ATOMIC kmalloc failures as suggested in the comments. Slab allocations for dentry, xfs_inode, fuse_inode, fuse_request, etc. are all at ~100% active, and the total number appears to be monotonically growing. Overall memory looks healthy (2/3 is buffers/cache, almost no swap is used). I'd need some help to determine if the memory is overly fragmented or not, but looking at pagetypeinfo and zoneinfo It doesn't appear so to me, and the failures are order:1 anyways. Any suggestions for what might be the problem here? Thanks, John Jun 13 09:41:18 myhost kernel: glusterfsd: page allocation failure. order:1, mode:0x20 Jun 13 09:41:18 myhost kernel: Pid: 20498, comm: glusterfsd Not tainted 2.6.32-279.14.1.el6.centos.plus.x86_64 #1 Jun 13 09:41:18 myhost kernel: Call Trace: Jun 13 09:41:18 myhost kernel: IRQ [8112790f] ? __alloc_pages_nodemask+0x77f/0x940 Jun 13 09:41:18 myhost kernel: [81162382] ? kmem_getpages+0x62/0x170 Jun 13 09:41:18 myhost kernel: [81162f9a] ? fallback_alloc+0x1ba/0x270 Jun 13 09:41:18 myhost kernel: [811629ef] ? cache_grow+0x2cf/0x320 Jun 13 09:41:18 myhost kernel: [81162d19] ? cache_alloc_node+0x99/0x160 Jun 13 09:41:18 myhost kernel: [81163afb] ? kmem_cache_alloc+0x11b/0x190 Jun 13 09:41:18 myhost kernel: [81435298] ? sk_prot_alloc+0x48/0x1c0 Jun 13 09:41:18 myhost kernel: [81435562] ? sk_clone+0x22/0x2e0 Jun 13 09:41:18 myhost kernel: [814833a6] ? inet_csk_clone+0x16/0xd0 Jun 13 09:41:18 myhost kernel: [8149c383] ? tcp_create_openreq_child+0x23/0x450 Jun 13 09:41:18 myhost kernel: [81499bed] ? tcp_v4_syn_recv_sock+0x4d/0x310 Jun 13 09:41:18 myhost kernel: [8149c126] ? tcp_check_req+0x226/0x460 Jun 13 09:41:18 myhost kernel: [81437087] ? __kfree_skb+0x47/0xa0 Jun 13 09:41:18 myhost kernel: [8149960b] ? tcp_v4_do_rcv+0x35b/0x430 Jun 13 09:41:18 myhost kernel: [8149ae4e] ? tcp_v4_rcv+0x4fe/0x8d0 Jun 13 09:41:18 myhost kernel: [81432f6c] ? sk_reset_timer+0x1c/0x30 Jun 13 09:41:18 myhost kernel: [81478add] ? ip_local_deliver_finish+0xdd/0x2d0 Jun 13 09:41:18 myhost kernel: [81478d68] ? ip_local_deliver+0x98/0xa0 Jun 13 09:41:18 myhost kernel: [8147822d] ? ip_rcv_finish+0x12d/0x440 Jun 13 09:41:18 myhost kernel: [814787b5] ? ip_rcv+0x275/0x350 Jun 13 09:41:18 myhost kernel: [81441deb] ? __netif_receive_skb+0x49b/0x6f0 Jun 13 09:41:18 myhost kernel: [8149813a] ? tcp4_gro_receive+0x5a/0xd0 Jun 13 09:41:18 myhost kernel: [81444068] ? netif_receive_skb+0x58/0x60 Jun 13 09:41:18 myhost kernel: [81444170] ? napi_skb_finish+0x50/0x70 Jun 13 09:41:18 myhost kernel: [814466a9] ? napi_gro_receive+0x39/0x50 Jun 13 09:41:18 myhost kernel: [a01303b4] ? igb_poll+0x864/0xb00 [igb] Jun 13 09:41:18 myhost kernel: [810606ec] ? rebalance_domains+0x3cc/0x5a0 Jun 13 09:41:18 myhost kernel: [814467c3] ? net_rx_action+0x103/0x2f0 Jun 13 09:41:18 myhost kernel: [81096523] ? hrtimer_get_next_event+0xc3/0x100 Jun 13 09:41:18 myhost kernel: [81073f61] ? __do_softirq+0xc1/0x1e0 Jun 13 09:41:18 myhost kernel: [810dbb70] ? handle_IRQ_event+0x60/0x170 Jun 13 09:41:18 myhost kernel: [8100c24c] ? call_softirq+0x1c/0x30 Jun 13 09:41:18 myhost kernel: [8100de85] ? do_softirq+0x65/0xa0 Jun 13 09:41:18 myhost kernel: [81073d45] ? irq_exit+0x85/0x90 Jun 13 09:41:18 myhost kernel: [8150d505] ? do_IRQ+0x75/0xf0 Jun 13 09:41:18 myhost kernel: [8100ba53] ? ret_from_intr+0x0/0x11 ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users