Re: [Gluster-users] Would there be a use for cluster-specific filesystem tools?

2014-04-16 Thread John Brunelle
This is great stuff!  I think there is a huge need for this.  It's amazing
how much faster basic coreutils operations can be, even when just using one
client as you show.

I've had a lot of success using dftw[1,2] to speed up these recursive
operations on distributed filesystems (conditional finds, lustre OST
retirement, etc).  It's MPI-based (hybrid, actually) and many such tasks
really scale well, and keep scaling when you spread the work over multiple
client nodes (especially with gluster it seems).  It's really fantastic.
 I've done things like combine it with mrmpi[3] to create a general
mapreduce[4] for these situations.

Take for example du: standard /usr/bin/du (or just a find that prints size)
on our 10-node distributed gluster filesystem takes well over a couple of
hours for trees with tens of thousands of files.  We've cut that down over
an order of magnitude, to about 10 minutes, with a simple parallel du
calculation using the above[5] (running with 16 procs across 4 nodes, ~85%
strong scaling efficiency).

Like you, I have hopes of making a package of such utilities.  Probably
your threaded model will make this much more approachable, though, and
elastic, too.  I'll happily try out your tools when you're ready to post
them, and a bet a lot of others will, too.

Best,

John

[1] https://github.com/hpc/libdftw
[2] http://dl.acm.org/citation.cfm?id=2389114
[3] http://mapreduce.sandia.gov/
[4] https://github.com/jabrcx/fsmr
[5]
https://github.com/jabrcx/fsmr/blob/master/examples/fsmr.du_by_owner/example.c


On Wed, Apr 16, 2014 at 10:31 AM, Joe Julian j...@julianfamily.org wrote:

 Excellent! I've been toying with the same concept in the back of my mind
 for a long while now. I'm sure there is an unrealized desire for such tools.

 When your ready, please put such a toolset on 
 forge.gluster.orghttps://urldefense.proofpoint.com/v1/url?u=http://forge.gluster.orgk=AjZjj3dyY74kKL92lieHqQ%3D%3D%0Ar=%2FQG8E5VPw7JQVkUUr0s3matmYRxzaY0KPU5nnXGigBI%3D%0Am=w3pb14ExePHzQy9Y9dkKj8Y7PhtZd1RDukRFHcy8eig%3D%0As=6ff566767b846e9debfd2f0390ee8c500a231e4d7f3b479d1931309a11a8.


 On April 16, 2014 6:50:48 AM PDT, Michael Peek p...@nimbios.org wrote:

 Hi guys,

 (I'm new to this, so pardon me if my shenanigans turns out to be a waste
 of your time.)

 I have been experimenting with Gluster by copying and deleting large
 numbers of files of all sizes.  What I found was that when deleting a large
 number of small files, the deletion process seems to take a good chunk of
 my time -- in some cases it seemed to take a significant percentage of the
 time that it took to copy the files to the cluster to begin with.  I'm
 guessing that the reason is a combination of find and rm -fr processing
 files serially and having to wait on the packets to travel back and forth
 over the network.  But with a clustering filesystem, the bottleneck is
 processing files serially and waiting for network packets when you don't
 have to.

 So I decided to try an experiment.  Instead of using /bin/rm to delete
 files serially, I wrote my own quick-and-dirty recursive rm (and recursive
 ls) that uses pthreads (listed as cluster-rm and cluster-ls in the
 table below):

 Methods:

 1) This was done on a Linux system.  I suspect that Linux (or any modern
 OS) caches filesystem information.  For example, after setting up a
 directory, when running rm -fr on that directory, the time for rm to
 complete is lessened if I first run find on the same directory.  So to
 avoid this caching effect, each command was run on it's own test
 directory.  (I.e. find was never run on the same directory as rm -fr or
 cluster-rm.)  This approach seemed to prevent inconsistencies resulting
 from any caching behavior, resulting in run times that were more consistent.

 2) Each test directory contained the exact same data for each of the four
 commands tested (find, cluster-ls, rm, cluster-rm) for each test run.

 3) All commands were run on a client machine and not one of the cluster
 nodes.

 Results:

 *Data Size*
 *Command*
 *Test #1*
 *Test #2*
 *Test #3*
 *Test #4*
 49GB
 find -print
 real6m45.066s
 user0m0.172s
 sys0m0.748s
 real6m18.524s
 user0m0.140s
 sys0m0.508s
 real5m45.301s
 user0m0.156s
 sys0m0.484s
 real5m58.577s
 user0m0.132s
 sys0m0.480s

 cluster-ls
 real2m32.770s
 user0m0.208s
 sys0m1.876s
 real2m21.376s
 user0m0.164s
 sys0m1.568s
 real2m40.511s
 user0m0.184s
 sys0m1.488s
 real2m36.202s
 user0m0.172s
 sys0m1.412s






 49GB
 rm -fr
 real16m36.264s
 user0m0.232s
 sys0m1.724s
 real16m16.795s
 user0m0.248s
 sys0m1.528s
 real15m54.503s
 user0m0.204s
 sys0m1.396s
 real16m10.037s
 user0m0.168s
 sys0m1.448s

 cluster-rm
 real1m50.717s
 user0m0.236s
 sys0m1.820s
 real1m44.803s
 user0m0.192s
 sys0m2.100s
 real2m6.250s
 user0m0.224s
 sys0m2.200s
 real2m6.367s
 user

Re: [Gluster-users] Horrendously slow directory access

2014-04-10 Thread John Brunelle
Sure thing, #1086303:

https://bugzilla.redhat.com/show_bug.cgi?id=1086303

On Thu, Apr 10, 2014 at 7:04 AM, John Mark Walker jowal...@redhat.com wrote:
 Hi James,

 This definitely looks worthy of investigation. Could you file a bug? We need 
 to get our guys on this.

 Thanks for doing your homework. Send us the BZ #, and we'll start poking 
 around.

 -JM


 - Original Message -
 Hey Joe!

 Yeah we are all XFS all the time round here - none of that nasty ext4
 combo that we know causes raised levels of mercury :-)

 The brick errors, we have not seen any we have been busy grepping and
 alerting on anything suspect in our logs.  Mind you there are hundreds
 of brick logs to search through I'm not going to say we may have
 missed one, but after asking the boys in chat just now they are pretty
 convinced that was not the smoking gun.  I'm sure they will chip in on
 this thread if there is anything.


 j.

 --
 dr. james cuff, assistant dean for research computing, harvard
 university | division of science | thirty eight oxford street,
 cambridge. ma. 02138 | +1 617 384 7647 | http://rc.fas.harvard.edu


 On Wed, Apr 9, 2014 at 10:36 AM, Joe Julian j...@julianfamily.org wrote:
  What's the backend filesystem?
  Were there any brick errors, probably around 2014-03-31 22:44:04 (half an
  hour before the frame timeout)?
 
 
  On April 9, 2014 7:10:58 AM PDT, James Cuff james_c...@harvard.edu wrote:
 
  Hi team,
 
  I hate me too emails sometimes not at all constructive, but I feel I
  really ought chip in from real world systems we use in anger and at
  massive scale here.
 
  So we also use NFS to mask this and other performance issues.  The
  cluster.readdir-optimize gave us similar results unfortunately.
 
  We reported our other challenge back last summer but we stalled on this:
 
  http://www.gluster.org/pipermail/gluster-users/2013-June/036252.html
 
  We also unfortunately now see a new NFS phenotype that I've pasted
  below which is again is causing real heartburn.
 
  Small files, always difficult for any FS, might be worth doing some
  regression testing with small file directory scenarios in test - it's
  an easy reproducer on even moderately sized gluster clusters.  Hope
  some good progress can be
  made, and I understand it's a tough one to
  track down performance hangs and issues.  I just wanted to say that we
  really do see them, and have tried many things to avoid them.
 
  Here's the note from my team:
 
  We were hitting 30 minute timeouts on getxattr/system.posix_acl_access
  calls on directories in a NFS v3 mount (w/ acl option) of a 10-node
  40-brick gluster 3.4.0 volume.  Strace shows where the client hangs:
 
  $ strace -tt -T getfacl d6h_take1
  ...
  18:43:57.929225 lstat(d6h_take1, {st_mode=S_IFDIR|0755,
  st_size=7024, ...}) = 0 0.257107
  18:43:58.186461 getxattr(d6h_take1, system.posix_acl_access,
  0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) 1806.296893
  19:14:04.483556 stat(d6h_take1, {st_mode=S_IFDIR|0755, st_size=7024,
  ...}) = 0 0.642362
  19:14:05.126025 getxattr(d6h_take1, system.posix_acl_default,
  0x7fffdf2b9f50, 132) = -1 ENODATA (No data
  available) 0.24
  19:14:05.126114 stat(d6h_take1, {st_mode=S_IFDIR|0755, st_size=7024,
  ...}) = 0 0.10
  ...
 
  Load on the servers was moderate.  While the above was hanging,
  getfacl worked nearly instantaneously on that directory on all bricks.
   When it finally hit the 30 minute timeout, gluster logged it in
  nfs.log:
 
  [2014-03-31 23:14:04.481154] E [rpc-clnt.c:207:call_bail]
  0-holyscratch-client-36: bailing out frame type(GlusterFS 3.3)
  op(GETXATTR(18)) xid = 0x8168809x sent = 2014-03-31 22:43:58.442411.
  timeout = 1800
  [2014-03-31 23:14:04.481233] W
  [client-rpc-fops.c:1112:client3_3_getxattr_cbk]
  0-holyscratch-client-36: remote operation failed: Transport endpoint
  is not connected. Path: gfid:b116fb01-b13d-448a-90d0-a8693a98698b
  (b116fb01-b13d-448a-90d0-a8693a98698b). Key: (null)
 
  Other than that, we didn't see anything directly related in the nfs or
  brick logs or anything out of sorts with the gluster services.  A
  couple other errors raise eyebrows, but these are different
  directories (neighbors of the example above) and at different times:
 
  holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:30:47.794454]
  I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found
  anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1
  overlaps=0
  holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:31:47.794447]
  I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found
  anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1
  overlaps=0
  holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:33:47.802135]
  I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found
  anomalies in /ramanathan_lab/dhuh/d9_take2_BGI/Diffreg. holes=1
  overlaps=0
  holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:34:47.802182]

Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory

2013-06-14 Thread John Brunelle
Thanks for the reply, Vijay.  I set that parameter On, but it hasn't
helped, and in fact it seems a bit worse.  After making the change on
the volume and dropping caches on some test clients, some are now
seeing zero subdirectories at all.  In my tests before, after dropping
caches clients go back to seeing all the subdirectories, and it's only
after a while they start disappearing (and have never gone to zero
before).

Any other ideas?

Thanks,

John

On Fri, Jun 14, 2013 at 10:35 AM, Vijay Bellur vbel...@redhat.com wrote:
 On 06/13/2013 03:38 PM, John Brunelle wrote:

 Hello,

 We're having an issue with our distributed gluster filesystem:

 * gluster 3.3.1 servers and clients
 * distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes
 * xfs backend
 * nfs clients
 * nfs.enable-ino32: On

 * servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64
 * cleints: CentOS 5.7, 2.6.18-274.12.1.el5

 We have a directory containing 3,343 subdirectories.  On some clients,
 ls lists only a subset of the directories (a different amount on
 different clients).  On others, ls gets stuck in a getdents loop and
 consumes more and more memory until it hits ENOMEM.  On yet others, it
 works fine.  Having the bad clients remount or drop caches makes the
 problem temporarily go away, but eventually it comes back.  The issue
 sounds a lot like bug #838784, but we are using xfs on the backend,
 and this seems like more of a client issue.


 Turning on cluster.readdir-optimize can help readdir when a directory
 contains a number of sub-directories and there are more bricks in the
 volume. Do you observe any change with this option enabled?

 -Vijay


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory

2013-06-14 Thread John Brunelle
Ah, I did not know that about 0x7.  Is it of note that the
clients do *not* get this?

This is on an NFS mount, and the volume has nfs.enable-ino32 On.  (I
should've pointed that out again when Jeff mentioned FUSE.)

Side note -- we do have a couple FUSE mounts, too, and I had not seen
this issue on any of them before, but when I checked now, zero
subdirectories were listed on some.  Since I had only seen this on NFS
clients after setting cluster.readdir-optimize On, I have now set that
back Off.  FUSE mounts are now behaving fine again.

Thanks,

John

On Fri, Jun 14, 2013 at 2:17 PM, Anand Avati anand.av...@gmail.com wrote:
 Are the ls commands (which list partially, or loop and die of ENOMEM
 eventually) executed on an NFS mount or FUSE mount? Or does it happen on
 both?

 Avati


 On Fri, Jun 14, 2013 at 11:14 AM, Anand Avati anand.av...@gmail.com wrote:




 On Fri, Jun 14, 2013 at 10:04 AM, John Brunelle
 john_brune...@harvard.edu wrote:

 Thanks, Jeff!  I ran readdir.c on all 23 bricks on the gluster nfs
 server to which my test clients are connected (one client that's
 working, and one that's not; and I ran on those, too).  The results
 are attached.

 The values it prints are all well within 32 bits, *except* for one
 that's suspiciously the max 32-bit signed int:

 $ cat readdir.out.* | awk '{print $1}' | sort | uniq | tail
 0xfd59
 0xfd6b
 0xfd7d
 0xfd8f
 0xfda1
 0xfdb3
 0xfdc5
 0xfdd7
 0xfde8
 0x7fff

 That outlier is the same subdirectory on all 23 bricks.  Could this be
 the issue?

 Thanks,

 John



 0x7 is the EOF marker. You should find that as last entry in
 _every_ directory.

 Avati


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory

2013-06-13 Thread John Brunelle
Hello,

We're having an issue with our distributed gluster filesystem:

* gluster 3.3.1 servers and clients
* distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes
* xfs backend
* nfs clients
* nfs.enable-ino32: On

* servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64
* cleints: CentOS 5.7, 2.6.18-274.12.1.el5

We have a directory containing 3,343 subdirectories.  On some clients,
ls lists only a subset of the directories (a different amount on
different clients).  On others, ls gets stuck in a getdents loop and
consumes more and more memory until it hits ENOMEM.  On yet others, it
works fine.  Having the bad clients remount or drop caches makes the
problem temporarily go away, but eventually it comes back.  The issue
sounds a lot like bug #838784, but we are using xfs on the backend,
and this seems like more of a client issue.

But we are also getting some page allocation failures on the server
side, e.g. the stack strace below.  These are nearly identical to bug
#842206 and bug #767127.  I'm trying to sort out if these are related
to the above issue or just recoverable nic driver GFP_ATOMIC kmalloc
failures as suggested in the comments.  Slab allocations for dentry,
xfs_inode, fuse_inode, fuse_request, etc. are all at ~100% active, and
the total number appears to be monotonically growing.  Overall memory
looks healthy (2/3 is buffers/cache, almost no swap is used).  I'd
need some help to determine if the memory is overly fragmented or not,
but looking at pagetypeinfo and zoneinfo It doesn't appear so to me,
and the failures are order:1 anyways.

Any suggestions for what might be the problem here?

Thanks,

John

Jun 13 09:41:18 myhost kernel: glusterfsd: page allocation failure.
order:1, mode:0x20
Jun 13 09:41:18 myhost kernel: Pid: 20498, comm: glusterfsd Not
tainted 2.6.32-279.14.1.el6.centos.plus.x86_64 #1
Jun 13 09:41:18 myhost kernel: Call Trace:
Jun 13 09:41:18 myhost kernel: IRQ  [8112790f] ?
__alloc_pages_nodemask+0x77f/0x940
Jun 13 09:41:18 myhost kernel: [81162382] ? kmem_getpages+0x62/0x170
Jun 13 09:41:18 myhost kernel: [81162f9a] ? fallback_alloc+0x1ba/0x270
Jun 13 09:41:18 myhost kernel: [811629ef] ? cache_grow+0x2cf/0x320
Jun 13 09:41:18 myhost kernel: [81162d19] ?
cache_alloc_node+0x99/0x160
Jun 13 09:41:18 myhost kernel: [81163afb] ?
kmem_cache_alloc+0x11b/0x190
Jun 13 09:41:18 myhost kernel: [81435298] ? sk_prot_alloc+0x48/0x1c0
Jun 13 09:41:18 myhost kernel: [81435562] ? sk_clone+0x22/0x2e0
Jun 13 09:41:18 myhost kernel: [814833a6] ? inet_csk_clone+0x16/0xd0
Jun 13 09:41:18 myhost kernel: [8149c383] ?
tcp_create_openreq_child+0x23/0x450
Jun 13 09:41:18 myhost kernel: [81499bed] ?
tcp_v4_syn_recv_sock+0x4d/0x310
Jun 13 09:41:18 myhost kernel: [8149c126] ? tcp_check_req+0x226/0x460
Jun 13 09:41:18 myhost kernel: [81437087] ? __kfree_skb+0x47/0xa0
Jun 13 09:41:18 myhost kernel: [8149960b] ? tcp_v4_do_rcv+0x35b/0x430
Jun 13 09:41:18 myhost kernel: [8149ae4e] ? tcp_v4_rcv+0x4fe/0x8d0
Jun 13 09:41:18 myhost kernel: [81432f6c] ? sk_reset_timer+0x1c/0x30
Jun 13 09:41:18 myhost kernel: [81478add] ?
ip_local_deliver_finish+0xdd/0x2d0
Jun 13 09:41:18 myhost kernel: [81478d68] ? ip_local_deliver+0x98/0xa0
Jun 13 09:41:18 myhost kernel: [8147822d] ? ip_rcv_finish+0x12d/0x440
Jun 13 09:41:18 myhost kernel: [814787b5] ? ip_rcv+0x275/0x350
Jun 13 09:41:18 myhost kernel: [81441deb] ?
__netif_receive_skb+0x49b/0x6f0
Jun 13 09:41:18 myhost kernel: [8149813a] ? tcp4_gro_receive+0x5a/0xd0
Jun 13 09:41:18 myhost kernel: [81444068] ?
netif_receive_skb+0x58/0x60
Jun 13 09:41:18 myhost kernel: [81444170] ? napi_skb_finish+0x50/0x70
Jun 13 09:41:18 myhost kernel: [814466a9] ? napi_gro_receive+0x39/0x50
Jun 13 09:41:18 myhost kernel: [a01303b4] ? igb_poll+0x864/0xb00 [igb]
Jun 13 09:41:18 myhost kernel: [810606ec] ?
rebalance_domains+0x3cc/0x5a0
Jun 13 09:41:18 myhost kernel: [814467c3] ? net_rx_action+0x103/0x2f0
Jun 13 09:41:18 myhost kernel: [81096523] ?
hrtimer_get_next_event+0xc3/0x100
Jun 13 09:41:18 myhost kernel: [81073f61] ? __do_softirq+0xc1/0x1e0
Jun 13 09:41:18 myhost kernel: [810dbb70] ?
handle_IRQ_event+0x60/0x170
Jun 13 09:41:18 myhost kernel: [8100c24c] ? call_softirq+0x1c/0x30
Jun 13 09:41:18 myhost kernel: [8100de85] ? do_softirq+0x65/0xa0
Jun 13 09:41:18 myhost kernel: [81073d45] ? irq_exit+0x85/0x90
Jun 13 09:41:18 myhost kernel: [8150d505] ? do_IRQ+0x75/0xf0
Jun 13 09:41:18 myhost kernel: [8100ba53] ? ret_from_intr+0x0/0x11
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users