Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hi Brian, I'm just wondering if you had any luck with figuring out performance limitations of your setup. I'm testing a similar configuration, so any tips or recommendations would be much appreciated. Thanks, --Alex ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Final point. I tried remounting the volume using an undocumented setting I saw in another posting: mount -o direct-io-mode=enable -t glusterfs dev-storage1:/single1 /gluster/single1 But with that, and KVM also using cache=none, the VM simply hung on startup. This looks like a bug to me. With this same mount I was able to restart the VM without cache=none, but then performance was terrible: ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros6 bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 122.493 s, 4.3 MB/s Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Sat, Jun 09, 2012 at 09:53:05AM +0100, Brian Candler wrote: > So clearly cache='none' (O_DIRECT) makes a big difference when using a > local filesystem, so I'd very much like to be able to test it with gluster. Aha, O_DIRECT is in 3.4+: http://comments.gmane.org/gmane.comp.file-systems.gluster.user/8916 http://lwn.net/Articles/476978/ So I upgraded this box to a mainline 3.4.0 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-precise/ After this: * KVM *does* boot with the cache='none' option :-) * However performance is pretty much unchanged :-( ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros5 bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 10.184 s, 51.5 MB/s (As a reminder: that's KVM talking to a single-brick gluster volume, FUSE-mounted on the same node. Other tests showed 248MB/s with KVM guest talking to the RAID10 array directly, and 350MB/s with the host talking to the RAID10 array directly) Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Fri, Jun 08, 2012 at 09:30:19PM +0100, Brian Candler wrote: > ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros2 bs=1024k count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 14.5182 s, 7.2 MB/s > > And this is after live-migrating the VM to dev-storage2: > > ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros3 bs=1024k count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 4.17285 s, 25.1 MB/s I did some more timings after converting the qcow2 image to a raw file. Note that you have to be careful: qemu-img convert -O raw will give you a sparse file, not actually allocating space on disk. So I had to flatten it with dd (which incidentally showed a reasonable write throughput of ~350MB/sec to the 12-disk RAID10 array, and was the same writing locally or writing to a single-brick gluster volume) Tests: 1. VM using a single-brick gluster volume as backend. The brick is on the same node as KVM is running. (Actually the second cluster node was powered off for all these tests) ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros4 bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 55.9581 s, 9.4 MB/s (Strangely this is lower than the 25MB/s I got before) 2. VM image stored directly on the RAID10 array - no gluster. ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros4 bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 10.6027 s, 49.4 MB/s 3. Same VM instance after test 2, but this time with option cache='none' (which doesn't work with glusterfs) ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros5 bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 2.29959 s, 228 MB/s That's more like it :-) So clearly cache='none' (O_DIRECT) makes a big difference when using a local filesystem, so I'd very much like to be able to test it with gluster. I'd also very much look forward to having libglusterfs integrated directly into KVM, which I believe is on the cards at some point: http://www.mail-archive.com/users@ovirt.org/msg01812.html Regards, Brian. P.S. for those who haven't seen it yet, there's a very nice Red Hat presentation on KVM performance tuning here. http://www.linux-kvm.org/wiki/images/5/59/Kvm-forum-2011-performance-improvements-optimizations-D.pdf ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Fri, Jun 08, 2012 at 05:46:42PM +0100, Brian Candler wrote: > The VM boots with io='native' and bus='virtio', but performance is still > very poor: > > ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros bs=1024k > count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 17.4095 s, 6.0 MB/s > > This will need some further work. And for comparison, it's not the replication which is causing the delay, because I get very similar performance if I copy the image to a distributed volume instead. This is where the VM is running on dev-storage1 but the distributed image happens to reside on dev-storage2: ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros2 bs=1024k count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 14.5182 s, 7.2 MB/s And this is after live-migrating the VM to dev-storage2: ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros3 bs=1024k count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 4.17285 s, 25.1 MB/s Clearly network latency has a part to play - this is 10GE on CAT6 (yes I know that's a poor choice for latency, but they're the NICs I happened to have spare) Given that the dd is writing large blocks, I'd hope that large ranges of blocks get flushed to disk too. Of course, 25.1 MB/s is not exactly stellar either. Maybe using a qcow2 (growable) image is part of the problem - I'll need to convert to raw and retest. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Fri, Jun 08, 2012 at 02:23:57PM -0400, olav johansen wrote: >This is a single thread trying to process a sequential task where the >latency really becomes a problem with ls -aR I get similar speed: That's interesting. >[@web1 files]# time ls -aR|wc -l >1968316 >real27m23.432s >user0m5.523s >sys 0m35.369s >[@web1 files]# time ls -aR|wc -l >1968316 >real26m2.728s >user0m5.529s >sys 0m33.779s That's an average of 0.8ms per file, which isn't too bad if you're also getting similar times with ls -laR. If you're getting much better figures with NFS then it may be down to something like client-side caching as you suggested. You may need to do some more direct looking at what's going on, e.g. with strace, to be sure what's going on. >Don't get me wrong, Gluster rocks but in our current case latency is >killing us, and I'm looking for help on solving this. >One idea I haven't had a chance to try in terms of latency is to split >the 6x1TB raid 10 on each brick to 3x (2x1TB RAID 1) not sure if >gluster can even do this. (A1->B1, A2->B2,A3->B3 as one volume) Sure it can do that - it's called a distributed replicated volume. It doesn't care if the bricks are on the same node. I very much doubt it will make any difference in latency, but feel free to test. If the latency is in the network then you could try using 10GE (but use SFP+ with fibre or direct-attach cables; don't use 10GE over CAT6 because that has an even longer latency than 1GE), or Infiniband. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hi Brian, This is a single thread trying to process a sequential task where the latency really becomes a problem with ls -aR I get similar speed: [@web1 files]# time ls -aR|wc -l 1968316 real27m23.432s user0m5.523s sys 0m35.369s [@web1 files]# time ls -aR|wc -l 1968316 real26m2.728s user0m5.529s sys 0m33.779s I understand "ls -alR" isn't truly our use-case but we use similar functions the application we're supporting uses opendir() / file_exists() a lot in PHP, ideally we won't have either but that is not the situation I have, we have been pushing NFS to its limits, we're looking for better / scalable performance, and looking for feedback / suggestions on this. Also to rsync the folders to backup servers we hit on the same issue as ls -alR in terms of speed. (I understand in this case I could use the raw /data/ folder) The difference between a single server -> replicated gluster cluster, what slowdown do others see compared to a NFS? Don't get me wrong, Gluster rocks but in our current case latency is killing us, and I'm looking for help on solving this. One idea I haven't had a chance to try in terms of latency is to split the 6x1TB raid 10 on each brick to 3x (2x1TB RAID 1) not sure if gluster can even do this. (A1->B1, A2->B2,A3->B3 as one volume) Any ideas / suggestions are very appreciated. Thanks again, On Fri, Jun 8, 2012 at 4:20 AM, Brian Candler wrote: > On Fri, Jun 08, 2012 at 12:19:58AM -0400, olav johansen wrote: > ># mount -t glusterfs fs1:/data-storage /storage > >I've copied over my data to it again and doing a ls several times, > >takes ~0.5 seconds: > >[@web1 files]# time ls -all|wc -l > > Like I said before, please also try without the "-l" flags and compare the > results. > > My guess is that ls -al or ls -alR are not representative of the *real* > workload you are going to ask of your system (i.e. "scan all the files in > this directory, sequentially, and perform a stat() call on each one in > turn") - but please contradict me if I'm wrong. > > However you need to measure how much cost that "-l" is giving you. > > >Doing the same thing on the raw os files on one node takes 0.021s > >[@fs2 files]# time ls -all|wc -l > >1989 > >real0m0.021s > >user0m0.007s > >sys 0m0.015s > > In that case it's probably all coming from cache. If you wanted to test > actual disk performance then you would do > > echo "3" >/proc/sys/vm/drop_caches > > before each test (on both client and server, if they are different > machines). > > But from what you say, it sounds like you are actually more interested in > the cached answers anyway. > > >Just as crazy reference, on another single server with SSD's (Raid 10) > >drives I get: > >files# time ls -alR|wc -l > >2260484 > >real0m15.761s > >user0m5.170s > >sys 0m7.670s > >For the same operation. (this server even have more files...) > > You are not comparing like-for-like. A replicated volume behaves very > differently from a single brick or distributed volume, as explained before. > > If you compared a two-brick (HD) setup with an identical two-brick (SSD) > setup then that would be meaningful. I would expect that if everything is > cacheable then you'd get the same results for both. In that case, what > you'd show is that the latency for open/stat and heal is the cause of the > delay. > > Like I said before, I expect that adding the "-l" flag to ls is giving you > lots of cumulative latency. > > This means that the server is actually idle for a lot of the time, while > it's waiting for the next request. So the server has spare capacity for > handling other clients. > > In other words: if your real workload is actually lots of clients accessing > the system concurrently, you'll get a much better total throughput than the > simple tests you are doing, which are a single client performing single > operations one after the other. > > >If I added two more bricks to the cluster / replicated, would this > >double read speed? > > Definitely not. The latency would be the same, it's just that some requests > would go to bricks A and B, and other requests would go to bricks C and D. > The other two bricks would be idle, and would not speed things up. > > However, if you had concurrent accesses from multiple clients, the extra > bricks would give extra capacity so that the total *throughput* would be > higher when there are multiple clients active. > > So I repeat my advice before. If you really want to understand where the > performance issues are coming from, these two tests may highlight them: > > * Compare the same 2-brick replicated volume, > using "ls -aR" versus "ls -laR" > > * Compare a 2-brick replicated volume to a 2-brick distributed volume, > using "ls -laR" on both > > Regards, > > Brian. > ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Thanks for sharing that Brian, I wonder if the cause of the problem when trying to power Up VMware ESXi VMs is for the same reason. Fernando -Original Message- From: Brian Candler [mailto:b.cand...@pobox.com] Sent: 08 June 2012 17:47 To: Pranith Kumar Karampuri Cc: olav johansen; gluster-users@gluster.org; Fernando Frediani (Qube) Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) On Thu, Jun 07, 2012 at 02:36:26PM +0100, Brian Candler wrote: > I'm interested in understanding this, especially the split-brain > scenarios (better to understand them *before* you're stuck in a > problem :-) > > BTW I'm in the process of building a 2-node 3.3 test cluster right now. FYI, I have got KVM working with a glusterfs 3.3.0 replicated volume as the image store. There are two nodes, both running as glusterfs storage and as KVM hosts. I build a 10.04 ubuntu image using vmbuilder, stored on the replicated glusterfs volume: vmbuilder kvm ubuntu --hostname lucidtest --mem 512 --debug --rootsize 20480 --dest /gluster/safe/images/lucidtest I was able to fire it up (virsh start lucidtest), ssh into it, and then live-migrate it to another host: brian@dev-storage1:~$ virsh migrate --live lucidtest qemu+ssh://dev-storage2/system brian@dev-storage2's password: brian@dev-storage1:~$ virsh list Id Name State -- brian@dev-storage1:~$ And I live-migrated it back again, all without the ssh session being interrupted. I then rebooted the second storage server. While it was rebooting I did some work in the VM which grew its image. When the second storage server came back, it resynchronised the image immediately and automatically. Here is the relevant entry from /var/log/glusterfs/glustershd.log on the first (non-rebooted) machine: [2012-06-08 17:08:40.817893] E [socket.c:1715:socket_connect_finish] 0-safe-client-1: connection to 10.0.1.2:24009 failed (Connection timed out) [2012-06-08 17:09:10.698272] I [client-handshake.c:1636:select_server_supported_programs] 0-safe-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-08 17:09:10.700197] I [client-handshake.c:1433:client_setvolume_cbk] 0-safe-client-1: Connected to 10.0.1.2:24009, attached to remote volume '/disk/storage2/safe'. [2012-06-08 17:09:10.700234] I [client-handshake.c:1445:client_setvolume_cbk] 0-safe-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-06-08 17:09:10.701901] I [client-handshake.c:453:client_set_lk_version_cbk] 0-safe-client-1: Server lk version = 1 [2012-06-08 17:09:14.699571] I [afr-common.c:1189:afr_detect_self_heal_by_iatt] 0-safe-replicate-0: size differs for [2012-06-08 17:09:14.699616] I [afr-common.c:1340:afr_launch_self_heal] 0-safe-replicate-0: background data self-heal triggered. path: , reason: lookup detected pending operations [2012-06-08 17:09:18.230855] I [afr-self-heal-algorithm.c:122:sh_loop_driver_done] 0-safe-replicate-0: diff self-heal on : completed. (19 blocks of 3299 were different (0.58%)) [2012-06-08 17:09:18.232520] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-safe-replicate-0: background data self-heal completed on So at first glance this is extremely impressive. It's also very new and shiny, and I wonder how many edge cases remain to be debugged in live use, but I can't argue that it's very neat indeed! Performance-wise: (1) on the storage/VM host, which has the replicated volume mounted via FUSE: root@dev-storage1:~# dd if=/dev/zero of=/gluster/safe/test.zeros bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 2.7086 s, 194 MB/s (The bricks have a 12-disk md RAID10 array, far-2 layout, and there's probably scope for some performance tweaking here) (2) however from within the VM guest, performance was very poor (2.2MB/s). I tried my usual tuning options: ... but glusterfs objected to the cache='none' option (possibly this opens the file with O_DIRECT?) # virsh start lucidtest virsherror: Failed to start domain lucidtest error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/0 kvm: -drive file=/gluster/safe/images/lucidtest/tmpaJqTD9.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native: could not open disk image /gluster/safe/images/lucidtest/tmpaJqTD9.qcow2: Invalid argument The VM boots with io='native' and bus='virtio', but performance is still very poor: ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros bs=1024k count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 17.4095 s, 6.0 MB/s This will need some fur
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Fri, Jun 08, 2012 at 05:46:42PM +0100, Brian Candler wrote: > but glusterfs objected to the cache='none' option (possibly this opens the > file with O_DIRECT?) Yes that's definitely the problem, as I can see if I strace the kvm process: stat("/gluster/safe/images/lucidtest/tmpaJqTD9.qcow2", {st_mode=S_IFREG|0644, st_size=774307840, ...}) = 0 open("/gluster/safe/images/lucidtest/tmpaJqTD9.qcow2", O_RDWR|O_DIRECT|O_CLOEXEC) = -1 EINVAL (Invalid argument) I found http://gluster.org/pipermail/gluster-users/2012-March/009936.html and tried remounting with '-o direct-io-mode=enable', but that didn't make a difference. Also, 'mount' output doesn't show this option anyway. That page also talked about adding 'option o-direct enable' to the posix translator, but I'd rather not mess with that directly as I have not yet found any documentation about how to modify translator options while still using CLI/glusterd to manage the configuration. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Thu, Jun 07, 2012 at 02:36:26PM +0100, Brian Candler wrote: > I'm interested in understanding this, especially the split-brain scenarios > (better to understand them *before* you're stuck in a problem :-) > > BTW I'm in the process of building a 2-node 3.3 test cluster right now. FYI, I have got KVM working with a glusterfs 3.3.0 replicated volume as the image store. There are two nodes, both running as glusterfs storage and as KVM hosts. I build a 10.04 ubuntu image using vmbuilder, stored on the replicated glusterfs volume: vmbuilder kvm ubuntu --hostname lucidtest --mem 512 --debug --rootsize 20480 --dest /gluster/safe/images/lucidtest I was able to fire it up (virsh start lucidtest), ssh into it, and then live-migrate it to another host: brian@dev-storage1:~$ virsh migrate --live lucidtest qemu+ssh://dev-storage2/system brian@dev-storage2's password: brian@dev-storage1:~$ virsh list Id Name State -- brian@dev-storage1:~$ And I live-migrated it back again, all without the ssh session being interrupted. I then rebooted the second storage server. While it was rebooting I did some work in the VM which grew its image. When the second storage server came back, it resynchronised the image immediately and automatically. Here is the relevant entry from /var/log/glusterfs/glustershd.log on the first (non-rebooted) machine: [2012-06-08 17:08:40.817893] E [socket.c:1715:socket_connect_finish] 0-safe-client-1: connection to 10.0.1.2:24009 failed (Connection timed out) [2012-06-08 17:09:10.698272] I [client-handshake.c:1636:select_server_supported_programs] 0-safe-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-08 17:09:10.700197] I [client-handshake.c:1433:client_setvolume_cbk] 0-safe-client-1: Connected to 10.0.1.2:24009, attached to remote volume '/disk/storage2/safe'. [2012-06-08 17:09:10.700234] I [client-handshake.c:1445:client_setvolume_cbk] 0-safe-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-06-08 17:09:10.701901] I [client-handshake.c:453:client_set_lk_version_cbk] 0-safe-client-1: Server lk version = 1 [2012-06-08 17:09:14.699571] I [afr-common.c:1189:afr_detect_self_heal_by_iatt] 0-safe-replicate-0: size differs for [2012-06-08 17:09:14.699616] I [afr-common.c:1340:afr_launch_self_heal] 0-safe-replicate-0: background data self-heal triggered. path: , reason: lookup detected pending operations [2012-06-08 17:09:18.230855] I [afr-self-heal-algorithm.c:122:sh_loop_driver_done] 0-safe-replicate-0: diff self-heal on : completed. (19 blocks of 3299 were different (0.58%)) [2012-06-08 17:09:18.232520] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-safe-replicate-0: background data self-heal completed on So at first glance this is extremely impressive. It's also very new and shiny, and I wonder how many edge cases remain to be debugged in live use, but I can't argue that it's very neat indeed! Performance-wise: (1) on the storage/VM host, which has the replicated volume mounted via FUSE: root@dev-storage1:~# dd if=/dev/zero of=/gluster/safe/test.zeros bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 2.7086 s, 194 MB/s (The bricks have a 12-disk md RAID10 array, far-2 layout, and there's probably scope for some performance tweaking here) (2) however from within the VM guest, performance was very poor (2.2MB/s). I tried my usual tuning options: ... but glusterfs objected to the cache='none' option (possibly this opens the file with O_DIRECT?) # virsh start lucidtest virsherror: Failed to start domain lucidtest error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/0 kvm: -drive file=/gluster/safe/images/lucidtest/tmpaJqTD9.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native: could not open disk image /gluster/safe/images/lucidtest/tmpaJqTD9.qcow2: Invalid argument The VM boots with io='native' and bus='virtio', but performance is still very poor: ubuntu@lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros bs=1024k count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 17.4095 s, 6.0 MB/s This will need some further work. The guest is lucid (10.04) only because for some reason I cannot get a 12.04 image built with vmbuilder to work (it spins at 100% CPU). This is not related to glusterfs and something I need to debug separately. Maybe a 12.04 guest will also run better. Anyway, just thought it was worth a mention. Keep up the good work guys! Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Fri, Jun 08, 2012 at 12:19:58AM -0400, olav johansen wrote: ># mount -t glusterfs fs1:/data-storage /storage >I've copied over my data to it again and doing a ls several times, >takes ~0.5 seconds: >[@web1 files]# time ls -all|wc -l Like I said before, please also try without the "-l" flags and compare the results. My guess is that ls -al or ls -alR are not representative of the *real* workload you are going to ask of your system (i.e. "scan all the files in this directory, sequentially, and perform a stat() call on each one in turn") - but please contradict me if I'm wrong. However you need to measure how much cost that "-l" is giving you. >Doing the same thing on the raw os files on one node takes 0.021s >[@fs2 files]# time ls -all|wc -l >1989 >real0m0.021s >user0m0.007s >sys 0m0.015s In that case it's probably all coming from cache. If you wanted to test actual disk performance then you would do echo "3" >/proc/sys/vm/drop_caches before each test (on both client and server, if they are different machines). But from what you say, it sounds like you are actually more interested in the cached answers anyway. >Just as crazy reference, on another single server with SSD's (Raid 10) >drives I get: >files# time ls -alR|wc -l >2260484 >real0m15.761s >user0m5.170s >sys 0m7.670s >For the same operation. (this server even have more files...) You are not comparing like-for-like. A replicated volume behaves very differently from a single brick or distributed volume, as explained before. If you compared a two-brick (HD) setup with an identical two-brick (SSD) setup then that would be meaningful. I would expect that if everything is cacheable then you'd get the same results for both. In that case, what you'd show is that the latency for open/stat and heal is the cause of the delay. Like I said before, I expect that adding the "-l" flag to ls is giving you lots of cumulative latency. This means that the server is actually idle for a lot of the time, while it's waiting for the next request. So the server has spare capacity for handling other clients. In other words: if your real workload is actually lots of clients accessing the system concurrently, you'll get a much better total throughput than the simple tests you are doing, which are a single client performing single operations one after the other. >If I added two more bricks to the cluster / replicated, would this >double read speed? Definitely not. The latency would be the same, it's just that some requests would go to bricks A and B, and other requests would go to bricks C and D. The other two bricks would be idle, and would not speed things up. However, if you had concurrent accesses from multiple clients, the extra bricks would give extra capacity so that the total *throughput* would be higher when there are multiple clients active. So I repeat my advice before. If you really want to understand where the performance issues are coming from, these two tests may highlight them: * Compare the same 2-brick replicated volume, using "ls -aR" versus "ls -laR" * Compare a 2-brick replicated volume to a 2-brick distributed volume, using "ls -laR" on both Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
om 0) [2012-06-07 20:47:49.592729] I [client-handshake.c:1636:select_server_supported_programs] 0-data-storage-client-0: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-07 20:47:49.595099] I [client-handshake.c:1636:select_server_supported_programs] 0-data-storage-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-07 20:47:49.608455] I [client-handshake.c:1433:client_setvolume_cbk] 0-data-storage-client-0: Connected to 10.1.80.81:24009, attached to remote volume '/data/storage'. [2012-06-07 20:47:49.608489] I [client-handshake.c:1445:client_setvolume_cbk] 0-data-storage-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-06-07 20:47:49.608572] I [afr-common.c:3627:afr_notify] 0-data-storage-replicate-0: Subvolume 'data-storage-client-0' came back up; going online. [2012-06-07 20:47:49.608837] I [client-handshake.c:453:client_set_lk_version_cbk] 0-data-storage-client-0: Server lk version = 1 [2012-06-07 20:47:49.616381] I [client-handshake.c:1433:client_setvolume_cbk] 0-data-storage-client-1: Connected to 10.1.80.82:24009, attached to remote volume '/data/storage'. [2012-06-07 20:47:49.616434] I [client-handshake.c:1445:client_setvolume_cbk] 0-data-storage-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-06-07 20:47:49.621808] I [fuse-bridge.c:4193:fuse_graph_setup] 0-fuse: switched to graph 0 [2012-06-07 20:47:49.622793] I [client-handshake.c:453:client_set_lk_version_cbk] 0-data-storage-client-1: Server lk version = 1 [2012-06-07 20:47:49.622873] I [fuse-bridge.c:3376:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13 [2012-06-07 20:47:49.623440] I [afr-common.c:1964:afr_set_root_inode_on_first_lookup] 0-data-storage-replicate-0: added root inode End storage.log - On Thu, Jun 7, 2012 at 9:46 AM, Pranith Kumar Karampuri wrote: > hi Brian, >'stat' command comes as fop (File-operation) 'lookup' to the gluster > mount which triggers self-heal. So the behavior is still same. > I was referring to the fop 'stat' which will be performed only on one of > the bricks. > Unfortunately most of the commands and fops have same name. > Following are some of the examples of read-fops: >.access >.stat >.fstat >.readlink >.getxattr >.fgetxattr >.readv > > Pranith. > - Original Message - > From: "Brian Candler" > To: "Pranith Kumar Karampuri" > Cc: "olav johansen" , gluster-users@gluster.org, > "Fernando Frediani (Qube)" > Sent: Thursday, June 7, 2012 7:06:26 PM > Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? > (small files / directory listings) > > On Thu, Jun 07, 2012 at 08:34:56AM -0400, Pranith Kumar Karampuri wrote: > > Brian, > > Small correction: 'sending queries to *both* servers to check they are > in sync - even read accesses.' Read fops like stat/getxattr etc are sent to > only one brick. > > Is that new behaviour for 3.3? My understanding was that stat() was a > healing operation. > > http://gluster.org/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate > > If this is no longer true, then I'd like to understand what happens after a > node has been down and comes up again. I understand there's a self-healing > daemon in 3.3, but what if you try to access a file which has not yet been > healed? > > I'm interested in understanding this, especially the split-brain scenarios > (better to understand them *before* you're stuck in a problem :-) > > BTW I'm in the process of building a 2-node 3.3 test cluster right now. > > Cheers, > > Brian. > ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
hi Brian, 'stat' command comes as fop (File-operation) 'lookup' to the gluster mount which triggers self-heal. So the behavior is still same. I was referring to the fop 'stat' which will be performed only on one of the bricks. Unfortunately most of the commands and fops have same name. Following are some of the examples of read-fops: .access .stat .fstat .readlink .getxattr .fgetxattr .readv Pranith. - Original Message - From: "Brian Candler" To: "Pranith Kumar Karampuri" Cc: "olav johansen" , gluster-users@gluster.org, "Fernando Frediani (Qube)" Sent: Thursday, June 7, 2012 7:06:26 PM Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) On Thu, Jun 07, 2012 at 08:34:56AM -0400, Pranith Kumar Karampuri wrote: > Brian, > Small correction: 'sending queries to *both* servers to check they are in > sync - even read accesses.' Read fops like stat/getxattr etc are sent to only > one brick. Is that new behaviour for 3.3? My understanding was that stat() was a healing operation. http://gluster.org/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate If this is no longer true, then I'd like to understand what happens after a node has been down and comes up again. I understand there's a self-healing daemon in 3.3, but what if you try to access a file which has not yet been healed? I'm interested in understanding this, especially the split-brain scenarios (better to understand them *before* you're stuck in a problem :-) BTW I'm in the process of building a 2-node 3.3 test cluster right now. Cheers, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Thu, Jun 07, 2012 at 08:34:56AM -0400, Pranith Kumar Karampuri wrote: > Brian, > Small correction: 'sending queries to *both* servers to check they are in > sync - even read accesses.' Read fops like stat/getxattr etc are sent to only > one brick. Is that new behaviour for 3.3? My understanding was that stat() was a healing operation. http://gluster.org/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate If this is no longer true, then I'd like to understand what happens after a node has been down and comes up again. I understand there's a self-healing daemon in 3.3, but what if you try to access a file which has not yet been healed? I'm interested in understanding this, especially the split-brain scenarios (better to understand them *before* you're stuck in a problem :-) BTW I'm in the process of building a 2-node 3.3 test cluster right now. Cheers, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Brian, Small correction: 'sending queries to *both* servers to check they are in sync - even read accesses.' Read fops like stat/getxattr etc are sent to only one brick. Pranith. - Original Message - From: "Brian Candler" To: "Fernando Frediani (Qube)" Cc: "olav johansen" , "gluster-users@gluster.org" Sent: Thursday, June 7, 2012 4:24:37 PM Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) On Thu, Jun 07, 2012 at 10:10:03AM +, Fernando Frediani (Qube) wrote: >Sorry this reply won’t be of any help to your problem, but I am too >curious to understand how it can be even slower if monting using >Gluster client which I would expect always be quicker than NFS or >anything else. (1) Try it with "ls -aR" or "find ." instead of "ls -alR" (2) Try it on a gluster non-replicated volume (for fair comparison with direct NFS access) With a replicated volume, many accesses involve sending queries to *both* servers to check they are in sync - even read accesses. This in turn can cause disk seeks on both machines, so the latency you'll get is the larger of the two. If you are doing lots of accesses sequentially then the latencies will all add up. A stat() is one of those accesses which touches both machines, and "ls -l" forces a stat() of each file found. In fact, a quick test suggests ls -l does stat, lstat, getxattr and lgetxattr: $ ls -laR . >/dev/null 2>ert; cut -f1 -d'(' ert | sort | uniq -c 13 access 1 arch_prctl 5 brk 395 close 4 connect 1 execve 1 exit_group 2 fcntl 391 fstat 3 futex 702 getdents 1 getrlimit 1719 getxattr 3 ioctl 1721 lgetxattr 9 lseek 1721 lstat 58 mmap 24 mprotect 12 munmap 424 open 19 read 2 readlink 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 4 socket 1719 stat 1 statfs 29 write Looking at the detail in the strace output, I see these are actually lstat(, ...) lgetxattr(, "security.selinux", ...) getxattr(, "system.posix_acl_access", ...) stat("/etc/localtime", ...) Compare without -l: $ strace ls -aR . >/dev/null 2>ert; cut -f1 -d'(' ert | sort | uniq -c 9 access 1 arch_prctl 4 brk 377 close 1 execve 1 exit_group 1 fcntl 376 fstat 3 futex 702 getdents 1 getrlimit 3 ioctl 39 mmap 16 mprotect 4 munmap 388 open 11 read 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 1 stat 1 statfs 9 write Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Here's the link: http://community.gluster.org/a/nfs-performance-with-fuse-client-redundancy/ Sent again with a reply to all. Gerald - Original Message - > From: "Christian Meisinger" > To: "olav johansen" > Cc: gluster-users@gluster.org > Sent: Thursday, June 7, 2012 7:00:14 AM > Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? > (small files / directory listings) > > Hello there. > > > That's really interesting, because we think about using GlusterFS too > with a > similar setup/scenario. > > I read about a really strange setup with GlusterFS native client > mount on > the web servers and NFS mount on top of that so you get GlusterFS > failover + > NFS caching. > Can't find the link right now. > > > - Original Message - > From: "olav johansen" > To: gluster-users@gluster.org > Sent: Thursday, June 7, 2012 8:02:14 AM > Subject: [Gluster-users] Performance optimization tips Gluster 3.3? > (small > files / directory listings) > > > Hi, > > I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, > replicated mode > (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, > 3ware > raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 > for > /data), 1Gbit network > > I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, > using > glusterfs. (also tried NFS -> Gluster mount) > > We have 50Gb of files, ~800'000 files in 3 levels of directories (max > 2000 > directories in one folder) > > My main problem is speed of directory indexes "ls -alR" on the > gluster mount > takes 23 minutes every time. > > It don't seem like any directory listing information cache, with > regular NFS > (not gluster) between web1<->fs1, this takes 6m13s first time, and > 5m13s > there after. > > Gluster mount is 4+ times slower for directory indexing performance > vs pure > NFS to single server, is this as expected? > I understand there is a lot more calls involved checking both nodes > but I'm > just looking for a reality check regarding this. > > Any suggestions of how I can speed this up? > > Thanks, > > > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hello there. That's really interesting, because we think about using GlusterFS too with a similar setup/scenario. I read about a really strange setup with GlusterFS native client mount on the web servers and NFS mount on top of that so you get GlusterFS failover + NFS caching. Can't find the link right now. - Original Message - From: "olav johansen" To: gluster-users@gluster.org Sent: Thursday, June 7, 2012 8:02:14 AM Subject: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) Hi, I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, replicated mode (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, 3ware raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 for /data), 1Gbit network I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, using glusterfs. (also tried NFS -> Gluster mount) We have 50Gb of files, ~800'000 files in 3 levels of directories (max 2000 directories in one folder) My main problem is speed of directory indexes "ls -alR" on the gluster mount takes 23 minutes every time. It don't seem like any directory listing information cache, with regular NFS (not gluster) between web1<->fs1, this takes 6m13s first time, and 5m13s there after. Gluster mount is 4+ times slower for directory indexing performance vs pure NFS to single server, is this as expected? I understand there is a lot more calls involved checking both nodes but I'm just looking for a reality check regarding this. Any suggestions of how I can speed this up? Thanks, ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Thu, Jun 07, 2012 at 10:10:03AM +, Fernando Frediani (Qube) wrote: >Sorry this reply won’t be of any help to your problem, but I am too >curious to understand how it can be even slower if monting using >Gluster client which I would expect always be quicker than NFS or >anything else. (1) Try it with "ls -aR" or "find ." instead of "ls -alR" (2) Try it on a gluster non-replicated volume (for fair comparison with direct NFS access) With a replicated volume, many accesses involve sending queries to *both* servers to check they are in sync - even read accesses. This in turn can cause disk seeks on both machines, so the latency you'll get is the larger of the two. If you are doing lots of accesses sequentially then the latencies will all add up. A stat() is one of those accesses which touches both machines, and "ls -l" forces a stat() of each file found. In fact, a quick test suggests ls -l does stat, lstat, getxattr and lgetxattr: $ ls -laR . >/dev/null 2>ert; cut -f1 -d'(' ert | sort | uniq -c 13 access 1 arch_prctl 5 brk 395 close 4 connect 1 execve 1 exit_group 2 fcntl 391 fstat 3 futex 702 getdents 1 getrlimit 1719 getxattr 3 ioctl 1721 lgetxattr 9 lseek 1721 lstat 58 mmap 24 mprotect 12 munmap 424 open 19 read 2 readlink 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 4 socket 1719 stat 1 statfs 29 write Looking at the detail in the strace output, I see these are actually lstat(, ...) lgetxattr(, "security.selinux", ...) getxattr(, "system.posix_acl_access", ...) stat("/etc/localtime", ...) Compare without -l: $ strace ls -aR . >/dev/null 2>ert; cut -f1 -d'(' ert | sort | uniq -c 9 access 1 arch_prctl 4 brk 377 close 1 execve 1 exit_group 1 fcntl 376 fstat 3 futex 702 getdents 1 getrlimit 3 ioctl 39 mmap 16 mprotect 4 munmap 388 open 11 read 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 1 stat 1 statfs 9 write Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hi, Sorry this reply won't be of any help to your problem, but I am too curious to understand how it can be even slower if monting using Gluster client which I would expect always be quicker than NFS or anything else. If you find the reason port it back to the list and share with us please. I think this directory index issues has been reported already for systems with many files. Regards, Fernando From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of olav johansen Sent: 07 June 2012 03:32 To: gluster-users@gluster.org Subject: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) Hi, I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, replicated mode (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, 3ware raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 for /data), 1Gbit network I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, using glusterfs. (also tried NFS -> Gluster mount) We have 50Gb of files, ~800'000 files in 3 levels of directories (max 2000 directories in one folder) My main problem is speed of directory indexes "ls -alR" on the gluster mount takes 23 minutes every time. It don't seem like any directory listing information cache, with regular NFS (not gluster) between web1<->fs1, this takes 6m13s first time, and 5m13s there after. Gluster mount is 4+ times slower for directory indexing performance vs pure NFS to single server, is this as expected? I understand there is a lot more calls involved checking both nodes but I'm just looking for a reality check regarding this. Any suggestions of how I can speed this up? Thanks, ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Could you post the logs of the mount process so that we can analyse what is going on. Did you have data on bricks before you created the volume? Did you upgrade from 3.2? Pranith - Original Message - From: "olav johansen" To: gluster-users@gluster.org Sent: Thursday, June 7, 2012 8:02:14 AM Subject: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) Hi, I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, replicated mode (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, 3ware raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 for /data), 1Gbit network I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, using glusterfs. (also tried NFS -> Gluster mount) We have 50Gb of files, ~800'000 files in 3 levels of directories (max 2000 directories in one folder) My main problem is speed of directory indexes "ls -alR" on the gluster mount takes 23 minutes every time. It don't seem like any directory listing information cache, with regular NFS (not gluster) between web1<->fs1, this takes 6m13s first time, and 5m13s there after. Gluster mount is 4+ times slower for directory indexing performance vs pure NFS to single server, is this as expected? I understand there is a lot more calls involved checking both nodes but I'm just looking for a reality check regarding this. Any suggestions of how I can speed this up? Thanks, ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users