Re: [Gluster-users] cluster.min-free-disk working?
Maybe my question was a bit involved, I'll try again: while searching the web I have found various issues connected to cluster.min-free-disk (e.g., one shouldn't use % but rather a size number). Would it be possible with an update of the status? Thanks, /jon On Jun 11, 2013 16:47 Jon Tegner teg...@renget.se wrote: Hi, have a system consisting of four bricks, using 3.3.2qa3. I used the command gluster volume set glusterKumiko cluster.min-free-disk 20% Two of the bricks where empty, and two were full to just under 80% when building the volume. Now, when syncing data (from a primary system), and using min-free-disk 20% I thought new data would go to the two empty bricks, but gluster does not seem to honor the 20% limit. Have I missed something here? Thanks! /jon ***gluster volume info Volume Name: glusterKumiko Type: Distribute Volume ID: 8f639d0f-9099-46b4-b597-244d89def5bd Status: Started Number of Bricks: 4 Transport-type: tcp,rdma Bricks: Brick1: kumiko01:/mnt/raid6 Brick2: kumiko02:/mnt/raid6 Brick3: kumiko03:/mnt/raid6 Brick4: kumiko04:/mnt/raid6 Options Reconfigured: cluster.min-free-disk: 20% ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Ubuntu 12.04 and fallocate()
So, in other words, stay away from fallocate for now :). Thanks for the info Brian! On Thu, Jun 13, 2013 at 4:05 PM, Brian Foster bfos...@redhat.com wrote: On 06/13/2013 01:38 PM, Jacob Godin wrote: Hey all, Trying to use fallocate with qcow2 images to increase performance. When doing so (with OpenStack), my Gluster mountpoint goes into Transport endpoint is not connected. I am running the Ubuntu 12.04 version of glusterfs-client/server (3.2.5-1ubuntu1) and fuse (2.8.6-2ubuntu2). Any ideas? This sounds like the following bug: https://bugzilla.redhat.com/show_bug.cgi?id=856704 Note that the fix there is to not crash and return -ENOSYS. ;) fallocate still isn't supported in gluster, though patches for said functionality are currently under review. Brian Thanks, Jacob ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] 40 gig ethernet
I have been playing around with Gluster on and off for the last 6 years or so. Most of the things that have been keeping me from using it have been related to latency. In the past I have been using 10 gig infiniband or 10 gig ethernet, recently the price of 40 gig ethernet has fallen quite a bit with guys like Arista. My question is, is this worth it at all for something like Gluster? The port to port latency looks impressive at under 4 microseconds, but I don't yet know what total system to system latency would look like assuming QSPF+ copper cables and linux stack. -- Nathan Stratton Founder, CTO Exario Networks, Inc. nathan at robotics.net nathan at exarionetworks.com http://www.robotics.net http://www.exarionetworks.com/ Building the WebRTC solutions today that your customers will demand tomorrow. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
On 06/13/2013 03:38 PM, John Brunelle wrote: Hello, We're having an issue with our distributed gluster filesystem: * gluster 3.3.1 servers and clients * distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes * xfs backend * nfs clients * nfs.enable-ino32: On * servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64 * cleints: CentOS 5.7, 2.6.18-274.12.1.el5 We have a directory containing 3,343 subdirectories. On some clients, ls lists only a subset of the directories (a different amount on different clients). On others, ls gets stuck in a getdents loop and consumes more and more memory until it hits ENOMEM. On yet others, it works fine. Having the bad clients remount or drop caches makes the problem temporarily go away, but eventually it comes back. The issue sounds a lot like bug #838784, but we are using xfs on the backend, and this seems like more of a client issue. Turning on cluster.readdir-optimize can help readdir when a directory contains a number of sub-directories and there are more bricks in the volume. Do you observe any change with this option enabled? -Vijay ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
Thanks for the reply, Vijay. I set that parameter On, but it hasn't helped, and in fact it seems a bit worse. After making the change on the volume and dropping caches on some test clients, some are now seeing zero subdirectories at all. In my tests before, after dropping caches clients go back to seeing all the subdirectories, and it's only after a while they start disappearing (and have never gone to zero before). Any other ideas? Thanks, John On Fri, Jun 14, 2013 at 10:35 AM, Vijay Bellur vbel...@redhat.com wrote: On 06/13/2013 03:38 PM, John Brunelle wrote: Hello, We're having an issue with our distributed gluster filesystem: * gluster 3.3.1 servers and clients * distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes * xfs backend * nfs clients * nfs.enable-ino32: On * servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64 * cleints: CentOS 5.7, 2.6.18-274.12.1.el5 We have a directory containing 3,343 subdirectories. On some clients, ls lists only a subset of the directories (a different amount on different clients). On others, ls gets stuck in a getdents loop and consumes more and more memory until it hits ENOMEM. On yet others, it works fine. Having the bad clients remount or drop caches makes the problem temporarily go away, but eventually it comes back. The issue sounds a lot like bug #838784, but we are using xfs on the backend, and this seems like more of a client issue. Turning on cluster.readdir-optimize can help readdir when a directory contains a number of sub-directories and there are more bricks in the volume. Do you observe any change with this option enabled? -Vijay ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
On 06/13/2013 03:38 PM, John Brunelle wrote: We have a directory containing 3,343 subdirectories. On some clients, ls lists only a subset of the directories (a different amount on different clients). On others, ls gets stuck in a getdents loop and consumes more and more memory until it hits ENOMEM. On yet others, it works fine. Having the bad clients remount or drop caches makes the problem temporarily go away, but eventually it comes back. The issue sounds a lot like bug #838784, but we are using xfs on the backend, and this seems like more of a client issue. The fact that drop_caches makes it go away temporarily suggests to me that something's going on in FUSE. The reference to #838784 might also be significant even though you're not using ext4. Even the fix for that still makes some assumptions about how certain directory-entry fields are used and might still be sensitive to changes in that usage by the local FS or by FUSE. That might explain both skipping and looping, as you say you've seen. Would it be possible for you to compile and run the attached program on one of the affected directories so we can see what d_off values are involved? But we are also getting some page allocation failures on the server side, e.g. the stack strace below. These are nearly identical to bug #842206 and bug #767127. I'm trying to sort out if these are related to the above issue or just recoverable nic driver GFP_ATOMIC kmalloc failures as suggested in the comments. Slab allocations for dentry, xfs_inode, fuse_inode, fuse_request, etc. are all at ~100% active, and the total number appears to be monotonically growing. Overall memory looks healthy (2/3 is buffers/cache, almost no swap is used). I'd need some help to determine if the memory is overly fragmented or not, but looking at pagetypeinfo and zoneinfo It doesn't appear so to me, and the failures are order:1 anyways. This one definitely seems like one of those innocent victim kind of things where the real problem is in the network code and we just happen to be the app that's running. #include stdio.h #include sys/types.h #include dirent.h int main (int argc, char **argv) { DIR *fd; struct dirent *ent; int counter = 0; fd = opendir(argv[1]); if (!fd) { perror(opendir); return !0; } for (;;) { ent = readdir(fd); if (!ent) { break; } printf (0x%016llx %d %.*s\n, ent-d_off, ent-d_type, sizeof(ent-d_name), ent-d_name); if (++counter 10) { break; } } return 0; } ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quickest way to delete many small files
Be careful, I had made a big mistake here. 'mktemp -d' creates a directory in /tmp, and, surely, it is another file system. So 'mv' will just 'cp'. So ${tempdirname} should be in the some file system as dir Regards, Pablo. El 12/06/2013 08:16 p.m., Liam Slusser escribió: So combining the two approaches I think that this may be a better solution? tempdirname=`mktemp -d` mv dir $tempdirname mkdir dir # rm -rf tempdirname mkdir empty rsync -a --delete empty/ $tempdirname rmdir empty $tempdirname Regards, Pablo. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
On Fri, Jun 14, 2013 at 10:04 AM, John Brunelle john_brune...@harvard.eduwrote: Thanks, Jeff! I ran readdir.c on all 23 bricks on the gluster nfs server to which my test clients are connected (one client that's working, and one that's not; and I ran on those, too). The results are attached. The values it prints are all well within 32 bits, *except* for one that's suspiciously the max 32-bit signed int: $ cat readdir.out.* | awk '{print $1}' | sort | uniq | tail 0xfd59 0xfd6b 0xfd7d 0xfd8f 0xfda1 0xfdb3 0xfdc5 0xfdd7 0xfde8 0x7fff That outlier is the same subdirectory on all 23 bricks. Could this be the issue? Thanks, John 0x7 is the EOF marker. You should find that as last entry in _every_ directory. Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
Ah, I did not know that about 0x7. Is it of note that the clients do *not* get this? This is on an NFS mount, and the volume has nfs.enable-ino32 On. (I should've pointed that out again when Jeff mentioned FUSE.) Side note -- we do have a couple FUSE mounts, too, and I had not seen this issue on any of them before, but when I checked now, zero subdirectories were listed on some. Since I had only seen this on NFS clients after setting cluster.readdir-optimize On, I have now set that back Off. FUSE mounts are now behaving fine again. Thanks, John On Fri, Jun 14, 2013 at 2:17 PM, Anand Avati anand.av...@gmail.com wrote: Are the ls commands (which list partially, or loop and die of ENOMEM eventually) executed on an NFS mount or FUSE mount? Or does it happen on both? Avati On Fri, Jun 14, 2013 at 11:14 AM, Anand Avati anand.av...@gmail.com wrote: On Fri, Jun 14, 2013 at 10:04 AM, John Brunelle john_brune...@harvard.edu wrote: Thanks, Jeff! I ran readdir.c on all 23 bricks on the gluster nfs server to which my test clients are connected (one client that's working, and one that's not; and I ran on those, too). The results are attached. The values it prints are all well within 32 bits, *except* for one that's suspiciously the max 32-bit signed int: $ cat readdir.out.* | awk '{print $1}' | sort | uniq | tail 0xfd59 0xfd6b 0xfd7d 0xfd8f 0xfda1 0xfdb3 0xfdc5 0xfdd7 0xfde8 0x7fff That outlier is the same subdirectory on all 23 bricks. Could this be the issue? Thanks, John 0x7 is the EOF marker. You should find that as last entry in _every_ directory. Avati ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] 40 gig ethernet
I'm using 40G Infiniband with IPoIB for gluster. Here are some ping times (from host 172.16.1.10): [root@node0.cloud ~]# ping -c 10 172.16.1.11 PING 172.16.1.11 (172.16.1.11) 56(84) bytes of data. 64 bytes from 172.16.1.11: icmp_seq=1 ttl=64 time=0.093 ms 64 bytes from 172.16.1.11: icmp_seq=2 ttl=64 time=0.113 ms 64 bytes from 172.16.1.11: icmp_seq=3 ttl=64 time=0.163 ms 64 bytes from 172.16.1.11: icmp_seq=4 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=5 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=6 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=7 ttl=64 time=0.198 ms 64 bytes from 172.16.1.11: icmp_seq=8 ttl=64 time=0.171 ms 64 bytes from 172.16.1.11: icmp_seq=9 ttl=64 time=0.194 ms 64 bytes from 172.16.1.11: icmp_seq=10 ttl=64 time=0.115 ms --- 172.16.1.11 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.093/0.142/0.198/0.035 ms On Fri, Jun 14, 2013 at 7:03 AM, Nathan Stratton nat...@robotics.net wrote: I have been playing around with Gluster on and off for the last 6 years or so. Most of the things that have been keeping me from using it have been related to latency. In the past I have been using 10 gig infiniband or 10 gig ethernet, recently the price of 40 gig ethernet has fallen quite a bit with guys like Arista. My question is, is this worth it at all for something like Gluster? The port to port latency looks impressive at under 4 microseconds, but I don't yet know what total system to system latency would look like assuming QSPF+ copper cables and linux stack. -- Nathan Stratton Founder, CTO Exario Networks, Inc. nathan at robotics.net nathan at exarionetworks.com http://www.robotics.net http://www.exarionetworks.com/ Building the WebRTC solutions today that your customers will demand tomorrow. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] 40 gig ethernet
On Fri, 14 Jun 2013 12:13:53 -0700 Bryan Whitehead dri...@megahappy.net wrote: I'm using 40G Infiniband with IPoIB for gluster. Here are some ping times (from host 172.16.1.10): [root@node0.cloud ~]# ping -c 10 172.16.1.11 PING 172.16.1.11 (172.16.1.11) 56(84) bytes of data. 64 bytes from 172.16.1.11: icmp_seq=1 ttl=64 time=0.093 ms 64 bytes from 172.16.1.11: icmp_seq=2 ttl=64 time=0.113 ms 64 bytes from 172.16.1.11: icmp_seq=3 ttl=64 time=0.163 ms 64 bytes from 172.16.1.11: icmp_seq=4 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=5 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=6 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=7 ttl=64 time=0.198 ms 64 bytes from 172.16.1.11: icmp_seq=8 ttl=64 time=0.171 ms 64 bytes from 172.16.1.11: icmp_seq=9 ttl=64 time=0.194 ms 64 bytes from 172.16.1.11: icmp_seq=10 ttl=64 time=0.115 ms --- 172.16.1.11 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.093/0.142/0.198/0.035 ms What you like to say is that there is no significant difference compared to GigE, right? Anyone got a ping between two kvm-qemu virtio-net cards at hand? -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
On 06/14/2013 01:04 PM, John Brunelle wrote: Thanks, Jeff! I ran readdir.c on all 23 bricks on the gluster nfs server to which my test clients are connected (one client that's working, and one that's not; and I ran on those, too). The results are attached. The values it prints are all well within 32 bits, *except* for one that's suspiciously the max 32-bit signed int: $ cat readdir.out.* | awk '{print $1}' | sort | uniq | tail 0xfd59 0xfd6b 0xfd7d 0xfd8f 0xfda1 0xfdb3 0xfdc5 0xfdd7 0xfde8 0x7fff That outlier is the same subdirectory on all 23 bricks. Could this be the issue? As Avati points out, that's the EOF marker so it's fine. It might be interesting to run this on the client as well, to see how those values relate to those on the bricks - especially at the point where you see skipping or looping. You might also want to modify it to print out d_name as well, so that you can see the file names and correlate things more easily. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Distributed-replicate across four bricks with two hosts
Hi, I created a distributed-replicated volume across two servers and two bricks each. Here is the configuration: Volume Name: slice1 Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: elkpinfkvm05-bus:/srv/brick1 Brick2: elkpinfkvm05-bus:/srv/brick2 Brick3: elkpinfkvm06-bus:/srv/brick1 Brick4: elkpinfkvm06-bus:/srv/brick2 So naturally the question comes up: how do I know the replication is happening across the hosts, not two bricks on the same host? If one of the hosts went down, I want to make sure the storage is still available. Thanks, Ziemowit ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] 40 gig ethernet
GigE is slower. Here is ping from same boxes but using the 1GigE cards: [root@node0.cloud ~]# ping -c 10 10.100.0.11 PING 10.100.0.11 (10.100.0.11) 56(84) bytes of data. 64 bytes from 10.100.0.11: icmp_seq=1 ttl=64 time=0.628 ms 64 bytes from 10.100.0.11: icmp_seq=2 ttl=64 time=0.283 ms 64 bytes from 10.100.0.11: icmp_seq=3 ttl=64 time=0.307 ms 64 bytes from 10.100.0.11: icmp_seq=4 ttl=64 time=0.275 ms 64 bytes from 10.100.0.11: icmp_seq=5 ttl=64 time=0.313 ms 64 bytes from 10.100.0.11: icmp_seq=6 ttl=64 time=0.278 ms 64 bytes from 10.100.0.11: icmp_seq=7 ttl=64 time=0.309 ms 64 bytes from 10.100.0.11: icmp_seq=8 ttl=64 time=0.197 ms 64 bytes from 10.100.0.11: icmp_seq=9 ttl=64 time=0.267 ms 64 bytes from 10.100.0.11: icmp_seq=10 ttl=64 time=0.187 ms --- 10.100.0.11 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9000ms rtt min/avg/max/mdev = 0.187/0.304/0.628/0.116 ms Note: The Infiniband interfaces have a constant load of traffic from glusterfs. The Nic cards comparatively have very little traffic. On Fri, Jun 14, 2013 at 12:40 PM, Stephan von Krawczynski sk...@ithnet.com wrote: On Fri, 14 Jun 2013 12:13:53 -0700 Bryan Whitehead dri...@megahappy.net wrote: I'm using 40G Infiniband with IPoIB for gluster. Here are some ping times (from host 172.16.1.10): [root@node0.cloud ~]# ping -c 10 172.16.1.11 PING 172.16.1.11 (172.16.1.11) 56(84) bytes of data. 64 bytes from 172.16.1.11: icmp_seq=1 ttl=64 time=0.093 ms 64 bytes from 172.16.1.11: icmp_seq=2 ttl=64 time=0.113 ms 64 bytes from 172.16.1.11: icmp_seq=3 ttl=64 time=0.163 ms 64 bytes from 172.16.1.11: icmp_seq=4 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=5 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=6 ttl=64 time=0.125 ms 64 bytes from 172.16.1.11: icmp_seq=7 ttl=64 time=0.198 ms 64 bytes from 172.16.1.11: icmp_seq=8 ttl=64 time=0.171 ms 64 bytes from 172.16.1.11: icmp_seq=9 ttl=64 time=0.194 ms 64 bytes from 172.16.1.11: icmp_seq=10 ttl=64 time=0.115 ms --- 172.16.1.11 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.093/0.142/0.198/0.035 ms What you like to say is that there is no significant difference compared to GigE, right? Anyone got a ping between two kvm-qemu virtio-net cards at hand? -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Distributed-replicate across four bricks with two hosts
On Fri, Jun 14, 2013 at 4:35 PM, Ziemowit Pierzycki ziemo...@pierzycki.com wrote: Hi, I created a distributed-replicated volume across two servers and two bricks each. Here is the configuration: Volume Name: slice1 Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: elkpinfkvm05-bus:/srv/brick1 Brick2: elkpinfkvm05-bus:/srv/brick2 Brick3: elkpinfkvm06-bus:/srv/brick1 Brick4: elkpinfkvm06-bus:/srv/brick2 So naturally the question comes up: how do I know the replication is happening across the hosts, not two bricks on the same host? If one of the hosts went down, I want to make sure the storage is still available. Thanks, Test it of course by taking down different nodes! Ziemowit ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Distributed-replicate across four bricks with two hosts
On 06/14/13 13:35, Ziemowit Pierzycki wrote: Hi, I created a distributed-replicated volume across two servers and two bricks each. Here is the configuration: Volume Name: slice1 Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: elkpinfkvm05-bus:/srv/brick1 Brick2: elkpinfkvm05-bus:/srv/brick2 Brick3: elkpinfkvm06-bus:/srv/brick1 Brick4: elkpinfkvm06-bus:/srv/brick2 So naturally the question comes up: how do I know the replication is happening across the hosts, not two bricks on the same host? If one of the hosts went down, I want to make sure the storage is still available. Replica sets are done in order that the bricks are added to the volume. So, you have the replica set size set to 2, which means every 2 bricks are mirrored, down the line. Brick1 + Brick2 = Replica 1 Brick3 + Brick4 = Replica 2 Then the distribution (DHT) distributes files over Replica 1 and Replica 2. So, you have an issue here, that both bricks of a replica set are on the same host. -- Mr. Flibble King of the Potato People ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users