Re: [Gluster-users] Throughout over infiniband
Corey, Make sure to test with direct I/O, otherwise the caching can give you unrealistic expectations of your actual throughput. Typically, using the ipoib driver is not recommended with Infiniband since you will introduce unnecessary overhead via TCP. Knowing how you have Gluster configured is also essential to understanding whether any metrics you get from testing are within expectations. Including the output from `gluster volume info` is an essential piece of information. Thanks, Eco On Fri, Sep 7, 2012 at 1:45 AM,Corey Kovacs wrote: > Folks, > > I finally got my hands on a 4x FDR (56Gb) Infiniband switch and 4 cards to do > some testing of GlusterFS over that interface. > > So far, I am not getting the throughput I _think_ I should see. > > My config is made up of.. > > 4 dl360-g8's (three bricks and one client) > 4 4xFDR, dual port IB cards (one port configured in each card per host) > 1 4xFDR 36 port Mellanox Switch (managed and configured) > GlusterFS 3.2.6 > RHEL6.3 > > I have tested the IB cards and get about 6GB between hosts over raw IB. Using > ipoib, I can get about 22Gb/sec. Not too shabby for a first go but I expected > more (cards are in connected mode with MTU of 64k). > > My raw speed to the disks (though the buffer cache... I just realized I've > not tested direct mode IO, I'll do that later today) is about 800MB/sec. I > expect to see on the order of 2GB/sec (a little less than 3x800). > > When I write a large stream using dd, and watch the bricks I/O I see > ~800MB/sec on each one, but at the end of the test, the report from dd > indicates 800MB/sec. > > Am I missing something fundamental? > > Any pointers would be appreciated, > > > Thanks! > > > Corey ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
Everyone, This is just a response to the issue of nfs vs glusterfs and the performance for glsuter as I think some of the information may be useful here and has not been discussed. For the sake of clarity, I do not run infiniband..but I am running 10GB. My normal production speeds sit around 600MB/s to 700MB/s with the native gluster client. My setup has 12 servers each with a single 24 disk sata raid 5 10TB brick. Gluster setup of 6x2. Before I settled on this setup I run extensive tests over about 6 weeks to confirm my setup... in my case glusterfs native out performed NFS considerably in aggretgate data xfersI also found that the peak performance of the glusterfs client in my setup was at about 12 servers. This distributed the write and read loads very wellafter 12 servers adding more produces diminishing returns. I bring this up as no one has been talking about how their brick setup may be effecting the performance as well as the number of servers hosting bricksPutting multiple bricks on a single server does not increase the load capacity anywhere near like adding another server with the additional brick. The point I wanted to make was you need to look at all sides of your setup in order to get the best performance...In my case this involved evaluating the raid setup (tuning the size of blocks), the file system used for the brick on the raid (and tuning it, specifically for the size and types of files be manipulated), the memory and cpu in the servers, the network BW, and the client access method. I had to look at all of these (and each of them had an impact on the final performance numbers) before I found my best setup.I do not think you can just unilaterally dismiss the gluster setup until you have done a COMPLETE analysis for how to best setup your environment. Just sharing my thoughts as when I first was setting up gluster I thought I could just install it and tweak the options and be good to go but once I understood everything it is dependent on and addressed all of those options and tuning as wellI significantly improved my overall performance to well over what I was able to achieve with nfs. Feel free to shoot me with comments or questions. Bryan Washer -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Fernando Frediani (Qube) Sent: Monday, September 10, 2012 8:14 AM To: 'Stephan von Krawczynski'; 'Whit Blauvelt' Cc: 'gluster-users@gluster.org'; 'Brian Candler' Subject: Re: [Gluster-users] Throughout over infiniband Well, I would say there is a reason, if the Gluster client performed as expected. Using the Gluster client it should in theory access the file(s) directly from the nodes where they reside and not having to go though a single node exporting the NFS folder which would then have to gather the file. Yes the NFS has all the caching stuff but if the Gluster client behaviour was similar it should be able to get similar performance which doesn't seem to be what has been resported. I did tests myself using Gluster client and NFS and NFS got better performance also and I believe this is due the caching. Fernando -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Stephan von Krawczynski Sent: 10 September 2012 13:57 To: Whit Blauvelt Cc: gluster-users@gluster.org; Brian Candler Subject: Re: [Gluster-users] Throughout over infiniband On Mon, 10 Sep 2012 08:06:51 -0400 Whit Blauvelt wrote: > On Mon, Sep 10, 2012 at 11:13:11AM +0200, Stephan von Krawczynski wrote: > > [...] > > If you're lucky you reach something like 1/3 of the NFS performance. > [Gluster NFS Client] > Whit There is a reason why one would switch from NFS to GlusterFS, and mostly it is redundancy. If you start using a NFS-client type you cut yourself off the "complete solution". As said elsewhere you can as well export GlusterFS via kernel-nfs-server. But honestly, it is a patch. It would be better by far if things are done right, native glusterfs client in kernel-space. And remember, generally there should be no big difference between NFS and GlusterFS with bricks spread over several networks - if it is done how it should be, without userspace. -- MfG, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users NOTICE: This email and any attachments may contain confidential and proprietary information of NetSuite Inc. and is for the sole use of the intended recipient for the stated purpose. Any improper use or distribution is prohibited. If you
Re: [Gluster-users] Throughout over infiniband
On 09/10/2012 08:56 AM, Stephan von Krawczynski wrote: On Mon, 10 Sep 2012 08:06:51 -0400 Whit Blauvelt wrote: On Mon, Sep 10, 2012 at 11:13:11AM +0200, Stephan von Krawczynski wrote: [...] If you're lucky you reach something like 1/3 of the NFS performance. [Gluster NFS Client] Whit There is a reason why one would switch from NFS to GlusterFS, and mostly it is redundancy. If you start using a NFS-client type you cut yourself off the "complete solution". As said elsewhere you can as well export GlusterFS via kernel-nfs-server. But honestly, it is a patch. It would be better by far if things are done right, native glusterfs client in kernel-space. And remember, generally there should be no big difference between NFS and GlusterFS with bricks spread over several networks - if it is done how it should be, without userspace. Just to be clear, when you export a gluster volume via NFS, the clients are using kernel NFS. The gluster NFS server is the only thing in user space. The redundancy you do lose is the automatic fail-over to the other servers if the NFS server the client mounted from fails. If you're using replication, you do not lose that when you chose to use NFS. -- Kaleb ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
Well, I would say there is a reason, if the Gluster client performed as expected. Using the Gluster client it should in theory access the file(s) directly from the nodes where they reside and not having to go though a single node exporting the NFS folder which would then have to gather the file. Yes the NFS has all the caching stuff but if the Gluster client behaviour was similar it should be able to get similar performance which doesn't seem to be what has been resported. I did tests myself using Gluster client and NFS and NFS got better performance also and I believe this is due the caching. Fernando -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Stephan von Krawczynski Sent: 10 September 2012 13:57 To: Whit Blauvelt Cc: gluster-users@gluster.org; Brian Candler Subject: Re: [Gluster-users] Throughout over infiniband On Mon, 10 Sep 2012 08:06:51 -0400 Whit Blauvelt wrote: > On Mon, Sep 10, 2012 at 11:13:11AM +0200, Stephan von Krawczynski wrote: > > [...] > > If you're lucky you reach something like 1/3 of the NFS performance. > [Gluster NFS Client] > Whit There is a reason why one would switch from NFS to GlusterFS, and mostly it is redundancy. If you start using a NFS-client type you cut yourself off the "complete solution". As said elsewhere you can as well export GlusterFS via kernel-nfs-server. But honestly, it is a patch. It would be better by far if things are done right, native glusterfs client in kernel-space. And remember, generally there should be no big difference between NFS and GlusterFS with bricks spread over several networks - if it is done how it should be, without userspace. -- MfG, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
On Mon, 10 Sep 2012 08:06:51 -0400 Whit Blauvelt wrote: > On Mon, Sep 10, 2012 at 11:13:11AM +0200, Stephan von Krawczynski wrote: > > [...] > > If you're lucky you reach something like 1/3 of the NFS > > performance. > [Gluster NFS Client] > Whit There is a reason why one would switch from NFS to GlusterFS, and mostly it is redundancy. If you start using a NFS-client type you cut yourself off the "complete solution". As said elsewhere you can as well export GlusterFS via kernel-nfs-server. But honestly, it is a patch. It would be better by far if things are done right, native glusterfs client in kernel-space. And remember, generally there should be no big difference between NFS and GlusterFS with bricks spread over several networks - if it is done how it should be, without userspace. -- MfG, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
On Mon, Sep 10, 2012 at 11:13:11AM +0200, Stephan von Krawczynski wrote: > If you have small files you are busted, if you have workload on the clients > you are busted and if you have lots of concurrent FS action on the client you > are busted. Which leaves you with test cases nowhere near real life. > I replaced nfs servers with glusterfs and I know what's going on in these > setups afterwards. If you're lucky you reach something like 1/3 of the NFS > performance. This is an informative debate. But I wonder, Stephan, how much of the bustedness you report is avoided by simply using NFS clients with Gluster. For my own purposes, Gluster (3.1.4) has performed well in production. And it's hardly an optimal arrangement. For instance, there's a public-facing FTP/SFTP server NFS mounting mirrored Gluster servers over a gigabit LAN. The FTP/SFTP server also re-exports the NFS Gluster shares via Samba, to a medium-sized office. Numerious files of a wide range of sizes continuously go in an out of this system from every direction. We haven't always run this way. It used to be the same setup, but with local storage on the FTP/SFTP/Samba server. Yet nobody since the switch to Gluster has so much as commented on the speed, and half of our staff are programmers who are not shy about critiquing system performance when they see it lagging. Now, when I tested Gluster for KVM - yeah, that doesn't work. And there are going to be environments in which file servers are far more stressed than ours. But since part of the slowness of Gluster for you is about the Gluster clients rather than servers - well, I'd like to see the NFS-client-to-Gluster test as comparison. Some here have reported it's faster. It's certainly in my own use case fast enough. Whit ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
On Mon, 10 Sep 2012 09:44:26 +0100 Brian Candler wrote: > On Mon, Sep 10, 2012 at 10:03:14AM +0200, Stephan von Krawczynski wrote: > > > Yes - so in workloads where you have many concurrent clients, this isn't a > > > problem. It's only a problem if you have a single client doing a lot of > > > sequential operations. > > > > That is not correct for most cases. GlusterFS always has a problem on > > clients > > with high workloads. This obviously derives from the fact that the FS is > > userspace-based. If other userspace applications eat lots of cpu your FS > > comes > > to a crawl. > > It's only "obvious" if your application is CPU-bound, rather than I/O-bound. I think one can drop the 5% market share that uses storage only for storing _big_ files from client boxes with zero load. This is about the only case where GlusterFS works ok if you don't mind the throughput problem of FUSE at high rates. If you have small files you are busted, if you have workload on the clients you are busted and if you have lots of concurrent FS action on the client you are busted. Which leaves you with test cases nowhere near real life. I replaced nfs servers with glusterfs and I know what's going on in these setups afterwards. If you're lucky you reach something like 1/3 of the NFS performance. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
On Mon, Sep 10, 2012 at 10:03:14AM +0200, Stephan von Krawczynski wrote: > > Yes - so in workloads where you have many concurrent clients, this isn't a > > problem. It's only a problem if you have a single client doing a lot of > > sequential operations. > > That is not correct for most cases. GlusterFS always has a problem on clients > with high workloads. This obviously derives from the fact that the FS is > userspace-based. If other userspace applications eat lots of cpu your FS comes > to a crawl. It's only "obvious" if your application is CPU-bound, rather than I/O-bound. > > [...] > > Have you tried doing exactly the same test but over NFS? I didn't see that > > in your posting (you only mentioned NFS in the context of KVM) > > And as I said above NFS (kernel-version) does have no problem at all in these > scenarios. I think the OP needs to test the specific workload - I think it was iozone - using NFS. I saw figures for iozone local access to disk, and iozone over glusterfs, but not iozone over NFS. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
On Mon, 10 Sep 2012 08:48:03 +0100 Brian Candler wrote: > On Sun, Sep 09, 2012 at 09:28:47PM +0100, Andrei Mikhailovsky wrote: > >While trying to figure out the cause of the bottleneck i've realised > >that the bottle neck is coming from the client side as running > >concurrent test from two clients would give me about 650mb/s per each > >client. > > Yes - so in workloads where you have many concurrent clients, this isn't a > problem. It's only a problem if you have a single client doing a lot of > sequential operations. That is not correct for most cases. GlusterFS always has a problem on clients with high workloads. This obviously derives from the fact that the FS is userspace-based. If other userspace applications eat lots of cpu your FS comes to a crawl. > [...] > Have you tried doing exactly the same test but over NFS? I didn't see that > in your posting (you only mentioned NFS in the context of KVM) And as I said above NFS (kernel-version) does have no problem at all in these scenarios. And it does not have the GlusterFS-problems with multiple concurrent FS action on the same client, too. Neither there is a problem with maximum bandwidth. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
On Sun, Sep 09, 2012 at 09:28:47PM +0100, Andrei Mikhailovsky wrote: >While trying to figure out the cause of the bottleneck i've realised >that the bottle neck is coming from the client side as running >concurrent test from two clients would give me about 650mb/s per each >client. Yes - so in workloads where you have many concurrent clients, this isn't a problem. It's only a problem if you have a single client doing a lot of sequential operations. My guess would be it's something to do with latency: i.e. client sends request, waits for response before sending next request. A random-read workload is the worst case, and this is a "laws of physics" thing. Consider: - Client issues request to read file A - Request gets transferred over network - Fileserver issues seek/read - Response gets transferred back over network - Client issues request to read file B - ... etc If the client is written to issue only one request at a time then there's no way to optimise this - the server cannot guess in advance what the next read will be. Have you tried doing exactly the same test but over NFS? I didn't see that in your posting (you only mentioned NFS in the context of KVM) When you are doing lots of writes you should be able to pipeline the data. The difference is, with a local filesystem you have an instant latency, and writing data is just stuffing dirty blocks into the VFS cache. With a remote filesystem, when you open a file you have to wait for an OK response before you start writing. Again, you should compare this against NFS before writing off Gluster. >P.S. If you are looking to use glusterfs as the backend storage for the >kvm virtualisation, I would warn you that it's a tricky business. I've >managed to make things work, but the performance is far worse than any >of my pessimistic expectations! An example - a mounted glusterfs-rdma >file system on the server running kvm would give me around 700-850mb/s >throughput. I was only getting 50mb/s max when doing the test from the >vm stored on that partition. Yes, this has been observed by lots of people. KVM block access mapped to FUSE file access mapped to Gluster doesn't perform well. However some patches have been written for KVM to use the gluster protocol directly and the performance is way, way better. KVM machines are just userland processes, and the I/O stays entirely in userland. I'm looking forward to these being incorporated into mainline KVM. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
Ok, now you can see why I am talking about dropping the long-gone unix versions (BSD/Solaris/name-one) and concentrate on doing a linux-kernel module for glusterfs without fuse overhead. It is the _only_ way to make this project a really successful one. Everything happening now is just a project pre-test environment. And saying that open is the reason why quite some people dislike my comments... Please stop riding dead horses guys. -- Regards, Stephan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Throughout over infiniband
Hi Corey, Let me share the results of testing that i've been doing for the past 5 weeks or so. As in your experience, the results are no where near to what i've been expecting. What a disappointment. Anyway, here we go. I am using Centos 6.3 with the latest updates and patches using the latest QLogic OFED version 1.5.3.x; Qlogic drivers with OFED 1.5.4.x is not compiling on Centos 6.3. I've also tried vanilla OFED 1.5.4.1 and Mellanox OFED 1.5.3.x with pretty much similar results. I've been testing Glusterfs 3.2.7 and 3.3.0. No significant performance difference between 3.2 and 3.3 brunch. My hardware is 1 storage server with Mellanox QDR dual port card. Two server nodes with Qlogic dual port mezzanine cards and QLogic QDR (HP) blade switch. Storage server uses ZFS made of 4 stripes of 2 disks mirror + 240gb SSD for ZIL + 240gb SSD for L2ARC cache. I also enabled compression + deduplication. Underlying ZFS performance using iozone tests ( iozone -+u -t 2 -F f1 f2 -r 2048 -s 30G) is between 4GB/s and 10GB/s depending on the test levels. Infiniband fabric tests using rdma were giving between 3 and 4 GB/s. Please note GigaBytes NOT GigaBits per second. So, I was expecting to have a throughput of around 2.5 - 3 GB/s over glusterfs rdma taking into account overheads. yeah, right, wishful thinking it was!!! I've built my PoC environment and started testing with just one client and i've been getting around 400-600mb/s tops. Writes were about 20% faster than reads. Following some performance tuning on the glusterfs and zfs side I've managed to increase throughput to around 700-800mb/s with writes still being about 20% faster. To note that adding the "-o" switch to the iozone command to use the synchronised writes the writes throughput was limited to the ZIL SSD speed. While trying to figure out the cause of the bottleneck i've realised that the bottle neck is coming from the client side as running concurrent test from two clients would give me about 650mb/s per each client. Doing a bit more research it seems that the cause of the problem is with FUSE. googleing for this issue i've found a number of people complaining the limit of fuse throughput at around 600-700mb/s. There is a kernel patch to address this issue, but the results of testing from several people showed only a marginal increase in performance. Guys managed to increase their throughput from around 600mb/s to about 850mb/s or so. Thus, from what i've read, it's currently not being possible to achieve speeds over 1GB/s with fuse. This made me wonder the reason behind choosing to use fuse in the first place for the client side glusterfs. P.S. If you are looking to use glusterfs as the backend storage for the kvm virtualisation, I would warn you that it's a tricky business. I've managed to make things work, but the performance is far worse than any of my pessimistic expectations! An example - a mounted glusterfs-rdma file system on the server running kvm would give me around 700-850mb/s throughput. I was only getting 50mb/s max when doing the test from the vm stored on that partition. In comparison, nfs would give me around 350-400mb/s. I have never expected glustefs to perform worse than nfs. I would be grateful if anyone would share their experience with glusterfs over infiniband and their tips on improving performance. cheers Andrei - Original Message - From: "Corey Kovacs" To: gluster-users@gluster.org Sent: Friday, 7 September, 2012 2:45:48 PM Subject: [Gluster-users] Throughout over infiniband Folks, I finally got my hands on a 4x FDR (56Gb) Infiniband switch and 4 cards to do some testing of GlusterFS over that interface. So far, I am not getting the throughput I _think_ I should see. My config is made up of.. 4 dl360-g8's (three bricks and one client) 4 4xFDR, dual port IB cards (one port configured in each card per host) 1 4xFDR 36 port Mellanox Switch (managed and configured) GlusterFS 3.2.6 RHEL6.3 I have tested the IB cards and get about 6GB between hosts over raw IB. Using ipoib, I can get about 22Gb/sec. Not too shabby for a first go but I expected more (cards are in connected mode with MTU of 64k). My raw speed to the disks (though the buffer cache... I just realized I've not tested direct mode IO, I'll do that later today) is about 800MB/sec. I expect to see on the order of 2GB/sec (a little less than 3x800). When I write a large stream using dd, and watch the bricks I/O I see ~800MB/sec on each one, but at the end of the test, the report from dd indicates 800MB/sec. Am I missing something fundamental? Any pointers would be appreciated, Thanks! Corey ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users _
[Gluster-users] Throughout over infiniband
Folks, I finally got my hands on a 4x FDR (56Gb) Infiniband switch and 4 cards to do some testing of GlusterFS over that interface. So far, I am not getting the throughput I _think_ I should see. My config is made up of.. 4 dl360-g8's (three bricks and one client) 4 4xFDR, dual port IB cards (one port configured in each card per host) 1 4xFDR 36 port Mellanox Switch (managed and configured) GlusterFS 3.2.6 RHEL6.3 I have tested the IB cards and get about 6GB between hosts over raw IB. Using ipoib, I can get about 22Gb/sec. Not too shabby for a first go but I expected more (cards are in connected mode with MTU of 64k). My raw speed to the disks (though the buffer cache... I just realized I've not tested direct mode IO, I'll do that later today) is about 800MB/sec. I expect to see on the order of 2GB/sec (a little less than 3x800). When I write a large stream using dd, and watch the bricks I/O I see ~800MB/sec on each one, but at the end of the test, the report from dd indicates 800MB/sec. Am I missing something fundamental? Any pointers would be appreciated, Thanks! Corey ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users