[Gluster-users] Write performance in a replicated/distributed setup with KVM?
This has probably been discussed before, but since I'm new on the list I hope You have patience with me. I have a four brick distributed/replicated setup. The computers are multi-core 16GB memory and 2*2.0TB in raid1 SATA-disks locally. The nodes are connected by 1 GB ethernet. All nodes have glusterfs 3.3beta2 installed and they are running debian 6 64bit. The underlying filesystems are xfs. I have setup a volume like so; gluster volume create virtuals replica 2 transport tcp \ adraste:/data/brick alcippe:/data/brick aethra:/data/brick helen:/data/brick Which resulted in a nice volume; # gluster volume info virtuals Volume Name: virtuals Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: adraste:/data/brick Brick2: alcippe:/data/brick Brick3: aethra:/data/brick Brick4: helen:/data/brick All seems OK so far, but write performance seems very slow. When writing to localhost:/virtuals I get single-digit MB/s performance which isn't really what I had expected. I know that the write has to go to at least two (?) nodes at the same time, but still? A single scp of a 1GB file from a node to another gives something like ~100MBps. A copy of a virtual image took 17 minutes; # time cp debtest.raw /gluster/debtest.img real17m36.727s user0m1.832s sys 0m14.081s # ls -lah /gluster/debtest.img -rw--- 1 root root 20G Mar 1 12:35 /gluster/debtest.img # du -ah /gluster/debtest.img 4.5G/gluster/debtest.img I noted that the processlist shows that direct-io-mode is disabled. Default should be on, or should it? Any help is really appreciated! -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Write performance in a replicated/distributed setup with KVM?
On Fri, 2 Mar 2012, Bryan Whitehead wrote: I'd try putting all hostnames in /etc/hosts. Also, can you post ping times between each host ? They are in /etc/hosts. # ping6 -c3 alcippe PING alcippe(alcippe) 56 data bytes 64 bytes from alcippe: icmp_seq=1 ttl=64 time=0.160 ms 64 bytes from alcippe: icmp_seq=2 ttl=64 time=0.088 ms 64 bytes from alcippe: icmp_seq=3 ttl=64 time=0.150 ms --- alcippe ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.088/0.132/0.160/0.034 ms # ping6 -c3 aethra PING aethra(aethra.arcada.fi) 56 data bytes 64 bytes from aethra.arcada.fi: icmp_seq=1 ttl=64 time=0.154 ms 64 bytes from aethra.arcada.fi: icmp_seq=2 ttl=64 time=0.158 ms 64 bytes from aethra.arcada.fi: icmp_seq=3 ttl=64 time=0.164 ms --- aethra ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.154/0.158/0.164/0.015 ms # ping6 -c3 adraste PING adraste(adraste) 56 data bytes 64 bytes from adraste: icmp_seq=1 ttl=255 time=0.165 ms 64 bytes from adraste: icmp_seq=2 ttl=255 time=0.155 ms 64 bytes from adraste: icmp_seq=3 ttl=255 time=0.187 ms --- adraste ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.155/0.169/0.187/0.013 ms As said before, I don't think there's a problem with the LAN. Trust me, I would know about it :) On Fri, Mar 2, 2012 at 8:55 AM, Harald Hannelius wrote: On Fri, 2 Mar 2012, Brian Candler wrote: On Fri, Mar 02, 2012 at 05:25:18PM +0200, Harald Hannelius wrote: I'll have to test with just a two-way replica, and see if I get better performance out of that. I'm gonna loose the capability to have one node at the other site then Ah... are these nodes separated by a WAN? Synchronous replication is pretty sensitive to latency. You might want to look at geo-replication instead (which I've not tested) No, it's a 1 Gbps LAN. The other "site" is within LAN-range. -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Write performance in a replicated/distributed setup with KVM?
On Fri, 2 Mar 2012, Brian Candler wrote: On Fri, Mar 02, 2012 at 05:25:18PM +0200, Harald Hannelius wrote: I'll have to test with just a two-way replica, and see if I get better performance out of that. I'm gonna loose the capability to have one node at the other site then Ah... are these nodes separated by a WAN? Synchronous replication is pretty sensitive to latency. You might want to look at geo-replication instead (which I've not tested) No, it's a 1 Gbps LAN. The other "site" is within LAN-range. -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Write performance in a replicated/distributed setup with KVM?
On Fri, 2 Mar 2012, Brian Candler wrote: On Fri, Mar 02, 2012 at 03:33:19PM +0200, Harald Hannelius wrote: The pattern for me starts to look like this; max-write-speed ~= /nodes. This is most odd. If you are using a regular replicated+distributed (not striped) volume, then each file operation will be directed to one pair of servers. The dd should just hit two servers and the other two will be idle. So I don't see why your 4-node setup should perform any differently to a 2-node one. I'll have to test with just a two-way replica, and see if I get better performance out of that. I'm gonna loose the capability to have one node at the other site then, but write performance is more important right now. It could be a good idea to have another ethernet-connection interconnect private between the nodes as well, I suppose? Hopefully 10Gbps will get cheaper soon. -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Write performance in a replicated/distributed setup with KVM?
On Fri, 2 Mar 2012, Samuli Heinonen wrote: 2.3.2012 15:33, Harald Hannelius kirjoitti: The pattern for me starts to look like this; max-write-speed ~= /nodes. Have you tried tuning performance.io-thread-count setting? More information about that can be found at http://docs.redhat.com/docs/en-US/Red_Hat_Storage_Software_Appliance/3.2/html/User_Guide/chap-User_Guide-Managing_Volumes.html Yes, as in a previous post; # gluster volume info Volume Name: virtuals Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: adraste:/data/brick Brick2: alcippe:/data/brick Brick3: aethra:/data/brick Brick4: helen:/data/brick Options Reconfigured: cluster.data-self-heal-algorithm: diff cluster.self-heal-window-size: 1 performance.io-thread-count: 64 performance.cache-size: 536870912 performance.write-behind-window-size: 16777216 performance.flush-behind: on -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Write performance in a replicated/distributed setup with KVM?
On Fri, 2 Mar 2012, Brian Candler wrote: On Fri, Mar 02, 2012 at 02:41:30PM +0200, Harald Hannelius wrote: So next is back to the four-node setup you had before. I would expect that to perform about the same. So would I expect too. But; # time dd if=/dev/zero bs=1M count=2 of=/gluster/testfile 2+0 records in 2+0 records out 2097152 bytes (21 GB) copied, 1058.22 s, 19.8 MB/s real17m38.357s user0m0.040s sys 0m12.501s Right, so we know: - replic of aethra and alcippe is fast - distrib/replic across all four nodes is slow So chopping further, what about: - replic of adraste and helen? The pattern for me starts to look like this; max-write-speed ~= /nodes. Volume Name: test Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: adraste:/data/single Brick2: helen:/data/single # time dd if=/dev/zero bs=1M count=1 of=/mnt/testfile 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 195.816 s, 53.5 MB/s real3m15.821s user0m0.016s sys 0m8.169s This would show whether one of these nodes is at fault. At least I got double figure readings this time. Sometimes I get write speeds of 5-6 MB/s. Well, I'm a bit lost when you start talking about VMs. Is this a production environment, and you are doing these dd/cp tests *in addition* to the production load of VM traffic? Or are you doing tests on an unloaded system? I have some systems running in the background yes. They are not really production machines. Note: mail servers have a nasty habit of doing fsync() all the time, for every single received message. It looks like openldap's slapadd uses some kind of sync as well. The load-average on the KVM-host was up at 9.00 while slapadd was running. Tools which might be useful to observe the production load: iostat 1 # shows the count of I/O requests and KB read/written per second iotop is handy too. btrace /dev/sdb | grep ' [DC] ' # shows the actual I/O operations dispatched (D) and completed (C) # to the drive There are also gluster-layer tools but I've not tried them: http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/AG/html/chap-Gluster_Administration_Guide-Monitor_Workload.html Regards, Brian. -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Write performance in a replicated/distributed setup with KVM?
On Fri, 2 Mar 2012, Brian Candler wrote: On Fri, Mar 02, 2012 at 01:02:39PM +0200, Harald Hannelius wrote: If both are fast: then retest using a two-node replicated volume. gluster volume create test replica 2 transport tcp aethra:/data/single alcippe:/data/single Volume Name: test Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: aethra:/data/single Brick2: alcippe:/data/single # time dd if=/dev/zero bs=1M count=2 of=/mnt/testfile 2+0 records in 2+0 records out 2097152 bytes (21 GB) copied, 426.62 s, 49.2 MB/s real7m6.625s user0m0.040s sys 0m12.293s As expected, roughly half of the single node setup. I could live with that too. So next is back to the four-node setup you had before. I would expect that to perform about the same. So would I expect too. But; # time dd if=/dev/zero bs=1M count=2 of=/gluster/testfile 2+0 records in 2+0 records out 2097152 bytes (21 GB) copied, 1058.22 s, 19.8 MB/s real17m38.357s user0m0.040s sys 0m12.501s # gluster volume info Volume Name: virtuals Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: adraste:/data/brick Brick2: alcippe:/data/brick Brick3: aethra:/data/brick Brick4: helen:/data/brick Options Reconfigured: cluster.data-self-heal-algorithm: diff cluster.self-heal-window-size: 1 performance.io-thread-count: 64 performance.cache-size: 536870912 performance.write-behind-window-size: 16777216 performance.flush-behind: on At the same time nagios tries to empty my cell phone battery when virtual hosts don't respond to ping anymore. That virtual host is a mailserver and it receives e-mail. I guess that sendmail+procmail+imapd generates some I/O. At least I got double figure readings this time. Sometimes I get write speeds of 5-6 MB/s. If you have problems with high levels of concurrency, this might be a problem with the number of I/O threads which gluster creates. You actually only get log(2) of the number of outstanding requests in the queue. I made a (stupid, non-production) patch which got around this problem in my benchmarking: http://gluster.org/pipermail/gluster-users/2012-February/009590.html IMO it would be better to be able to configure the *minimum* number of I/O threads to spawn. You can configure the maximum but it will almost never be reached. Regards, Brian. -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Write performance in a replicated/distributed setup with KVM?
This has probably been discussed before, but since I'm new on the list I hope You have patience with me. I have a four brick distributed/replicated setup. The computers are multi-core 16GB memory and 2*2.0TB in raid1 SATA-disks locally. The nodes are connected by 1 GB ethernet. All nodes have glusterfs 3.3beta2 installed and they are running debian 6 64bit. The underlying filesystems are xfs. I have setup a volume like so; gluster volume create virtuals replica 2 transport tcp \ adraste:/data/brick alcippe:/data/brick aethra:/data/brick helen:/data/brick Which resulted in a nice volume; # gluster volume info virtuals Volume Name: virtuals Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: adraste:/data/brick Brick2: alcippe:/data/brick Brick3: aethra:/data/brick Brick4: helen:/data/brick All seems OK so far, but write performance seems very slow. When writing to localhost:/virtuals I get single-digit MB/s performance which isn't really what I had expected. I know that the write has to go to at least two (?) nodes at the same time, but still? A single scp of a 1GB file from a node to another gives something like ~100MBps. A copy of a virtual image took 17 minutes; # time cp debtest.raw /gluster/debtest.img real17m36.727s user0m1.832s sys 0m14.081s # ls -lah /gluster/debtest.img -rw--- 1 root root 20G Mar 1 12:35 /gluster/debtest.img # du -ah /gluster/debtest.img 4.5G/gluster/debtest.img I noted that the processlist shows that direct-io-mode is disabled. Default should be on, or should it? Any help is really appreciated! -- Harald Hannelius | harald.hannelius/a\arcada.fi | +358 50 594 1020 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users