Hello, On Mon, 2 Jun 2014 16:15:22 +0800 Indra Pramana wrote:
> Dear all, > > I have managed to identify some slow OSDs and journals and have since > replaced them. RADOS benchmark of the whole cluster is now fast, much > improved from last time, showing the cluster can go up to 700+ MB/s. > > ===== > Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds > or 0 objects > Object prefix: benchmark_data_hv-kvm-01_6931 > sec Cur ops started finished avg MB/s cur MB/s last lat avg > lat 0 0 0 0 0 0 - 0 > 1 16 214 198 791.387 792 0.260687 > 0.074689 2 16 275 259 517.721 244 0.079697 > 0.0861397 3 16 317 301 401.174 168 0.209022 > 0.115348 4 16 317 301 300.902 0 - > 0.115348 5 16 356 340 271.924 78 0.040032 > 0.172452 6 16 389 373 248.604 132 0.038983 > 0.221213 7 16 411 395 225.662 88 0.048462 > 0.211686 8 16 441 425 212.454 120 0.048722 > 0.237671 9 16 474 458 203.513 132 0.041285 > 0.226825 10 16 504 488 195.161 120 0.041899 > 0.224044 11 16 505 489 177.784 4 0.622238 > 0.224858 12 16 505 489 162.97 0 - > 0.224858 Total time run: 12.142654 > Total writes made: 505 > Write size: 4194304 > Bandwidth (MB/sec): 166.356 > > Stddev Bandwidth: 208.41 > Max bandwidth (MB/sec): 792 > Min bandwidth (MB/sec): 0 > Average Latency: 0.384178 > Stddev Latency: 1.10504 > Max latency: 9.64224 > Min latency: 0.031679 > ===== > This might be better than the last result, but it still shows the same massive variance in latency and a pretty horrible average latency. Also you want to run this test for a lot longer, looking at the bandwidth progression it seems to drop over time. I'd expect the sustained bandwidth over a minute or so be below 100MB/s. > However, dd test result on guest VM is still slow. > > ===== > root@test1# dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync > oflag=direct > 256+0 records in > 256+0 records out > 268435456 bytes (268 MB) copied, 17.1829 s, 15.6 MB/s > ===== > You're kinda comparing apples to oranges here. Firstly, the block size isn't the same, running a rbd bench with 1MB blocks shows about a 25% decrease in bandwidth. Secondly, is the VM user space, kernel space, what FS, etc. Mounting a RBD image formatted the same way in kernelspace on a host and doing the the dd test there would be a better comparison. Christian > I thought I have fixed the problem by replacing all those bad OSDs and > journals but apparently it doesn't resolve the problem. > > Is there any throttling settings which prevents the guest VMs to get the > I/O write speed that it's entitled to? > > Looking forward to your reply, thank you. > > Cheers. > > > > > On Tue, Apr 29, 2014 at 8:54 PM, Christian Balzer <ch...@gol.com> wrote: > > > On Thu, 24 Apr 2014 13:51:49 +0800 Indra Pramana wrote: > > > > > Hi Christian, > > > > > > Good day to you, and thank you for your reply. > > > > > > On Wed, Apr 23, 2014 at 11:41 PM, Christian Balzer <ch...@gol.com> > > wrote: > > > > > > > > > > Using 32 concurrent writes, result is below. The speed really > > > > > > > fluctuates. > > > > > > > > > > > > > > Total time run: 64.31704964.317049 > > > > > > > Total writes made: 1095 > > > > > > > Write size: 4194304 > > > > > > > Bandwidth (MB/sec): 68.100 > > > > > > > > > > > > > > Stddev Bandwidth: 44.6773 > > > > > > > Max bandwidth (MB/sec): 184 > > > > > > > Min bandwidth (MB/sec): 0 > > > > > > > Average Latency: 1.87761 > > > > > > > Stddev Latency: 1.90906 > > > > > > > Max latency: 9.99347 > > > > > > > Min latency: 0.075849 > > > > > > > > > > > > > That is really weird, it should get faster, not slower. ^o^ > > > > > > I assume you've run this a number of times? > > > > > > > > > > > > Also my apologies, the default is 16 threads, not 1, but that > > > > > > still isn't enough to get my cluster to full speed: > > > > > > --- > > > > > > Bandwidth (MB/sec): 349.044 > > > > > > > > > > > > Stddev Bandwidth: 107.582 > > > > > > Max bandwidth (MB/sec): 408 > > > > > > --- > > > > > > at 64 threads it will ramp up from a slow start to: > > > > > > --- > > > > > > Bandwidth (MB/sec): 406.967 > > > > > > > > > > > > Stddev Bandwidth: 114.015 > > > > > > Max bandwidth (MB/sec): 452 > > > > > > --- > > > > > > > > > > > > But what stands out is your latency. I don't have a 10GBE > > > > > > network to compare, but my Infiniband based cluster (going > > > > > > through at least one switch) gives me values like this: > > > > > > --- > > > > > > Average Latency: 0.335519 > > > > > > Stddev Latency: 0.177663 > > > > > > Max latency: 1.37517 > > > > > > Min latency: 0.1017 > > > > > > --- > > > > > > > > > > > > Of course that latency is not just the network. > > > > > > > > > > > > > > > > What else can contribute to this latency? Storage node load, disk > > > > > speed, anything else? > > > > > > > > > That and the network itself are pretty much it, you should know > > > > once you've run those test with atop or iostat on the storage > > > > nodes. > > > > > > > > > > > > > > > I would suggest running atop (gives you more information at one > > > > > > glance) or "iostat -x 3" on all your storage nodes during these > > > > > > tests to identify any node or OSD that is overloaded in some > > > > > > way. > > > > > > > > > > > > > > > > Will try. > > > > > > > > > Do that and let us know about the results. > > > > > > > > > > I have done some tests using iostat and noted some OSDs on a > > > particular storage node going up to the 100% limit when I run the > > > rados bench test. > > > > > Dumping lots of text will make people skip over your mails, you need to > > summarize and preferably understand yourself what these numbers mean. > > > > The iostat output is not too conclusive, as the numbers when reaching > > 100% utilization are not particular impressive. > > The fact that it happens though should make you look for anything > > different with these OSDs, from smartctl checks to PG distribution, as > > in "ceph pg dump" and then tallying up each PG. > > Also look at "ceph osd tree" and see if those OSDs or node have a > > higher weight than others. > > > > The atop line indicates that sdb was being read at a rate of 100MB/s > > and assuming that your benchmark was more or less the only thing > > running at that time this would mean something very odd is going on, > > as all the other OSDs were have no significant reads going on and all > > were being written at about the same speed. > > > > Christian > > > > > ==== > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > 1.09 0.00 0.92 21.74 0.00 76.25 > > > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > > sda 0.00 0.00 4.33 42.00 73.33 6980.00 > > > 304.46 0.29 6.22 0.00 6.86 1.50 6.93 > > > sdb 0.00 0.00 0.00 17.67 0.00 6344.00 > > > 718.19 59.64 854.26 0.00 854.26 56.60 *100.00* > > > sdc 0.00 0.00 12.33 59.33 70.67 18882.33 > > > 528.92 36.54 509.80 64.76 602.31 10.51 75.33 > > > sdd 0.00 0.00 3.33 54.33 24.00 15249.17 > > > 529.71 1.29 22.45 3.20 23.63 1.64 9.47 > > > sde 0.00 0.33 0.00 0.67 0.00 4.00 > > > 12.00 0.30 450.00 0.00 450.00 450.00 30.00 > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > 1.38 0.00 1.13 7.75 0.00 89.74 > > > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > > sda 0.00 0.00 5.00 69.00 30.67 19408.50 > > > 525.38 4.29 58.02 0.53 62.18 2.00 14.80 > > > sdb 0.00 0.00 7.00 63.33 41.33 20911.50 > > > 595.82 13.09 826.96 88.57 908.57 5.48 38.53 > > > sdc 0.00 0.00 2.67 30.00 17.33 6945.33 > > > 426.29 0.21 6.53 0.50 7.07 1.59 5.20 > > > sdd 0.00 0.00 2.67 58.67 16.00 20661.33 > > > 674.26 4.89 79.54 41.00 81.30 2.70 16.53 > > > sde 0.00 0.00 0.00 1.67 0.00 6.67 > > > 8.00 0.01 3.20 0.00 3.20 1.60 0.27 > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > 0.97 0.00 0.55 6.73 0.00 91.75 > > > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > > sda 0.00 0.00 1.67 15.33 21.33 120.00 > > > 16.63 0.02 1.18 0.00 1.30 0.63 1.07 > > > sdb 0.00 0.00 4.33 62.33 24.00 13299.17 > > > 399.69 2.68 11.18 1.23 11.87 1.94 12.93 > > > sdc 0.00 0.00 0.67 38.33 70.67 7881.33 > > > 407.79 37.66 202.15 0.00 205.67 13.61 53.07 > > > sdd 0.00 0.00 3.00 17.33 12.00 166.00 > > > 17.51 0.05 2.89 3.11 2.85 0.98 2.00 > > > sde 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > 1.29 0.00 0.92 24.10 0.00 73.68 > > > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > > sda 0.00 0.00 0.00 45.33 0.00 4392.50 > > > 193.79 0.62 13.62 0.00 13.62 1.09 4.93 > > > sdb 0.00 0.00 0.00 8.67 0.00 3600.00 > > > 830.77 63.87 1605.54 0.00 1605.54 115.38 *100.00* > > > sdc 0.00 0.33 8.67 42.67 37.33 5672.33 > > > 222.45 16.88 908.78 1.38 1093.09 7.06 36.27 > > > sdd 0.00 0.00 0.33 31.00 1.33 629.83 > > > 40.29 0.06 1.91 0.00 1.94 0.94 2.93 > > > sde 0.00 0.00 0.00 0.33 0.00 1.33 > > > 8.00 0.12 368.00 0.00 368.00 368.00 12.27 > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > 1.59 0.00 0.88 4.82 0.00 92.70 > > > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > > sda 0.00 0.00 0.00 29.00 0.00 235.00 > > > 16.21 0.06 1.98 0.00 1.98 0.97 2.80 > > > sdb 0.00 6.00 4.33 114.67 38.67 6422.33 > > > 108.59 9.19 513.19 265.23 522.56 2.08 24.80 > > > sdc 0.00 0.00 0.00 20.67 0.00 124.00 > > > 12.00 0.04 2.00 0.00 2.00 1.03 2.13 > > > sdd 0.00 5.00 1.67 81.00 12.00 546.17 > > > 13.50 0.10 1.21 0.80 1.22 0.39 3.20 > > > sde 0.00 0.00 0.00 0.00 0.00 0.00 > > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > > ==== > > > > > > And the high utilisation is randomly affecting other OSDs as well > > > within the same node, and not only affecting one particular OSD. > > > > > > atop result on the node: > > > > > > ==== > > > ATOP - > > > ceph-osd-07 > > > 2014/04/24 > > > 13:49:12 > > > ------ > > > 10s elapsed > > > PRC | sys 1.77s | user 2.11s | | | > > > #proc 164 | | #trun 2 | #tslpi 2817 | #tslpu > > > 0 | | #zombie 0 | clones 4 | > > > | | #exit 0 | > > > CPU | sys 14% | user 20% | | irq 1% > > > | | | idle 632% | wait 133% > > > | | | steal 0% | guest 0% > > > | | avgf 1.79GHz | avgscal 54% | > > > cpu | sys 6% | user 7% | | irq 0% > > > | | | idle 19% | cpu006 w 68% > > > | | | steal 0% | guest 0% > > > | | avgf 2.42GHz | avgscal 73% | > > > cpu | sys 2% | user 3% | | irq 0% > > > | | | idle 88% | cpu002 w 7% > > > | | | steal 0% | guest 0% > > > | | avgf 1.68GHz | avgscal 50% | > > > cpu | sys 2% | user 2% | | irq 0% > > > | | | idle 86% | cpu003 w 10% > > > | | | steal 0% | guest 0% > > > | | avgf 1.67GHz | avgscal 50% | > > > cpu | sys 2% | user 2% | | irq 0% > > > | | | idle 75% | cpu001 w 21% > > > | | | steal 0% | guest 0% > > > | | avgf 1.83GHz | avgscal 55% | > > > cpu | sys 1% | user 2% | | irq 1% > > > | | | idle 70% | cpu000 w 26% > > > | | | steal 0% | guest 0% > > > | | avgf 1.85GHz | avgscal 56% | > > > cpu | sys 1% | user 2% | | irq 0% > > > | | | idle 97% | cpu004 w 1% > > > | | | steal 0% | guest 0% > > > | | avgf 1.64GHz | avgscal 49% | > > > cpu | sys 1% | user 1% | | irq 0% > > > | | | idle 98% | cpu005 w 0% > > > | | | steal 0% | guest 0% > > > | | avgf 1.60GHz | avgscal 48% | > > > cpu | sys 0% | user 1% | | irq 0% > > > | | | idle 98% | cpu007 w 0% > > > | | | steal 0% | guest 0% > > > | | avgf 1.60GHz | avgscal 48% | > > > CPL | avg1 1.12 | | avg5 0.90 | | > > > avg15 0.72 | | | | csw > > > 103682 | | intr 34330 | | > > > | | numcpu 8 | > > > MEM | tot 15.6G | | free 158.2M | cache 13.7G > > > | | dirty 101.4M | buff 18.2M | | slab > > > 574.6M | | | | > > > | | | > > > SWP | tot 518.0M | | free 489.6M | > > > | | | | > > > | | | | | vmcom > > > 5.2G | | vmlim 8.3G | > > > PAG | scan 327450 | | | stall 0 > > > | | | | > > > | | | swin 0 | > > > | | | swout 0 | > > > DSK | sdb | | busy 90% | read 8115 > > > | | write 695 | KiB/r 130 | | KiB/w > > > 194 | MBr/s 103.34 | | MBw/s 13.22 | avq 4.61 > > > | | avio 1.01 ms | > > > DSK | sdc | | busy 32% | read 23 > > > | | write 431 | KiB/r 6 | | KiB/w > > > 318 | MBr/s 0.02 | | MBw/s 13.41 | avq 34.86 > > > | | avio 6.95 ms | > > > DSK | sda | | busy 32% | read 25 > > > | | write 674 | KiB/r 6 | | KiB/w > > > 193 | MBr/s 0.02 | | MBw/s 12.76 | avq 41.00 > > > | | avio 4.48 ms | > > > DSK | sdd | | busy 7% | read 26 > > > | | write 473 | KiB/r 7 | | KiB/w > > > 223 | MBr/s 0.02 | | MBw/s 10.31 | avq 14.29 > > > | | avio 1.45 ms | > > > DSK | sde | | busy 2% | read 0 > > > | | write 5 | KiB/r 0 | | KiB/w > > > 5 | MBr/s 0.00 | | MBw/s 0.00 | avq 1.00 > > > | | avio 44.8 ms | > > > NET | transport | tcpi 21326 | | tcpo 27479 | > > > udpi 0 | udpo 0 | tcpao 0 | | tcppo > > > 2 | tcprs 3 | tcpie 0 | tcpor 0 | | > > > udpnp 0 | udpip 0 | > > > NET | network | | ipi 21326 | ipo 14340 > > > | | ipfrw 0 | deliv 21326 | > > > | | | | | icmpi > > > 0 | | icmpo 0 | > > > NET | p2p2 ---- | pcki 12659 | | pcko 20931 | > > > si 124 Mbps | | so 107 Mbps | coll 0 | > > > mlti 0 | | erri 0 | erro 0 > > > | | drpi 0 | drpo 0 | > > > NET | p2p1 ---- | pcki 8565 | | pcko 6443 | > > > si 106 Mbps | | so 7911 Kbps | coll 0 | > > > mlti 0 | | erri 0 | erro 0 > > > | | drpi 0 | drpo 0 | > > > NET | lo ---- | pcki 108 | | pcko 108 | > > > si 8 Kbps | | so 8 Kbps | coll 0 | mlti > > > 0 | | erri 0 | erro 0 | | > > > drpi 0 | drpo 0 | > > > > > > PID RUID EUID THR > > > SYSCPU USRCPU VGROW RGROW > > > RDDSK WRDSK ST EXC S > > > CPUNR CPU CMD 1/1 > > > 6881 root root 538 > > > 0.74s 0.94s 0K 256K > > > 1.0G 121.3M -- - S > > > 3 17% ceph-osd > > > 28708 root root 720 > > > 0.30s 0.69s 512K -8K > > > 160K 157.7M -- - S > > > 3 10% ceph-osd > > > 31569 root root 678 > > > 0.21s 0.30s 512K -584K > > > 156K 162.7M -- - S > > > 0 5% ceph-osd > > > 32095 root root 654 > > > 0.14s 0.16s 0K 0K > > > 60K 105.9M -- - S > > > 0 3% ceph-osd > > > 61 root root 1 > > > 0.20s 0.00s 0K 0K > > > 0K 0K -- - S > > > 3 2% kswapd0 > > > 10584 root root 1 > > > 0.03s 0.02s 112K 112K > > > 0K 0K -- - R > > > 4 1% atop > > > 11618 root root 1 > > > 0.03s 0.00s 0K 0K > > > 0K 0K -- - S > > > 6 0% kworker/6:2 > > > 10 root root 1 > > > 0.02s 0.00s 0K 0K > > > 0K 0K -- - S > > > 0 0% rcu_sched > > > 38 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 6 0% ksoftirqd/6 > > > 1623 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 6 0% kworker/6:1H > > > 1993 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 2 0% flush-8:48 > > > 2031 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 2 0% flush-8:0 > > > 2032 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 0 0% flush-8:16 > > > 2033 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 2 0% flush-8:32 > > > 5787 root root 1 > > > 0.01s 0.00s 0K 0K > > > 4K 0K -- - S > > > 3 0% kworker/3:0 > > > 27605 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 1 0% kworker/1:2 > > > 27823 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 0 0% kworker/0:2 > > > 32511 root root 1 > > > 0.01s 0.00s 0K 0K > > > 0K 0K -- - S > > > 2 0% kworker/2:0 > > > 1536 root root 1 > > > 0.00s 0.00s 0K 0K > > > 0K 0K -- - S > > > 2 0% irqbalance > > > 478 root root 1 > > > 0.00s 0.00s 0K 0K > > > 0K 0K -- - S > > > 3 0% usb-storage > > > 494 root root 1 > > > 0.00s 0.00s 0K 0K > > > 0K 0K -- - S > > > 1 0% jbd2/sde1-8 > > > 1550 root root 1 > > > 0.00s 0.00s 0K 0K > > > 400K 0K -- - S > > > 1 0% xfsaild/sdb1 > > > 1750 root root 1 > > > 0.00s 0.00s 0K 0K > > > 128K 0K -- - S > > > 2 0% xfsaild/sdd1 > > > 1994 root root 1 > > > 0.00s 0.00s 0K 0K > > > 0K 0K -- - S > > > 1 0% flush-8:64 > > > ==== > > > > > > I have tried to trim the SSD drives but the problem seems to persist. > > > Last time trimming the SSD drives can help to improve the > > > performance. > > > > > > Any advice is greatly appreciated. > > > > > > Thank you. > > > > > > -- > > Christian Balzer Network/Systems Engineer > > ch...@gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com