Hello,

On Mon, 2 Jun 2014 16:15:22 +0800 Indra Pramana wrote:

> Dear all,
> 
> I have managed to identify some slow OSDs and journals and have since
> replaced them. RADOS benchmark of the whole cluster is now fast, much
> improved from last time, showing the cluster can go up to 700+ MB/s.
> 
> =====
>  Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds
> or 0 objects
>  Object prefix: benchmark_data_hv-kvm-01_6931
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
> lat 0       0         0         0         0         0         -         0
>      1      16       214       198   791.387       792  0.260687
> 0.074689 2      16       275       259   517.721       244  0.079697
> 0.0861397 3      16       317       301   401.174       168  0.209022
> 0.115348 4      16       317       301   300.902         0         -
> 0.115348 5      16       356       340   271.924        78  0.040032
> 0.172452 6      16       389       373   248.604       132  0.038983
> 0.221213 7      16       411       395   225.662        88  0.048462
> 0.211686 8      16       441       425   212.454       120  0.048722
> 0.237671 9      16       474       458   203.513       132  0.041285
> 0.226825 10      16       504       488   195.161       120  0.041899
> 0.224044 11      16       505       489   177.784         4  0.622238
> 0.224858 12      16       505       489    162.97         0         -
> 0.224858 Total time run:         12.142654
> Total writes made:      505
> Write size:             4194304
> Bandwidth (MB/sec):     166.356
> 
> Stddev Bandwidth:       208.41
> Max bandwidth (MB/sec): 792
> Min bandwidth (MB/sec): 0
> Average Latency:        0.384178
> Stddev Latency:         1.10504
> Max latency:            9.64224
> Min latency:            0.031679
> =====
> 
This might be better than the last result, but it still shows the same
massive variance in latency and a pretty horrible average latency.

Also you want to run this test for a lot longer, looking at the bandwidth
progression it seems to drop over time. 
I'd expect the sustained bandwidth over a minute or so be below 100MB/s.


> However, dd test result on guest VM is still slow.
> 
> =====
> root@test1# dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync
> oflag=direct
> 256+0 records in
> 256+0 records out
> 268435456 bytes (268 MB) copied, 17.1829 s, 15.6 MB/s
> =====
> 
You're kinda comparing apples to oranges here.
Firstly, the block size isn't the same, running a rbd bench with 1MB blocks
shows about a 25% decrease in bandwidth. 

Secondly, is the VM user space, kernel space, what FS, etc.
Mounting a RBD image formatted the same way in kernelspace on a host and
doing the the dd test there would be a better comparison.

Christian
> I thought I have fixed the problem by replacing all those bad OSDs and
> journals but apparently it doesn't resolve the problem.
> 
> Is there any throttling settings which prevents the guest VMs to get the
> I/O write speed that it's entitled to?
> 
> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> 
> 
> 
> On Tue, Apr 29, 2014 at 8:54 PM, Christian Balzer <ch...@gol.com> wrote:
> 
> > On Thu, 24 Apr 2014 13:51:49 +0800 Indra Pramana wrote:
> >
> > > Hi Christian,
> > >
> > > Good day to you, and thank you for your reply.
> > >
> > > On Wed, Apr 23, 2014 at 11:41 PM, Christian Balzer <ch...@gol.com>
> > wrote:
> > >
> > > > > > > Using 32 concurrent writes, result is below. The speed really
> > > > > > > fluctuates.
> > > > > > >
> > > > > > >  Total time run:         64.31704964.317049
> > > > > > > Total writes made:      1095
> > > > > > > Write size:             4194304
> > > > > > > Bandwidth (MB/sec):     68.100
> > > > > > >
> > > > > > > Stddev Bandwidth:       44.6773
> > > > > > > Max bandwidth (MB/sec): 184
> > > > > > > Min bandwidth (MB/sec): 0
> > > > > > > Average Latency:        1.87761
> > > > > > > Stddev Latency:         1.90906
> > > > > > > Max latency:            9.99347
> > > > > > > Min latency:            0.075849
> > > > > > >
> > > > > > That is really weird, it should get faster, not slower. ^o^
> > > > > > I assume you've run this a number of times?
> > > > > >
> > > > > > Also my apologies, the default is 16 threads, not 1, but that
> > > > > > still isn't enough to get my cluster to full speed:
> > > > > > ---
> > > > > > Bandwidth (MB/sec):     349.044
> > > > > >
> > > > > > Stddev Bandwidth:       107.582
> > > > > > Max bandwidth (MB/sec): 408
> > > > > > ---
> > > > > > at 64 threads it will ramp up from a slow start to:
> > > > > > ---
> > > > > > Bandwidth (MB/sec):     406.967
> > > > > >
> > > > > > Stddev Bandwidth:       114.015
> > > > > > Max bandwidth (MB/sec): 452
> > > > > > ---
> > > > > >
> > > > > > But what stands out is your latency. I don't have a 10GBE
> > > > > > network to compare, but my Infiniband based cluster (going
> > > > > > through at least one switch) gives me values like this:
> > > > > > ---
> > > > > > Average Latency:        0.335519
> > > > > > Stddev Latency:         0.177663
> > > > > > Max latency:            1.37517
> > > > > > Min latency:            0.1017
> > > > > > ---
> > > > > >
> > > > > > Of course that latency is not just the network.
> > > > > >
> > > > >
> > > > > What else can contribute to this latency? Storage node load, disk
> > > > > speed, anything else?
> > > > >
> > > > That and the network itself are pretty much it, you should know
> > > > once you've run those test with atop or iostat on the storage
> > > > nodes.
> > > >
> > > > >
> > > > > > I would suggest running atop (gives you more information at one
> > > > > > glance) or "iostat -x 3" on all your storage nodes during these
> > > > > > tests to identify any node or OSD that is overloaded in some
> > > > > > way.
> > > > > >
> > > > >
> > > > > Will try.
> > > > >
> > > > Do that and let us know about the results.
> > > >
> > >
> > > I have done some tests using iostat and noted some OSDs on a
> > > particular storage node going up to the 100% limit when I run the
> > > rados bench test.
> > >
> > Dumping lots of text will make people skip over your mails, you need to
> > summarize and preferably understand yourself what these numbers mean.
> >
> > The iostat output is not too conclusive, as the numbers when reaching
> > 100% utilization are not particular impressive.
> > The fact that it happens though should make you look for anything
> > different with these OSDs, from smartctl checks to PG distribution, as
> > in "ceph pg dump" and then tallying up each PG.
> > Also look at "ceph osd tree" and see if those OSDs or node have a
> > higher weight than others.
> >
> > The atop line indicates that sdb was being read at a rate of 100MB/s
> > and assuming that your benchmark was more or less the only thing
> > running at that time this would mean something very odd is going on,
> > as all the other OSDs were have no significant reads going on and all
> > were being written at about the same speed.
> >
> > Christian
> >
> > > ====
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >            1.09    0.00    0.92   21.74    0.00   76.25
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > sda               0.00     0.00    4.33   42.00    73.33  6980.00
> > > 304.46     0.29    6.22    0.00    6.86   1.50   6.93
> > > sdb               0.00     0.00    0.00   17.67     0.00  6344.00
> > > 718.19    59.64  854.26    0.00  854.26  56.60 *100.00*
> > > sdc               0.00     0.00   12.33   59.33    70.67 18882.33
> > > 528.92    36.54  509.80   64.76  602.31  10.51  75.33
> > > sdd               0.00     0.00    3.33   54.33    24.00 15249.17
> > > 529.71     1.29   22.45    3.20   23.63   1.64   9.47
> > > sde               0.00     0.33    0.00    0.67     0.00     4.00
> > > 12.00     0.30  450.00    0.00  450.00 450.00  30.00
> > >
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >            1.38    0.00    1.13    7.75    0.00   89.74
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > sda               0.00     0.00    5.00   69.00    30.67 19408.50
> > > 525.38     4.29   58.02    0.53   62.18   2.00  14.80
> > > sdb               0.00     0.00    7.00   63.33    41.33 20911.50
> > > 595.82    13.09  826.96   88.57  908.57   5.48  38.53
> > > sdc               0.00     0.00    2.67   30.00    17.33  6945.33
> > > 426.29     0.21    6.53    0.50    7.07   1.59   5.20
> > > sdd               0.00     0.00    2.67   58.67    16.00 20661.33
> > > 674.26     4.89   79.54   41.00   81.30   2.70  16.53
> > > sde               0.00     0.00    0.00    1.67     0.00     6.67
> > > 8.00     0.01    3.20    0.00    3.20   1.60   0.27
> > >
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >            0.97    0.00    0.55    6.73    0.00   91.75
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > sda               0.00     0.00    1.67   15.33    21.33   120.00
> > > 16.63     0.02    1.18    0.00    1.30   0.63   1.07
> > > sdb               0.00     0.00    4.33   62.33    24.00 13299.17
> > > 399.69     2.68   11.18    1.23   11.87   1.94  12.93
> > > sdc               0.00     0.00    0.67   38.33    70.67  7881.33
> > > 407.79    37.66  202.15    0.00  205.67  13.61  53.07
> > > sdd               0.00     0.00    3.00   17.33    12.00   166.00
> > > 17.51     0.05    2.89    3.11    2.85   0.98   2.00
> > > sde               0.00     0.00    0.00    0.00     0.00     0.00
> > > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > >
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >            1.29    0.00    0.92   24.10    0.00   73.68
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > sda               0.00     0.00    0.00   45.33     0.00  4392.50
> > > 193.79     0.62   13.62    0.00   13.62   1.09   4.93
> > > sdb               0.00     0.00    0.00    8.67     0.00  3600.00
> > > 830.77    63.87 1605.54    0.00 1605.54 115.38 *100.00*
> > > sdc               0.00     0.33    8.67   42.67    37.33  5672.33
> > > 222.45    16.88  908.78    1.38 1093.09   7.06  36.27
> > > sdd               0.00     0.00    0.33   31.00     1.33   629.83
> > > 40.29     0.06    1.91    0.00    1.94   0.94   2.93
> > > sde               0.00     0.00    0.00    0.33     0.00     1.33
> > > 8.00     0.12  368.00    0.00  368.00 368.00  12.27
> > >
> > > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >            1.59    0.00    0.88    4.82    0.00   92.70
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > > sda               0.00     0.00    0.00   29.00     0.00   235.00
> > > 16.21     0.06    1.98    0.00    1.98   0.97   2.80
> > > sdb               0.00     6.00    4.33  114.67    38.67  6422.33
> > > 108.59     9.19  513.19  265.23  522.56   2.08  24.80
> > > sdc               0.00     0.00    0.00   20.67     0.00   124.00
> > > 12.00     0.04    2.00    0.00    2.00   1.03   2.13
> > > sdd               0.00     5.00    1.67   81.00    12.00   546.17
> > > 13.50     0.10    1.21    0.80    1.22   0.39   3.20
> > > sde               0.00     0.00    0.00    0.00     0.00     0.00
> > > 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > > ====
> > >
> > > And the high utilisation is randomly affecting other OSDs as well
> > > within the same node, and not only affecting one particular OSD.
> > >
> > > atop result on the node:
> > >
> > > ====
> > > ATOP -
> > > ceph-osd-07
> > > 2014/04/24
> > > 13:49:12
> > > ------
> > > 10s elapsed
> > > PRC | sys    1.77s |  user   2.11s |              |               |
> > > #proc    164 |               | #trun      2 | #tslpi  2817  | #tslpu
> > > 0 |               | #zombie    0 | clones     4  |
> > > |               | #exit      0 |
> > > CPU | sys      14% |  user     20% |              |  irq       1%
> > > |              |               | idle    632% | wait    133%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.79GHz  | avgscal  54% |
> > > cpu | sys       6% |  user      7% |              |  irq       0%
> > > |              |               | idle     19% | cpu006 w 68%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 2.42GHz  | avgscal  73% |
> > > cpu | sys       2% |  user      3% |              |  irq       0%
> > > |              |               | idle     88% | cpu002 w  7%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.68GHz  | avgscal  50% |
> > > cpu | sys       2% |  user      2% |              |  irq       0%
> > > |              |               | idle     86% | cpu003 w 10%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.67GHz  | avgscal  50% |
> > > cpu | sys       2% |  user      2% |              |  irq       0%
> > > |              |               | idle     75% | cpu001 w 21%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.83GHz  | avgscal  55% |
> > > cpu | sys       1% |  user      2% |              |  irq       1%
> > > |              |               | idle     70% | cpu000 w 26%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.85GHz  | avgscal  56% |
> > > cpu | sys       1% |  user      2% |              |  irq       0%
> > > |              |               | idle     97% | cpu004 w  1%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.64GHz  | avgscal  49% |
> > > cpu | sys       1% |  user      1% |              |  irq       0%
> > > |              |               | idle     98% | cpu005 w  0%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.60GHz  | avgscal  48% |
> > > cpu | sys       0% |  user      1% |              |  irq       0%
> > > |              |               | idle     98% | cpu007 w  0%
> > > |              |               | steal     0% | guest     0%
> > > |              | avgf 1.60GHz  | avgscal  48% |
> > > CPL | avg1    1.12 |               | avg5    0.90 |               |
> > > avg15 0.72 |               |              |               | csw
> > > 103682 |               | intr   34330 |               |
> > > |               | numcpu     8 |
> > > MEM | tot    15.6G |               | free  158.2M |  cache  13.7G
> > > |              |  dirty 101.4M | buff   18.2M |               | slab
> > > 574.6M |               |              |               |
> > > |               |              |
> > > SWP | tot   518.0M |               | free  489.6M |
> > > |              |               |              |
> > > |              |               |              |               | vmcom
> > > 5.2G |               | vmlim   8.3G |
> > > PAG | scan  327450 |               |              |  stall      0
> > > |              |               |              |
> > > |              |               | swin       0 |
> > > |              |               | swout      0 |
> > > DSK |          sdb |               | busy     90% |  read    8115
> > > |              |  write    695 | KiB/r    130 |               | KiB/w
> > > 194 | MBr/s 103.34  |              | MBw/s  13.22  | avq     4.61
> > > |               | avio 1.01 ms |
> > > DSK |          sdc |               | busy     32% |  read      23
> > > |              |  write    431 | KiB/r      6 |               | KiB/w
> > > 318 | MBr/s   0.02  |              | MBw/s  13.41  | avq    34.86
> > > |               | avio 6.95 ms |
> > > DSK |          sda |               | busy     32% |  read      25
> > > |              |  write    674 | KiB/r      6 |               | KiB/w
> > > 193 | MBr/s   0.02  |              | MBw/s  12.76  | avq    41.00
> > > |               | avio 4.48 ms |
> > > DSK |          sdd |               | busy      7% |  read      26
> > > |              |  write    473 | KiB/r      7 |               | KiB/w
> > > 223 | MBr/s   0.02  |              | MBw/s  10.31  | avq    14.29
> > > |               | avio 1.45 ms |
> > > DSK |          sde |               | busy      2% |  read       0
> > > |              |  write      5 | KiB/r      0 |               | KiB/w
> > > 5 | MBr/s   0.00  |              | MBw/s   0.00  | avq     1.00
> > > |               | avio 44.8 ms |
> > > NET | transport    |  tcpi   21326 |              |  tcpo   27479 |
> > > udpi       0 |  udpo       0 | tcpao      0 |               | tcppo
> > > 2 | tcprs      3  | tcpie      0 | tcpor      0  |              |
> > > udpnp 0  | udpip      0 |
> > > NET | network      |               | ipi    21326 |  ipo    14340
> > > |              |  ipfrw      0 | deliv  21326 |
> > > |              |               |              |               | icmpi
> > > 0 |               | icmpo      0 |
> > > NET | p2p2    ---- |  pcki   12659 |              |  pcko   20931 |
> > > si 124 Mbps |               | so  107 Mbps | coll       0  |
> > > mlti       0 |               | erri       0 | erro       0
> > > |              | drpi 0  | drpo       0 |
> > > NET | p2p1    ---- |  pcki    8565 |              |  pcko    6443 |
> > > si 106 Mbps |               | so 7911 Kbps | coll       0  |
> > > mlti       0 |               | erri       0 | erro       0
> > > |              | drpi 0  | drpo       0 |
> > > NET | lo      ---- |  pcki     108 |              |  pcko     108 |
> > > si    8 Kbps |               | so    8 Kbps | coll       0  | mlti
> > > 0 |               | erri       0 | erro       0  |              |
> > > drpi 0  | drpo       0 |
> > >
> > >   PID         RUID              EUID              THR
> > > SYSCPU           USRCPU          VGROW           RGROW
> > > RDDSK          WRDSK          ST          EXC         S
> > > CPUNR           CPU         CMD         1/1
> > >  6881         root              root              538
> > > 0.74s            0.94s             0K            256K
> > > 1.0G         121.3M          --            -         S
> > > 3           17%         ceph-osd
> > > 28708         root              root              720
> > > 0.30s            0.69s           512K             -8K
> > > 160K         157.7M          --            -         S
> > > 3           10%         ceph-osd
> > > 31569         root              root              678
> > > 0.21s            0.30s           512K           -584K
> > > 156K         162.7M          --            -         S
> > > 0            5%         ceph-osd
> > > 32095         root              root              654
> > > 0.14s            0.16s             0K              0K
> > > 60K         105.9M          --            -         S
> > > 0            3%         ceph-osd
> > >    61         root              root                1
> > > 0.20s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 3            2%         kswapd0
> > > 10584         root              root                1
> > > 0.03s            0.02s           112K            112K
> > > 0K             0K          --            -         R
> > > 4            1%         atop
> > > 11618         root              root                1
> > > 0.03s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 6            0%         kworker/6:2
> > >    10         root              root                1
> > > 0.02s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 0            0%         rcu_sched
> > >    38         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 6            0%         ksoftirqd/6
> > >  1623         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 6            0%         kworker/6:1H
> > >  1993         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 2            0%         flush-8:48
> > >  2031         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 2            0%         flush-8:0
> > >  2032         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 0            0%         flush-8:16
> > >  2033         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 2            0%         flush-8:32
> > >  5787         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 4K             0K          --            -         S
> > > 3            0%         kworker/3:0
> > > 27605         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 1            0%         kworker/1:2
> > > 27823         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 0            0%         kworker/0:2
> > > 32511         root              root                1
> > > 0.01s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 2            0%         kworker/2:0
> > >  1536         root              root                1
> > > 0.00s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 2            0%         irqbalance
> > >   478         root              root                1
> > > 0.00s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 3            0%         usb-storage
> > >   494         root              root                1
> > > 0.00s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 1            0%         jbd2/sde1-8
> > >  1550         root              root                1
> > > 0.00s            0.00s             0K              0K
> > > 400K             0K          --            -         S
> > > 1            0%         xfsaild/sdb1
> > >  1750         root              root                1
> > > 0.00s            0.00s             0K              0K
> > > 128K             0K          --            -         S
> > > 2            0%         xfsaild/sdd1
> > >  1994         root              root                1
> > > 0.00s            0.00s             0K              0K
> > > 0K             0K          --            -         S
> > > 1            0%         flush-8:64
> > > ====
> > >
> > > I have tried to trim the SSD drives but the problem seems to persist.
> > > Last time trimming the SSD drives can help to improve the
> > > performance.
> > >
> > > Any advice is greatly appreciated.
> > >
> > > Thank you.
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to