Re: [ceph-users] How to increase the size of requests written to a ceph image

Maged Mokhtar Wed, 18 Oct 2017 13:40:58 -0700

just run the same 32 threaded rados test as you did before and this time
run atop while the test is running looking for %busy of cpu/disks. It
should give an idea if there is a bottleneck in them.


On 2017-10-18 21:35, Russell Glaue wrote:

> I cannot run the write test reviewed at the 
> ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device blog. The tests 
> write directly to the raw disk device. 
> Reading an infile (created with urandom) on one SSD, writing the outfile to 
> another osd, yields about 17MB/s. 
> But Isn't this write speed limited by the speed in which in the dd infile can 
> be read? 
> And I assume the best test should be run with no other load.
> 
> How does one run the rados bench "as stress"? 
> 
> -RG 
> 
> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> measuring resource load as outlined earlier will show if the drives are 
> performing well or not. Also how many osds do you have  ?
> 
> On 2017-10-18 19:26, Russell Glaue wrote: 
> The SSD drives are Crucial M500 
> A Ceph user did some benchmarks and found it had good performance 
> https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/ 
> [1] 
> 
> However, a user comment from 3 years ago on the blog post you linked to says 
> to avoid the Crucial M500 
> 
> Yet, this performance posting tells that the Crucial M500 is good. 
> https://inside.servers.com/ssd-performance-2017-c4307a92dea [2] 
> 
> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> Check out the following link: some SSDs perform bad in Ceph due to sync 
> writes to journal 
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  [3] 
> 
> Anther thing that can help is to re-run the rados 32 threads as stress and 
> view resource usage using atop (or collectl/sar) to check for %busy cpu and 
> %busy disks to give you an idea of what is holding down your cluster..for 
> example: if cpu/disk % are all low then check your network/switches.  If disk 
> %busy is high (90%) for all disks then your disks are the bottleneck: which 
> either means you have SSDs that are not suitable for Ceph or you have too few 
> disks (which i doubt is the case). If only 1 disk %busy is high, there may be 
> something wrong with this disk should be removed. 
> 
> Maged
> 
> On 2017-10-18 18:13, Russell Glaue wrote: 
> 
> In my previous post, in one of my points I was wondering if the request size 
> would increase if I enabled jumbo packets. currently it is disabled. 
> 
> @jdillama: The qemu settings for both these two guest machines, with RAID/LVM 
> and Ceph/rbd images, are the same. I am not thinking that changing the qemu 
> settings of "min_io_size=<limited to 16bits>,opt_io_size=<RBD image object 
> size>" will directly address the issue. 
> @mmokhtar: Ok. So you suggest the request size is the result of the problem 
> and not the cause of the problem. meaning I should go after a different 
> issue. 
> 
> I have been trying to get write speeds up to what people on this mail list 
> are discussing. 
> It seems that for our configuration, as it matches others, we should be 
> getting about 70MB/s write speed. 
> But we are not getting that. 
> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically 
> 1MB/s to 2MB/s. 
> Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I 
> have seen very rare momentary spikes up to 30MB/s. 
> 
> My storage network is connected via a 10Gb switch 
> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller 
> Each storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID) 
> Each drive is one LVM group, with two volumes - one volume for the osd, one 
> volume for the journal 
> 
> Each osd is formatted with xfs 
> The crush map is simple: default->rack->[host[1..4]->osd] with an evenly 
> distributed weight 
> The redundancy is triple replication 
> 
> While I have read comments that having the osd and journal on the same disk 
> decreases write speed, I have also read that once past 8 OSDs per node this 
> is the recommended configuration, however this is also the reason why SSD 
> drives are used exclusively for OSDs in the storage nodes. 
> None-the-less, I was still expecting write speeds to be above 30MB/s, not 
> below 6MB/s. 
> Even at 12x slower than the RAID, using my previously posted iostat data set, 
> I should be seeing write speeds that average 10MB/s, not 2MB/s. 
> 
> In regards to the rados benchmark tests you asked me to run, here is the 
> output: 
> 
> [centos7]# rados bench -p scbench -b 4096 30 write -t 1 
> Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up 
> to 30 seconds or 0 objects 
> Object prefix: benchmark_data_hamms.sys.cu [4].cait.org_85049 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 0       0         0         0         0         0           -           0 
> 1       1       201       200   0.78356   0.78125  0.00522307  0.00496574 
> 2       1       469       468  0.915303   1.04688  0.00437497  0.00426141 
> 3       1       741       740  0.964371    1.0625  0.00512853   0.0040434 
> 4       1       888       887  0.866739  0.574219  0.00307699  0.00450177 
> 5       1      1147      1146  0.895725   1.01172  0.00376454   0.0043559 
> 6       1      1325      1324  0.862293  0.695312  0.00459443    0.004525 
> 7       1      1494      1493   0.83339  0.660156  0.00461002  0.00458452 
> 8       1      1736      1735  0.847369  0.945312  0.00253971  0.00460458 
> 9       1      1998      1997  0.866922   1.02344  0.00236573  0.00450172 
> 10       1      2260      2259  0.882563   1.02344  0.00262179  0.00442152 
> 11       1      2526      2525  0.896775   1.03906  0.00336914  0.00435092 
> 12       1      2760      2759  0.898203  0.914062  0.00351827  0.00434491 
> 13       1      3016      3015  0.906025         1  0.00335703  0.00430691 
> 14       1      3257      3256  0.908545  0.941406  0.00332344  0.00429495 
> 15       1      3490      3489  0.908644  0.910156  0.00318815  0.00426387 
> 16       1      3728      3727  0.909952  0.929688   0.0032881  0.00428895 
> 17       1      3986      3985  0.915703   1.00781  0.00274809   0.0042614 
> 18       1      4250      4249  0.922116   1.03125  0.00287411  0.00423214 
> 19       1      4505      4504  0.926003  0.996094  0.00375435  0.00421442 
> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553 avg lat: 
> 0.00420118 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 20       1      4757      4756  0.928915  0.984375  0.00463972  0.00420118 
> 21       1      5009      5008   0.93155  0.984375  0.00360065  0.00418937 
> 22       1      5235      5234  0.929329  0.882812  0.00626214    0.004199 
> 23       1      5500      5499  0.933925   1.03516  0.00466584  0.00417836 
> 24       1      5708      5707  0.928861    0.8125  0.00285727  0.00420146 
> 25       0      5964      5964  0.931858   1.00391  0.00417383   0.0041881 
> 26       1      6216      6215  0.933722  0.980469   0.0041009  0.00417915 
> 27       1      6481      6480  0.937474   1.03516  0.00307484  0.00416118 
> 28       1      6745      6744  0.940819   1.03125  0.00266329  0.00414777 
> 29       1      7003      7002  0.943124   1.00781  0.00305905  0.00413758 
> 30       1      7271      7270  0.946578   1.04688  0.00391017  0.00412238 
> Total time run:         30.006060 
> Total writes made:      7272 
> Write size:             4096 
> Object size:            4096 
> Bandwidth (MB/sec):     0.946684 
> Stddev Bandwidth:       0.123762 
> Max bandwidth (MB/sec): 1.0625 
> Min bandwidth (MB/sec): 0.574219 
> Average IOPS:           242 
> Stddev IOPS:            31 
> Max IOPS:               272 
> Min IOPS:               147 
> Average Latency(s):     0.00412247 
> Stddev Latency(s):      0.00648437 
> Max latency(s):         0.270553 
> Min latency(s):         0.00175318 
> Cleaning up (deleting benchmark objects) 
> Clean up completed and total clean up time :29.069423 
> 
> [centos7]# rados bench -p scbench -b 4096 30 write -t 32 
> Maintaining 32 concurrent writes of 4096 bytes to objects of size 4096 for up 
> to 30 seconds or 0 objects 
> Object prefix: benchmark_data_hamms.sys.cu [4].cait.org_86076 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 0       0         0         0         0         0           -           0 
> 1      32      3013      2981   11.6438   11.6445  0.00247906  0.00572026 
> 2      32      5349      5317   10.3834     9.125  0.00246662  0.00932016 
> 3      32      5707      5675    7.3883   1.39844  0.00389774   0.0156726 
> 4      32      5895      5863   5.72481  0.734375     1.13137   0.0167946 
> 5      32      6869      6837   5.34068   3.80469   0.0027652   0.0226577 
> 6      32      8901      8869   5.77306    7.9375   0.0053211   0.0216259 
> 7      32     10800     10768   6.00785   7.41797  0.00358187   0.0207418 
> 8      32     11825     11793   5.75728   4.00391  0.00217575   0.0215494 
> 9      32     12941     12909    5.6019   4.35938  0.00278512   0.0220567 
> 10      32     13317     13285   5.18849   1.46875   0.0034973   0.0240665 
> 11      32     16189     16157   5.73653   11.2188  0.00255841   0.0212708 
> 12      32     16749     16717   5.44077    2.1875  0.00330334   0.0215915 
> 13      32     16756     16724   5.02436 0.0273438  0.00338994    0.021849 
> 14      32     17908     17876   4.98686       4.5  0.00402598   0.0244568 
> 15      32     17936     17904   4.66171  0.109375  0.00375799   0.0245545 
> 16      32     18279     18247   4.45409   1.33984  0.00483873   0.0267929 
> 17      32     18372     18340   4.21346  0.363281  0.00505187   0.0275887 
> 18      32     19403     19371   4.20309   4.02734  0.00545154    0.029348 
> 19      31     19845     19814   4.07295   1.73047  0.00254726   0.0306775 
> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 avg lat: 
> 0.0307559 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 20      31     20401     20370   3.97788   2.17188  0.00307238   0.0307559 
> 21      32     21338     21306   3.96254   3.65625  0.00464563   0.0312288 
> 22      32     23057     23025    4.0876   6.71484  0.00296295   0.0299267 
> 23      32     23057     23025   3.90988         0           -   0.0299267 
> 24      32     23803     23771   3.86837   1.45703  0.00301471   0.0312804 
> 25      32     24112     24080   3.76191   1.20703  0.00191063   0.0331462 
> 26      31     25303     25272   3.79629   4.65625  0.00794399   0.0329129 
> 27      32     28803     28771   4.16183    13.668   0.0109817   0.0297469 
> 28      32     29592     29560   4.12325   3.08203  0.00188185   0.0301911 
> 29      32     30595     30563   4.11616   3.91797  0.00379099   0.0296794 
> 30      32     31031     30999   4.03572   1.70312  0.00283347   0.0302411 
> Total time run:         30.822350 
> Total writes made:      31032 
> Write size:             4096 
> Object size:            4096 
> Bandwidth (MB/sec):     3.93282 
> Stddev Bandwidth:       3.66265 
> Max bandwidth (MB/sec): 13.668 
> Min bandwidth (MB/sec): 0 
> Average IOPS:           1006 
> Stddev IOPS:            937 
> Max IOPS:               3499 
> Min IOPS:               0 
> Average Latency(s):     0.0317779 
> Stddev Latency(s):      0.164076 
> Max latency(s):         2.27707 
> Min latency(s):         0.0013848 
> Cleaning up (deleting benchmark objects) 
> Clean up completed and total clean up time :20.166559 
> 
> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
> 
> First a general comment: local RAID will be faster than Ceph for a single 
> threaded (queue depth=1) io operation test. A single thread Ceph client will 
> see at best same disk speed for reads and for writes 4-6 times slower than 
> single disk. Not to mention the latency of local disks will much better. 
> Where Ceph shines is when you have many concurrent ios, it scales whereas 
> RAID will decrease speed per client as you add more. 
> 
> Having said that, i would recommend running rados/rbd bench-write and measure 
> 4k iops at 1 and 32 threads to get a better idea of how your cluster 
> performs: 
> 
> ceph osd pool create testpool 256 256 
> rados bench -p testpool -b 4096 30 write -t 1
> rados bench -p testpool -b 4096 30 write -t 32 
> ceph osd pool delete testpool testpool --yes-i-really-really-mean-it 
> 
> rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand 
> --rbd_cache=false
> rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern rand 
> --rbd_cache=false 
> 
> I think the request size difference you see is due to the io scheduler in the 
> case of local disks having more ios to re-group so has a better chance in 
> generating larger requests. Depending on your kernel, the io scheduler may be 
> different for rbd (blq-mq) vs sdx (cfq) but again i would think the request 
> size is a result not a cause. 
> 
> Maged
> 
> On 2017-10-17 23:12, Russell Glaue wrote: 
> 
> I am running ceph jewel on 5 nodes with SSD OSDs. 
> I have an LVM image on a local RAID of spinning disks. 
> I have an RBD image on in a pool of SSD disks.
> Both disks are used to run an almost identical CentOS 7 system. 
> Both systems were installed with the same kickstart, though the disk 
> partitioning is different. 
> 
> I want to make writes on the the ceph image faster. For example, lots of 
> writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x 
> slower than on a spindle RAID disk image. The MySQL server on ceph rbd image 
> has a hard time keeping up in replication. 
> 
> So I wanted to test writes on these two systems 
> I have a 10GB compressed (gzip) file on both servers. 
> I simply gunzip the file on both systems, while running iostat. 
> 
> The primary difference I see in the results is the average size of the 
> request to the disk. 
> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the 
> request is about 40x, but the number of writes per second is about the same 
> This makes me want to conclude that the smaller size of the request for 
> CentOS7-ceph-rbd-ssd system is the cause of it being slow. 
> 
> How can I make the size of the request larger for ceph rbd images, so I can 
> increase the write throughput? 
> Would this be related to having jumbo packets enabled in my ceph storage 
> network? 
> 
> Here is a sample of the results: 
> 
> [CentOS7-lvm-raid-sata] 
> $ gunzip large10gFile.gz & 
> $ iostat -x vg_root-lv_var -d 5 -m -N 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util 
> ... 
> vg_root-lv_var     0.00     0.00   30.60  452.20    13.60   222.15  1000.04   
>   8.69   14.05    0.99   14.93   2.07 100.04 
> vg_root-lv_var     0.00     0.00   88.20  182.00    39.20    89.43   974.95   
>   4.65    9.82    0.99   14.10   3.70 100.00 
> vg_root-lv_var     0.00     0.00   75.45  278.24    33.53   136.70   985.73   
>   4.36   33.26    1.34   41.91   0.59  20.84 
> vg_root-lv_var     0.00     0.00  111.60  181.80    49.60    89.34   969.84   
>   2.60    8.87    0.81   13.81   0.13   3.90 
> vg_root-lv_var     0.00     0.00   68.40  109.60    30.40    53.63   966.87   
>   1.51    8.46    0.84   13.22   0.80  14.16 
> ... 
> 
> [CentOS7-ceph-rbd-ssd] 
> $ gunzip large10gFile.gz & 
> $ iostat -x vg_root-lv_data -d 5 -m -N 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util 
> ... 
> vg_root-lv_data     0.00     0.00   46.40  167.80     0.88     1.46    22.36  
>    1.23    5.66    2.47    6.54   4.52  96.82 
> vg_root-lv_data     0.00     0.00   16.60   55.20     0.36     0.14    14.44  
>    0.99   13.91    9.12   15.36  13.71  98.46 
> vg_root-lv_data     0.00     0.00   69.00  173.80     1.34     1.32    22.48  
>    1.25    5.19    3.77    5.75   3.94  95.68 
> vg_root-lv_data     0.00     0.00   74.40  293.40     1.37     1.47    15.83  
>    1.22    3.31    2.06    3.63   2.54  93.26 
> vg_root-lv_data     0.00     0.00   90.80  359.00     1.96     3.41    24.45  
>    1.63    3.63    1.94    4.05   2.10  94.38 
> ... 
> 
> [iostat key] 
> w/s == The number (after merges) of write requests completed per second for 
> the device. 
> wMB/s == The number of sectors (kilobytes, megabytes) written to the device 
> per second. 
> avgrq-sz == The average size (in kilobytes) of the requests that were issued 
> to the device. 
> avgqu-sz == The average queue length of the requests that were issued to the 
> device. 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [5]

  

Links:
------
[1]
https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/
[2] https://inside.servers.com/ssd-performance-2017-c4307a92dea
[3]
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
[4] http://benchmark_data_hamms.sys.cu
[5] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

Reply via email to