In my previous post, in one of my points I was wondering if the request size would increase if I enabled jumbo packets. currently it is disabled.
@jdillama: The qemu settings for both these two guest machines, with RAID/LVM and Ceph/rbd images, are the same. I am not thinking that changing the qemu settings of "min_io_size=<limited to 16bits>,opt_io_size=<RBD image object size>" will directly address the issue. @mmokhtar: Ok. So you suggest the request size is the result of the problem and not the cause of the problem. meaning I should go after a different issue. I have been trying to get write speeds up to what people on this mail list are discussing. It seems that for our configuration, as it matches others, we should be getting about 70MB/s write speed. But we are not getting that. Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically 1MB/s to 2MB/s. Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I have seen very rare momentary spikes up to 30MB/s. My storage network is connected via a 10Gb switch I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller Each storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID) Each drive is one LVM group, with two volumes - one volume for the osd, one volume for the journal Each osd is formatted with xfs The crush map is simple: default->rack->[host[1..4]->osd] with an evenly distributed weight The redundancy is triple replication While I have read comments that having the osd and journal on the same disk decreases write speed, I have also read that once past 8 OSDs per node this is the recommended configuration, however this is also the reason why SSD drives are used exclusively for OSDs in the storage nodes. None-the-less, I was still expecting write speeds to be above 30MB/s, not below 6MB/s. Even at 12x slower than the RAID, using my previously posted iostat data set, I should be seeing write speeds that average 10MB/s, not 2MB/s. In regards to the rados benchmark tests you asked me to run, here is the output: [centos7]# rados bench -p scbench -b 4096 30 write -t 1 Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objects Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 1 201 200 0.78356 0.78125 0.00522307 0.00496574 2 1 469 468 0.915303 1.04688 0.00437497 0.00426141 3 1 741 740 0.964371 1.0625 0.00512853 0.0040434 4 1 888 887 0.866739 0.574219 0.00307699 0.00450177 5 1 1147 1146 0.895725 1.01172 0.00376454 0.0043559 6 1 1325 1324 0.862293 0.695312 0.00459443 0.004525 7 1 1494 1493 0.83339 0.660156 0.00461002 0.00458452 8 1 1736 1735 0.847369 0.945312 0.00253971 0.00460458 9 1 1998 1997 0.866922 1.02344 0.00236573 0.00450172 10 1 2260 2259 0.882563 1.02344 0.00262179 0.00442152 11 1 2526 2525 0.896775 1.03906 0.00336914 0.00435092 12 1 2760 2759 0.898203 0.914062 0.00351827 0.00434491 13 1 3016 3015 0.906025 1 0.00335703 0.00430691 14 1 3257 3256 0.908545 0.941406 0.00332344 0.00429495 15 1 3490 3489 0.908644 0.910156 0.00318815 0.00426387 16 1 3728 3727 0.909952 0.929688 0.0032881 0.00428895 17 1 3986 3985 0.915703 1.00781 0.00274809 0.0042614 18 1 4250 4249 0.922116 1.03125 0.00287411 0.00423214 19 1 4505 4504 0.926003 0.996094 0.00375435 0.00421442 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553 avg lat: 0.00420118 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 20 1 4757 4756 0.928915 0.984375 0.00463972 0.00420118 21 1 5009 5008 0.93155 0.984375 0.00360065 0.00418937 22 1 5235 5234 0.929329 0.882812 0.00626214 0.004199 23 1 5500 5499 0.933925 1.03516 0.00466584 0.00417836 24 1 5708 5707 0.928861 0.8125 0.00285727 0.00420146 25 0 5964 5964 0.931858 1.00391 0.00417383 0.0041881 26 1 6216 6215 0.933722 0.980469 0.0041009 0.00417915 27 1 6481 6480 0.937474 1.03516 0.00307484 0.00416118 28 1 6745 6744 0.940819 1.03125 0.00266329 0.00414777 29 1 7003 7002 0.943124 1.00781 0.00305905 0.00413758 30 1 7271 7270 0.946578 1.04688 0.00391017 0.00412238 Total time run: 30.006060 Total writes made: 7272 Write size: 4096 Object size: 4096 Bandwidth (MB/sec): 0.946684 Stddev Bandwidth: 0.123762 Max bandwidth (MB/sec): 1.0625 Min bandwidth (MB/sec): 0.574219 Average IOPS: 242 Stddev IOPS: 31 Max IOPS: 272 Min IOPS: 147 Average Latency(s): 0.00412247 Stddev Latency(s): 0.00648437 Max latency(s): 0.270553 Min latency(s): 0.00175318 Cleaning up (deleting benchmark objects) Clean up completed and total clean up time :29.069423 [centos7]# rados bench -p scbench -b 4096 30 write -t 32 Maintaining 32 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objects Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 32 3013 2981 11.6438 11.6445 0.00247906 0.00572026 2 32 5349 5317 10.3834 9.125 0.00246662 0.00932016 3 32 5707 5675 7.3883 1.39844 0.00389774 0.0156726 4 32 5895 5863 5.72481 0.734375 1.13137 0.0167946 5 32 6869 6837 5.34068 3.80469 0.0027652 0.0226577 6 32 8901 8869 5.77306 7.9375 0.0053211 0.0216259 7 32 10800 10768 6.00785 7.41797 0.00358187 0.0207418 8 32 11825 11793 5.75728 4.00391 0.00217575 0.0215494 9 32 12941 12909 5.6019 4.35938 0.00278512 0.0220567 10 32 13317 13285 5.18849 1.46875 0.0034973 0.0240665 11 32 16189 16157 5.73653 11.2188 0.00255841 0.0212708 12 32 16749 16717 5.44077 2.1875 0.00330334 0.0215915 13 32 16756 16724 5.02436 0.0273438 0.00338994 0.021849 14 32 17908 17876 4.98686 4.5 0.00402598 0.0244568 15 32 17936 17904 4.66171 0.109375 0.00375799 0.0245545 16 32 18279 18247 4.45409 1.33984 0.00483873 0.0267929 17 32 18372 18340 4.21346 0.363281 0.00505187 0.0275887 18 32 19403 19371 4.20309 4.02734 0.00545154 0.029348 19 31 19845 19814 4.07295 1.73047 0.00254726 0.0306775 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 avg lat: 0.0307559 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 20 31 20401 20370 3.97788 2.17188 0.00307238 0.0307559 21 32 21338 21306 3.96254 3.65625 0.00464563 0.0312288 22 32 23057 23025 4.0876 6.71484 0.00296295 0.0299267 23 32 23057 23025 3.90988 0 - 0.0299267 24 32 23803 23771 3.86837 1.45703 0.00301471 0.0312804 25 32 24112 24080 3.76191 1.20703 0.00191063 0.0331462 26 31 25303 25272 3.79629 4.65625 0.00794399 0.0329129 27 32 28803 28771 4.16183 13.668 0.0109817 0.0297469 28 32 29592 29560 4.12325 3.08203 0.00188185 0.0301911 29 32 30595 30563 4.11616 3.91797 0.00379099 0.0296794 30 32 31031 30999 4.03572 1.70312 0.00283347 0.0302411 Total time run: 30.822350 Total writes made: 31032 Write size: 4096 Object size: 4096 Bandwidth (MB/sec): 3.93282 Stddev Bandwidth: 3.66265 Max bandwidth (MB/sec): 13.668 Min bandwidth (MB/sec): 0 Average IOPS: 1006 Stddev IOPS: 937 Max IOPS: 3499 Min IOPS: 0 Average Latency(s): 0.0317779 Stddev Latency(s): 0.164076 Max latency(s): 2.27707 Min latency(s): 0.0013848 Cleaning up (deleting benchmark objects) Clean up completed and total clean up time :20.166559 On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <mmokh...@petasan.org> wrote: > First a general comment: local RAID will be faster than Ceph for a single > threaded (queue depth=1) io operation test. A single thread Ceph client > will see at best same disk speed for reads and for writes 4-6 times slower > than single disk. Not to mention the latency of local disks will much > better. Where Ceph shines is when you have many concurrent ios, it scales > whereas RAID will decrease speed per client as you add more. > > Having said that, i would recommend running rados/rbd bench-write and > measure 4k iops at 1 and 32 threads to get a better idea of how your > cluster performs: > > ceph osd pool create testpool 256 256 > rados bench -p testpool -b 4096 30 write -t 1 > rados bench -p testpool -b 4096 30 write -t 32 > ceph osd pool delete testpool testpool --yes-i-really-really-mean-it > > rbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand > --rbd_cache=false > rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern > rand --rbd_cache=false > > I think the request size difference you see is due to the io scheduler in > the case of local disks having more ios to re-group so has a better chance > in generating larger requests. Depending on your kernel, the io scheduler > may be different for rbd (blq-mq) vs sdx (cfq) but again i would think the > request size is a result not a cause. > > Maged > > On 2017-10-17 23:12, Russell Glaue wrote: > > I am running ceph jewel on 5 nodes with SSD OSDs. > I have an LVM image on a local RAID of spinning disks. > I have an RBD image on in a pool of SSD disks. > Both disks are used to run an almost identical CentOS 7 system. > Both systems were installed with the same kickstart, though the disk > partitioning is different. > > I want to make writes on the the ceph image faster. For example, lots of > writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x > slower than on a spindle RAID disk image. The MySQL server on ceph rbd > image has a hard time keeping up in replication. > > So I wanted to test writes on these two systems > I have a 10GB compressed (gzip) file on both servers. > I simply gunzip the file on both systems, while running iostat. > > The primary difference I see in the results is the average size of the > request to the disk. > CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the > request is about 40x, but the number of writes per second is about the same > This makes me want to conclude that the smaller size of the request for > CentOS7-ceph-rbd-ssd system is the cause of it being slow. > > > How can I make the size of the request larger for ceph rbd images, so I > can increase the write throughput? > Would this be related to having jumbo packets enabled in my ceph storage > network? > > > Here is a sample of the results: > > [CentOS7-lvm-raid-sata] > $ gunzip large10gFile.gz & > $ iostat -x vg_root-lv_var -d 5 -m -N > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > ... > vg_root-lv_var 0.00 0.00 30.60 452.20 13.60 222.15 > 1000.04 8.69 14.05 0.99 14.93 2.07 100.04 > vg_root-lv_var 0.00 0.00 88.20 182.00 39.20 89.43 > 974.95 4.65 9.82 0.99 14.10 3.70 100.00 > vg_root-lv_var 0.00 0.00 75.45 278.24 33.53 136.70 > 985.73 4.36 33.26 1.34 41.91 0.59 20.84 > vg_root-lv_var 0.00 0.00 111.60 181.80 49.60 89.34 > 969.84 2.60 8.87 0.81 13.81 0.13 3.90 > vg_root-lv_var 0.00 0.00 68.40 109.60 30.40 53.63 > 966.87 1.51 8.46 0.84 13.22 0.80 14.16 > ... > > [CentOS7-ceph-rbd-ssd] > $ gunzip large10gFile.gz & > $ iostat -x vg_root-lv_data -d 5 -m -N > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > ... > vg_root-lv_data 0.00 0.00 46.40 167.80 0.88 1.46 > 22.36 1.23 5.66 2.47 6.54 4.52 96.82 > vg_root-lv_data 0.00 0.00 16.60 55.20 0.36 0.14 > 14.44 0.99 13.91 9.12 15.36 13.71 98.46 > vg_root-lv_data 0.00 0.00 69.00 173.80 1.34 1.32 > 22.48 1.25 5.19 3.77 5.75 3.94 95.68 > vg_root-lv_data 0.00 0.00 74.40 293.40 1.37 1.47 > 15.83 1.22 3.31 2.06 3.63 2.54 93.26 > vg_root-lv_data 0.00 0.00 90.80 359.00 1.96 3.41 > 24.45 1.63 3.63 1.94 4.05 2.10 94.38 > ... > > [iostat key] > w/s == The number (after merges) of write requests completed per second > for the device. > wMB/s == The number of sectors (kilobytes, megabytes) written to the > device per second. > avgrq-sz == The average size (in kilobytes) of the requests that were > issued to the device. > avgqu-sz == The average queue length of the requests that were issued to > the device. > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com