Hello all I faced with poor performance on RBD images First, my lab's hardware consists of 3 intel server with - 2 intel xeon e5-2660 v4 (all powersaving stuff are turned off in BIOS) running on - S2600TPR MOBO - 256 Gb RAM - 4 Sata SSD intel 960 Gb model DC S3520 for OSD - 2 Sata SSD intel 480 Gb model DC S3520 for OS - 1 PCI-e NVMe intel 800 Gb model DC P3700 Series for writeback pool - dual port ixgbe 10Gb/s NIC in each All this stuff running under CentOS 7.6 on kernel 4.14.15-1.el7.elrepo.x86_64 Network interfaces run in teaming
Each of these 3 servers act as mon-host, OSD-host, mgr-host and RBD-host: ceph -s cluster: id: 6dc5b328-f8be-4c52-96b7-d20a1f78b067 health: HEALTH_WARN Failed to send data to Zabbix 1548 slow ops, oldest one blocked for 63205 sec, mon.alfa-csn-03 has slow ops services: mon: 3 daemons, quorum alfa-csn-01,alfa-csn-02,alfa-csn-03 mgr: alfa-csn-03(active), standbys: alfa-csn-02, alfa-csn-01 osd: 27 osds: 27 up, 27 in rgw: 3 daemons active data: pools: 8 pools, 2592 pgs objects: 219.0 k objects, 810 GiB usage: 1.3 TiB used, 9.4 TiB / 11 TiB avail pgs: 2592 active+clean I created 2 OSD per SSD and using them to store data and 1 OSD on NVMe for write cache Also i created erasure profile: crush-device-class= crush-failure-domain=host crush-root=default k=2 m=1 plugin=isa technique=reed_sol_van and organized pool `vmstor' under this profile with 1024 pg and pgp Here is crush rule for `vmstor' pool: rule vmstor { id 1 type erasure min_size 3 max_size 3 step set_chooseleaf_tries 50 step set_choose_tries 100 step take data step chooseleaf indep 0 type host-data step emit } host-data alfa-csn-01-ssd { id -5 # do not change unnecessarily id -6 class ssd # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 item osd.2 weight 1.000 item osd.3 weight 1.000 item osd.4 weight 1.000 item osd.5 weight 1.000 item osd.6 weight 1.000 item osd.7 weight 1.000 } host-data alfa-csn-02-ssd { id -7 # do not change unnecessarily id -8 class ssd # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.8 weight 1.000 item osd.9 weight 1.000 item osd.10 weight 1.000 item osd.11 weight 1.000 item osd.12 weight 1.000 item osd.13 weight 1.000 item osd.14 weight 1.000 item osd.15 weight 1.000 } host-data alfa-csn-03-ssd { id -9 # do not change unnecessarily id -10 class ssd # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.16 weight 1.000 item osd.17 weight 1.000 item osd.18 weight 1.000 item osd.19 weight 1.000 item osd.20 weight 1.000 item osd.21 weight 1.000 item osd.22 weight 1.000 item osd.23 weight 1.000 } Also there was created pool named `wb-vmstor' with 256 pg and pgs as hot tier for `vmstor': rule wb-vmstor { id 4 type replicated min_size 2 max_size 3 step take wb step set_chooseleaf_tries 50 step set_choose_tries 100 step chooseleaf firstn 0 type host-wb step emit } Then pool `vmstor' was inited as rbd pool, and a few images were created in it These images was plugged as disks to 2 qemu-kvm virtual machines - 4 images per VM using native RBD support in QEMU Qemu servers are running on the same (but separated) servers, i.e. xeon e5-2660v4, 256 ram and so on And then fio tests were performed on these disks Results: 1) in case of using this virtual drive as raw block devices i got about 400 IOPS by 4kb or 8 kb (or another other size till 1Mb) blocks on random write 2) after i created filesystems on these drives and mounted them in system and got about 20k IOPS. And it doesn't matter if i run test on single or both VMs - i have total 20k IOPS. I mean i run fio test on one VM and have 20k IOPS, then i run fio test on 2 VMs and have 10k IOPS on each VM My fio job is: [global] numjobs=1 ioengine=libaio buffered=0 direct=1 bs=8k rw=randrw rwmixread=0 iodepth=8 group_reporting=1 time_based=1 [vdb] size=10G directory=/mnt filename=vdb [vdc] size=10G directory=/mnt1 filename=vdc [vdd] size=10G directory=/mnt2 filename=vdd [vde] size=10G directory=/mnt3 filename=vde [vdf] size=10G directory=/mnt4 filename=vdf To my mind that result is not so good and i guess this hardware and CEPH can produce much more Please, help me find what i'm doing wrong
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com