Hi,

I built a CEPH 16.2.x cluster with relatively fast and modern hardware, and
its performance is kind of disappointing. I would very much appreciate an
advice and/or pointers :-)

The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:

2 x Intel(R) Xeon(R) Gold 5220R CPUs
384 GB RAM
2 x boot drives
2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
2 x Intel XL710 NICs connected to a pair of 40/100GE switches

All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
apparmor is disabled, energy-saving features are disabled. The network
between the CEPH nodes is 40G, CEPH access network is 40G, the average
latencies are < 0.15 ms. I've personally tested the network for throughput,
latency and loss, and can tell that it's operating as expected and doesn't
exhibit any issues at idle or under load.

The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
smaller NVME drives in each node used as DB/WAL and each HDD allocated .
ceph osd tree output:

ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT  PRI-AFF
 -1         288.37488  root default
-13         288.37488      datacenter ste
-14         288.37488          rack rack01
 -7          96.12495              host ceph01
  0    hdd    9.38680                  osd.0        up   1.00000  1.00000
  1    hdd    9.38680                  osd.1        up   1.00000  1.00000
  2    hdd    9.38680                  osd.2        up   1.00000  1.00000
  3    hdd    9.38680                  osd.3        up   1.00000  1.00000
  4    hdd    9.38680                  osd.4        up   1.00000  1.00000
  5    hdd    9.38680                  osd.5        up   1.00000  1.00000
  6    hdd    9.38680                  osd.6        up   1.00000  1.00000
  7    hdd    9.38680                  osd.7        up   1.00000  1.00000
  8    hdd    9.38680                  osd.8        up   1.00000  1.00000
  9   nvme    5.82190                  osd.9        up   1.00000  1.00000
 10   nvme    5.82190                  osd.10       up   1.00000  1.00000
-10          96.12495              host ceph02
 11    hdd    9.38680                  osd.11       up   1.00000  1.00000
 12    hdd    9.38680                  osd.12       up   1.00000  1.00000
 13    hdd    9.38680                  osd.13       up   1.00000  1.00000
 14    hdd    9.38680                  osd.14       up   1.00000  1.00000
 15    hdd    9.38680                  osd.15       up   1.00000  1.00000
 16    hdd    9.38680                  osd.16       up   1.00000  1.00000
 17    hdd    9.38680                  osd.17       up   1.00000  1.00000
 18    hdd    9.38680                  osd.18       up   1.00000  1.00000
 19    hdd    9.38680                  osd.19       up   1.00000  1.00000
 20   nvme    5.82190                  osd.20       up   1.00000  1.00000
 21   nvme    5.82190                  osd.21       up   1.00000  1.00000
 -3          96.12495              host ceph03
 22    hdd    9.38680                  osd.22       up   1.00000  1.00000
 23    hdd    9.38680                  osd.23       up   1.00000  1.00000
 24    hdd    9.38680                  osd.24       up   1.00000  1.00000
 25    hdd    9.38680                  osd.25       up   1.00000  1.00000
 26    hdd    9.38680                  osd.26       up   1.00000  1.00000
 27    hdd    9.38680                  osd.27       up   1.00000  1.00000
 28    hdd    9.38680                  osd.28       up   1.00000  1.00000
 29    hdd    9.38680                  osd.29       up   1.00000  1.00000
 30    hdd    9.38680                  osd.30       up   1.00000  1.00000
 31   nvme    5.82190                  osd.31       up   1.00000  1.00000
 32   nvme    5.82190                  osd.32       up   1.00000  1.00000

ceph df:

--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    253 TiB  241 TiB  13 TiB    13 TiB       5.00
nvme    35 TiB   35 TiB  82 GiB    82 GiB       0.23
TOTAL  288 TiB  276 TiB  13 TiB    13 TiB       4.42

--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
images                 12  256   24 GiB    3.15k   73 GiB   0.03     76 TiB
volumes                13  256  839 GiB  232.16k  2.5 TiB   1.07     76 TiB
backups                14  256   31 GiB    8.56k   94 GiB   0.04     76 TiB
vms                    15  256  752 GiB  198.80k  2.2 TiB   0.96     76 TiB
device_health_metrics  16   32   35 MiB       39  106 MiB      0     76 TiB
volumes-nvme           17  256   28 GiB    7.21k   81 GiB   0.24     11 TiB
ec-volumes-meta        18  256   27 KiB        4   92 KiB      0     76 TiB
ec-volumes-data        19  256    8 KiB        1   12 KiB      0    152 TiB

Please disregard the ec-pools, as they're not currently in use. All other
pools are configured with min_size=2, size=3. All pools are bound to HDD
storage except for 'volumes-nvme', which is bound to NVME. The number of
PGs was increased recently, as with autoscaler I was getting a very uneven
PG distribution on devices and we're expecting to add 3 more nodes of
exactly the same configuration in the coming weeks. I have to emphasize
that I tested different PG numbers and they didn't have a noticeable impact
on the cluster performance.

The main issue is that this beautiful cluster isn't very fast. When I test
against the 'volumes' pool, residing on HDD storage class (HDDs with DB/WAL
on NVME), I get unexpectedly low throughput numbers:

> rados -p volumes bench 30 write --no-cleanup
...
Total time run:         30.3078
Total writes made:      3731
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     492.415
Stddev Bandwidth:       161.777
Max bandwidth (MB/sec): 820
Min bandwidth (MB/sec): 204
Average IOPS:           123
Stddev IOPS:            40.4442
Max IOPS:               205
Min IOPS:               51
Average Latency(s):     0.129115
Stddev Latency(s):      0.143881
Max latency(s):         1.35669
Min latency(s):         0.0228179

> rados -p volumes bench 30 seq --no-cleanup
...
Total time run:       14.7272
Total reads made:     3731
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1013.36
Average IOPS:         253
Stddev IOPS:          63.8709
Max IOPS:             323
Min IOPS:             91
Average Latency(s):   0.0625202
Max latency(s):       0.551629
Min latency(s):       0.010683

On average, I get around 550 MB/s writes and 800 MB/s reads with 16 threads
and 4MB blocks. The numbers don't look fantastic for this hardware, I can
actually push over 8 GB/s of throughput with fio, 16 threads and 4MB blocks
from an RBD client (KVM Linux VM) connected over a low-latency 40G network,
probably hitting some OSD caches there:

   READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s-1058MB/s),
io=501GiB (538GB), run=60001-60153msec
Disk stats (read/write):
  vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092,
util=99.48%

The issue manifests when the same client does something closer to real-life
usage, like a single-thread write or read with 4KB blocks, as if using for
example ext4 file system:

> fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1
--size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
...
Run status group 0 (all jobs):
  WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s),
io=7694MiB (8067MB), run=64079-64079msec
Disk stats (read/write):
  vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216,
util=77.31%

> fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1
--size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
...
Run status group 0 (all jobs):
   READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s-56.7MB/s),
io=3242MiB (3399MB), run=60001-60001msec
Disk stats (read/write):
  vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, util=99.13%

And this is a total disaster: the IOPS look decent, but the bandwidth is
unexpectedly very very low. I just don't understand why a single RBD client
writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad
joke ¯\_(ツ)_/¯

When I run these benchmarks, nothing seems to be overloaded, things like
CPU and network are barely utilized, OSD latencies don't show anything
unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs
with DB/WAL on NVME drives should produce better I/O bandwidth, both for
writes and reads. I mean, I can easily get much better performance from a
single HDD shared over network via NFS or iSCSI.

I am open to suggestions and would very much appreciate comments and/or an
advice on how to improve the cluster performance.

Best regards,
Zakhar
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to