On 9/23/20 12:18 PM, Mark Nelson wrote:
On 9/23/20 10:58 AM, vita...@yourcmc.ru wrote:
I have no idea how you get 66k write iops with one OSD )

I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images. I got 17k iops.

OK, in fact that's not the worst result for Ceph, but problem is that I only get 30k write iops when benchmarking 4 RBD images spread over all OSDs _in_the_same_cluster_. And there are 14 of them.

The key parts are the single OSD, the tuning, the fast HW, the massive aggregate io depth, and the fact that it's a fresh OSD with only this dataset sitting on it.  Like I said, you can't get that when you go to a multi-node setup with replication.  The same hardware is only achieving about 19K IOPS per OSD before replication in an 8 node, 64 NVMe, 3x rep configuration.  Feel free to look at the data from the run yourself though (2.5MB tgz file):

https://drive.google.com/file/d/1SuFoJtxAauKTBjgrV9XB00_M3Yr9LwW1/view?usp=sharing


In the results directory you'll see the 4 separate copies of fio doing 4k random writes each to it's own 128GB dataset on it's own RBD volume via the librbd engine with iodepth=128:

[nhm@o10 id-f7e4b44f]$ grep "randwrite," output.*.localhost
output.0.localhost:cbt-librbd/o10.front.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=128 output.1.localhost:cbt-librbd/o10.front.sepia.ceph.com-1-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=128 output.2.localhost:cbt-librbd/o10.front.sepia.ceph.com-2-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=128 output.3.localhost:cbt-librbd/o10.front.sepia.ceph.com-3-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=128

[nhm@o10 id-f7e4b44f]$ grep rbdname output.*.localhost
output.0.localhost:    "rbdname" : "o10.front.sepia.ceph.com-0",
output.1.localhost:    "rbdname" : "o10.front.sepia.ceph.com-1",
output.2.localhost:    "rbdname" : "o10.front.sepia.ceph.com-2",
output.3.localhost:    "rbdname" : "o10.front.sepia.ceph.com-3",

[nhm@o10 id-f7e4b44f]$ grep size output.*.localhost
output.0.localhost:    "size" : "131072M",
output.1.localhost:    "size" : "131072M",
output.2.localhost:    "size" : "131072M",
output.3.localhost:    "size" : "131072M",

[nhm@o10 id-f7e4b44f]$ grep IOPS output.*.localhost
output.0.localhost:  write: IOPS=15.5k, BW=60.5MiB/s (63.4MB/s)(17.7GiB/300003msec); 0 zone resets output.1.localhost:  write: IOPS=15.7k, BW=61.2MiB/s (64.2MB/s)(17.9GiB/300005msec); 0 zone resets output.2.localhost:  write: IOPS=15.6k, BW=60.9MiB/s (63.9MB/s)(17.8GiB/300005msec); 0 zone resets output.3.localhost:  write: IOPS=19.1k, BW=74.5MiB/s (78.2MB/s)(21.8GiB/300006msec); 0 zone resets


timestamp is virtually identical:

[nhm@o10 id-f7e4b44f]$ grep timestamp_ms output.*.localhost
output.0.localhost:  "timestamp_ms" : 1570222502962,
output.1.localhost:  "timestamp_ms" : 1570222502965,
output.2.localhost:  "timestamp_ms" : 1570222502965,
output.3.localhost:  "timestamp_ms" : 1570222502959,

as is job runtime:

output.0.localhost:      "job_runtime" : 300002,
output.1.localhost:      "job_runtime" : 300004,
output.2.localhost:      "job_runtime" : 300004,
output.3.localhost:      "job_runtime" : 300005,


We can perform some sanity checks by looking at collectl data from the run:


[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme0n1p1|#" | tail -n 21
# DISK STATISTICS (/sec)
# <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged IOs Size  Wait  RWSize  QLen  Wait SvcTim Util 20:54:45 nvme0n1p1        0      0    0    0     0  227780  55001 1944  117     1     117     7     1      0   39 20:54:46 nvme0n1p1        0      0    0    0     0  485056 117319 3945  123     4     122    38     4      0   45 20:54:47 nvme0n1p1        0      0    0    0     0  409144  98947 3339  123     7     122    52     7      0   42 20:54:48 nvme0n1p1        0      0    0    0     0  384940  93056 3179  121     2     121    14     2      0   51 20:54:49 nvme0n1p1        0      0    0    0     0  499152 120724 4064  123     4     122    28     4      0   55 20:54:50 nvme0n1p1        0      0    0    0     0  388560  93931 3209  121     4     121    26     4      0   52 20:54:51 nvme0n1p1        0      0    0    0     0  316304  76415 2661  119     5     118    26     5      0   51 20:54:52 nvme0n1p1        0      0    0    0     0  418876 107305 3443  122     5     121    30     5      0   52 20:54:53 nvme0n1p1        0      0    0    0     0  367420  82728 3098  119     4     118    26     4      0   50 20:54:54 nvme0n1p1        0      0    0    0     0  293444  70912 2449  120     5     119    33     5      0   41 20:54:55 nvme0n1p1        0      0    0    0     0  300680  72687 2483  121     6     121    39     6      0   38 20:54:56 nvme0n1p1        0      0    0    0     0  274404  66311 2290  120     2     119    10     2      0   40 20:54:57 nvme0n1p1        0      0    0    0     0  244220  58996 2059  119     5     118    28     5      0   39 20:54:58 nvme0n1p1        0      0    0    0     0  242116  58491 2038  119     6     118    31     6      0   37 20:54:59 nvme0n1p1        0      0    0    0     0  436208 105512 3540  123     5     123    40     5      0   44 20:55:00 nvme0n1p1        0      0    0    0     0  427420 103459 3474  123     7     123    54     7      0   42 20:55:01 nvme0n1p1        0      0    0    0     0  152428  36677 1352  113     0     112     1     0      0   36 20:55:02 nvme0n1p1        0      0    0    0     0  448740 108542 3643  123     5     123    45     5      0   44


So in this case the block layer in the kernel was able to play some tricks and merge a bunch of IOs helping us get that nice juicy benchmark number (though these P4510 NVMe drives may have been able to handle more randomly scattered non-merged 4K writes relatively well too).  in this case we actually have a number of IOs waiting in the device queue meaning we are pushing the NVMe at least moderately hard.  Next we look at OSD process data over time:


[nhm@o10 id-f7e4b44f]$ collectl -sZ -oT -p collectl.localhost/o10-20191004-204957.raw.gz | awk 'NR==5 || /ceph-osd/' | head -n 10 #Time      PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT UsrT Pct  AccuTime  RKB  WKB MajF MinF Command 20:50:08  7964  root     20     1   71 S    7G    6G 84 10.13 49.34 594  09:03.66    0    0    0  75K /usr/local/bin/ceph-osd 20:50:18  7964  root     20     1   71 S    7G    6G 84 15.26 84.28 995  10:43.21    0    0    0  42K /usr/local/bin/ceph-osd 20:50:28  7964  root     20     1   71 S    7G    6G 84 15.87 89.10 1049  12:28.18    0    0    0  59K /usr/local/bin/ceph-osd 20:50:38  7964  root     20     1   71 S    7G    6G 84 15.46 94.56 1100  14:18.20    0    0    0  66K /usr/local/bin/ceph-osd 20:50:48  7964  root     20     1   71 S    7G    6G 84 15.19 94.82 1100  16:08.21    0    0    0  76K /usr/local/bin/ceph-osd 20:50:58  7964  root     20     1   71 S    7G    6G 84 14.64 92.22 1068  17:55.07    0    0    0  52K /usr/local/bin/ceph-osd 20:51:08  7964  root     20     1   71 S    7G    6G 84 14.91 95.36 1102  19:45.34    0    0    0  55K /usr/local/bin/ceph-osd 20:51:18  7964  root     20     1   71 S    7G    6G 84 15.48 100.92 1164  21:41.74    0    0    0  64K /usr/local/bin/ceph-osd 20:51:28  7964  root     20     1   71 S    7G    6G 84 14.91 90.63 1055  23:27.28    0    0    0  53K /usr/local/bin/ceph-osd

The OSD process is consuming a little under 8GB which is reasonably on target.  PCT used by the OSD stabilizes between about 10-12 cores.  The important point is that the OSD/bluestore machinery is perfectly happy consuming a ton of cores and driving ~66K IOPs worth of work at the underlying device assuming the block layer and the device itself can handle it.  It used to be that the kv sync and onode cache triming threads were much bigger bottlenecks but we've improved both over the last couple of years.  Having said that we're getting pretty close to the current limits of bluestore here.  Also once you start involving the network and replication we introduce enough extra latency and contention in our stack that the OSD can no longer maintain these kinds of numbers even if the device itself and bluestore can handle it.  That's where seastore, crimson, and future msgr work come in imho.  Finally, we can use cbt's osd log parser to look at rocksdb compaction statistics.  This also probably explains at least some of the large IO workload we see in collectl since we are spending nearly 1/3rd of the overall OSD uptime in compaction (and given that the test only ran for 300s of that uptime, rocksdb may be spending the majority of the actual test portion of the OSD uptime performing universal compaction):


[nhm@o10 nvme_nvme_bs]$ ~/src/ceph/cbt/tools/ceph_rocksdb_log_parser.py ./osd.0.log | head -n 13
Compaction Statistics    ./osd.0.log
Total OSD Log Duration (seconds)    605.512
Number of Compaction Events    112
Avg Compaction Time (seconds)    1.68255873214
Total Compaction Time (seconds)    188.446578
Avg Output Size: (MB)    380.801597553
Total Output Size: (MB)    42649.7789259
Total Input Records    243648735
Total Output Records    174897784
Avg Output Throughput (MB/s)    250.561577229
Avg Input Records/second    1242038.3797
Avg Output Records/second    1059967.66123
Avg Output/Input Ratio    0.862494228971


Mark


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


Minor correction, nvme0n1p1 was a DB/WAL partition while nvme1n1p2 is the block storage device where the 4k writes are happening and unmerged.  That makes way more sense since those merged writes are all just the WAL and big SST writes.  Sorry, it's been a while since I looked at these numbers.  IE here we are using a separate NVMe drive in this case for the DB/WAL.  In reality I don't recall it actually helping that much vs just putting everything on a single P4510, but in any event I wanted to update for clarity.


Actual block partition writes:

[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme1n1p2|#" | tail -n 21
# DISK STATISTICS (/sec)
# <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged IOs Size  Wait  RWSize  QLen  Wait SvcTim Util 20:54:45 nvme1n1p2        0      0    0    0     0  261804      0 65K    4     0       3     0     0      0   89 20:54:46 nvme1n1p2        0      0    0    0     0  206916      0 51K    4     0       4     0     0      0   67 20:54:47 nvme1n1p2        0      0    0    0     0  218888      0 54K    4     0       4     0     0      0   75 20:54:48 nvme1n1p2        0      0    0    0     0  261032      0 65K    4     0       4     0     0      0   79 20:54:49 nvme1n1p2        0      0    0    0     0  237132      0 59K    4     0       4     0     0      0   70 20:54:50 nvme1n1p2        0      0    0    0     0  269404      0 67K    4     0       4     0     0      0   80 20:54:51 nvme1n1p2        0      0    0    0     0  280064      0 70K    4     0       4     0     0      0   82 20:54:52 nvme1n1p2        0      0    0    0     0  251956      0 62K    4     0       4     0     0      0   75 20:54:53 nvme1n1p2        0      0    0    0     0  267232      0 66K    4     0       4     0     0      0   82 20:54:54 nvme1n1p2        0      0    0    0     0  263884      0 65K    4     0       4     0     0      0   83 20:54:55 nvme1n1p2        0      0    0    0     0  245224      0 61K    4     0       4     0     0      0   85 20:54:56 nvme1n1p2        0      0    0    0     0  262472      0 65K    4     0       4     0     0      0   86 20:54:57 nvme1n1p2        0      0    0    0     0  270272      0 67K    4     0       4     0     0      0   89 20:54:58 nvme1n1p2        0      0    0    0     0  264256      0 66K    4     0       4     0     0      0   90 20:54:59 nvme1n1p2        0      0    0    0     0  226052      0 56K    4     0       4     0     0      0   75 20:55:00 nvme1n1p2        0      0    0    0     0  224352      0 56K    4     0       4     0     0      0   76 20:55:01 nvme1n1p2        0      0    0    0     0  287760      0 71K    4     0       4     0     0      0   96 20:55:02 nvme1n1p2        0      0    0    0     0  220128      0 55K    4     0       4     0     0      0   73

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to