[ceph-users] Re: NVMe's

Mark Nelson Wed, 23 Sep 2020 10:34:08 -0700

On 9/23/20 12:18 PM, Mark Nelson wrote:

On 9/23/20 10:58 AM, vita...@yourcmc.ru wrote:
I have no idea how you get 66k write iops with one OSD )
I've just repeated a test by creating a test pool on one NVMe OSDwith 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4xfio randwrite q128 over 4 RBD images. I got 17k iops.
OK, in fact that's not the worst result for Ceph, but problem is thatI only get 30k write iops when benchmarking 4 RBD images spread overall OSDs _in_the_same_cluster_. And there are 14 of them.
The key parts are the single OSD, the tuning, the fast HW, the massiveaggregate io depth, and the fact that it's a fresh OSD with only thisdataset sitting on it. Like I said, you can't get that when you go toa multi-node setup with replication. The same hardware is onlyachieving about 19K IOPS per OSD before replication in an 8 node, 64NVMe, 3x rep configuration. Feel free to look at the data from therun yourself though (2.5MB tgz file):
https://drive.google.com/file/d/1SuFoJtxAauKTBjgrV9XB00_M3Yr9LwW1/view?usp=sharing
In the results directory you'll see the 4 separate copies of fio doing4k random writes each to it's own 128GB dataset on it's own RBD volumevia the librbd engine with iodepth=128:
[nhm@o10 id-f7e4b44f]$ grep "randwrite," output.*.localhost
output.0.localhost:cbt-librbd/o10.front.sepia.ceph.com-0-0: (g=0):rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,ioengine=rbd, iodepth=128output.1.localhost:cbt-librbd/o10.front.sepia.ceph.com-1-0: (g=0):rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,ioengine=rbd, iodepth=128output.2.localhost:cbt-librbd/o10.front.sepia.ceph.com-2-0: (g=0):rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,ioengine=rbd, iodepth=128output.3.localhost:cbt-librbd/o10.front.sepia.ceph.com-3-0: (g=0):rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,ioengine=rbd, iodepth=128
[nhm@o10 id-f7e4b44f]$ grep rbdname output.*.localhost
output.0.localhost:    "rbdname" : "o10.front.sepia.ceph.com-0",
output.1.localhost:    "rbdname" : "o10.front.sepia.ceph.com-1",
output.2.localhost:    "rbdname" : "o10.front.sepia.ceph.com-2",
output.3.localhost:    "rbdname" : "o10.front.sepia.ceph.com-3",

[nhm@o10 id-f7e4b44f]$ grep size output.*.localhost
output.0.localhost:    "size" : "131072M",
output.1.localhost:    "size" : "131072M",
output.2.localhost:    "size" : "131072M",
output.3.localhost:    "size" : "131072M",

[nhm@o10 id-f7e4b44f]$ grep IOPS output.*.localhost
output.0.localhost: write: IOPS=15.5k, BW=60.5MiB/s(63.4MB/s)(17.7GiB/300003msec); 0 zone resetsoutput.1.localhost: write: IOPS=15.7k, BW=61.2MiB/s(64.2MB/s)(17.9GiB/300005msec); 0 zone resetsoutput.2.localhost: write: IOPS=15.6k, BW=60.9MiB/s(63.9MB/s)(17.8GiB/300005msec); 0 zone resetsoutput.3.localhost: write: IOPS=19.1k, BW=74.5MiB/s(78.2MB/s)(21.8GiB/300006msec); 0 zone resets
timestamp is virtually identical:

[nhm@o10 id-f7e4b44f]$ grep timestamp_ms output.*.localhost
output.0.localhost:  "timestamp_ms" : 1570222502962,
output.1.localhost:  "timestamp_ms" : 1570222502965,
output.2.localhost:  "timestamp_ms" : 1570222502965,
output.3.localhost:  "timestamp_ms" : 1570222502959,

as is job runtime:

output.0.localhost:      "job_runtime" : 300002,
output.1.localhost:      "job_runtime" : 300004,
output.2.localhost:      "job_runtime" : 300004,
output.3.localhost:      "job_runtime" : 300005,
We can perform some sanity checks by looking at collectl data from therun:
[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -pcollectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme0n1p1|#"| tail -n 21
# DISK STATISTICS (/sec)
#<---------reads---------------><---------writes--------------><--------averages-------->Pct#Time Name KBytes Merged IOs Size Wait KBytes Merged IOsSize Wait RWSize QLen Wait SvcTim Util20:54:45 nvme0n1p1 0 0 0 0 0 227780 550011944 117 1 117 7 1 0 3920:54:46 nvme0n1p1 0 0 0 0 0 485056 1173193945 123 4 122 38 4 0 4520:54:47 nvme0n1p1 0 0 0 0 0 409144 989473339 123 7 122 52 7 0 4220:54:48 nvme0n1p1 0 0 0 0 0 384940 930563179 121 2 121 14 2 0 5120:54:49 nvme0n1p1 0 0 0 0 0 499152 1207244064 123 4 122 28 4 0 5520:54:50 nvme0n1p1 0 0 0 0 0 388560 939313209 121 4 121 26 4 0 5220:54:51 nvme0n1p1 0 0 0 0 0 316304 764152661 119 5 118 26 5 0 5120:54:52 nvme0n1p1 0 0 0 0 0 418876 1073053443 122 5 121 30 5 0 5220:54:53 nvme0n1p1 0 0 0 0 0 367420 827283098 119 4 118 26 4 0 5020:54:54 nvme0n1p1 0 0 0 0 0 293444 709122449 120 5 119 33 5 0 4120:54:55 nvme0n1p1 0 0 0 0 0 300680 726872483 121 6 121 39 6 0 3820:54:56 nvme0n1p1 0 0 0 0 0 274404 663112290 120 2 119 10 2 0 4020:54:57 nvme0n1p1 0 0 0 0 0 244220 589962059 119 5 118 28 5 0 3920:54:58 nvme0n1p1 0 0 0 0 0 242116 584912038 119 6 118 31 6 0 3720:54:59 nvme0n1p1 0 0 0 0 0 436208 1055123540 123 5 123 40 5 0 4420:55:00 nvme0n1p1 0 0 0 0 0 427420 1034593474 123 7 123 54 7 0 4220:55:01 nvme0n1p1 0 0 0 0 0 152428 366771352 113 0 112 1 0 0 3620:55:02 nvme0n1p1 0 0 0 0 0 448740 1085423643 123 5 123 45 5 0 44
So in this case the block layer in the kernel was able to play sometricks and merge a bunch of IOs helping us get that nice juicybenchmark number (though these P4510 NVMe drives may have been able tohandle more randomly scattered non-merged 4K writes relatively welltoo). in this case we actually have a number of IOs waiting in thedevice queue meaning we are pushing the NVMe at least moderatelyhard. Next we look at OSD process data over time:
[nhm@o10 id-f7e4b44f]$ collectl -sZ -oT -pcollectl.localhost/o10-20191004-204957.raw.gz | awk 'NR==5 ||/ceph-osd/' | head -n 10#Time PID User PR PPID THRD S VSZ RSS CP SysT UsrTPct AccuTime RKB WKB MajF MinF Command20:50:08 7964 root 20 1 71 S 7G 6G 84 10.13 49.34594 09:03.66 0 0 0 75K /usr/local/bin/ceph-osd20:50:18 7964 root 20 1 71 S 7G 6G 84 15.26 84.28995 10:43.21 0 0 0 42K /usr/local/bin/ceph-osd20:50:28 7964 root 20 1 71 S 7G 6G 84 15.87 89.101049 12:28.18 0 0 0 59K /usr/local/bin/ceph-osd20:50:38 7964 root 20 1 71 S 7G 6G 84 15.46 94.561100 14:18.20 0 0 0 66K /usr/local/bin/ceph-osd20:50:48 7964 root 20 1 71 S 7G 6G 84 15.19 94.821100 16:08.21 0 0 0 76K /usr/local/bin/ceph-osd20:50:58 7964 root 20 1 71 S 7G 6G 84 14.64 92.221068 17:55.07 0 0 0 52K /usr/local/bin/ceph-osd20:51:08 7964 root 20 1 71 S 7G 6G 84 14.91 95.361102 19:45.34 0 0 0 55K /usr/local/bin/ceph-osd20:51:18 7964 root 20 1 71 S 7G 6G 84 15.48 100.921164 21:41.74 0 0 0 64K /usr/local/bin/ceph-osd20:51:28 7964 root 20 1 71 S 7G 6G 84 14.91 90.631055 23:27.28 0 0 0 53K /usr/local/bin/ceph-osd
The OSD process is consuming a little under 8GB which is reasonably ontarget. PCT used by the OSD stabilizes between about 10-12 cores. The important point is that the OSD/bluestore machinery is perfectlyhappy consuming a ton of cores and driving ~66K IOPs worth of work atthe underlying device assuming the block layer and the device itselfcan handle it. It used to be that the kv sync and onode cache trimingthreads were much bigger bottlenecks but we've improved both over thelast couple of years. Having said that we're getting pretty close tothe current limits of bluestore here. Also once you start involvingthe network and replication we introduce enough extra latency andcontention in our stack that the OSD can no longer maintain thesekinds of numbers even if the device itself and bluestore can handleit. That's where seastore, crimson, and future msgr work come inimho. Finally, we can use cbt's osd log parser to look at rocksdbcompaction statistics. This also probably explains at least some ofthe large IO workload we see in collectl since we are spending nearly1/3rd of the overall OSD uptime in compaction (and given that the testonly ran for 300s of that uptime, rocksdb may be spending the majorityof the actual test portion of the OSD uptime performing universalcompaction):
[nhm@o10 nvme_nvme_bs]$~/src/ceph/cbt/tools/ceph_rocksdb_log_parser.py ./osd.0.log | head -n 13
Compaction Statistics    ./osd.0.log
Total OSD Log Duration (seconds)    605.512
Number of Compaction Events    112
Avg Compaction Time (seconds)    1.68255873214
Total Compaction Time (seconds)    188.446578
Avg Output Size: (MB)    380.801597553
Total Output Size: (MB)    42649.7789259
Total Input Records    243648735
Total Output Records    174897784
Avg Output Throughput (MB/s)    250.561577229
Avg Input Records/second    1242038.3797
Avg Output Records/second    1059967.66123
Avg Output/Input Ratio    0.862494228971


Mark


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Minor correction, nvme0n1p1 was a DB/WAL partition while nvme1n1p2 isthe block storage device where the 4k writes are happening andunmerged. That makes way more sense since those merged writes are alljust the WAL and big SST writes. Sorry, it's been a while since Ilooked at these numbers. IE here we are using a separate NVMe drive inthis case for the DB/WAL. In reality I don't recall it actually helpingthat much vs just putting everything on a single P4510, but in any eventI wanted to update for clarity.



Actual block partition writes:

[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -pcollectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme1n1p2|#" |tail -n 21

# DISK STATISTICS (/sec)

#<---------reads---------------><---------writes--------------><--------averages-------->Pct#Time Name KBytes Merged IOs Size Wait KBytes Merged IOsSize Wait RWSize QLen Wait SvcTim Util20:54:45 nvme1n1p2 0 0 0 0 0 261804 0 65K 4 0 3 0 0 0 8920:54:46 nvme1n1p2 0 0 0 0 0 206916 0 51K 4 0 4 0 0 0 6720:54:47 nvme1n1p2 0 0 0 0 0 218888 0 54K 4 0 4 0 0 0 7520:54:48 nvme1n1p2 0 0 0 0 0 261032 0 65K 4 0 4 0 0 0 7920:54:49 nvme1n1p2 0 0 0 0 0 237132 0 59K 4 0 4 0 0 0 7020:54:50 nvme1n1p2 0 0 0 0 0 269404 0 67K 4 0 4 0 0 0 8020:54:51 nvme1n1p2 0 0 0 0 0 280064 0 70K 4 0 4 0 0 0 8220:54:52 nvme1n1p2 0 0 0 0 0 251956 0 62K 4 0 4 0 0 0 7520:54:53 nvme1n1p2 0 0 0 0 0 267232 0 66K 4 0 4 0 0 0 8220:54:54 nvme1n1p2 0 0 0 0 0 263884 0 65K 4 0 4 0 0 0 8320:54:55 nvme1n1p2 0 0 0 0 0 245224 0 61K 4 0 4 0 0 0 8520:54:56 nvme1n1p2 0 0 0 0 0 262472 0 65K 4 0 4 0 0 0 8620:54:57 nvme1n1p2 0 0 0 0 0 270272 0 67K 4 0 4 0 0 0 8920:54:58 nvme1n1p2 0 0 0 0 0 264256 0 66K 4 0 4 0 0 0 9020:54:59 nvme1n1p2 0 0 0 0 0 226052 0 56K 4 0 4 0 0 0 7520:55:00 nvme1n1p2 0 0 0 0 0 224352 0 56K 4 0 4 0 0 0 7620:55:01 nvme1n1p2 0 0 0 0 0 287760 0 71K 4 0 4 0 0 0 9620:55:02 nvme1n1p2 0 0 0 0 0 220128 0 55K 4 0 4 0 0 0 73


Mark
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NVMe's

Reply via email to