[ceph-users] Re: High io wait when osd rocksdb is compacting

Raffael Bachmann Wed, 29 Jul 2020 17:48:24 -0700

Hi Mark

I think its 15 hours not 15 days. But the compaction time seems reallyto be slow. I' destroying and recreating all nvme osds one by one. Andthe recreated ones don't have latency problems and are also much fastercompacting the disk.


This is since two hours:
Compaction Statistics   /var/log/ceph/ceph-osd.0.log
Total OSD Log Duration (seconds)        7909.104
Number of Compaction Events     11
Avg Compaction Time (seconds)   1.26702554545
Total Compaction Time (seconds) 13.937281
Avg Output Size: (MB)   268.282840729
Total Output Size: (MB) 2951.11124802
Total Input Records     7693669
Total Output Records    7670229
Avg Output Throughput (MB/s)    225.29745104
Avg Input Records/second        533134.954087
Avg Output Records/second       531386.725197
Avg Output/Input Ratio  0.996558805862

Not sure if you are interested in the answers to your questions anymorebut:DBs per drive: I think one? (Not yet familiar with all the ceph details.It was "just" a normal osd. the full disk, bluestore including db)Workload: Rather low. Having three node proxmox/ceph cluster is just fornot have a single point of failure. The cpus and disks are mostly bored.

In iostat the aqu-sz value is getting to about 520 when this occours

There is one difference between the old osds and the recreated ones. Theold ones were partitioned and the mount /var/lib/ceph/osd/ceph-1 was thefirst partition as xfs.Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Both, old andnew, are bluestore.

I'm still in the middle recreating one by one. Luckily it's not apetabyte cluster with thousand of disks ;-)

Anyway, thanks everyone for answering and helping so fast. Having amailing list this active is really nice.

Cheers,
Raffael

On 29/07/2020 16:53, Mark Nelson wrote:

Wow, that's crazy. You only had 13 compaction events for that OSDover roughly 15 days but the average compaction time was 116 seconds!Notice too though that the average compaction output size is 422MBwith an average output throughput of 3.5MB! That's really slow withRocksDB sitting on an NVMe drive. You are only processing about 16Krecords/second.
Here are some of the results from our internal NVMe (Intel P4510) testcluster looking at Sharded vs Unsharded rocksdb. This was based onmaster from last fall so figure it's about halfway between Nautilusand Octopus. These results are not exactly comparable to yours sincewe're using some experimental settings, but your compaction eventslook like they are orders of magnitude slower.
https://docs.google.com/spreadsheets/d/1FYFBxwvE1i28AKoLyqrksHptE1Z523NU3Fag0MELTQo/edit?usp=sharing
No wonder you are seeing periodic stalls. How many DBs per NVMedrive? What's your cluster workload typically like? Also, can you seeif the NVMe drive aqu-sz is getting big waiting for the requests to beserviced?
Mark


On 7/29/20 8:35 AM, Raffael Bachmann wrote:
Hi Mark
Unfortunately it is the production cluster and I don't have anotherone :-(
This is the output of the log parser. I'have nothing to compare themto. Stupid me has no more logs from before the upgrade.
python ceph_rocksdb_log_parser.py ceph-osd.1.log
Compaction Statistics   ceph-osd.1.log
Total OSD Log Duration (seconds)        55500.457
Number of Compaction Events     13
Avg Compaction Time (seconds)   116.498074615
Total Compaction Time (seconds) 1514.47497
Avg Output Size: (MB)   422.757656391
Total Output Size: (MB) 5495.84953308
Total Input Records     21019590
Total Output Records    18093259
Avg Output Throughput (MB/s)    3.53010211372
Avg Input Records/second        17994.0419635
Avg Output Records/second       16449.9710169
Avg Output/Input Ratio  0.891530624966

ceph-osd.1.log
start_offset compaction_time_seconds output_levelnum_output_files total_output_size num_input_recordsnum_output_records output (MB/s) input (r/s) output (r/s) output/input ratio417.204 70.247058 1 5 261853019 14766891384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476546.271 128.652685 2 7 473883973 16743931098908 3.51279861751 13014.8313655 8541.66393807 0.6563023137345761.795 60.460736 1 4 211033833 10414081013909 3.32873133441 17224.5339521 16769.7098494 0.97359440296214912.985 64.958415 1 4 231336608 13165751249120 3.3963233477 20267.9668215 19229.5332329 0.94876478742215152.316 238.925764 2 14 944635417 24450941902084 3.77052068592 10233.6975262 7960.98322825 0.7779185585524607.857 53.022134 1 4 188414045 1029179988116 3.38887973778 19410.36549 18635.915333 0.96010120688431259.993 55.442826 1 4 210856392 12967251221474 3.62694941814 23388.5083708 22031.2362865 0.94196842044431574.193 313.736584 2 18 1213247010 29287422359960 3.68794259867 9335.03502416 7522.10650703 0.80579306746737708.375 49.78089 1 3 171888381 974097939847 3.29294101107 19567.6895291 18879.6745096 0.9648392305943219.745 51.798215 1 4 193360867 12461011172257 3.5600318014 24056.8328465 22631.2238752 0.94073995607148041.751 56.559014 1 4 208216413 14511051367052 3.5108576209 25656.4762604 24170.3647804 0.94207655545348368.403 325.833185 2 19 1289359869 31961562489088 3.77380036251 9809.17889011 7639.1482347 0.77877550407452693.952 45.057464 1 3 164730093 943326907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433
cheers
Raffael


On 29/07/2020 15:19, Mark Nelson wrote:
Hi Raffael,
Adam made a PR this year that shards rocksdb data across differentcolumn families to help reduce compaction overhead. The goal is toreduce write-amplification during compaction by storing multiplesmall LSM hierarchies rather than 1 big one. We've seen evidencethat this lowers compaction time and overhead, sometimessignificantly. That PR was merged to master on April 26th so Idon't believe it's in any of the releases yet but you can test it ifyou have a non-production cluster available. That PR is here:
https://github.com/ceph/ceph/pull/34006
Normally though you should have about 1GB of WAL to absorb writesduring compaction and rocksdb automatically slows writes down if thebuffers start filling up. You should only see a write stall fromcompaction if you completely fill all of the buffers. Also, youshouldn't see compaction at one level blocking IO to the entiredatabase. Something seems off to me here.
If you have OSD logs, you can see a history of the compaction eventsby running this script:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
That can give you an idea of how long your compaction events arelasting and what they are doing.
Mark


On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,
I'm kind of crossposting this from here:https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/But since I'm more and more sure that it's a ceph problem I'll trymy luck here.
Since updating from Luminous to Nautilus I have a big problem.
I have a 3 node cluster. Each cluster has 2 nvme ssd and a10GBASE-T net for ceph.Every few minutes a osd seems to compact the rocksdb. While doingthis it uses alot of I/O and blocks.This basically blocks the whole cluster and no VM/Container canread data for some seconds (minutes).
While it happens "iostat -x" looks like this:
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %utilnvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40nvme1n1 0.00 1495.00 0.00 3924.00 0.006099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00
And iotop:

Total DISK READ:         0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE: 3.43 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO> COMMAND
2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 %ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph[rocksdb:low1]
In the ceph-osd log I see that rocksdb is compacting.https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08
The pool and one OSD is nearfull. I'd planed to move some data awayto another ceph pool. But now I'm not sure anymore if I should gowith ceph.I'l move some data away anyway today to see if that helps, butbefore the upgrade there was the same amount of data an I haven'thad a problem.
Any hints to solve this are appreciated.

Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: High io wait when osd rocksdb is compacting

Reply via email to