[ceph-users] Re: High io wait when osd rocksdb is compacting

Raffael Bachmann Wed, 29 Jul 2020 17:11:22 -0700

Hi Igor

Thanks for you answer. All the disks had low latancy warnings. "had"because I think the problem is solved.After moving some data and almost losing the nearfull nvme pool, becauseone disk had so much latency that ceph decided to mark it out, I couldstart destroying and recreating each nvme osd.I did this becuase the latency problem still existed even with only ahalf full pool. I'm now in the middle of recreating the osds one by one.The old ones still have latency issues when compacting the rocksdb butthe new ones don't. So I hope the problem will be gone by tomorrow.There is one difference between the old osds and the recreated ones. Theold ones were partitioned and the mount /var/lib/ceph/osd/ceph-1 was thefirst partition as xfs.Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Im not yetfamiliar enough with all ceph details to know why this changed or whatexactly the change is. Both, old and new, are bluestore.


Cheers,
Raffael


On 29/07/2020 16:48, Igor Fedotov wrote:

Hi Raffael,
wondering if all OSDs are suffering from slow compaction or just heone which is "near full"?
Do other OSDs has that "log_latency_fn slow operation observed for"lines?
Have you tried "osd bench" command for your OSDs? Does it show similarnumbers for every OSD?
You might want to try manual offline DB compaction usingceph-kvstore-tool. Any improvements after that?
Thanks,

Igor

On 7/29/2020 4:35 PM, Raffael Bachmann wrote:
Hi Mark
Unfortunately it is the production cluster and I don't have anotherone :-(
This is the output of the log parser. I'have nothing to compare themto. Stupid me has no more logs from before the upgrade.
python ceph_rocksdb_log_parser.py ceph-osd.1.log
Compaction Statistics   ceph-osd.1.log
Total OSD Log Duration (seconds)        55500.457
Number of Compaction Events     13
Avg Compaction Time (seconds)   116.498074615
Total Compaction Time (seconds) 1514.47497
Avg Output Size: (MB)   422.757656391
Total Output Size: (MB) 5495.84953308
Total Input Records     21019590
Total Output Records    18093259
Avg Output Throughput (MB/s)    3.53010211372
Avg Input Records/second        17994.0419635
Avg Output Records/second       16449.9710169
Avg Output/Input Ratio  0.891530624966

ceph-osd.1.log
start_offset compaction_time_seconds output_levelnum_output_files total_output_size num_input_recordsnum_output_records output (MB/s) input (r/s) output (r/s) output/input ratio417.204 70.247058 1 5 261853019 14766891384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476546.271 128.652685 2 7 473883973 16743931098908 3.51279861751 13014.8313655 8541.66393807 0.6563023137345761.795 60.460736 1 4 211033833 10414081013909 3.32873133441 17224.5339521 16769.7098494 0.97359440296214912.985 64.958415 1 4 231336608 13165751249120 3.3963233477 20267.9668215 19229.5332329 0.94876478742215152.316 238.925764 2 14 944635417 24450941902084 3.77052068592 10233.6975262 7960.98322825 0.7779185585524607.857 53.022134 1 4 188414045 1029179988116 3.38887973778 19410.36549 18635.915333 0.96010120688431259.993 55.442826 1 4 210856392 12967251221474 3.62694941814 23388.5083708 22031.2362865 0.94196842044431574.193 313.736584 2 18 1213247010 29287422359960 3.68794259867 9335.03502416 7522.10650703 0.80579306746737708.375 49.78089 1 3 171888381 974097939847 3.29294101107 19567.6895291 18879.6745096 0.9648392305943219.745 51.798215 1 4 193360867 12461011172257 3.5600318014 24056.8328465 22631.2238752 0.94073995607148041.751 56.559014 1 4 208216413 14511051367052 3.5108576209 25656.4762604 24170.3647804 0.94207655545348368.403 325.833185 2 19 1289359869 31961562489088 3.77380036251 9809.17889011 7639.1482347 0.77877550407452693.952 45.057464 1 3 164730093 943326907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433
cheers
Raffael


On 29/07/2020 15:19, Mark Nelson wrote:
Hi Raffael,
Adam made a PR this year that shards rocksdb data across differentcolumn families to help reduce compaction overhead. The goal is toreduce write-amplification during compaction by storing multiplesmall LSM hierarchies rather than 1 big one. We've seen evidencethat this lowers compaction time and overhead, sometimessignificantly. That PR was merged to master on April 26th so Idon't believe it's in any of the releases yet but you can test it ifyou have a non-production cluster available. That PR is here:
https://github.com/ceph/ceph/pull/34006
Normally though you should have about 1GB of WAL to absorb writesduring compaction and rocksdb automatically slows writes down if thebuffers start filling up. You should only see a write stall fromcompaction if you completely fill all of the buffers. Also, youshouldn't see compaction at one level blocking IO to the entiredatabase. Something seems off to me here.
If you have OSD logs, you can see a history of the compaction eventsby running this script:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
That can give you an idea of how long your compaction events arelasting and what they are doing.
Mark


On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,
I'm kind of crossposting this from here:https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/But since I'm more and more sure that it's a ceph problem I'll trymy luck here.
Since updating from Luminous to Nautilus I have a big problem.
I have a 3 node cluster. Each cluster has 2 nvme ssd and a10GBASE-T net for ceph.Every few minutes a osd seems to compact the rocksdb. While doingthis it uses alot of I/O and blocks.This basically blocks the whole cluster and no VM/Container canread data for some seconds (minutes).
While it happens "iostat -x" looks like this:
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %utilnvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40nvme1n1 0.00 1495.00 0.00 3924.00 0.006099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00
And iotop:

Total DISK READ:         0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE: 3.43 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO> COMMAND
2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 %ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph[rocksdb:low1]
In the ceph-osd log I see that rocksdb is compacting.https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08
The pool and one OSD is nearfull. I'd planed to move some data awayto another ceph pool. But now I'm not sure anymore if I should gowith ceph.I'l move some data away anyway today to see if that helps, butbefore the upgrade there was the same amount of data an I haven'thad a problem.
Any hints to solve this are appreciated.

Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: High io wait when osd rocksdb is compacting

Reply via email to