Hi Igor
Thanks for you answer. All the disks had low latancy warnings. "had"
because I think the problem is solved.
After moving some data and almost losing the nearfull nvme pool, because
one disk had so much latency that ceph decided to mark it out, I could
start destroying and recreating each nvme osd.
I did this becuase the latency problem still existed even with only a
half full pool. I'm now in the middle of recreating the osds one by one.
The old ones still have latency issues when compacting the rocksdb but
the new ones don't. So I hope the problem will be gone by tomorrow.
There is one difference between the old osds and the recreated ones. The
old ones were partitioned and the mount /var/lib/ceph/osd/ceph-1 was the
first partition as xfs.
Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Im not yet
familiar enough with all ceph details to know why this changed or what
exactly the change is. Both, old and new, are bluestore.
Cheers,
Raffael
On 29/07/2020 16:48, Igor Fedotov wrote:
Hi Raffael,
wondering if all OSDs are suffering from slow compaction or just he
one which is "near full"?
Do other OSDs has that "log_latency_fn slow operation observed for"
lines?
Have you tried "osd bench" command for your OSDs? Does it show similar
numbers for every OSD?
You might want to try manual offline DB compaction using
ceph-kvstore-tool. Any improvements after that?
Thanks,
Igor
On 7/29/2020 4:35 PM, Raffael Bachmann wrote:
Hi Mark
Unfortunately it is the production cluster and I don't have another
one :-(
This is the output of the log parser. I'have nothing to compare them
to. Stupid me has no more logs from before the upgrade.
python ceph_rocksdb_log_parser.py ceph-osd.1.log
Compaction Statistics ceph-osd.1.log
Total OSD Log Duration (seconds) 55500.457
Number of Compaction Events 13
Avg Compaction Time (seconds) 116.498074615
Total Compaction Time (seconds) 1514.47497
Avg Output Size: (MB) 422.757656391
Total Output Size: (MB) 5495.84953308
Total Input Records 21019590
Total Output Records 18093259
Avg Output Throughput (MB/s) 3.53010211372
Avg Input Records/second 17994.0419635
Avg Output Records/second 16449.9710169
Avg Output/Input Ratio 0.891530624966
ceph-osd.1.log
start_offset compaction_time_seconds output_level
num_output_files total_output_size num_input_records
num_output_records output (MB/s) input (r/s) output (r/s)
output/input ratio
417.204 70.247058 1 5 261853019 1476689
1384444 3.55491754393 21021.3643396 19708.2132607 0.937532547476
546.271 128.652685 2 7 473883973 1674393
1098908 3.51279861751 13014.8313655 8541.66393807 0.656302313734
5761.795 60.460736 1 4 211033833 1041408
1013909 3.32873133441 17224.5339521 16769.7098494 0.973594402962
14912.985 64.958415 1 4 231336608 1316575
1249120 3.3963233477 20267.9668215 19229.5332329 0.948764787422
15152.316 238.925764 2 14 944635417 2445094
1902084 3.77052068592 10233.6975262 7960.98322825 0.77791855855
24607.857 53.022134 1 4 188414045 1029179
988116 3.38887973778 19410.36549 18635.915333 0.960101206884
31259.993 55.442826 1 4 210856392 1296725
1221474 3.62694941814 23388.5083708 22031.2362865 0.941968420444
31574.193 313.736584 2 18 1213247010 2928742
2359960 3.68794259867 9335.03502416 7522.10650703 0.805793067467
37708.375 49.78089 1 3 171888381 974097
939847 3.29294101107 19567.6895291 18879.6745096 0.96483923059
43219.745 51.798215 1 4 193360867 1246101
1172257 3.5600318014 24056.8328465 22631.2238752 0.940739956071
48041.751 56.559014 1 4 208216413 1451105
1367052 3.5108576209 25656.4762604 24170.3647804 0.942076555453
48368.403 325.833185 2 19 1289359869 3196156
2489088 3.77380036251 9809.17889011 7639.1482347 0.778775504074
52693.952 45.057464 1 3 164730093 943326
907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433
cheers
Raffael
On 29/07/2020 15:19, Mark Nelson wrote:
Hi Raffael,
Adam made a PR this year that shards rocksdb data across different
column families to help reduce compaction overhead. The goal is to
reduce write-amplification during compaction by storing multiple
small LSM hierarchies rather than 1 big one. We've seen evidence
that this lowers compaction time and overhead, sometimes
significantly. That PR was merged to master on April 26th so I
don't believe it's in any of the releases yet but you can test it if
you have a non-production cluster available. That PR is here:
https://github.com/ceph/ceph/pull/34006
Normally though you should have about 1GB of WAL to absorb writes
during compaction and rocksdb automatically slows writes down if the
buffers start filling up. You should only see a write stall from
compaction if you completely fill all of the buffers. Also, you
shouldn't see compaction at one level blocking IO to the entire
database. Something seems off to me here.
If you have OSD logs, you can see a history of the compaction events
by running this script:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
That can give you an idea of how long your compaction events are
lasting and what they are doing.
Mark
On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,
I'm kind of crossposting this from here:
https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/
But since I'm more and more sure that it's a ceph problem I'll try
my luck here.
Since updating from Luminous to Nautilus I have a big problem.
I have a 3 node cluster. Each cluster has 2 nvme ssd and a
10GBASE-T net for ceph.
Every few minutes a osd seems to compact the rocksdb. While doing
this it uses alot of I/O and blocks.
This basically blocks the whole cluster and no VM/Container can
read data for some seconds (minutes).
While it happens "iostat -x" looks like this:
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00
0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40
nvme1n1 0.00 1495.00 0.00 3924.00 0.00
6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00
And iotop:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 %
ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
[rocksdb:low1]
In the ceph-osd log I see that rocksdb is compacting.
https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08
The pool and one OSD is nearfull. I'd planed to move some data away
to another ceph pool. But now I'm not sure anymore if I should go
with ceph.
I'l move some data away anyway today to see if that helps, but
before the upgrade there was the same amount of data an I haven't
had a problem.
Any hints to solve this are appreciated.
Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io