Thanks Mauricio! I confirmed that this upload is identical to 15.2.17-0ubuntu0.20.04.4 (except for dep3 and the security rebase) and so I'm accepting on the basis of Steve's previous review, and that based on the comments since, it seems like the original patch was actually what was expected.
** Changed in: ceph (Ubuntu Focal) Status: In Progress => Fix Committed ** Tags added: verification-needed verification-needed-focal -- You received this bug notification because you are a member of SE ("STS") Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1978913 Title: [SRU] ceph-osd takes all memory at boot Status in Ubuntu Cloud Archive: Invalid Status in Ubuntu Cloud Archive queens series: New Status in Ubuntu Cloud Archive ussuri series: Fix Committed Status in Ubuntu Cloud Archive wallaby series: Invalid Status in Ubuntu Cloud Archive xena series: Invalid Status in Ubuntu Cloud Archive yoga series: Invalid Status in ceph package in Ubuntu: Fix Released Status in ceph source package in Bionic: New Status in ceph source package in Focal: Fix Committed Status in ceph source package in Jammy: Invalid Status in ceph source package in Kinetic: Invalid Bug description: [Impact] The OSD will fail to trim the pg log dup entries, which could result in millions of dup entries for a PG while it was supposed to be at most 3000 (controlled by option osd_pg_log_dups_tracked). This could cause OSD to run out of memory and crash, and it might not be able to start up again due to the need of loading millions of dup entries. This could happen to multiple OSDs at the same time (as also reported by many community users), so we may get a completely unusable cluster if we hit this issue. The current known trigger for this problem is the pg split, as the whole dup entries will be copied during the pg split. The reason we don’t observe this so often before is that the pg autoscale wasn’t turned on by default, it’s on by default since from octopus. Note that there is also no way to check the number of dups in a PG online. [Test Plan] To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs, #ps -eaf | grep osd root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf kill all OSDs, so they're down, root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ceph -s 2022-09-22T08:26:15.120+0000 7fa9694fe700 -1 WARNING: all dangerous and experimental features are enabled. 2022-09-22T08:26:15.140+0000 7fa963fff700 -1 WARNING: all dangerous and experimental features are enabled. cluster: id: 9e7c0a82-8072-4c48-b697-1e6399b4fc9e health: HEALTH_WARN 2 osds down 1 host (3 osds) down 1 root (3 osds) down Reduced data availability: 169 pgs stale Degraded data redundancy: 255/765 objects degraded (33.333%), 64 pgs degraded, 169 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 3s) mgr: x(active, since 28h) mds: a:1 {0=a=up:active} osd: 3 osds: 0 up (since 83m), 2 in (since 91m) rgw: 1 daemon active (8000) task status: data: pools: 7 pools, 169 pgs objects: 255 objects, 9.5 KiB usage: 4.1 GiB used, 198 GiB / 202 GiB avail pgs: 255/765 objects degraded (33.333%) 105 stale+active+undersized 64 stale+active+undersized+degraded Then inject dups using this json for all OSDs, root@nikhil-Lenovo-Legion-Y540-15IRH-PG0:/home/nikhil/HDD_MOUNT/Downloads/ceph_build_oct/ceph/build# cat bin/dups.json [ {"reqid": "client.4177.0:0", "version": "3'0", "user_version": "0", "generate": "500000", "return_code": "0"} ] Use the ceph-objectstore-tool with the --pg-log-inject-dups parameter, to inject dups for all OSDs. root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd0/ --op pg-log-inject- dups --file bin/dups.json --no-mon-config --pgid 2.1e root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd1/ --op pg-log-inject- dups --file bin/dups.json --no-mon-config --pgid 2.1e root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd2/ --op pg-log-inject- dups --file bin/dups.json --no-mon-config --pgid 2.1e Then set osd debug level to 20 (since here is the log that actually doing the trim: https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138, so need debug_osd = 20) set debug osd=20 in global in ceph.conf, root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# cat ceph.conf | grep "debug osd" debug osd=20 Then bring up the OSDs /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 0 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 1 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf /home/nikhil/Downloads/ceph_build_oct/ceph/build/bin/ceph-osd -i 2 -c /home/nikhil/Downloads/ceph_build_oct/ceph/build/ceph.conf Run some IO on the OSDs. Wait at least a few hours. Then take the OSDs down (so the command below can be run), and run, root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build# ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1e --op log > op.log You will see at the end of that output in the file op.log, the number of dups is still as it was when they were injected, (no trimming has taken place) { "reqid": "client.4177.0:0", "version": "3'499999", "user_version": "0", "return_code": "0" }, { "reqid": "client.4177.0:0", <-- note the id (4177) "version": "3'500000", <--- "user_version": "0", "return_code": "0" } ] }, "pg_missing_t": { "missing": [], "may_include_deletes": true } To verify the patch: With the patch in place, once the dups are injected, output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log will again show the dups (this command should be run with the OSDs down, like before). Then bring up the OSDs and start IO using rbd bench-write, leave the IO running a few hours, till these logs (https://github.com/ceph/ceph/pull/47046/commits/aada08acde7a05ad769bb7a886ebcece628d522c#diff-b293fb673637ea53b5874bbb04f8f0638ca39cab009610e2cbc40a867bca4906L138) are seen as below, in the osd logs, with the same client ID (4177 in my example) as the one that the client that injected the dups had used, root@focal-new:/home/nikhil/Downloads/ceph_build_oct/ceph/build/out# cat osd.1.log | grep -i "trim dup " | grep 4177 | more 2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'5 uv=0 rc=0) ... ... 2022-09-26T10:30:53.125+0000 7fdb72741700 1 trim dup log_dup(reqid=client.4177.0:0 v=3'52 uv=0 rc=0) # grep -ri "trim dup " *.log | grep 4177 | wc -l 390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined. And the output of ./bin/ceph-objectstore-tool --data-path dev/osd1/ --no-mon-config --pgid 2.1f --op log (you would need to take the particular OSD down for verifying this) will show that the first bunch of (130k for eg. here) dups have been trimmed already, see the "version", which starts with the figure 3'130001 instead of 0 now, "dups": [ { "reqid": "client.4177.0:0", "version": "3'130001", <---- "user_version": "0", "return_code": "0" }, { "reqid": "client.4177.0:0", "version": "3'130002", "user_version": "0", "return_code": "0" }, This will verify that the dups are being trimmed by the patch, and it is working correctly. And of course, OSDs should not go OOM at boot time! [Where problems could occur] This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https://github.com/ceph/ceph/pull/47046#issuecomment-1243252126). Also, an earlier attempt to fix this issue upstream was reverted, as discussed at https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1978913/comments/1 While this fix has been tested and validated after building it into the upstream 15.2.17 release (please see the [Test Plan] section), we would still need to proceed with extreme caution by allowing some time for problems (if any) to surface before going ahead with this SRU, and running our QA tests on the packages that build this fix into the 15.2.17 release before releasing it to the customer who await this fix on octopus. [Other Info] The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth. Reported upstream at https://tracker.ceph.com/issues/53729 and fixed on master through https://github.com/ceph/ceph/pull/47046 To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1978913/+subscriptions -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp