Hi Everyone,

This email was originally posted to d...@ceph.io, but Marc mentioned that he thought this would be useful to post on the user list so I'm re-posting here as well.

David Orman mentioned in the CLT meeting this morning that there are a number of people on the mailing list asking about performance regressions in Pacific+ vs older releases.  I want to document a couple of the bigger ones that we know about for the community's benefit.  I want to be clear that Pacific does have a number of performance improvements over previous releases, and we do have tests showing improvement relative to nautilus (especially RBD on NVMe drives).  Some of these regressions are going to have a bigger effect for some users than others.  Having said that, let's get into them.


********** Regression #1: RocksDB Log File Recycling **********

Effects: More metadata updates to the underlying FS, higher write-amplification (Observed by Digital Ocean), Slower performance especially when the WAL device is saturated.


When bluestore was created back in 2015 Sage implemented an optimization in RocksDB that allowed WAL log files to be recycled. The idea is that instead of deleting logs when they are flushed, rocksdb can simply reuse them.  The benefit here is that it allows records to be written and fadatasync can be called without touching the inode for every IO.  Sage did a pretty good job of explaining the benefit in the PR available here:

https://github.com/facebook/rocksdb/pull/746


After much discussion, that PR was merged and received a couple of bug fixes over the years:

Locking bug fix from Somnath back in 2016:
https://github.com/facebook/rocksdb/pull/1313

Another bug fix from ajkr in 2020:
https://github.com/facebook/rocksdb/pull/5900


In 2020, the RocksDB folks discovered there is a fundamental flaw in the way that the original PR works.  It turns out that the feature to recycle log files is incompatible with RocksDB's kPointInTimeRecovery, kAbsoluteConsistency, and kTolerateCorruptedTailRecords recovery modes.  One of the later PR's included a very good and concise description of the problem:

"The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, WALRecoveryMode::kTolerateCorruptedTailRecords must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated."


More background discussion on the RocksDB side available in these PRs and comments:

https://github.com/facebook/rocksdb/pull/6351

https://github.com/facebook/rocksdb/pull/6351#issuecomment-672838284

https://github.com/facebook/rocksdb/pull/7252

https://github.com/facebook/rocksdb/pull/7271


On the Ceph side, there was a PR to try to re-enable the old behavior which we rejected as unsafe based on the analysis by the RocksDB folks (which we agree with):

https://github.com/ceph/ceph/pull/36579

Sage also commented about a potential way forward:

https://github.com/ceph/ceph/pull/36579#issuecomment-870884583

"tbh I think the best approach would be to create a new WAL file format that (1) is 4k block aligned and (2) has a header for each block that indicates the generation # for that log file (so we can see whether what we read is from a previous pass or corruption). That would be a fair bit of effort, though."


On a side note, Igor tried to also disable WAL file recycling as a backport to Octopus but was thwarted by a BlueFS bug.  That PR was eventually reverted leaving the old (dangerous!) behavior being left in place:

https://github.com/ceph/ceph/pull/45040

https://github.com/ceph/ceph/pull/47053


The gist of it is that releases of Ceph older than Pacific are benefiting from the speed improvement of log file recycling but may be vulnerable to the issue as described above.  This is likely one of the more impactful regressions that people upgrading to Pacific or later releases are seeing.


Josh Baergen from Digital Ocean followed up that there is a slew of additional information on this issue in the following tracker as well:

https://tracker.ceph.com/issues/58530



********** Regression #1 Potential Fixes ***********

Josh Baergen also mentioned that the write-amplification effect that was observed due to this issue is mitigated in by https://github.com/ceph/ceph/pull/48915 which was merged into 16.2.11 back in December.  That however does not improve write IOPS amplification.

Beyond that, we could follow Sage's idea and try to implement a new WAL file format.  The risks here are that it could be a lot of work and we don't know if there is really any appetite on the RocksDB side to merge something like this upstream.  My personal take is that we're already kind of abusing the RocksDB WAL for short lived PG log updates and I'm not thrilled about trying to add further code into RocksDB to try and support our use cases (though there is benefit here that goes beyond Ceph).  We already maintain a custom version of RocksDB's LRU cache in our code to tie into our memory autotuning system but it would be really nice to avoid custom code like that in the future.


One alternative: Igor Fedetov implemented a prototype WAL inside bluestore itself and we saw very good initial results from it with the RocksDB WAL disabled.  These can be seen on slide 24 of my performance deck from Cephalocon 2023:

https://www.linkedin.com/in/markhpc/overlay/experience/2113859303/multiple-media-viewer/?profileId=ACoAAAHzuIEB_T2FuVPM2terPw14ffzShLXPbbo&treasuryMediaId=1635524697350

If Igor (or others) want to continue this work, I personally would be in favor of trying to move the WAL into Bluestore itself.  I suspect we can make better decisions about PG log life cycles and have better BlueFS integration than what RocksDB provides us.  Igor probably has a better idea of the pitfalls here though so I think we should hear out his thoughts on whether this is the right path forward.  Igor also mentioned that he is continuing to work on his Bluestore WAL prototype with promising results, but that PG Log will (as expected) likely require a different solution that looks more like a specialized ring buffer.  I think moving the WAL out of RocksDB is a good step toward that eventual goal.




********** Regression #2: (re-)Enabling BlueFS Buffered IO **********

Effects: Works around unexpected readahead behavior in RocksDB by utilizing underlying kernel page cache.  Hurts write performance on fast devices.


We're stuck between a bit of a rock and a hard place here.  Over the years we have sea-sawed back and forth regarding when we should or should not use buffered IO:

https://github.com/ceph/ceph/pull/11012

https://github.com/ceph/ceph/pull/11059

https://github.com/ceph/ceph/pull/18172

https://github.com/ceph/ceph/pull/20542

https://github.com/ceph/ceph/pull/34224

https://github.com/ceph/ceph/pull/38044 <-- lots of discussion here


The gist of it is that there are upsides and downside to having bluefs_buffered_io=true.  Direct IO is faster in some scenarios, especially more recent write tests on NVMe drives.  The trade off is that RocksDB really seems to benefit from kernel buffer cache and there are other scenarios where bluefs_buffered_io is a big win. 2 years ago Adam and I did a walkthrough of the RocksDB code to try to understand the behavior regarding RocksDB readahead and we couldn't understand why it was re-reading data from the file system so often (or in the case of buffered IO the page cache!). I wrote up our walkthrough of the code here:

https://github.com/ceph/ceph/pull/38044#issuecomment-790157415


********** Regression #2 Potential Fixes **********

In a recent discussion with Mark Callaghan (of MyRocks/RocksDB performance tuning fame), he pointed out that RocksDB has an option to pre-populate the block cache with the data from SSTs created by memtable flush and that might help when O_DIRECT is used:

https://github.com/facebook/rocksdb/blob/main/include/rocksdb/table.h#L600

We may want to experiment to see if this helps keep the block cache pre-populated after compaction and avoid (re)reads from the disk during iteration.  We also might want to revisit this topic in general with the compact on iteration feature that was recently added and backported to pacific in 16.2.13.  I'm still a little concerned however that we were seeing repeated overlapping reads for the same ranges during iteration that I would have expected to be cached by RocksDB on a previous read.  Ultimately I think many of us would prefer to move entirely to directIO but there's more work to do to figure this one out.

Josh Baergen provided further advice here:  They have had good luck enabling buffered IO for rgw bucket index OSDs and disabling it everywhere else.  This assumes that bucket indexes are on their own dedicated OSDs though, and personally I am a bit wary of hitting slow cases in RocksDB even on "regular" OSDs, but this might be something to consider as they've had good luck with this configuration for over a year.



********** Regression #3: RadosGW Coroutine and Request Timeout Changes **********

Effects: Higher RadosGW CPU usage, lower performance, especially for small object workloads


Back when Pacific was released it was observed that RadosGW was showing much higher CPU usage and lower performance vs Nautilus for small (4KB) objects.  It's likely that larger objects may be affected, though to a lesser degree  A git bisection was performed and the results are summarized in the introduction section of the folliowing RGW performance analysis blog post here:

https://ceph.io/en/news/blog/2023/reef-freeze-rgw-performance/


The bisection uncovered two primary PRs that were causing performance regression:

https://github.com/ceph/ceph/pull/31580

https://github.com/ceph/ceph/pull/35355


The good news is that once those PRs were identified, the RGW team started working to improve things, especially for #35355:

https://github.com/ceph/ceph/pull/43761 <-- Fixes issues introduced in #35355, backported to Pacific in 2022


*********** Regression #3 Potential Fixes **********

Quincy (and due to the backport likely Pacific) is showing significantly better behavior in recent tests due to PR #43761. The effects of #31580 are still present, but are considered a necessary trade-off.  Other improvements since then may be helping, but we'll need to continue to make up the difference in other areas and start really investigating where we are spending cycles/time, especially in Reef.


********** Regression #4: Gradually slowing down OSDs **********


Effects: Significant slowdown after 1-2 weeks of OSD runtime


Igor Fedetov pointed this one out in discussion earlier today: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/OXAUOK7CQXWYGQNT7LWHMLPRB4KNIFXT/


This one is pretty new and there is not much there yet other than perhaps low memory (and cache?) usage despite regular IO workload. Onode misses can absolutely cause performance degradation, but it's not clear yet whether this is memory related issue or something else.  More investigation needed.  Hopefully we'll get perf data from the users who encountered it to help diagnose what's going on here.


********** Conclusion **********

There may be other performance issues that I'm not remembering, but these are the big ones I can think of off the top of my head at the moment.  Hopefully this helps clarify what's going on if people are seeing a regression, what to look for, and if they are hitting it, the why behind it.


Thanks,

Mark


--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to