[ceph-users] Discussion thread for Known Pacific Performance Regressions

Mark Nelson Thu, 11 May 2023 07:40:36 -0700

Hi Everyone,

This email was originally posted to d...@ceph.io, but Marc mentioned thathe thought this would be useful to post on the user list so I'mre-posting here as well.

David Orman mentioned in the CLT meeting this morning that there are anumber of people on the mailing list asking about performanceregressions in Pacific+ vs older releases. I want to document a coupleof the bigger ones that we know about for the community's benefit. Iwant to be clear that Pacific does have a number of performanceimprovements over previous releases, and we do have tests showingimprovement relative to nautilus (especially RBD on NVMe drives). Someof these regressions are going to have a bigger effect for some usersthan others. Having said that, let's get into them.



********** Regression #1: RocksDB Log File Recycling **********

Effects: More metadata updates to the underlying FS, higherwrite-amplification (Observed by Digital Ocean), Slower performanceespecially when the WAL device is saturated.

When bluestore was created back in 2015 Sage implemented an optimizationin RocksDB that allowed WAL log files to be recycled. The idea is thatinstead of deleting logs when they are flushed, rocksdb can simply reusethem. The benefit here is that it allows records to be written andfadatasync can be called without touching the inode for every IO. Sagedid a pretty good job of explaining the benefit in the PR available here:


https://github.com/facebook/rocksdb/pull/746

After much discussion, that PR was merged and received a couple of bugfixes over the years:


Locking bug fix from Somnath back in 2016:
https://github.com/facebook/rocksdb/pull/1313

Another bug fix from ajkr in 2020:
https://github.com/facebook/rocksdb/pull/5900

In 2020, the RocksDB folks discovered there is a fundamental flaw in theway that the original PR works. It turns out that the feature torecycle log files is incompatible with RocksDB's kPointInTimeRecovery,kAbsoluteConsistency, and kTolerateCorruptedTailRecords recovery modes. One of the later PR's included a very good and concise description ofthe problem:

"The two features are naturally incompatible. WAL recycling expects therecovery to succeed upon encountering a corrupt record at the pointwhere new data ends and recycled data remains at the tail. However,WALRecoveryMode::kTolerateCorruptedTailRecords must fail uponencountering any such corrupt record, as it cannot differentiate betweenthis and a real corruption, which would cause committed updates to betruncated."

More background discussion on the RocksDB side available in these PRsand comments:


https://github.com/facebook/rocksdb/pull/6351

https://github.com/facebook/rocksdb/pull/6351#issuecomment-672838284

https://github.com/facebook/rocksdb/pull/7252

https://github.com/facebook/rocksdb/pull/7271

On the Ceph side, there was a PR to try to re-enable the old behaviorwhich we rejected as unsafe based on the analysis by the RocksDB folks(which we agree with):


https://github.com/ceph/ceph/pull/36579

Sage also commented about a potential way forward:

https://github.com/ceph/ceph/pull/36579#issuecomment-870884583

"tbh I think the best approach would be to create a new WAL file formatthat (1) is 4k block aligned and (2) has a header for each block thatindicates the generation # for that log file (so we can see whether whatwe read is from a previous pass or corruption). That would be a fair bitof effort, though."

On a side note, Igor tried to also disable WAL file recycling as abackport to Octopus but was thwarted by a BlueFS bug. That PR waseventually reverted leaving the old (dangerous!) behavior being left inplace:


https://github.com/ceph/ceph/pull/45040

https://github.com/ceph/ceph/pull/47053

The gist of it is that releases of Ceph older than Pacific arebenefiting from the speed improvement of log file recycling but may bevulnerable to the issue as described above. This is likely one of themore impactful regressions that people upgrading to Pacific or laterreleases are seeing.

Josh Baergen from Digital Ocean followed up that there is a slew ofadditional information on this issue in the following tracker as well:


https://tracker.ceph.com/issues/58530



********** Regression #1 Potential Fixes ***********

Josh Baergen also mentioned that the write-amplification effect that wasobserved due to this issue is mitigated in byhttps://github.com/ceph/ceph/pull/48915 which was merged into 16.2.11back in December. That however does not improve write IOPS amplification.

Beyond that, we could follow Sage's idea and try to implement a new WALfile format. The risks here are that it could be a lot of work and wedon't know if there is really any appetite on the RocksDB side to mergesomething like this upstream. My personal take is that we're alreadykind of abusing the RocksDB WAL for short lived PG log updates and I'mnot thrilled about trying to add further code into RocksDB to try andsupport our use cases (though there is benefit here that goes beyondCeph). We already maintain a custom version of RocksDB's LRU cache inour code to tie into our memory autotuning system but it would be reallynice to avoid custom code like that in the future.

One alternative: Igor Fedetov implemented a prototype WAL insidebluestore itself and we saw very good initial results from it with theRocksDB WAL disabled. These can be seen on slide 24 of my performancedeck from Cephalocon 2023:


https://www.linkedin.com/in/markhpc/overlay/experience/2113859303/multiple-media-viewer/?profileId=ACoAAAHzuIEB_T2FuVPM2terPw14ffzShLXPbbo&treasuryMediaId=1635524697350

If Igor (or others) want to continue this work, I personally would be infavor of trying to move the WAL into Bluestore itself. I suspect we canmake better decisions about PG log life cycles and have better BlueFSintegration than what RocksDB provides us. Igor probably has a betteridea of the pitfalls here though so I think we should hear out histhoughts on whether this is the right path forward. Igor also mentionedthat he is continuing to work on his Bluestore WAL prototype withpromising results, but that PG Log will (as expected) likely require adifferent solution that looks more like a specialized ring buffer. Ithink moving the WAL out of RocksDB is a good step toward that eventualgoal.





********** Regression #2: (re-)Enabling BlueFS Buffered IO **********

Effects: Works around unexpected readahead behavior in RocksDB byutilizing underlying kernel page cache. Hurts write performance on fastdevices.

We're stuck between a bit of a rock and a hard place here. Over theyears we have sea-sawed back and forth regarding when we should orshould not use buffered IO:


https://github.com/ceph/ceph/pull/11012

https://github.com/ceph/ceph/pull/11059

https://github.com/ceph/ceph/pull/18172

https://github.com/ceph/ceph/pull/20542

https://github.com/ceph/ceph/pull/34224

https://github.com/ceph/ceph/pull/38044 <-- lots of discussion here

The gist of it is that there are upsides and downside to havingbluefs_buffered_io=true. Direct IO is faster in some scenarios,especially more recent write tests on NVMe drives. The trade off isthat RocksDB really seems to benefit from kernel buffer cache and thereare other scenarios where bluefs_buffered_io is a big win. 2 years agoAdam and I did a walkthrough of the RocksDB code to try to understandthe behavior regarding RocksDB readahead and we couldn't understand whyit was re-reading data from the file system so often (or in the case ofbuffered IO the page cache!). I wrote up our walkthrough of the code here:


https://github.com/ceph/ceph/pull/38044#issuecomment-790157415


********** Regression #2 Potential Fixes **********

In a recent discussion with Mark Callaghan (of MyRocks/RocksDBperformance tuning fame), he pointed out that RocksDB has an option topre-populate the block cache with the data from SSTs created by memtableflush and that might help when O_DIRECT is used:


https://github.com/facebook/rocksdb/blob/main/include/rocksdb/table.h#L600

We may want to experiment to see if this helps keep the block cachepre-populated after compaction and avoid (re)reads from the disk duringiteration. We also might want to revisit this topic in general with thecompact on iteration feature that was recently added and backported topacific in 16.2.13. I'm still a little concerned however that we wereseeing repeated overlapping reads for the same ranges during iterationthat I would have expected to be cached by RocksDB on a previous read. Ultimately I think many of us would prefer to move entirely to directIObut there's more work to do to figure this one out.

Josh Baergen provided further advice here: They have had good luckenabling buffered IO for rgw bucket index OSDs and disabling iteverywhere else. This assumes that bucket indexes are on their owndedicated OSDs though, and personally I am a bit wary of hitting slowcases in RocksDB even on "regular" OSDs, but this might be something toconsider as they've had good luck with this configuration for over a year.

********** Regression #3: RadosGW Coroutine and Request Timeout Changes**********

Effects: Higher RadosGW CPU usage, lower performance, especially forsmall object workloads

Back when Pacific was released it was observed that RadosGW was showingmuch higher CPU usage and lower performance vs Nautilus for small (4KB)objects. It's likely that larger objects may be affected, though to alesser degree A git bisection was performed and the results aresummarized in the introduction section of the folliowing RGW performanceanalysis blog post here:


https://ceph.io/en/news/blog/2023/reef-freeze-rgw-performance/

The bisection uncovered two primary PRs that were causing performanceregression:


https://github.com/ceph/ceph/pull/31580

https://github.com/ceph/ceph/pull/35355

The good news is that once those PRs were identified, the RGW teamstarted working to improve things, especially for #35355:

https://github.com/ceph/ceph/pull/43761 <-- Fixes issues introduced in#35355, backported to Pacific in 2022



*********** Regression #3 Potential Fixes **********

Quincy (and due to the backport likely Pacific) is showing significantlybetter behavior in recent tests due to PR #43761. The effects of #31580are still present, but are considered a necessary trade-off. Otherimprovements since then may be helping, but we'll need to continue tomake up the difference in other areas and start really investigatingwhere we are spending cycles/time, especially in Reef.



********** Regression #4: Gradually slowing down OSDs **********


Effects: Significant slowdown after 1-2 weeks of OSD runtime

Igor Fedetov pointed this one out in discussion earlier today:https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/OXAUOK7CQXWYGQNT7LWHMLPRB4KNIFXT/

This one is pretty new and there is not much there yet other thanperhaps low memory (and cache?) usage despite regular IO workload. Onodemisses can absolutely cause performance degradation, but it's not clearyet whether this is memory related issue or something else. Moreinvestigation needed. Hopefully we'll get perf data from the users whoencountered it to help diagnose what's going on here.



********** Conclusion **********

There may be other performance issues that I'm not remembering, butthese are the big ones I can think of off the top of my head at themoment. Hopefully this helps clarify what's going on if people areseeing a regression, what to look for, and if they are hitting it, thewhy behind it.



Thanks,

Mark


--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Discussion thread for Known Pacific Performance Regressions

Reply via email to