Public bug reported: Ceph on bcache could have serious performance degradation (10 times drop) when the below two conditions are met: 1. bluefs_buffered_io is turned on
2. Any OSD bcache’s cache_available_percent is less than 60 As many of us may already know that bcache will force all writes to go directly to the backing device when the cache_available_percent is less than CUTOFF_WRITEBACK_SYNC(30). But the thing is that bcache will start to bypass *some* writes when cache_available_percent reached to CUTOFF_WRITEBACK(60), and those are the writes that are not carrying any synchronization flags. Those are kernel IO flags: REQ_SYNC, REQ_FUA, REQ_PREFLUSH. The code is here https://github.com/torvalds/linux/blob/master/drivers/md/bcache/writeback.h#L123 The problem I found from a recent case (bionic-stein + 4.15 kernel) is that when bluefs are submitting writes with bluefs_buffered_io turned on, the writes that send to bcache won’t carry any of the sync flags, and when the cache_available_percent dropped lower than 60 (which is quite easy to hit), all bluefs IO will be forced to be performed in a non-writeback mode. This is equivalent to setting up the bluestore DB on an HDD device, so every IO is bounded by the HDD speed. I’m not sure How the sync flags are propagated from ceph all the way to the kernel bcache layer. But I’ve verified all different ceph/kernel/ubuntu versions when bluefs_buffered_io turned on: N: no issue, all writes contain SYNC flag. P: has the issue, disable bluefs_buffered_io works. Bionic-ussuri + kernel 5.4.0 -> N Bionic-ussuri + kernel 4.15.0 -> P Bionic-stein + kernel 5.4.0 -> N Bionic-stein + kernel 4.15.0 -> P Bionic-train + kernel 5.4.0 -> N Bionic-train + kernel 4.15.0 -> P Focal(octopus) + kernel 5.4.0 -> N Focal(octopus) + kernel 5.8.0 -> N Focal-wallaby + kernel 5.4.0 -> N Focal-wallaby + kernel 5.8.0 -> N As we can see the issue appears to hit when bluefs_buffered_io = true and the kernel with 4.15.0. I’m not sure how/why the SYNC flag is added in the 5.4 or 5.8 kernel when bluefs_buffered_io is enabled. Currently, I know 5.4 and 5.8 are good when bluefs_buffered_io is turned on. Note that if all OSDs are deployed with separate NVME as the bluestore DB device, then the cluster won’t hit the issue, only those OSDs that put bluestore DB on bcache device will hit this issue. Ceph releases with bluefs_buffered_io enabled by default: bluefs_buffered_io was enabled by default in v13.2.0 and v14.2.0. bluefs_buffered_io was disabled by default in v14.2.10 and v15.2.0. bluefs_buffered io was re-enabled in the following point releases: v14.2.22 v15.2.13 v16.2.0 So in a summary, if all below 3 are met, this cluster will very likely hit the issue when any OSD bcache has the cache_available_percent dropped to 60: 1. ceph has bluefs_buffered_io enabled 2. OSDs are putting bluestore DB on top of bcache device 3. kernel version is bionic-ga (4.15.0) ** Affects: ceph (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1936136 Title: ceph on bcache performance regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1936136/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs