On Tue, 27 Jun 2023 at 18:20, Josh Baergen <jbaer...@digitalocean.com> wrote: > > Hi Matthew, > > We've done a limited amount of work on characterizing the pwl and I think it > suffers the classic problem of some writeback caches in that, once the cache > is saturated, it's actually worse than just being in writethrough. IIRC the > pwl does try to preserve write ordering (unlike the other > writeback/writearound modes) which limits it in the concurrency it can issue > to the backend, which means that even an iodepth=1 test can saturate the pwl, > assuming the backend latency is higher than the pwl latency.
What do you mean by saturated here? FWIW I was using the default cache size of 1G and each test run only wrote ~100MB of data, so I don't think I ever filled the cache, even with multiple runs. > I _think_ that if you were able to devise a burst test with bursts smaller > than the pwl capacity and gaps in between large enough for the cache to > flush, or if you were to ratelimit I/Os to the pwl, that you should see > closer to the lower latencies that you would expect. My goal is to characterise the requirements of etcd. Unfortunately I don't think changing the test would do that. Incidentally, note that the total bandwidth of an extremely busy etcd is usually very low. From memory, the etcd write rate for a system we were debugging whose etcd was occasionally falling over due to load was only about 5MiB/s. It's all about write latency of really small writes, not bandwidth. Matt > > Josh > > On Tue, Jun 27, 2023 at 9:04 AM Matthew Booth <mbo...@redhat.com> wrote: >> >> ** TL;DR >> >> In testing, the write latency performance of a PWL-cache backed RBD >> disk was 2 orders of magnitude worse than the disk holding the PWL >> cache. >> >> ** Summary >> >> I was hoping that PWL cache might be a good solution to the problem of >> write latency requirements of etcd when running a kubernetes control >> plane on ceph. Etcd is extremely write latency sensitive and becomes >> unstable if write latency is too high. The etcd workload can be >> characterised by very small (~4k) writes with a queue depth of 1. >> Throughput, even on a busy system, is normally very low. As etcd is >> distributed and can safely handle the loss of un-flushed data from a >> single node, a local ssd PWL cache for etcd looked like an ideal >> solution. >> >> My expectation was that adding a PWL cache on a local SSD to an >> RBD-backed would improve write latency to something approaching the >> write latency performance of the local SSD. However, in my testing >> adding a PWL cache to an rbd-backed VM increased write latency by >> approximately 4x over not using a PWL cache. This was over 100x more >> than the write latency performance of the underlying SSD. >> >> My expectation was based on the documentation here: >> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/ >> >> “The cache provides two different persistence modes. In >> persistent-on-write mode, the writes are completed only when they are >> persisted to the cache device and will be readable after a crash. In >> persistent-on-flush mode, the writes are completed as soon as it no >> longer needs the caller’s data buffer to complete the writes, but does >> not guarantee that writes will be readable after a crash. The data is >> persisted to the cache device when a flush request is received.” >> >> ** Method >> >> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other >> running libvirt and mounting a VM’s disk with librbd (also 17.2.6) >> from the first node. >> >> All performance testing is from the libvirt system. I tested write >> latency performance: >> >> * Inside the VM without a PWL cache >> * Of the PWL device directly from the host (direct to filesystem, no VM) >> * Inside the VM with a PWL cache >> >> I am testing with fio. Specifically I am running a containerised test, >> executed with: >> podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf >> >> This container runs: >> fio --rw=write --ioengine=sync --fdatasync=1 >> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf >> --output-format=json --runtime=60 --time_based=1 >> >> And extracts sync.lat_ns.percentile["99.000000"] >> >> ** Results >> >> All results were stable across multiple runs within a small margin of error. >> >> * rbd no cache: 1417216 ns >> * pwl cache device: 44288 ns >> * rbd with pwl cache: 5210112 ns >> >> Note that by adding a PWL cache we increase write latency by >> approximately 4x, which is more than 100x than the underlying device. >> >> ** Hardware >> >> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM >> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC >> H730P Mini (Embedded) >> >> OS installed on rotational disks >> >> N.B. Linux incorrectly detects these disks as rotational, which I >> assume relates to weird behaviour by the PERC controller. I remembered >> to manually correct this on the ‘client’ machine for the PWL cache, >> but at OSD configuration time ceph would have detected them as >> rotational. They are not rotational. >> >> ** Ceph Configuration >> >> CentOS Stream 9 >> >> # ceph version >> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy >> (stable) >> >> Single node installation with cephadm. 2 OSDs, one on each SSD. >> 1 pool with size 2 >> >> ** Client Configuration >> >> Fedora 38 >> Librbd1-17.2.6-3.fc38.x86_64 >> >> PWL cache is XFS filesystem with 4k block size, matching the >> underlying device. The filesystem uses the whole block device. There >> is no other load on the system. >> >> ** RBD Configuration >> >> # rbd config image list libvirt-pool/pwl-test | grep cache >> rbd_cache true >> config >> rbd_cache_block_writes_upfront false >> config >> rbd_cache_max_dirty 25165824 >> config >> rbd_cache_max_dirty_age 1.000000 >> config >> rbd_cache_max_dirty_object 0 >> config >> rbd_cache_policy writeback >> pool >> rbd_cache_size 33554432 >> config >> rbd_cache_target_dirty 16777216 >> config >> rbd_cache_writethrough_until_flush true >> pool >> rbd_parent_cache_enabled false >> config >> rbd_persistent_cache_mode ssd >> pool >> rbd_persistent_cache_path /var/lib/libvirt/images/pwl >> pool >> rbd_persistent_cache_size 1073741824 >> config >> rbd_plugins pwl_cache >> pool >> >> # rbd status libvirt-pool/pwl-test >> Watchers: >> watcher=10.1.240.27:0/1406459716 client.14475 cookie=140282423200720 >> Persistent cache state: >> host: dell-r640-050 >> path: >> /var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool >> size: 1 GiB >> mode: ssd >> stats_timestamp: Mon Jun 26 11:29:21 2023 >> present: true empty: false clean: true >> allocated: 180 MiB >> cached: 135 MiB >> dirty: 0 B >> free: 844 MiB >> hits_full: 1 / 0% >> hits_partial: 3 / 0% >> misses: 21952 >> hit_bytes: 6 KiB / 0% >> miss_bytes: 349 MiB >> -- >> Matthew Booth >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io -- Matthew Booth _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io