[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

Matthew Booth Tue, 27 Jun 2023 10:51:51 -0700

On Tue, 27 Jun 2023 at 18:20, Josh Baergen <jbaer...@digitalocean.com> wrote:
>
> Hi Matthew,
>
> We've done a limited amount of work on characterizing the pwl and I think it 
> suffers the classic problem of some writeback caches in that, once the cache 
> is saturated, it's actually worse than just being in writethrough. IIRC the 
> pwl does try to preserve write ordering (unlike the other 
> writeback/writearound modes) which limits it in the concurrency it can issue 
> to the backend, which means that even an iodepth=1 test can saturate the pwl, 
> assuming the backend latency is higher than the pwl latency.


What do you mean by saturated here? FWIW I was using the default cache
size of 1G and each test run only wrote ~100MB of data, so I don't
think I ever filled the cache, even with multiple runs.

> I _think_ that if you were able to devise a burst test with bursts smaller 
> than the pwl capacity and gaps in between large enough for the cache to 
> flush, or if you were to ratelimit I/Os to the pwl, that you should see 
> closer to the lower latencies that you would expect.

My goal is to characterise the requirements of etcd. Unfortunately I
don't think changing the test would do that. Incidentally, note that
the total bandwidth of an extremely busy etcd is usually very low.
From memory, the etcd write rate for a system we were debugging whose
etcd was occasionally falling over due to load was only about 5MiB/s.
It's all about write latency of really small writes, not bandwidth.

Matt

>
> Josh
>
> On Tue, Jun 27, 2023 at 9:04 AM Matthew Booth <mbo...@redhat.com> wrote:
>>
>> ** TL;DR
>>
>> In testing, the write latency performance of a PWL-cache backed RBD
>> disk was 2 orders of magnitude worse than the disk holding the PWL
>> cache.
>>
>> ** Summary
>>
>> I was hoping that PWL cache might be a good solution to the problem of
>> write latency requirements of etcd when running a kubernetes control
>> plane on ceph. Etcd is extremely write latency sensitive and becomes
>> unstable if write latency is too high. The etcd workload can be
>> characterised by very small (~4k) writes with a queue depth of 1.
>> Throughput, even on a busy system, is normally very low. As etcd is
>> distributed and can safely handle the loss of un-flushed data from a
>> single node, a local ssd PWL cache for etcd looked like an ideal
>> solution.
>>
>> My expectation was that adding a PWL cache on a local SSD to an
>> RBD-backed would improve write latency to something approaching the
>> write latency performance of the local SSD. However, in my testing
>> adding a PWL cache to an rbd-backed VM increased write latency by
>> approximately 4x over not using a PWL cache. This was over 100x more
>> than the write latency performance of the underlying SSD.
>>
>> My expectation was based on the documentation here:
>> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/
>>
>> “The cache provides two different persistence modes. In
>> persistent-on-write mode, the writes are completed only when they are
>> persisted to the cache device and will be readable after a crash. In
>> persistent-on-flush mode, the writes are completed as soon as it no
>> longer needs the caller’s data buffer to complete the writes, but does
>> not guarantee that writes will be readable after a crash. The data is
>> persisted to the cache device when a flush request is received.”
>>
>> ** Method
>>
>> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other
>> running libvirt and mounting a VM’s disk with librbd (also 17.2.6)
>> from the first node.
>>
>> All performance testing is from the libvirt system. I tested write
>> latency performance:
>>
>> * Inside the VM without a PWL cache
>> * Of the PWL device directly from the host (direct to filesystem, no VM)
>> * Inside the VM with a PWL cache
>>
>> I am testing with fio. Specifically I am running a containerised test,
>> executed with:
>>   podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
>>
>> This container runs:
>>   fio --rw=write --ioengine=sync --fdatasync=1
>> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
>> --output-format=json --runtime=60 --time_based=1
>>
>> And extracts sync.lat_ns.percentile["99.000000"]
>>
>> ** Results
>>
>> All results were stable across multiple runs within a small margin of error.
>>
>> * rbd no cache: 1417216 ns
>> * pwl cache device: 44288 ns
>> * rbd with pwl cache: 5210112 ns
>>
>> Note that by adding a PWL cache we increase write latency by
>> approximately 4x, which is more than 100x than the underlying device.
>>
>> ** Hardware
>>
>> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM
>> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC
>> H730P Mini (Embedded)
>>
>> OS installed on rotational disks
>>
>> N.B. Linux incorrectly detects these disks as rotational, which I
>> assume relates to weird behaviour by the PERC controller. I remembered
>> to manually correct this on the ‘client’ machine for the PWL cache,
>> but at OSD configuration time ceph would have detected them as
>> rotational. They are not rotational.
>>
>> ** Ceph Configuration
>>
>> CentOS Stream 9
>>
>>   # ceph version
>>   ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy 
>> (stable)
>>
>> Single node installation with cephadm. 2 OSDs, one on each SSD.
>> 1 pool with size 2
>>
>> ** Client Configuration
>>
>> Fedora 38
>> Librbd1-17.2.6-3.fc38.x86_64
>>
>> PWL cache is XFS filesystem with 4k block size, matching the
>> underlying device. The filesystem uses the whole block device. There
>> is no other load on the system.
>>
>> ** RBD Configuration
>>
>> # rbd config image list libvirt-pool/pwl-test | grep cache
>> rbd_cache                                    true                         
>> config
>> rbd_cache_block_writes_upfront               false                        
>> config
>> rbd_cache_max_dirty                          25165824                     
>> config
>> rbd_cache_max_dirty_age                      1.000000                     
>> config
>> rbd_cache_max_dirty_object                   0                            
>> config
>> rbd_cache_policy                             writeback                    
>> pool
>> rbd_cache_size                               33554432                     
>> config
>> rbd_cache_target_dirty                       16777216                     
>> config
>> rbd_cache_writethrough_until_flush           true                         
>> pool
>> rbd_parent_cache_enabled                     false                        
>> config
>> rbd_persistent_cache_mode                    ssd                          
>> pool
>> rbd_persistent_cache_path                    /var/lib/libvirt/images/pwl  
>> pool
>> rbd_persistent_cache_size                    1073741824                   
>> config
>> rbd_plugins                                  pwl_cache                    
>> pool
>>
>> # rbd status libvirt-pool/pwl-test
>> Watchers:
>>         watcher=10.1.240.27:0/1406459716 client.14475 cookie=140282423200720
>> Persistent cache state:
>>         host: dell-r640-050
>>         path: 
>> /var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool
>>         size: 1 GiB
>>         mode: ssd
>>         stats_timestamp: Mon Jun 26 11:29:21 2023
>>         present: true   empty: false    clean: true
>>         allocated: 180 MiB
>>         cached: 135 MiB
>>         dirty: 0 B
>>         free: 844 MiB
>>         hits_full: 1 / 0%
>>         hits_partial: 3 / 0%
>>         misses: 21952
>>         hit_bytes: 6 KiB / 0%
>>         miss_bytes: 349 MiB
>> --
>> Matthew Booth
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Matthew Booth
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

Reply via email to