Hi Robert,

We are definitely aware of this issue.  It appears to often be related to snap trimming and we believe possibly related to excessive thrashing of the rocksdb block cache.  I suspect that when bluefs_buffered_io is enabled it hides the issue and people don't notice the problem, but that might be related to why we see the other issue with the kernel with rgw workloads.  I would recommend that if you didn't see issues with bluefs_buffered_io enabled, you can re-enable it and periodically check to make sure you aren't hitting issues with kernel swap.  Unfortunately we are sort of between a rock and a hard place on this one until we solve the root cause.


Right now we're looking at trying to reduce thrashing in the rocksdb block cache(s) by splitting up onode and omap (and potentially pglog and allocator) block cache into their own distinct entities.  My hope is that we can finesse the situation so that the overall system page cache is no longer required to avoid execessive reads assuming enough memory has been assigned to the osd_memory_target.


Mark


On 1/11/21 9:47 AM, Robert Sander wrote:
Hi,

bluefs_buffered_io was disabled in Ceph version 14.2.11.

The cluster started last year with 14.2.5 and got upgraded over the year now 
running 14.2.16.

The performance was OK first but got abysmal bad at the end of 2020.

We checked the components and HDDs and SSDs seem to be fine. Single disk 
benchmarks showed performance according the specs.

Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 
HDDs distributed over 12 nodes.

Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s 
read instead of 123MB/s.

This setting was disabled in 14.2.11 because "in some test cases it appears to cause 
excessive swap utilization by the linux kernel and a large negative performance impact 
after several hours of run time."
We have to monitor if this will happen in our cluster. Is there any other 
negative side effect currently known?

Here are the rados bench values, first with bluefs_buffered_io=false, then with 
bluefs_buffered_io=true:

Bench           Total   Total   Write   Object  Band    Stddev  Max     Min     
Average Stddev  Max     Min     Average Stddev  Max     Min
                time    writes  Read    size    width        Bandwidth          
             IOPS                         Latency (s)
                run     reads   size            (MB/sec)
                        made
false write     33,081  490     4194304 4194304 59,2485 71,3829 264     0       
14      17,8702 66      0       1,07362 2,83017 20,71   0,0741089
false seq       15,8226 490     4194304 4194304 123,874                         
30      46,8659 174     0       0,51453         9,53873 0,00343417
false rand      38,2615 2131    4194304 4194304 222,782                         
55      109,374 415     0       0,28191         12,1039 0,00327948
true write      30,4612 3308    4194304 4194304 434,389 26,0323 480     376     
108     6,50809 120     94      0,14683 0,07368 0,99791 0,0751249
true seq        13,7628 3308    4194304 4194304 961,429                         
240     22,544  280     184     0,06528         0,88676 0,00338191
true rand       30,1007 8247    4194304 4194304 1095,92                         
273     25,5066 313     213     0,05719         0,99140 0,00325295

Regards

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to