Hi Igor,

Thank you for the reply and information.
I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
65537` correctly defers writes in my clusters.

Best regards,

Dan



On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov <igor.fedo...@croit.io> wrote:
>
> Hi Dan,
>
> I can confirm this is a regression introduced by 
> https://github.com/ceph/ceph/pull/42725.
>
> Indeed strict comparison is a key point in your specific case but generally  
> it looks like this piece of code needs more redesign to better handle 
> fragmented allocations (and issue deferred write for every short enough 
> fragment independently).
>
> So I'm looking for a way to improve that at the moment. Will fallback to 
> trivial comparison fix if I fail to do find better solution.
>
> Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer 
> not to raise it that high as 128K to avoid too many writes being deferred 
> (and hence DB overburden).
>
> IMO setting the parameter to 64K+1 should be fine.
>
>
> Thanks,
>
> Igor
>
> On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados 
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very 
> wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default 
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer 
> deferred from those pre-pacific hdds to flash in pacific with the default 
> config !!!
> Here are example bench writes from both releases: 
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set 
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. 
> Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 
> which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 
> 0x10000 with bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly 
> less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed 
> osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB 
> or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to