[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
OK I recreated one OSD. It now has 4k min_alloc_size: 2022-07-14T10:52:58.382+0200 7fe5ec0aa200 1 bluestore(/var/lib/ceph/osd/ceph-0/) _open_super_meta min_alloc_size 0x1000 and I tested all these bluestore_prefer_deferred_size_hdd values: 4096: not deferred 4097: "_do_alloc_write deferring 0x1000 write via deferred" 65536: "_do_alloc_write deferring 0x1000 write via deferred" 65537: "_do_alloc_write deferring 0x1000 write via deferred" With bluestore_prefer_deferred_size_hdd = 64k, I see that writes up to 0xf000 are deferred, e.g.: _do_alloc_write deferring 0xf000 write via deferred Cheers, Dan On Thu, Jul 14, 2022 at 9:37 AM Konstantin Shalygin wrote: > > Dan, do you tested the redeploy one of your OSD with default pacific > bluestore_min_alloc_size_hdd (4096) ? > This will also resolves this issue (just not affected, when all options in > their defaults)? > > > Thanks, > k > > On 14 Jul 2022, at 08:43, Dan van der Ster wrote: > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Dan, do you tested the redeploy one of your OSD with default pacific bluestore_min_alloc_size_hdd (4096) ? This will also resolves this issue (just not affected, when all options in their defaults)? Thanks, k > On 14 Jul 2022, at 08:43, Dan van der Ster wrote: > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Many thanks, Dan. Much appreciated! /Z On Thu, 14 Jul 2022 at 08:43, Dan van der Ster wrote: > Yes, that is correct. No need to restart the osds. > > .. Dan > > > On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, > wrote: > >> Hi! >> >> My apologies for butting in. Please confirm >> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't >> require OSDs to be stopped or rebuilt? >> >> Best regards, >> Zakhar >> >> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster >> wrote: >> >>> Hi Igor, >>> >>> Thank you for the reply and information. >>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd >>> 65537` correctly defers writes in my clusters. >>> >>> Best regards, >>> >>> Dan >>> >>> >>> >>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov >>> wrote: >>> > >>> > Hi Dan, >>> > >>> > I can confirm this is a regression introduced by >>> https://github.com/ceph/ceph/pull/42725. >>> > >>> > Indeed strict comparison is a key point in your specific case but >>> generally it looks like this piece of code needs more redesign to better >>> handle fragmented allocations (and issue deferred write for every short >>> enough fragment independently). >>> > >>> > So I'm looking for a way to improve that at the moment. Will fallback >>> to trivial comparison fix if I fail to do find better solution. >>> > >>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd >>> prefer not to raise it that high as 128K to avoid too many writes being >>> deferred (and hence DB overburden). >>> > >>> > IMO setting the parameter to 64K+1 should be fine. >>> > >>> > >>> > Thanks, >>> > >>> > Igor >>> > >>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote: >>> > >>> > Hi Igor and others, >>> > >>> > (apologies for html, but i want to share a plot ;) ) >>> > >>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple >>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something >>> is very wrong with deferred writes in pacific. >>> > Here is an example cluster, upgraded today: >>> > >>> > >>> > >>> > The OSDs are 12TB HDDs, formatted in nautilus with the default >>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. >>> > >>> > I found that the performance issue is because 4kB writes are no longer >>> deferred from those pre-pacific hdds to flash in pacific with the default >>> config !!! >>> > Here are example bench writes from both releases: >>> https://pastebin.com/raw/m0yL1H9Z >>> > >>> > I worked out that the issue is fixed if I set >>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. >>> Note the default was 32k in octopus). >>> > >>> > I think this is related to the fixes in >>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- >>> _do_alloc_write is comparing the prealloc size 0x1 with >>> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" >>> condition prevents deferred writes from ever happening. >>> > >>> > So I think this would impact anyone upgrading clusters with hdd/ssd >>> mixed osds ... surely we must not be the only clusters impacted by this?! >>> > >>> > Should we increase the default bluestore_prefer_deferred_size_hdd up >>> to 128kB or is there in fact a bug here? >>> > >>> > Best Regards, >>> > >>> > Dan >>> > >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Yes, that is correct. No need to restart the osds. .. Dan On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, wrote: > Hi! > > My apologies for butting in. Please confirm > that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't > require OSDs to be stopped or rebuilt? > > Best regards, > Zakhar > > On Tue, 12 Jul 2022 at 14:46, Dan van der Ster wrote: > >> Hi Igor, >> >> Thank you for the reply and information. >> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd >> 65537` correctly defers writes in my clusters. >> >> Best regards, >> >> Dan >> >> >> >> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov >> wrote: >> > >> > Hi Dan, >> > >> > I can confirm this is a regression introduced by >> https://github.com/ceph/ceph/pull/42725. >> > >> > Indeed strict comparison is a key point in your specific case but >> generally it looks like this piece of code needs more redesign to better >> handle fragmented allocations (and issue deferred write for every short >> enough fragment independently). >> > >> > So I'm looking for a way to improve that at the moment. Will fallback >> to trivial comparison fix if I fail to do find better solution. >> > >> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd >> prefer not to raise it that high as 128K to avoid too many writes being >> deferred (and hence DB overburden). >> > >> > IMO setting the parameter to 64K+1 should be fine. >> > >> > >> > Thanks, >> > >> > Igor >> > >> > On 7/7/2022 12:43 AM, Dan van der Ster wrote: >> > >> > Hi Igor and others, >> > >> > (apologies for html, but i want to share a plot ;) ) >> > >> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple >> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something >> is very wrong with deferred writes in pacific. >> > Here is an example cluster, upgraded today: >> > >> > >> > >> > The OSDs are 12TB HDDs, formatted in nautilus with the default >> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. >> > >> > I found that the performance issue is because 4kB writes are no longer >> deferred from those pre-pacific hdds to flash in pacific with the default >> config !!! >> > Here are example bench writes from both releases: >> https://pastebin.com/raw/m0yL1H9Z >> > >> > I worked out that the issue is fixed if I set >> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. >> Note the default was 32k in octopus). >> > >> > I think this is related to the fixes in >> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- >> _do_alloc_write is comparing the prealloc size 0x1 with >> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" >> condition prevents deferred writes from ever happening. >> > >> > So I think this would impact anyone upgrading clusters with hdd/ssd >> mixed osds ... surely we must not be the only clusters impacted by this?! >> > >> > Should we increase the default bluestore_prefer_deferred_size_hdd up to >> 128kB or is there in fact a bug here? >> > >> > Best Regards, >> > >> > Dan >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi! My apologies for butting in. Please confirm that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't require OSDs to be stopped or rebuilt? Best regards, Zakhar On Tue, 12 Jul 2022 at 14:46, Dan van der Ster wrote: > Hi Igor, > > Thank you for the reply and information. > I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd > 65537` correctly defers writes in my clusters. > > Best regards, > > Dan > > > > On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov > wrote: > > > > Hi Dan, > > > > I can confirm this is a regression introduced by > https://github.com/ceph/ceph/pull/42725. > > > > Indeed strict comparison is a key point in your specific case but > generally it looks like this piece of code needs more redesign to better > handle fragmented allocations (and issue deferred write for every short > enough fragment independently). > > > > So I'm looking for a way to improve that at the moment. Will fallback to > trivial comparison fix if I fail to do find better solution. > > > > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd > prefer not to raise it that high as 128K to avoid too many writes being > deferred (and hence DB overburden). > > > > IMO setting the parameter to 64K+1 should be fine. > > > > > > Thanks, > > > > Igor > > > > On 7/7/2022 12:43 AM, Dan van der Ster wrote: > > > > Hi Igor and others, > > > > (apologies for html, but i want to share a plot ;) ) > > > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados > bench -p test 10 write -b 4096 -t 1" latency probe showed something is very > wrong with deferred writes in pacific. > > Here is an example cluster, upgraded today: > > > > > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the default > config !!! > > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. > Note the default was 32k in octopus). > > > > I think this is related to the fixes in > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- > _do_alloc_write is comparing the prealloc size 0x1 with > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" > condition prevents deferred writes from ever happening. > > > > So I think this would impact anyone upgrading clusters with hdd/ssd > mixed osds ... surely we must not be the only clusters impacted by this?! > > > > Should we increase the default bluestore_prefer_deferred_size_hdd up to > 128kB or is there in fact a bug here? > > > > Best Regards, > > > > Dan > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
May be. My plan is to attempt to make general fix and if this wouldn't work within a short time frame - publish a 'quick' one. On 7/13/2022 4:58 PM, David Orman wrote: Is this something that makes sense to do the 'quick' fix on for the next pacific release to minimize impact to users until the improved iteration can be implemented? On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov wrote: Hi Dan, I can confirm this is a regression introduced by https://github.com/ceph/ceph/pull/42725. Indeed strict comparison is a key point in your specific case but generally it looks like this piece of code needs more redesign to better handle fragmented allocations (and issue deferred write for every short enough fragment independently). So I'm looking for a way to improve that at the moment. Will fallback to trivial comparison fix if I fail to do find better solution. Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer not to raise it that high as 128K to avoid too many writes being deferred (and hence DB overburden). IMO setting the parameter to 64K+1 should be fine. Thanks, Igor On 7/7/2022 12:43 AM, Dan van der Ster wrote: > Hi Igor and others, > > (apologies for html, but i want to share a plot ;) ) > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple > "rados bench -p test 10 write -b 4096 -t 1" latency probe showed > something is very wrong with deferred writes in pacific. > Here is an example cluster, upgraded today: > > image.png > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the > default config !!! > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific > default. Note the default was 32k in octopus). > > I think this is related to the fixes in > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- > _do_alloc_write is comparing the prealloc size 0x1 with > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less > than" condition prevents deferred writes from ever happening. > > So I think this would impact anyone upgrading clusters with hdd/ssd > mixed osds ... surely we must not be the only clusters impacted by this?! > > Should we increase the default bluestore_prefer_deferred_size_hdd up > to 128kB or is there in fact a bug here? > > Best Regards, > > Dan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Is this something that makes sense to do the 'quick' fix on for the next pacific release to minimize impact to users until the improved iteration can be implemented? On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov wrote: > Hi Dan, > > I can confirm this is a regression introduced by > https://github.com/ceph/ceph/pull/42725. > > Indeed strict comparison is a key point in your specific case but > generally it looks like this piece of code needs more redesign to > better handle fragmented allocations (and issue deferred write for every > short enough fragment independently). > > So I'm looking for a way to improve that at the moment. Will fallback to > trivial comparison fix if I fail to do find better solution. > > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd > prefer not to raise it that high as 128K to avoid too many writes being > deferred (and hence DB overburden). > > IMO setting the parameter to 64K+1 should be fine. > > > Thanks, > > Igor > > On 7/7/2022 12:43 AM, Dan van der Ster wrote: > > Hi Igor and others, > > > > (apologies for html, but i want to share a plot ;) ) > > > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple > > "rados bench -p test 10 write -b 4096 -t 1" latency probe showed > > something is very wrong with deferred writes in pacific. > > Here is an example cluster, upgraded today: > > > > image.png > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash > block.db. > > > > I found that the performance issue is because 4kB writes are no longer > > deferred from those pre-pacific hdds to flash in pacific with the > > default config !!! > > Here are example bench writes from both releases: > > https://pastebin.com/raw/m0yL1H9Z > > > > I worked out that the issue is fixed if I set > > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific > > default. Note the default was 32k in octopus). > > > > I think this is related to the fixes in > > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- > > _do_alloc_write is comparing the prealloc size 0x1 with > > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less > > than" condition prevents deferred writes from ever happening. > > > > So I think this would impact anyone upgrading clusters with hdd/ssd > > mixed osds ... surely we must not be the only clusters impacted by this?! > > > > Should we increase the default bluestore_prefer_deferred_size_hdd up > > to 128kB or is there in fact a bug here? > > > > Best Regards, > > > > Dan > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi Igor, Thank you for the reply and information. I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd 65537` correctly defers writes in my clusters. Best regards, Dan On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov wrote: > > Hi Dan, > > I can confirm this is a regression introduced by > https://github.com/ceph/ceph/pull/42725. > > Indeed strict comparison is a key point in your specific case but generally > it looks like this piece of code needs more redesign to better handle > fragmented allocations (and issue deferred write for every short enough > fragment independently). > > So I'm looking for a way to improve that at the moment. Will fallback to > trivial comparison fix if I fail to do find better solution. > > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer > not to raise it that high as 128K to avoid too many writes being deferred > (and hence DB overburden). > > IMO setting the parameter to 64K+1 should be fine. > > > Thanks, > > Igor > > On 7/7/2022 12:43 AM, Dan van der Ster wrote: > > Hi Igor and others, > > (apologies for html, but i want to share a plot ;) ) > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados > bench -p test 10 write -b 4096 -t 1" latency probe showed something is very > wrong with deferred writes in pacific. > Here is an example cluster, upgraded today: > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the default > config !!! > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. > Note the default was 32k in octopus). > > I think this is related to the fixes in https://tracker.ceph.com/issues/52089 > which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size > 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly > less than" condition prevents deferred writes from ever happening. > > So I think this would impact anyone upgrading clusters with hdd/ssd mixed > osds ... surely we must not be the only clusters impacted by this?! > > Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB > or is there in fact a bug here? > > Best Regards, > > Dan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
yep! Thanks and sorry for the confusion. On 7/12/2022 2:23 PM, Konstantin Shalygin wrote: Hi Igor, On 12 Jul 2022, at 14:16, Igor Fedotov wrote: Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer not to raise it that high as 128K to avoid too many writes being deferred (and hence DB overburden). For clarification, perhaps you mean bluestore_prefer_deferred_size_hdd ? k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi Igor, > On 12 Jul 2022, at 14:16, Igor Fedotov wrote: > > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer > not to raise it that high as 128K to avoid too many writes being deferred > (and hence DB overburden). For clarification, perhaps you mean bluestore_prefer_deferred_size_hdd ? k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi Dan, I can confirm this is a regression introduced by https://github.com/ceph/ceph/pull/42725. Indeed strict comparison is a key point in your specific case but generally it looks like this piece of code needs more redesign to better handle fragmented allocations (and issue deferred write for every short enough fragment independently). So I'm looking for a way to improve that at the moment. Will fallback to trivial comparison fix if I fail to do find better solution. Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer not to raise it that high as 128K to avoid too many writes being deferred (and hence DB overburden). IMO setting the parameter to 64K+1 should be fine. Thanks, Igor On 7/7/2022 12:43 AM, Dan van der Ster wrote: Hi Igor and others, (apologies for html, but i want to share a plot ;) ) We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something is very wrong with deferred writes in pacific. Here is an example cluster, upgraded today: image.png The OSDs are 12TB HDDs, formatted in nautilus with the default bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. I found that the performance issue is because 4kB writes are no longer deferred from those pre-pacific hdds to flash in pacific with the default config !!! Here are example bench writes from both releases: https://pastebin.com/raw/m0yL1H9Z I worked out that the issue is fixed if I set bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. Note the default was 32k in octopus). I think this is related to the fixes in https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than" condition prevents deferred writes from ever happening. So I think this would impact anyone upgrading clusters with hdd/ssd mixed osds ... surely we must not be the only clusters impacted by this?! Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB or is there in fact a bug here? Best Regards, Dan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
On 7 Jul 2022, at 15:41, Dan van der Ster wrote: > > How is one supposed to redeploy OSDs on a multi-PB cluster while the > performance is degraded? This is very strong point of view! Good that this case can be fixed with set bluestore_prefer_deferred_size_hdd to 128k, and I think we need analyze answer from Igor * this is bug * bluestore_prefer_deferred_size_hdd should be increased by operator, until migration to 4k will finished k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi, On Thu, Jul 7, 2022 at 2:37 PM Konstantin Shalygin wrote: > > Hi, > > On 7 Jul 2022, at 13:04, Dan van der Ster wrote: > > I'm not sure the html mail made it to the lists -- resending in plain text. > I've also opened https://tracker.ceph.com/issues/56488 > > > I think with pacific you need to redeploy all OSD's to respect the new > default bluestore_min_alloc_size_hdd = 4096 [1] > Or not? > Understood, yes, that is another "solution". But it is incredibly impractical, I would say impossible, for loaded production installations. (How is one supposed to redeploy OSDs on a multi-PB cluster while the performance is degraded?) -- Dan > > [1] https://github.com/ceph/ceph/pull/34588 > > k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi again, I'm not sure the html mail made it to the lists -- resending in plain text. I've also opened https://tracker.ceph.com/issues/56488 Cheers, Dan On Wed, Jul 6, 2022 at 11:43 PM Dan van der Ster wrote: > > Hi Igor and others, > > (apologies for html, but i want to share a plot ;) ) > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados > bench -p test 10 write -b 4096 -t 1" latency probe showed something is very > wrong with deferred writes in pacific. > Here is an example cluster, upgraded today: > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the default > config !!! > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. > Note the default was 32k in octopus). > > I think this is related to the fixes in https://tracker.ceph.com/issues/52089 > which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size > 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly > less than" condition prevents deferred writes from ever happening. > > So I think this would impact anyone upgrading clusters with hdd/ssd mixed > osds ... surely we must not be the only clusters impacted by this?! > > Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB > or is there in fact a bug here? > > Best Regards, > > Dan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io