[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Dan van der Ster
OK I recreated one OSD. It now has 4k min_alloc_size:

2022-07-14T10:52:58.382+0200 7fe5ec0aa200  1
bluestore(/var/lib/ceph/osd/ceph-0/) _open_super_meta min_alloc_size
0x1000

and I tested all these bluestore_prefer_deferred_size_hdd values:

4096: not deferred
4097: "_do_alloc_write deferring 0x1000 write via deferred"
65536: "_do_alloc_write deferring 0x1000 write via deferred"
65537: "_do_alloc_write deferring 0x1000 write via deferred"

With bluestore_prefer_deferred_size_hdd = 64k, I see that writes up to
0xf000 are deferred, e.g.:

 _do_alloc_write deferring 0xf000 write via deferred

Cheers, Dan

On Thu, Jul 14, 2022 at 9:37 AM Konstantin Shalygin  wrote:
>
> Dan, do you tested the redeploy one of your OSD with default pacific 
> bluestore_min_alloc_size_hdd (4096) ?
> This will also resolves this issue (just not affected, when all options in 
> their defaults)?
>
>
> Thanks,
> k
>
> On 14 Jul 2022, at 08:43, Dan van der Ster  wrote:
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Konstantin Shalygin
Dan, do you tested the redeploy one of your OSD with default pacific 
bluestore_min_alloc_size_hdd (4096) ?
This will also resolves this issue (just not affected, when all options in 
their defaults)?


Thanks,
k

> On 14 Jul 2022, at 08:43, Dan van der Ster  wrote:
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Zakhar Kirpichenko
Many thanks, Dan. Much appreciated!

/Z

On Thu, 14 Jul 2022 at 08:43, Dan van der Ster  wrote:

> Yes, that is correct. No need to restart the osds.
>
> .. Dan
>
>
> On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, 
> wrote:
>
>> Hi!
>>
>> My apologies for butting in. Please confirm
>> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
>> require OSDs to be stopped or rebuilt?
>>
>> Best regards,
>> Zakhar
>>
>> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster 
>> wrote:
>>
>>> Hi Igor,
>>>
>>> Thank you for the reply and information.
>>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
>>> 65537` correctly defers writes in my clusters.
>>>
>>> Best regards,
>>>
>>> Dan
>>>
>>>
>>>
>>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov 
>>> wrote:
>>> >
>>> > Hi Dan,
>>> >
>>> > I can confirm this is a regression introduced by
>>> https://github.com/ceph/ceph/pull/42725.
>>> >
>>> > Indeed strict comparison is a key point in your specific case but
>>> generally  it looks like this piece of code needs more redesign to better
>>> handle fragmented allocations (and issue deferred write for every short
>>> enough fragment independently).
>>> >
>>> > So I'm looking for a way to improve that at the moment. Will fallback
>>> to trivial comparison fix if I fail to do find better solution.
>>> >
>>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
>>> prefer not to raise it that high as 128K to avoid too many writes being
>>> deferred (and hence DB overburden).
>>> >
>>> > IMO setting the parameter to 64K+1 should be fine.
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Igor
>>> >
>>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>>> >
>>> > Hi Igor and others,
>>> >
>>> > (apologies for html, but i want to share a plot ;) )
>>> >
>>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
>>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something
>>> is very wrong with deferred writes in pacific.
>>> > Here is an example cluster, upgraded today:
>>> >
>>> >
>>> >
>>> > The OSDs are 12TB HDDs, formatted in nautilus with the default
>>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>>> >
>>> > I found that the performance issue is because 4kB writes are no longer
>>> deferred from those pre-pacific hdds to flash in pacific with the default
>>> config !!!
>>> > Here are example bench writes from both releases:
>>> https://pastebin.com/raw/m0yL1H9Z
>>> >
>>> > I worked out that the issue is fixed if I set
>>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
>>> Note the default was 32k in octopus).
>>> >
>>> > I think this is related to the fixes in
>>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
>>> _do_alloc_write is comparing the prealloc size 0x1 with
>>> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than"
>>> condition prevents deferred writes from ever happening.
>>> >
>>> > So I think this would impact anyone upgrading clusters with hdd/ssd
>>> mixed osds ... surely we must not be the only clusters impacted by this?!
>>> >
>>> > Should we increase the default bluestore_prefer_deferred_size_hdd up
>>> to 128kB or is there in fact a bug here?
>>> >
>>> > Best Regards,
>>> >
>>> > Dan
>>> >
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread Dan van der Ster
Yes, that is correct. No need to restart the osds.

.. Dan


On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko,  wrote:

> Hi!
>
> My apologies for butting in. Please confirm
> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
> require OSDs to be stopped or rebuilt?
>
> Best regards,
> Zakhar
>
> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster  wrote:
>
>> Hi Igor,
>>
>> Thank you for the reply and information.
>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
>> 65537` correctly defers writes in my clusters.
>>
>> Best regards,
>>
>> Dan
>>
>>
>>
>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov 
>> wrote:
>> >
>> > Hi Dan,
>> >
>> > I can confirm this is a regression introduced by
>> https://github.com/ceph/ceph/pull/42725.
>> >
>> > Indeed strict comparison is a key point in your specific case but
>> generally  it looks like this piece of code needs more redesign to better
>> handle fragmented allocations (and issue deferred write for every short
>> enough fragment independently).
>> >
>> > So I'm looking for a way to improve that at the moment. Will fallback
>> to trivial comparison fix if I fail to do find better solution.
>> >
>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
>> prefer not to raise it that high as 128K to avoid too many writes being
>> deferred (and hence DB overburden).
>> >
>> > IMO setting the parameter to 64K+1 should be fine.
>> >
>> >
>> > Thanks,
>> >
>> > Igor
>> >
>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>> >
>> > Hi Igor and others,
>> >
>> > (apologies for html, but i want to share a plot ;) )
>> >
>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something
>> is very wrong with deferred writes in pacific.
>> > Here is an example cluster, upgraded today:
>> >
>> >
>> >
>> > The OSDs are 12TB HDDs, formatted in nautilus with the default
>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>> >
>> > I found that the performance issue is because 4kB writes are no longer
>> deferred from those pre-pacific hdds to flash in pacific with the default
>> config !!!
>> > Here are example bench writes from both releases:
>> https://pastebin.com/raw/m0yL1H9Z
>> >
>> > I worked out that the issue is fixed if I set
>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
>> Note the default was 32k in octopus).
>> >
>> > I think this is related to the fixes in
>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
>> _do_alloc_write is comparing the prealloc size 0x1 with
>> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than"
>> condition prevents deferred writes from ever happening.
>> >
>> > So I think this would impact anyone upgrading clusters with hdd/ssd
>> mixed osds ... surely we must not be the only clusters impacted by this?!
>> >
>> > Should we increase the default bluestore_prefer_deferred_size_hdd up to
>> 128kB or is there in fact a bug here?
>> >
>> > Best Regards,
>> >
>> > Dan
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread Zakhar Kirpichenko
Hi!

My apologies for butting in. Please confirm
that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
require OSDs to be stopped or rebuilt?

Best regards,
Zakhar

On Tue, 12 Jul 2022 at 14:46, Dan van der Ster  wrote:

> Hi Igor,
>
> Thank you for the reply and information.
> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
> 65537` correctly defers writes in my clusters.
>
> Best regards,
>
> Dan
>
>
>
> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov 
> wrote:
> >
> > Hi Dan,
> >
> > I can confirm this is a regression introduced by
> https://github.com/ceph/ceph/pull/42725.
> >
> > Indeed strict comparison is a key point in your specific case but
> generally  it looks like this piece of code needs more redesign to better
> handle fragmented allocations (and issue deferred write for every short
> enough fragment independently).
> >
> > So I'm looking for a way to improve that at the moment. Will fallback to
> trivial comparison fix if I fail to do find better solution.
> >
> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
> prefer not to raise it that high as 128K to avoid too many writes being
> deferred (and hence DB overburden).
> >
> > IMO setting the parameter to 64K+1 should be fine.
> >
> >
> > Thanks,
> >
> > Igor
> >
> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
> >
> > Hi Igor and others,
> >
> > (apologies for html, but i want to share a plot ;) )
> >
> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very
> wrong with deferred writes in pacific.
> > Here is an example cluster, upgraded today:
> >
> >
> >
> > The OSDs are 12TB HDDs, formatted in nautilus with the default
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
> >
> > I found that the performance issue is because 4kB writes are no longer
> deferred from those pre-pacific hdds to flash in pacific with the default
> config !!!
> > Here are example bench writes from both releases:
> https://pastebin.com/raw/m0yL1H9Z
> >
> > I worked out that the issue is fixed if I set
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
> Note the default was 32k in octopus).
> >
> > I think this is related to the fixes in
> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
> _do_alloc_write is comparing the prealloc size 0x1 with
> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than"
> condition prevents deferred writes from ever happening.
> >
> > So I think this would impact anyone upgrading clusters with hdd/ssd
> mixed osds ... surely we must not be the only clusters impacted by this?!
> >
> > Should we increase the default bluestore_prefer_deferred_size_hdd up to
> 128kB or is there in fact a bug here?
> >
> > Best Regards,
> >
> > Dan
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread Igor Fedotov
May be. My plan is to attempt to make general fix and if this wouldn't 
work within a short time frame - publish a 'quick' one.



On 7/13/2022 4:58 PM, David Orman wrote:
Is this something that makes sense to do the 'quick' fix on for the 
next pacific release to minimize impact to users until the improved 
iteration can be implemented?


On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov  
wrote:


Hi Dan,

I can confirm this is a regression introduced by
https://github.com/ceph/ceph/pull/42725.

Indeed strict comparison is a key point in your specific case but
generally  it looks like this piece of code needs more redesign to
better handle fragmented allocations (and issue deferred write for
every
short enough fragment independently).

So I'm looking for a way to improve that at the moment. Will
fallback to
trivial comparison fix if I fail to do find better solution.

Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
prefer not to raise it that high as 128K to avoid too many writes
being
deferred (and hence DB overburden).

IMO setting the parameter to 64K+1 should be fine.


Thanks,

Igor

On 7/7/2022 12:43 AM, Dan van der Ster wrote:
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed
> something is very wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
> image.png
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash
block.db.
>
> I found that the performance issue is because 4kB writes are no
longer
> deferred from those pre-pacific hdds to flash in pacific with the
> default config !!!
> Here are example bench writes from both releases:
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific
> default. Note the default was 32k in octopus).
>
> I think this is related to the fixes in
> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
> _do_alloc_write is comparing the prealloc size 0x1 with
> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less
> than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd
> mixed osds ... surely we must not be the only clusters impacted
by this?!
>
> Should we increase the default
bluestore_prefer_deferred_size_hdd up
> to 128kB or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread David Orman
Is this something that makes sense to do the 'quick' fix on for the next
pacific release to minimize impact to users until the improved iteration
can be implemented?

On Tue, Jul 12, 2022 at 6:16 AM Igor Fedotov  wrote:

> Hi Dan,
>
> I can confirm this is a regression introduced by
> https://github.com/ceph/ceph/pull/42725.
>
> Indeed strict comparison is a key point in your specific case but
> generally  it looks like this piece of code needs more redesign to
> better handle fragmented allocations (and issue deferred write for every
> short enough fragment independently).
>
> So I'm looking for a way to improve that at the moment. Will fallback to
> trivial comparison fix if I fail to do find better solution.
>
> Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
> prefer not to raise it that high as 128K to avoid too many writes being
> deferred (and hence DB overburden).
>
> IMO setting the parameter to 64K+1 should be fine.
>
>
> Thanks,
>
> Igor
>
> On 7/7/2022 12:43 AM, Dan van der Ster wrote:
> > Hi Igor and others,
> >
> > (apologies for html, but i want to share a plot ;) )
> >
> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
> > "rados bench -p test 10 write -b 4096 -t 1" latency probe showed
> > something is very wrong with deferred writes in pacific.
> > Here is an example cluster, upgraded today:
> >
> > image.png
> >
> > The OSDs are 12TB HDDs, formatted in nautilus with the default
> > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash
> block.db.
> >
> > I found that the performance issue is because 4kB writes are no longer
> > deferred from those pre-pacific hdds to flash in pacific with the
> > default config !!!
> > Here are example bench writes from both releases:
> > https://pastebin.com/raw/m0yL1H9Z
> >
> > I worked out that the issue is fixed if I set
> > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific
> > default. Note the default was 32k in octopus).
> >
> > I think this is related to the fixes in
> > https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
> > _do_alloc_write is comparing the prealloc size 0x1 with
> > bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less
> > than" condition prevents deferred writes from ever happening.
> >
> > So I think this would impact anyone upgrading clusters with hdd/ssd
> > mixed osds ... surely we must not be the only clusters impacted by this?!
> >
> > Should we increase the default bluestore_prefer_deferred_size_hdd up
> > to 128kB or is there in fact a bug here?
> >
> > Best Regards,
> >
> > Dan
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-12 Thread Dan van der Ster
Hi Igor,

Thank you for the reply and information.
I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
65537` correctly defers writes in my clusters.

Best regards,

Dan



On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov  wrote:
>
> Hi Dan,
>
> I can confirm this is a regression introduced by 
> https://github.com/ceph/ceph/pull/42725.
>
> Indeed strict comparison is a key point in your specific case but generally  
> it looks like this piece of code needs more redesign to better handle 
> fragmented allocations (and issue deferred write for every short enough 
> fragment independently).
>
> So I'm looking for a way to improve that at the moment. Will fallback to 
> trivial comparison fix if I fail to do find better solution.
>
> Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer 
> not to raise it that high as 128K to avoid too many writes being deferred 
> (and hence DB overburden).
>
> IMO setting the parameter to 64K+1 should be fine.
>
>
> Thanks,
>
> Igor
>
> On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados 
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very 
> wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default 
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer 
> deferred from those pre-pacific hdds to flash in pacific with the default 
> config !!!
> Here are example bench writes from both releases: 
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set 
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. 
> Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 
> which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 
> 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly 
> less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed 
> osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB 
> or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-12 Thread Igor Fedotov

yep! Thanks and sorry for the confusion.

On 7/12/2022 2:23 PM, Konstantin Shalygin wrote:

Hi Igor,


On 12 Jul 2022, at 14:16, Igor Fedotov  wrote:

Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd 
prefer not to raise it that high as 128K to avoid too many writes 
being deferred (and hence DB overburden).


For clarification, perhaps you mean bluestore_prefer_deferred_size_hdd ?



k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-12 Thread Konstantin Shalygin
Hi Igor,

> On 12 Jul 2022, at 14:16, Igor Fedotov  wrote:
> 
> Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer 
> not to raise it that high as 128K to avoid too many writes being deferred 
> (and hence DB overburden).

For clarification, perhaps you mean bluestore_prefer_deferred_size_hdd ?



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-12 Thread Igor Fedotov

Hi Dan,

I can confirm this is a regression introduced by 
https://github.com/ceph/ceph/pull/42725.


Indeed strict comparison is a key point in your specific case but 
generally  it looks like this piece of code needs more redesign to 
better handle fragmented allocations (and issue deferred write for every 
short enough fragment independently).


So I'm looking for a way to improve that at the moment. Will fallback to 
trivial comparison fix if I fail to do find better solution.


Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd 
prefer not to raise it that high as 128K to avoid too many writes being 
deferred (and hence DB overburden).


IMO setting the parameter to 64K+1 should be fine.


Thanks,

Igor

On 7/7/2022 12:43 AM, Dan van der Ster wrote:

Hi Igor and others,

(apologies for html, but i want to share a plot ;) )

We're upgrading clusters to v16.2.9 from v15.2.16, and our simple 
"rados bench -p test 10 write -b 4096 -t 1" latency probe showed 
something is very wrong with deferred writes in pacific.

Here is an example cluster, upgraded today:

image.png

The OSDs are 12TB HDDs, formatted in nautilus with the default 
bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.


I found that the performance issue is because 4kB writes are no longer 
deferred from those pre-pacific hdds to flash in pacific with the 
default config !!!
Here are example bench writes from both releases: 
https://pastebin.com/raw/m0yL1H9Z


I worked out that the issue is fixed if I set 
bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific 
default. Note the default was 32k in octopus).


I think this is related to the fixes in 
https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- 
_do_alloc_write is comparing the prealloc size 0x1 with 
bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less 
than" condition prevents deferred writes from ever happening.


So I think this would impact anyone upgrading clusters with hdd/ssd 
mixed osds ... surely we must not be the only clusters impacted by this?!


Should we increase the default bluestore_prefer_deferred_size_hdd up 
to 128kB or is there in fact a bug here?


Best Regards,

Dan


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Konstantin Shalygin
On 7 Jul 2022, at 15:41, Dan van der Ster  wrote:
> 
> How is one supposed to redeploy OSDs on a multi-PB cluster while the
> performance is degraded?

This is very strong point of view!

Good that this case can be fixed with set bluestore_prefer_deferred_size_hdd to 
128k, and I think we need analyze answer from Igor

* this is bug
* bluestore_prefer_deferred_size_hdd should be increased by operator, until 
migration to 4k will finished


k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Dan van der Ster
Hi,

On Thu, Jul 7, 2022 at 2:37 PM Konstantin Shalygin  wrote:
>
> Hi,
>
> On 7 Jul 2022, at 13:04, Dan van der Ster  wrote:
>
> I'm not sure the html mail made it to the lists -- resending in plain text.
> I've also opened https://tracker.ceph.com/issues/56488
>
>
> I think with pacific you need to redeploy all OSD's to respect the new 
> default bluestore_min_alloc_size_hdd = 4096 [1]
> Or not? 
>

Understood, yes, that is another "solution". But it is incredibly
impractical, I would say impossible, for loaded production
installations.
(How is one supposed to redeploy OSDs on a multi-PB cluster while the
performance is degraded?)

-- Dan

>
> [1] https://github.com/ceph/ceph/pull/34588
>
> k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Dan van der Ster
Hi again,

I'm not sure the html mail made it to the lists -- resending in plain text.
I've also opened https://tracker.ceph.com/issues/56488

Cheers, Dan


On Wed, Jul 6, 2022 at 11:43 PM Dan van der Ster  wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados 
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very 
> wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default 
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer 
> deferred from those pre-pacific hdds to flash in pacific with the default 
> config !!!
> Here are example bench writes from both releases: 
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set 
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. 
> Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 
> which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 
> 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly 
> less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed 
> osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB 
> or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io