[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-21 Thread Igor Fedotov

Can't say nothing about "write_buffer_size" tuning.. Never tried that.

But I presume that these are *"max_bytes_for_level_base*" and 
*"**max_bytes_for_level_multiplier*" params which rather should be tuned 
to modify RocksDB level granularity.


But I have no ideas how safe this is in a production environment.


Thanks,

Igor

On 8/21/2020 12:51 AM, Seena Fallah wrote:
Ok thanks. And also as you mentioned in the doc you shared from 
cloudferro, It's not good to change `write_buffer_size` for bluestore 
rocksdb to fit our db?


On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov > wrote:


Honestly I don't have any perfect solution for now.

If this is urgent you probably better to proceed with enabling the
new DB space management feature.

But please do that eventually, modify 1-2 OSDs at the first stage
and test them for some period (may be a week or two).


Thanks,

Igor


On 8/20/2020 5:36 PM, Seena Fallah wrote:

So what do you suggest for a short term solution? (I think you
won't backport it to nautilus at least about 6 month)

Changing db size is too expensive because I should buy new NVME
devices with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and
doing it for a month doesn't look very good!

On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Correct.

On 8/20/2020 5:15 PM, Seena Fallah wrote:

So you won't backport it to nautilus until it gets
default to master for a while?

On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

From technical/developer's point of view I don't see any
issues with tuning this option. But since now I wouldn't
recommend to enable it in production as it partially
bypassed our regular development cycle. Being enabled in
master for a while by default allows more develpers to
use/try the feature before release. This can be
considered as an additional implicit QA process. But as
we just discovered this hasn't happened.

Hence you can definitely try it but this exposes your
cluster(s) to some risk as for any new (and incompletely
tested) feature


Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf
until next nautilus release or should I wait for the
next nautilus release for this change? I mean does qa
run on this value for this config that we could trust
and change it or should we wait until the next nautilus
release that qa ran on this value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Seena,

this parameter isn't intended to be adjusted in
production environments - it's supposed that
default behavior covers all regular customers' needs.

The issue though is that default setting is
invalid. It should be 'use_some_extra'. Gonna fix
that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in
LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV
in production environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to
use 'wasted' space at DB
volume.

see this PR:
https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB
design can be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as
additional concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I
understand correctly, we waste 93
> GB per OSD on unused NVME space, because
   

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
Ok thanks. And also as you mentioned in the doc you shared from cloudferro,
It's not good to change `write_buffer_size` for bluestore rocksdb to fit
our db?

On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov  wrote:

> Honestly I don't have any perfect solution for now.
>
> If this is urgent you probably better to proceed with enabling the new DB
> space management feature.
>
> But please do that eventually, modify 1-2 OSDs at the first stage and test
> them for some period (may be a week or two).
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 5:36 PM, Seena Fallah wrote:
>
> So what do you suggest for a short term solution? (I think you won't
> backport it to nautilus at least about 6 month)
>
> Changing db size is too expensive because I should buy new NVME devices
> with double size and also redeploy all my OSDs.
> Manual compaction will still have an impact on performance and doing it
> for a month doesn't look very good!
>
> On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov  wrote:
>
>> Correct.
>> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>>
>> So you won't backport it to nautilus until it gets default to master for
>> a while?
>>
>> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov  wrote:
>>
>>> From technical/developer's point of view I don't see any issues with
>>> tuning this option. But since now I wouldn't  recommend to enable it in
>>> production as it partially bypassed our regular development cycle. Being
>>> enabled in master for a while by default allows more develpers to use/try
>>> the feature before release. This can be considered as an additional
>>> implicit QA process. But as we just discovered this hasn't happened.
>>>
>>> Hence you can definitely try it but this exposes your cluster(s) to some
>>> risk as for any new (and incompletely tested) feature
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>>
>>> Greate, thanks.
>>>
>>> Is it safe to change it manually in ceph.conf until next nautilus
>>> release or should I wait for the next nautilus release for this change? I
>>> mean does qa run on this value for this config that we could trust and
>>> change it or should we wait until the next nautilus release that qa ran on
>>> this value?
>>>
>>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:
>>>
 Hi Seena,

 this parameter isn't intended to be adjusted in production environments
 - it's supposed that default behavior covers all regular customers' needs.

 The issue though is that default setting is invalid. It should be
 'use_some_extra'. Gonna fix that shortly...


 Thanks,

 Igor




 On 8/20/2020 1:44 PM, Seena Fallah wrote:

 Hi Igor.

 Could you please tell why this config is in LEVEL_DEV (
 https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
 As it is documented in Ceph we can't use LEVEL_DEV in production
 environments!

 Thanks

 On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:

> Hi Simon,
>
>
> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at
> DB
> volume.
>
> see this PR: https://github.com/ceph/ceph/pull/29687
>
> Nice overview on the overall BlueFS/RocksDB design can be find here:
>
>
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>
> Which also includes some overview (as well as additional concerns) for
> changes brought by the above-mentioned PR.
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> > Hi Michael,
> >
> > thanks for the explanation! So if I understand correctly, we waste
> 93
> > GB per OSD on unused NVME space, because only 30GB is actually
> used...?
> >
> > And to improve the space for rocksdb, we need to plan for 300GB per
> > rocksdb partition in order to benefit from this advantage
> >
> > Reducing the number of small files is something we always ask of our
> > users, but reality is what it is ;-)
> >
> > I'll have to look into how I can get an informative view on these
> > metrics... It's pretty overwhelming the amount of information coming
> > out of the ceph cluster, even when you look only superficially...
> >
> > Cheers,
> >
> > /Simon
> >
> > On 20/08/2020 10:16, Michael Bisig wrote:
> >> Hi Simon
> >>
> >> As far as I know, RocksDB only uses "leveled" space on the NVME
> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
> Every
> >> DB space above such a limit will automatically end up on slow
> devices.
> >> In your setup where you have 123GB per OSD that means you only use
> >> 30GB of fast device. The DB which spills over this limit will be
> >> offloaded to the HDD and accordingly, it slows down requests and

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Honestly I don't have any perfect solution for now.

If this is urgent you probably better to proceed with enabling the new 
DB space management feature.


But please do that eventually, modify 1-2 OSDs at the first stage and 
test them for some period (may be a week or two).



Thanks,

Igor


On 8/20/2020 5:36 PM, Seena Fallah wrote:
So what do you suggest for a short term solution? (I think you won't 
backport it to nautilus at least about 6 month)


Changing db size is too expensive because I should buy new NVME 
devices with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and doing 
it for a month doesn't look very good!


On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov > wrote:


Correct.

On 8/20/2020 5:15 PM, Seena Fallah wrote:

So you won't backport it to nautilus until it gets default to
master for a while?

On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

From technical/developer's point of view I don't see any
issues with tuning this option. But since now I wouldn't 
recommend to enable it in production as it partially bypassed
our regular development cycle. Being enabled in master for a
while by default allows more develpers to use/try the feature
before release. This can be considered as an additional
implicit QA process. But as we just discovered this hasn't
happened.

Hence you can definitely try it but this exposes your
cluster(s) to some risk as for any new (and incompletely
tested) feature


Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf until next
nautilus release or should I wait for the next nautilus
release for this change? I mean does qa run on this value
for this config that we could trust and change it or should
we wait until the next nautilus release that qa ran on this
value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Seena,

this parameter isn't intended to be adjusted in
production environments - it's supposed that default
behavior covers all regular customers' needs.

The issue though is that default setting is invalid. It
should be 'use_some_extra'. Gonna fix that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in
production environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use
'wasted' space at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design
can be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as
additional concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand
correctly, we waste 93
> GB per OSD on unused NVME space, because only
30GB is actually used...?
>
> And to improve the space for rocksdb, we need to
plan for 300GB per
> rocksdb partition in order to benefit from this
advantage
>
> Reducing the number of small files is something
we always ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an
informative view on these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far 

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
So what do you suggest for a short term solution? (I think you won't
backport it to nautilus at least about 6 month)

Changing db size is too expensive because I should buy new NVME devices
with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and doing it for
a month doesn't look very good!

On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov  wrote:

> Correct.
> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>
> So you won't backport it to nautilus until it gets default to master for a
> while?
>
> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov  wrote:
>
>> From technical/developer's point of view I don't see any issues with
>> tuning this option. But since now I wouldn't  recommend to enable it in
>> production as it partially bypassed our regular development cycle. Being
>> enabled in master for a while by default allows more develpers to use/try
>> the feature before release. This can be considered as an additional
>> implicit QA process. But as we just discovered this hasn't happened.
>>
>> Hence you can definitely try it but this exposes your cluster(s) to some
>> risk as for any new (and incompletely tested) feature
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>
>> Greate, thanks.
>>
>> Is it safe to change it manually in ceph.conf until next nautilus release
>> or should I wait for the next nautilus release for this change? I mean does
>> qa run on this value for this config that we could trust and change it or
>> should we wait until the next nautilus release that qa ran on this value?
>>
>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:
>>
>>> Hi Seena,
>>>
>>> this parameter isn't intended to be adjusted in production environments
>>> - it's supposed that default behavior covers all regular customers' needs.
>>>
>>> The issue though is that default setting is invalid. It should be
>>> 'use_some_extra'. Gonna fix that shortly...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>>
>>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>
>>> Hi Igor.
>>>
>>> Could you please tell why this config is in LEVEL_DEV (
>>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
>>> As it is documented in Ceph we can't use LEVEL_DEV in production
>>> environments!
>>>
>>> Thanks
>>>
>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:
>>>
 Hi Simon,


 starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
 volume.

 see this PR: https://github.com/ceph/ceph/pull/29687

 Nice overview on the overall BlueFS/RocksDB design can be find here:


 https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

 Which also includes some overview (as well as additional concerns) for
 changes brought by the above-mentioned PR.


 Thanks,

 Igor


 On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
 > Hi Michael,
 >
 > thanks for the explanation! So if I understand correctly, we waste 93
 > GB per OSD on unused NVME space, because only 30GB is actually
 used...?
 >
 > And to improve the space for rocksdb, we need to plan for 300GB per
 > rocksdb partition in order to benefit from this advantage
 >
 > Reducing the number of small files is something we always ask of our
 > users, but reality is what it is ;-)
 >
 > I'll have to look into how I can get an informative view on these
 > metrics... It's pretty overwhelming the amount of information coming
 > out of the ceph cluster, even when you look only superficially...
 >
 > Cheers,
 >
 > /Simon
 >
 > On 20/08/2020 10:16, Michael Bisig wrote:
 >> Hi Simon
 >>
 >> As far as I know, RocksDB only uses "leveled" space on the NVME
 >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
 Every
 >> DB space above such a limit will automatically end up on slow
 devices.
 >> In your setup where you have 123GB per OSD that means you only use
 >> 30GB of fast device. The DB which spills over this limit will be
 >> offloaded to the HDD and accordingly, it slows down requests and
 >> compactions.
 >>
 >> You can proof what your OSD currently consumes with:
 >>ceph daemon osd.X perf dump
 >>
 >> Informative values are `db_total_bytes`, `db_used_bytes` and
 >> `slow_used_bytes`. This changes regularly because of the ongoing
 >> compactions but Prometheus mgr module exports these values such that
 >> you can track it.
 >>
 >> Small files generally leads to bigger RocksDB, especially when you
 >> use EC, but this depends on the actual amount and file sizes.
 >>
 >> I hope this helps.
 >> Regards,
 >> Michael
 >>
 >> On 20.08.20, 09:10, "Simon Oosthoek" 
 wrote:
 >>
 >>  Hi
 >>
 >>  

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Correct.

On 8/20/2020 5:15 PM, Seena Fallah wrote:
So you won't backport it to nautilus until it gets default to master 
for a while?


On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov > wrote:


From technical/developer's point of view I don't see any issues
with tuning this option. But since now I wouldn't recommend to
enable it in production as it partially bypassed our regular
development cycle. Being enabled in master for a while by default
allows more develpers to use/try the feature before release. This
can be considered as an additional implicit QA process. But as we
just discovered this hasn't happened.

Hence you can definitely try it but this exposes your cluster(s)
to some risk as for any new (and incompletely tested) feature


Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf until next nautilus
release or should I wait for the next nautilus release for this
change? I mean does qa run on this value for this config that we
could trust and change it or should we wait until the next
nautilus release that qa ran on this value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi Seena,

this parameter isn't intended to be adjusted in production
environments - it's supposed that default behavior covers all
regular customers' needs.

The issue though is that default setting is invalid. It
should be 'use_some_extra'. Gonna fix that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in
production environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use
'wasted' space at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can
be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional
concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand
correctly, we waste 93
> GB per OSD on unused NVME space, because only 30GB is
actually used...?
>
> And to improve the space for rocksdb, we need to plan
for 300GB per
> rocksdb partition in order to benefit from this
advantage
>
> Reducing the number of small files is something we
always ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative
view on these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space
on the NVME
>> partition. The values are set to be 300MB, 3GB, 30GB
and 300GB. Every
>> DB space above such a limit will automatically end up
on slow devices.
>> In your setup where you have 123GB per OSD that means
you only use
>> 30GB of fast device. The DB which spills over this
limit will be
>> offloaded to the HDD and accordingly, it slows down
requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`,
`db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of
the ongoing
>> compactions but Prometheus mgr module exports these
values such that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB,
especially when you
>> 

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
So you won't backport it to nautilus until it gets default to master for a
while?

On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov  wrote:

> From technical/developer's point of view I don't see any issues with
> tuning this option. But since now I wouldn't  recommend to enable it in
> production as it partially bypassed our regular development cycle. Being
> enabled in master for a while by default allows more develpers to use/try
> the feature before release. This can be considered as an additional
> implicit QA process. But as we just discovered this hasn't happened.
>
> Hence you can definitely try it but this exposes your cluster(s) to some
> risk as for any new (and incompletely tested) feature
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>
> Greate, thanks.
>
> Is it safe to change it manually in ceph.conf until next nautilus release
> or should I wait for the next nautilus release for this change? I mean does
> qa run on this value for this config that we could trust and change it or
> should we wait until the next nautilus release that qa ran on this value?
>
> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:
>
>> Hi Seena,
>>
>> this parameter isn't intended to be adjusted in production environments -
>> it's supposed that default behavior covers all regular customers' needs.
>>
>> The issue though is that default setting is invalid. It should be
>> 'use_some_extra'. Gonna fix that shortly...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>>
>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>
>> Hi Igor.
>>
>> Could you please tell why this config is in LEVEL_DEV (
>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
>> As it is documented in Ceph we can't use LEVEL_DEV in production
>> environments!
>>
>> Thanks
>>
>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:
>>
>>> Hi Simon,
>>>
>>>
>>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
>>> volume.
>>>
>>> see this PR: https://github.com/ceph/ceph/pull/29687
>>>
>>> Nice overview on the overall BlueFS/RocksDB design can be find here:
>>>
>>>
>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>>
>>> Which also includes some overview (as well as additional concerns) for
>>> changes brought by the above-mentioned PR.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>> > Hi Michael,
>>> >
>>> > thanks for the explanation! So if I understand correctly, we waste 93
>>> > GB per OSD on unused NVME space, because only 30GB is actually used...?
>>> >
>>> > And to improve the space for rocksdb, we need to plan for 300GB per
>>> > rocksdb partition in order to benefit from this advantage
>>> >
>>> > Reducing the number of small files is something we always ask of our
>>> > users, but reality is what it is ;-)
>>> >
>>> > I'll have to look into how I can get an informative view on these
>>> > metrics... It's pretty overwhelming the amount of information coming
>>> > out of the ceph cluster, even when you look only superficially...
>>> >
>>> > Cheers,
>>> >
>>> > /Simon
>>> >
>>> > On 20/08/2020 10:16, Michael Bisig wrote:
>>> >> Hi Simon
>>> >>
>>> >> As far as I know, RocksDB only uses "leveled" space on the NVME
>>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
>>> >> DB space above such a limit will automatically end up on slow devices.
>>> >> In your setup where you have 123GB per OSD that means you only use
>>> >> 30GB of fast device. The DB which spills over this limit will be
>>> >> offloaded to the HDD and accordingly, it slows down requests and
>>> >> compactions.
>>> >>
>>> >> You can proof what your OSD currently consumes with:
>>> >>ceph daemon osd.X perf dump
>>> >>
>>> >> Informative values are `db_total_bytes`, `db_used_bytes` and
>>> >> `slow_used_bytes`. This changes regularly because of the ongoing
>>> >> compactions but Prometheus mgr module exports these values such that
>>> >> you can track it.
>>> >>
>>> >> Small files generally leads to bigger RocksDB, especially when you
>>> >> use EC, but this depends on the actual amount and file sizes.
>>> >>
>>> >> I hope this helps.
>>> >> Regards,
>>> >> Michael
>>> >>
>>> >> On 20.08.20, 09:10, "Simon Oosthoek" 
>>> wrote:
>>> >>
>>> >>  Hi
>>> >>
>>> >>  Recently our ceph cluster (nautilus) is experiencing bluefs
>>> >> spillovers,
>>> >>  just 2 osd's and I disabled the warning for these osds.
>>> >>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover
>>> false)
>>> >>
>>> >>  I'm wondering what causes this and how this can be prevented.
>>> >>
>>> >>  As I understand it the rocksdb for the OSD needs to store more
>>> >> than fits
>>> >>  on the NVME logical volume (123G for 12T OSD). A way to fix it
>>> >> could be
>>> >>  to increase the logical volume on the nvme (if there was space
>>> >> on the
>>> >>  nvme, which 

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov
From technical/developer's point of view I don't see any issues with 
tuning this option. But since now I wouldn't  recommend to enable it in 
production as it partially bypassed our regular development cycle. Being 
enabled in master for a while by default allows more develpers to 
use/try the feature before release. This can be considered as an 
additional implicit QA process. But as we just discovered this hasn't 
happened.


Hence you can definitely try it but this exposes your cluster(s) to some 
risk as for any new (and incompletely tested) feature



Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf until next nautilus 
release or should I wait for the next nautilus release for this 
change? I mean does qa run on this value for this config that we could 
trust and change it or should we wait until the next nautilus release 
that qa ran on this value?


On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov > wrote:


Hi Seena,

this parameter isn't intended to be adjusted in production
environments - it's supposed that default behavior covers all
regular customers' needs.

The issue though is that default setting is invalid. It should be
'use_some_extra'. Gonna fix that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in production
environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use 'wasted'
space at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be
find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional
concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand correctly,
we waste 93
> GB per OSD on unused NVME space, because only 30GB is
actually used...?
>
> And to improve the space for rocksdb, we need to plan for
300GB per
> rocksdb partition in order to benefit from this advantage
>
> Reducing the number of small files is something we always
ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative view on
these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space on the
NVME
>> partition. The values are set to be 300MB, 3GB, 30GB and
300GB. Every
>> DB space above such a limit will automatically end up on
slow devices.
>> In your setup where you have 123GB per OSD that means you
only use
>> 30GB of fast device. The DB which spills over this limit
will be
>> offloaded to the HDD and accordingly, it slows down
requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`, `db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of the
ongoing
>> compactions but Prometheus mgr module exports these values
such that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB, especially
when you
>> use EC, but this depends on the actual amount and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek"
mailto:s.oosth...@science.ru.nl>>
wrote:
>>
>>  Hi
>>
>>  Recently our ceph cluster (nautilus) is experiencing
bluefs
>> spillovers,
>>  just 2 osd's and I disabled the warning for these osds.
>>  (ceph config set osd.125
bluestore_warn_on_bluefs_spillover false)
>>
>>  I'm wondering what causes this and how this can be
  

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
Greate, thanks.

Is it safe to change it manually in ceph.conf until next nautilus release
or should I wait for the next nautilus release for this change? I mean does
qa run on this value for this config that we could trust and change it or
should we wait until the next nautilus release that qa ran on this value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:

> Hi Seena,
>
> this parameter isn't intended to be adjusted in production environments -
> it's supposed that default behavior covers all regular customers' needs.
>
> The issue though is that default setting is invalid. It should be
> 'use_some_extra'. Gonna fix that shortly...
>
>
> Thanks,
>
> Igor
>
>
>
>
> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>
> Hi Igor.
>
> Could you please tell why this config is in LEVEL_DEV (
> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
> As it is documented in Ceph we can't use LEVEL_DEV in production
> environments!
>
> Thanks
>
> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:
>
>> Hi Simon,
>>
>>
>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
>> volume.
>>
>> see this PR: https://github.com/ceph/ceph/pull/29687
>>
>> Nice overview on the overall BlueFS/RocksDB design can be find here:
>>
>>
>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>
>> Which also includes some overview (as well as additional concerns) for
>> changes brought by the above-mentioned PR.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>> > Hi Michael,
>> >
>> > thanks for the explanation! So if I understand correctly, we waste 93
>> > GB per OSD on unused NVME space, because only 30GB is actually used...?
>> >
>> > And to improve the space for rocksdb, we need to plan for 300GB per
>> > rocksdb partition in order to benefit from this advantage
>> >
>> > Reducing the number of small files is something we always ask of our
>> > users, but reality is what it is ;-)
>> >
>> > I'll have to look into how I can get an informative view on these
>> > metrics... It's pretty overwhelming the amount of information coming
>> > out of the ceph cluster, even when you look only superficially...
>> >
>> > Cheers,
>> >
>> > /Simon
>> >
>> > On 20/08/2020 10:16, Michael Bisig wrote:
>> >> Hi Simon
>> >>
>> >> As far as I know, RocksDB only uses "leveled" space on the NVME
>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
>> >> DB space above such a limit will automatically end up on slow devices.
>> >> In your setup where you have 123GB per OSD that means you only use
>> >> 30GB of fast device. The DB which spills over this limit will be
>> >> offloaded to the HDD and accordingly, it slows down requests and
>> >> compactions.
>> >>
>> >> You can proof what your OSD currently consumes with:
>> >>ceph daemon osd.X perf dump
>> >>
>> >> Informative values are `db_total_bytes`, `db_used_bytes` and
>> >> `slow_used_bytes`. This changes regularly because of the ongoing
>> >> compactions but Prometheus mgr module exports these values such that
>> >> you can track it.
>> >>
>> >> Small files generally leads to bigger RocksDB, especially when you
>> >> use EC, but this depends on the actual amount and file sizes.
>> >>
>> >> I hope this helps.
>> >> Regards,
>> >> Michael
>> >>
>> >> On 20.08.20, 09:10, "Simon Oosthoek" 
>> wrote:
>> >>
>> >>  Hi
>> >>
>> >>  Recently our ceph cluster (nautilus) is experiencing bluefs
>> >> spillovers,
>> >>  just 2 osd's and I disabled the warning for these osds.
>> >>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
>> >>
>> >>  I'm wondering what causes this and how this can be prevented.
>> >>
>> >>  As I understand it the rocksdb for the OSD needs to store more
>> >> than fits
>> >>  on the NVME logical volume (123G for 12T OSD). A way to fix it
>> >> could be
>> >>  to increase the logical volume on the nvme (if there was space
>> >> on the
>> >>  nvme, which there isn't at the moment).
>> >>
>> >>  This is the current size of the cluster and how much is free:
>> >>
>> >>  [root@cephmon1 ~]# ceph df
>> >>  RAW STORAGE:
>> >>   CLASS SIZEAVAIL   USEDRAW USED
>> >> %RAW USED
>> >>   hdd   1.8 PiB 842 TiB 974 TiB  974
>> >> TiB 53.63
>> >>   TOTAL 1.8 PiB 842 TiB 974 TiB  974
>> >> TiB 53.63
>> >>
>> >>  POOLS:
>> >>   POOLID STORED  OBJECTS USED
>> >>  %USED MAX AVAIL
>> >>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
>> >>  0   167 TiB
>> >>   cephfs_metadata  2  56 GiB 5.15M  57 GiB
>> >>  0   167 TiB
>> >>   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
>> >>  0.09   222 TiB
>> >>   cephfs_data_ec83  

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Hi Seena,

this parameter isn't intended to be adjusted in production environments 
- it's supposed that default behavior covers all regular customers' needs.


The issue though is that default setting is invalid. It should be 
'use_some_extra'. Gonna fix that shortly...



Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV 
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)? 
As it is documented in Ceph we can't use LEVEL_DEV in production 
environments!


Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov > wrote:


Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space
at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional concerns)
for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand correctly, we
waste 93
> GB per OSD on unused NVME space, because only 30GB is actually
used...?
>
> And to improve the space for rocksdb, we need to plan for 300GB per
> rocksdb partition in order to benefit from this advantage
>
> Reducing the number of small files is something we always ask of
our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative view on these
> metrics... It's pretty overwhelming the amount of information
coming
> out of the ceph cluster, even when you look only superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space on the NVME
>> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
Every
>> DB space above such a limit will automatically end up on slow
devices.
>> In your setup where you have 123GB per OSD that means you only use
>> 30GB of fast device. The DB which spills over this limit will be
>> offloaded to the HDD and accordingly, it slows down requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`, `db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of the ongoing
>> compactions but Prometheus mgr module exports these values such
that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB, especially when you
>> use EC, but this depends on the actual amount and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek" mailto:s.oosth...@science.ru.nl>> wrote:
>>
>>  Hi
>>
>>  Recently our ceph cluster (nautilus) is experiencing bluefs
>> spillovers,
>>  just 2 osd's and I disabled the warning for these osds.
>>  (ceph config set osd.125
bluestore_warn_on_bluefs_spillover false)
>>
>>  I'm wondering what causes this and how this can be prevented.
>>
>>  As I understand it the rocksdb for the OSD needs to store
more
>> than fits
>>  on the NVME logical volume (123G for 12T OSD). A way to
fix it
>> could be
>>  to increase the logical volume on the nvme (if there was
space
>> on the
>>  nvme, which there isn't at the moment).
>>
>>  This is the current size of the cluster and how much is free:
>>
>>  [root@cephmon1 ~]# ceph df
>>  RAW STORAGE:
>>   CLASS SIZE    AVAIL USED    RAW USED
>> %RAW USED
>>   hdd   1.8 PiB 842 TiB 974 TiB  974
>> TiB 53.63
>>   TOTAL 1.8 PiB 842 TiB 974 TiB  974
>> TiB 53.63
>>
>>  POOLS:
>>   POOL    ID STORED OBJECTS USED
>>  %USED MAX AVAIL
>>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
>>  0   167 TiB
>>   cephfs_metadata  2  56 GiB 5.15M  57 GiB
>>  0   167 TiB
>>   cephfs_data_3copy    8 201 GiB 51.68k 602 GiB
>>  0.09   222 TiB
>>   cephfs_data_ec83    13 643 TiB 279.75M 953 TiB
>>  58.86   485 TiB
>>   rbd 14  21 GiB 5.66k  64 GiB
>>  0   222 TiB
>>   .rgw.root   15 1.2 KiB 4   1 MiB

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
Hi Igor.

Could you please tell why this config is in LEVEL_DEV (
https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in production
environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:

> Hi Simon,
>
>
> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
> volume.
>
> see this PR: https://github.com/ceph/ceph/pull/29687
>
> Nice overview on the overall BlueFS/RocksDB design can be find here:
>
>
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>
> Which also includes some overview (as well as additional concerns) for
> changes brought by the above-mentioned PR.
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> > Hi Michael,
> >
> > thanks for the explanation! So if I understand correctly, we waste 93
> > GB per OSD on unused NVME space, because only 30GB is actually used...?
> >
> > And to improve the space for rocksdb, we need to plan for 300GB per
> > rocksdb partition in order to benefit from this advantage
> >
> > Reducing the number of small files is something we always ask of our
> > users, but reality is what it is ;-)
> >
> > I'll have to look into how I can get an informative view on these
> > metrics... It's pretty overwhelming the amount of information coming
> > out of the ceph cluster, even when you look only superficially...
> >
> > Cheers,
> >
> > /Simon
> >
> > On 20/08/2020 10:16, Michael Bisig wrote:
> >> Hi Simon
> >>
> >> As far as I know, RocksDB only uses "leveled" space on the NVME
> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
> >> DB space above such a limit will automatically end up on slow devices.
> >> In your setup where you have 123GB per OSD that means you only use
> >> 30GB of fast device. The DB which spills over this limit will be
> >> offloaded to the HDD and accordingly, it slows down requests and
> >> compactions.
> >>
> >> You can proof what your OSD currently consumes with:
> >>ceph daemon osd.X perf dump
> >>
> >> Informative values are `db_total_bytes`, `db_used_bytes` and
> >> `slow_used_bytes`. This changes regularly because of the ongoing
> >> compactions but Prometheus mgr module exports these values such that
> >> you can track it.
> >>
> >> Small files generally leads to bigger RocksDB, especially when you
> >> use EC, but this depends on the actual amount and file sizes.
> >>
> >> I hope this helps.
> >> Regards,
> >> Michael
> >>
> >> On 20.08.20, 09:10, "Simon Oosthoek"  wrote:
> >>
> >>  Hi
> >>
> >>  Recently our ceph cluster (nautilus) is experiencing bluefs
> >> spillovers,
> >>  just 2 osd's and I disabled the warning for these osds.
> >>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
> >>
> >>  I'm wondering what causes this and how this can be prevented.
> >>
> >>  As I understand it the rocksdb for the OSD needs to store more
> >> than fits
> >>  on the NVME logical volume (123G for 12T OSD). A way to fix it
> >> could be
> >>  to increase the logical volume on the nvme (if there was space
> >> on the
> >>  nvme, which there isn't at the moment).
> >>
> >>  This is the current size of the cluster and how much is free:
> >>
> >>  [root@cephmon1 ~]# ceph df
> >>  RAW STORAGE:
> >>   CLASS SIZEAVAIL   USEDRAW USED
> >> %RAW USED
> >>   hdd   1.8 PiB 842 TiB 974 TiB  974
> >> TiB 53.63
> >>   TOTAL 1.8 PiB 842 TiB 974 TiB  974
> >> TiB 53.63
> >>
> >>  POOLS:
> >>   POOLID STORED  OBJECTS USED
> >>  %USED MAX AVAIL
> >>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
> >>  0   167 TiB
> >>   cephfs_metadata  2  56 GiB 5.15M  57 GiB
> >>  0   167 TiB
> >>   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
> >>  0.09   222 TiB
> >>   cephfs_data_ec8313 643 TiB 279.75M 953 TiB
> >>  58.86   485 TiB
> >>   rbd 14  21 GiB 5.66k  64 GiB
> >>  0   222 TiB
> >>   .rgw.root   15 1.2 KiB 4   1 MiB
> >>  0   167 TiB
> >>   default.rgw.control 16 0 B 8 0 B
> >>  0   167 TiB
> >>   default.rgw.meta17   765 B 4   1 MiB
> >>  0   167 TiB
> >>   default.rgw.log 18 0 B 207 0 B
> >>  0   167 TiB
> >>   cephfs_data_ec5720 433 MiB 230 1.2 GiB
> >>  0   278 TiB
> >>
> >>  The amount used can still grow a bit before we need to add
> >> nodes, but
> >>  apparently we are running into the limits of our rocskdb
> >> partitions.
> >>
> >>  

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB 
volume.


see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be find here:

https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional concerns) for 
changes brought by the above-mentioned PR.



Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:

Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 
GB per OSD on unused NVME space, because only 30GB is actually used...?


And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage


Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)


I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming 
out of the ceph cluster, even when you look only superficially...


Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:

Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every 
DB space above such a limit will automatically end up on slow devices.
In your setup where you have 123GB per OSD that means you only use 
30GB of fast device. The DB which spills over this limit will be 
offloaded to the HDD and accordingly, it slows down requests and 
compactions.


You can proof what your OSD currently consumes with:
   ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing 
compactions but Prometheus mgr module exports these values such that 
you can track it.


Small files generally leads to bigger RocksDB, especially when you 
use EC, but this depends on the actual amount and file sizes.


I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek"  wrote:

 Hi

 Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,

 just 2 osd's and I disabled the warning for these osds.
 (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

 I'm wondering what causes this and how this can be prevented.

 As I understand it the rocksdb for the OSD needs to store more 
than fits
 on the NVME logical volume (123G for 12T OSD). A way to fix it 
could be
 to increase the logical volume on the nvme (if there was space 
on the

 nvme, which there isn't at the moment).

 This is the current size of the cluster and how much is free:

 [root@cephmon1 ~]# ceph df
 RAW STORAGE:
  CLASS SIZE    AVAIL   USED    RAW USED 
%RAW USED
  hdd   1.8 PiB 842 TiB 974 TiB  974 
TiB 53.63
  TOTAL 1.8 PiB 842 TiB 974 TiB  974 
TiB 53.63


 POOLS:
  POOL    ID STORED  OBJECTS USED
 %USED MAX AVAIL
  cephfs_data  1 572 MiB 121.26M 2.4 GiB
 0   167 TiB
  cephfs_metadata  2  56 GiB 5.15M  57 GiB
 0   167 TiB
  cephfs_data_3copy    8 201 GiB  51.68k 602 GiB
 0.09   222 TiB
  cephfs_data_ec83    13 643 TiB 279.75M 953 TiB
 58.86   485 TiB
  rbd 14  21 GiB 5.66k  64 GiB
 0   222 TiB
  .rgw.root   15 1.2 KiB 4   1 MiB
 0   167 TiB
  default.rgw.control 16 0 B 8 0 B
 0   167 TiB
  default.rgw.meta    17   765 B 4   1 MiB
 0   167 TiB
  default.rgw.log 18 0 B 207 0 B
 0   167 TiB
  cephfs_data_ec57    20 433 MiB 230 1.2 GiB
 0   278 TiB

 The amount used can still grow a bit before we need to add 
nodes, but
 apparently we are running into the limits of our rocskdb 
partitions.


 Did we choose a parameter (e.g. minimal object size) too small, 
so we
 have too much objects on these spillover OSDs? Or is it that too 
many

 small files are stored on the cephfs filesystems?

 When we expand the cluster, we can choose larger nvme devices to 
allow
 larger rocksdb partitions, but is that the right way to deal 
with this,
 or should we adjust some parameters on the cluster that will 
reduce the

 rocksdb size?

 Cheers

 /Simon
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an 

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Simon Oosthoek

Hi Michael,

thanks for the pointers! This is our first production ceph cluster and 
we have to learn as we go... Small files is always a problem for all 
(networked) filesystems, usually it just trashes performance, but in 
this case it has another unfortunate side effect with the rocksdb :-(


Cheers

/Simon

On 20/08/2020 11:06, Michael Bisig wrote:

Hi Simon

Unfortunately, the other NVME space is wasted or at least, this is the 
information we gathered during our research. This fact is due to the RocksDB 
level management which is explained here 
(https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think 
it's a hard limit but it will be something above these values. Also consult 
this thread 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html).
 It's probably better to go a bit over these limits to be on the safe side.

Exactly, reality is always different. We also struggle with small files which 
lead to further problems. Accordingly, the right initial setting is pretty 
important and depends on your individual usecase.

Regards,
Michael

On 20.08.20, 10:40, "Simon Oosthoek"  wrote:

 Hi Michael,

 thanks for the explanation! So if I understand correctly, we waste 93 GB
 per OSD on unused NVME space, because only 30GB is actually used...?

 And to improve the space for rocksdb, we need to plan for 300GB per
 rocksdb partition in order to benefit from this advantage

 Reducing the number of small files is something we always ask of our
 users, but reality is what it is ;-)

 I'll have to look into how I can get an informative view on these
 metrics... It's pretty overwhelming the amount of information coming out
 of the ceph cluster, even when you look only superficially...

 Cheers,

 /Simon

 On 20/08/2020 10:16, Michael Bisig wrote:
 > Hi Simon
 >
 > As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a 
limit will automatically end up on slow devices.
 > In your setup where you have 123GB per OSD that means you only use 30GB 
of fast device. The DB which spills over this limit will be offloaded to the HDD 
and accordingly, it slows down requests and compactions.
 >
 > You can proof what your OSD currently consumes with:
 >ceph daemon osd.X perf dump
 >
 > Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing compactions but 
Prometheus mgr module exports these values such that you can track it.
 >
 > Small files generally leads to bigger RocksDB, especially when you use 
EC, but this depends on the actual amount and file sizes.
 >
 > I hope this helps.
 > Regards,
 > Michael
 >
 > On 20.08.20, 09:10, "Simon Oosthoek"  wrote:
 >
 >  Hi
 >
 >  Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,
 >  just 2 osd's and I disabled the warning for these osds.
 >  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
 >
 >  I'm wondering what causes this and how this can be prevented.
 >
 >  As I understand it the rocksdb for the OSD needs to store more than 
fits
 >  on the NVME logical volume (123G for 12T OSD). A way to fix it 
could be
 >  to increase the logical volume on the nvme (if there was space on 
the
 >  nvme, which there isn't at the moment).
 >
 >  This is the current size of the cluster and how much is free:
 >
 >  [root@cephmon1 ~]# ceph df
 >  RAW STORAGE:
 >   CLASS SIZEAVAIL   USEDRAW USED 
%RAW USED
 >   hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
 >   TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
 >
 >  POOLS:
 >   POOLID STORED  OBJECTS USED
 >  %USED MAX AVAIL
 >   cephfs_data  1 572 MiB 121.26M 2.4 GiB
 >  0   167 TiB
 >   cephfs_metadata  2  56 GiB   5.15M  57 GiB
 >  0   167 TiB
 >   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
 >  0.09   222 TiB
 >   cephfs_data_ec8313 643 TiB 279.75M 953 TiB
 >  58.86   485 TiB
 >   rbd 14  21 GiB   5.66k  64 GiB
 >  0   222 TiB
 >   .rgw.root   15 1.2 KiB   4   1 MiB
 >  0   167 TiB
 >   default.rgw.control 16 0 B   8 0 B
 >  0   167 TiB
 >   default.rgw.meta17   765 B   4   1 MiB

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Michael Bisig
Hi Simon

Unfortunately, the other NVME space is wasted or at least, this is the 
information we gathered during our research. This fact is due to the RocksDB 
level management which is explained here 
(https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think 
it's a hard limit but it will be something above these values. Also consult 
this thread 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html).
 It's probably better to go a bit over these limits to be on the safe side.

Exactly, reality is always different. We also struggle with small files which 
lead to further problems. Accordingly, the right initial setting is pretty 
important and depends on your individual usecase.

Regards,
Michael

On 20.08.20, 10:40, "Simon Oosthoek"  wrote:

Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 GB 
per OSD on unused NVME space, because only 30GB is actually used...?

And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage

Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)

I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming out 
of the ceph cluster, even when you look only superficially...

Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:
> Hi Simon
> 
> As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space 
above such a limit will automatically end up on slow devices.
> In your setup where you have 123GB per OSD that means you only use 30GB 
of fast device. The DB which spills over this limit will be offloaded to the 
HDD and accordingly, it slows down requests and compactions.
> 
> You can proof what your OSD currently consumes with:
>ceph daemon osd.X perf dump
> 
> Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing compactions 
but Prometheus mgr module exports these values such that you can track it.
> 
> Small files generally leads to bigger RocksDB, especially when you use 
EC, but this depends on the actual amount and file sizes.
> 
> I hope this helps.
> Regards,
> Michael
> 
> On 20.08.20, 09:10, "Simon Oosthoek"  wrote:
> 
>  Hi
> 
>  Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,
>  just 2 osd's and I disabled the warning for these osds.
>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
> 
>  I'm wondering what causes this and how this can be prevented.
> 
>  As I understand it the rocksdb for the OSD needs to store more than 
fits
>  on the NVME logical volume (123G for 12T OSD). A way to fix it could 
be
>  to increase the logical volume on the nvme (if there was space on the
>  nvme, which there isn't at the moment).
> 
>  This is the current size of the cluster and how much is free:
> 
>  [root@cephmon1 ~]# ceph df
>  RAW STORAGE:
>   CLASS SIZEAVAIL   USEDRAW USED %RAW 
USED
>   hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
>   TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
> 
>  POOLS:
>   POOLID STORED  OBJECTS USED
>  %USED MAX AVAIL
>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
>  0   167 TiB
>   cephfs_metadata  2  56 GiB   5.15M  57 GiB
>  0   167 TiB
>   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
>  0.09   222 TiB
>   cephfs_data_ec8313 643 TiB 279.75M 953 TiB
>  58.86   485 TiB
>   rbd 14  21 GiB   5.66k  64 GiB
>  0   222 TiB
>   .rgw.root   15 1.2 KiB   4   1 MiB
>  0   167 TiB
>   default.rgw.control 16 0 B   8 0 B
>  0   167 TiB
>   default.rgw.meta17   765 B   4   1 MiB
>  0   167 TiB
>   default.rgw.log 18 0 B 207 0 B
>  0   167 TiB
>   cephfs_data_ec5720 433 MiB 230 1.2 GiB
>  0   278 TiB
> 
>  The amount used can still grow a bit before we need to add nodes, but
>  apparently we are running into the limits of our rocskdb 

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Simon Oosthoek

Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 GB 
per OSD on unused NVME space, because only 30GB is actually used...?


And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage


Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)


I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming out 
of the ceph cluster, even when you look only superficially...


Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:

Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME partition. The 
values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will 
automatically end up on slow devices.
In your setup where you have 123GB per OSD that means you only use 30GB of fast 
device. The DB which spills over this limit will be offloaded to the HDD and 
accordingly, it slows down requests and compactions.

You can proof what your OSD currently consumes with:
   ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. 
This changes regularly because of the ongoing compactions but Prometheus mgr 
module exports these values such that you can track it.

Small files generally leads to bigger RocksDB, especially when you use EC, but 
this depends on the actual amount and file sizes.

I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek"  wrote:

 Hi

 Recently our ceph cluster (nautilus) is experiencing bluefs spillovers,
 just 2 osd's and I disabled the warning for these osds.
 (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

 I'm wondering what causes this and how this can be prevented.

 As I understand it the rocksdb for the OSD needs to store more than fits
 on the NVME logical volume (123G for 12T OSD). A way to fix it could be
 to increase the logical volume on the nvme (if there was space on the
 nvme, which there isn't at the moment).

 This is the current size of the cluster and how much is free:

 [root@cephmon1 ~]# ceph df
 RAW STORAGE:
  CLASS SIZEAVAIL   USEDRAW USED %RAW USED
  hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 53.63
  TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 53.63

 POOLS:
  POOLID STORED  OBJECTS USED
 %USED MAX AVAIL
  cephfs_data  1 572 MiB 121.26M 2.4 GiB
 0   167 TiB
  cephfs_metadata  2  56 GiB   5.15M  57 GiB
 0   167 TiB
  cephfs_data_3copy8 201 GiB  51.68k 602 GiB
 0.09   222 TiB
  cephfs_data_ec8313 643 TiB 279.75M 953 TiB
 58.86   485 TiB
  rbd 14  21 GiB   5.66k  64 GiB
 0   222 TiB
  .rgw.root   15 1.2 KiB   4   1 MiB
 0   167 TiB
  default.rgw.control 16 0 B   8 0 B
 0   167 TiB
  default.rgw.meta17   765 B   4   1 MiB
 0   167 TiB
  default.rgw.log 18 0 B 207 0 B
 0   167 TiB
  cephfs_data_ec5720 433 MiB 230 1.2 GiB
 0   278 TiB

 The amount used can still grow a bit before we need to add nodes, but
 apparently we are running into the limits of our rocskdb partitions.

 Did we choose a parameter (e.g. minimal object size) too small, so we
 have too much objects on these spillover OSDs? Or is it that too many
 small files are stored on the cephfs filesystems?

 When we expand the cluster, we can choose larger nvme devices to allow
 larger rocksdb partitions, but is that the right way to deal with this,
 or should we adjust some parameters on the cluster that will reduce the
 rocksdb size?

 Cheers

 /Simon
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Michael Bisig
Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME partition. The 
values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a 
limit will automatically end up on slow devices. 
In your setup where you have 123GB per OSD that means you only use 30GB of fast 
device. The DB which spills over this limit will be offloaded to the HDD and 
accordingly, it slows down requests and compactions.

You can proof what your OSD currently consumes with:
  ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. 
This changes regularly because of the ongoing compactions but Prometheus mgr 
module exports these values such that you can track it.

Small files generally leads to bigger RocksDB, especially when you use EC, but 
this depends on the actual amount and file sizes.

I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek"  wrote:

Hi

Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, 
just 2 osd's and I disabled the warning for these osds.
(ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

I'm wondering what causes this and how this can be prevented.

As I understand it the rocksdb for the OSD needs to store more than fits 
on the NVME logical volume (123G for 12T OSD). A way to fix it could be 
to increase the logical volume on the nvme (if there was space on the 
nvme, which there isn't at the moment).

This is the current size of the cluster and how much is free:

[root@cephmon1 ~]# ceph df
RAW STORAGE:
 CLASS SIZEAVAIL   USEDRAW USED %RAW USED
 hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 53.63
 TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 53.63

POOLS:
 POOLID STORED  OBJECTS USED 
%USED MAX AVAIL
 cephfs_data  1 572 MiB 121.26M 2.4 GiB 
0   167 TiB
 cephfs_metadata  2  56 GiB   5.15M  57 GiB 
0   167 TiB
 cephfs_data_3copy8 201 GiB  51.68k 602 GiB 
0.09   222 TiB
 cephfs_data_ec8313 643 TiB 279.75M 953 TiB 
58.86   485 TiB
 rbd 14  21 GiB   5.66k  64 GiB 
0   222 TiB
 .rgw.root   15 1.2 KiB   4   1 MiB 
0   167 TiB
 default.rgw.control 16 0 B   8 0 B 
0   167 TiB
 default.rgw.meta17   765 B   4   1 MiB 
0   167 TiB
 default.rgw.log 18 0 B 207 0 B 
0   167 TiB
 cephfs_data_ec5720 433 MiB 230 1.2 GiB 
0   278 TiB

The amount used can still grow a bit before we need to add nodes, but 
apparently we are running into the limits of our rocskdb partitions.

Did we choose a parameter (e.g. minimal object size) too small, so we 
have too much objects on these spillover OSDs? Or is it that too many 
small files are stored on the cephfs filesystems?

When we expand the cluster, we can choose larger nvme devices to allow 
larger rocksdb partitions, but is that the right way to deal with this, 
or should we adjust some parameters on the cluster that will reduce the 
rocksdb size?

Cheers

/Simon
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io