[ceph-users] Re: BlueFS spillover detected, why, what?

Simon Oosthoek Thu, 20 Aug 2020 02:14:20 -0700

Hi Michael,

thanks for the pointers! This is our first production ceph cluster andwe have to learn as we go... Small files is always a problem for all(networked) filesystems, usually it just trashes performance, but inthis case it has another unfortunate side effect with the rocksdb :-(


Cheers

/Simon

On 20/08/2020 11:06, Michael Bisig wrote:

Hi Simon

Unfortunately, the other NVME space is wasted or at least, this is the 
information we gathered during our research. This fact is due to the RocksDB 
level management which is explained here 
(https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think 
it's a hard limit but it will be something above these values. Also consult 
this thread 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html).
 It's probably better to go a bit over these limits to be on the safe side.

Exactly, reality is always different. We also struggle with small files which 
lead to further problems. Accordingly, the right initial setting is pretty 
important and depends on your individual usecase.

Regards,
Michael

On 20.08.20, 10:40, "Simon Oosthoek" <s.oosth...@science.ru.nl> wrote:

     Hi Michael,

     thanks for the explanation! So if I understand correctly, we waste 93 GB
     per OSD on unused NVME space, because only 30GB is actually used...?

     And to improve the space for rocksdb, we need to plan for 300GB per
     rocksdb partition in order to benefit from this advantage....

     Reducing the number of small files is something we always ask of our
     users, but reality is what it is ;-)

     I'll have to look into how I can get an informative view on these
     metrics... It's pretty overwhelming the amount of information coming out
     of the ceph cluster, even when you look only superficially...

     Cheers,

     /Simon

     On 20/08/2020 10:16, Michael Bisig wrote:
     > Hi Simon
     >
     > As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a 
limit will automatically end up on slow devices.
     > In your setup where you have 123GB per OSD that means you only use 30GB 
of fast device. The DB which spills over this limit will be offloaded to the HDD 
and accordingly, it slows down requests and compactions.
     >
     > You can proof what your OSD currently consumes with:
     >    ceph daemon osd.X perf dump
     >
     > Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing compactions but 
Prometheus mgr module exports these values such that you can track it.
     >
     > Small files generally leads to bigger RocksDB, especially when you use 
EC, but this depends on the actual amount and file sizes.
     >
     > I hope this helps.
     > Regards,
     > Michael
     >
     > On 20.08.20, 09:10, "Simon Oosthoek" <s.oosth...@science.ru.nl> wrote:
     >
     >      Hi
     >
     >      Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,
     >      just 2 osd's and I disabled the warning for these osds.
     >      (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
     >
     >      I'm wondering what causes this and how this can be prevented.
     >
     >      As I understand it the rocksdb for the OSD needs to store more than 
fits
     >      on the NVME logical volume (123G for 12T OSD). A way to fix it 
could be
     >      to increase the logical volume on the nvme (if there was space on 
the
     >      nvme, which there isn't at the moment).
     >
     >      This is the current size of the cluster and how much is free:
     >
     >      [root@cephmon1 ~]# ceph df
     >      RAW STORAGE:
     >           CLASS     SIZE        AVAIL       USED        RAW USED     
%RAW USED
     >           hdd       1.8 PiB     842 TiB     974 TiB      974 TiB         
53.63
     >           TOTAL     1.8 PiB     842 TiB     974 TiB      974 TiB         
53.63
     >
     >      POOLS:
     >           POOL                    ID     STORED      OBJECTS     USED
     >      %USED     MAX AVAIL
     >           cephfs_data              1     572 MiB     121.26M     2.4 GiB
     >          0       167 TiB
     >           cephfs_metadata          2      56 GiB       5.15M      57 GiB
     >          0       167 TiB
     >           cephfs_data_3copy        8     201 GiB      51.68k     602 GiB
     >      0.09       222 TiB
     >           cephfs_data_ec83        13     643 TiB     279.75M     953 TiB
     >      58.86       485 TiB
     >           rbd                     14      21 GiB       5.66k      64 GiB
     >          0       222 TiB
     >           .rgw.root               15     1.2 KiB           4       1 MiB
     >          0       167 TiB
     >           default.rgw.control     16         0 B           8         0 B
     >          0       167 TiB
     >           default.rgw.meta        17       765 B           4       1 MiB
     >          0       167 TiB
     >           default.rgw.log         18         0 B         207         0 B
     >          0       167 TiB
     >           cephfs_data_ec57        20     433 MiB         230     1.2 GiB
     >          0       278 TiB
     >
     >      The amount used can still grow a bit before we need to add nodes, 
but
     >      apparently we are running into the limits of our rocskdb partitions.
     >
     >      Did we choose a parameter (e.g. minimal object size) too small, so 
we
     >      have too much objects on these spillover OSDs? Or is it that too 
many
     >      small files are stored on the cephfs filesystems?
     >
     >      When we expand the cluster, we can choose larger nvme devices to 
allow
     >      larger rocksdb partitions, but is that the right way to deal with 
this,
     >      or should we adjust some parameters on the cluster that will reduce 
the
     >      rocksdb size?
     >
     >      Cheers
     >
     >      /Simon
     >      _______________________________________________
     >      ceph-users mailing list -- ceph-users@ceph.io
     >      To unsubscribe send an email to ceph-users-le...@ceph.io
     >
     _______________________________________________
     ceph-users mailing list -- ceph-users@ceph.io
     To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BlueFS spillover detected, why, what?

Reply via email to