[ceph-users] Re: dealing with spillovers

Reed Dier Fri, 12 Jun 2020 08:31:48 -0700

Thanks Igor,

I did see that L4 sizing and thought it seemed auspicious.
Though after looking at a couple other OSD's with this, I saw that I think the 
sum of L0-L4 appears to match a rounded off version of the metadata size 
reported in ceph osd df tree.
So I'm not sure if thats actually showing the size of the Level store, but just 
what is stored in each level?
> No more ideas but do data migration using ceph-bluestore-tool. 
> 
Would this imply to backup the current block.db, then re-create the block.db 
and move the backup to the new block.db?


Just asking because I have never touched moving the block.db/WAL, and was 
actually under the impression that could not be done until the last few years 
as more people keep having spillovers.

Previously when I was expanding my block.db, I was just re-paving the OSD's, 
which was my likely course of action for this OSD if I was unsuccessful in 
clearing this as is.

Would that be bluefs-export and then bluefs-bdev-new-db?
Though that doesn't exactly look like it would work.

I don't think I could do migrate due to not having another block device to 
migrate from and to.

Should/could I try bluefs-bdev-expand to see if it sees a bigger partition and 
tries to use it?

Otherwise at this point I feel like re-paving may be the best path forward, I 
just wanted to provide any possible data points before doing that.

Thanks again for the help,

Reed

> On Jun 12, 2020, at 9:34 AM, Igor Fedotov <ifedo...@suse.de> wrote:
> 
> hmm, RocksDB reports 13GB at L4:
> 
>  "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
> Comp(cnt) Avg(sec) KeyIn KeyDrop",
>     "": 
> "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
>     "": "  L0      2/0   29.39 MB   0.5      0.0     0.0      0.0       0.0   
>    0.0       0.0   0.0      0.0      0.0      0.00              0.00         
> 0    0.000       0      0",
>     "": "  L1      1/0   22.31 MB   0.6      0.0     0.0      0.0       0.0   
>    0.0       0.0   0.0      0.0      0.0      0.00              0.00         
> 0    0.000       0      0",
>     "": "  L2      2/0   94.03 MB   0.3      0.0     0.0      0.0       0.0   
>    0.0       0.0   0.0      0.0      0.0      0.00              0.00         
> 0    0.000       0      0",
>     "": "  L3     12/0   273.29 MB   0.3      0.0     0.0      0.0       0.0  
>     0.0       0.0   0.0      0.0      0.0      0.00              0.00         
> 0    0.000       0      0",
>     "": "  L4    205/0   12.82 GB   0.1      0.0     0.0      0.0       0.0   
>    0.0       0.0   0.0      0.0      0.0      0.00              0.00         
> 0    0.000       0      0",
>     "": " Sum    222/0   13.23 GB   0.0      0.0     0.0      0.0       0.0   
>    0.0       0.0   0.0      0.0      0.0      0.00              0.00         
> 0    0.000       0      0",
> 
> which is unlikely to be correct...
> 
> No more ideas but do data migration using ceph-bluestore-tool. 
> 
> I would appreciate if you share whether it helps in both short- and 
> long-term. Will this reappear or not?
> 
> 
> Thanks,
> 
> Igor
> 
> 
> 
> On 6/12/2020 5:17 PM, Reed Dier wrote:
>> Thanks for sticking with me Igor.
>> 
>> Attached is the ceph-kvstore-tool stats output.
>> 
>> Hopefully something interesting in here.
>> 
>> Thanks,
>> 
>> Reed
>> 
>> 
>> 
>> 
>> 
>>> On Jun 12, 2020, at 6:56 AM, Igor Fedotov <ifedo...@suse.de 
>>> <mailto:ifedo...@suse.de>> wrote:
>>> 
>>> Hi Reed,
>>> 
>>> thanks for the log.
>>> 
>>> Nothing much of interest there though. Just a regular SST file that RocksDB 
>>> instructed to put at "slow" device. Presumably it belongs to a higher level 
>>> hence the desire to put it that "far". Or (which is less likely) RocksDB 
>>> lacked free space when doing compaction at some point and spilled some data 
>>> out. So I was wrong - ceph-kvstore's stats command output might be 
>>> helpful...
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> On 6/11/2020 5:14 PM, Reed Dier wrote:
>>>> Apologies for the delay Igor,
>>>> 
>>>> Hopefully you are still interested in taking a look.
>>>> 
>>>> Attached is the bluestore bluefs-log-dump output.
>>>> I gzipped it as the log was very large.
>>>> Let me know if there is anything else I can do to help track this down.
>>>> 
>>>> Thanks,
>>>> 
>>>> Reed
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jun 8, 2020, at 8:04 AM, Igor Fedotov <ifedo...@suse.de 
>>>>> <mailto:ifedo...@suse.de>> wrote:
>>>>> 
>>>>> Reed,
>>>>> 
>>>>> No, "ceph-kvstore-tool stats" isn't be of any interest.
>>>>> 
>>>>> For the sake of better issue understanding it might be interesting to 
>>>>> have bluefs log dump obtained via ceph-bluestore-tool's bluefs-log-dump 
>>>>> command. This will give some insight what RocksDB files are spilled over. 
>>>>>  It's still not clear what's the root cause for the issue. It's not that 
>>>>> frequent and dangerous though so no active investigation on that...
>>>>> 
>>>>> Wondering if migration has helped though?
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Igor
>>>>> 
>>>>> On 6/6/2020 8:00 AM, Reed Dier wrote:
>>>>>> The WAL/DB was part of the OSD deployment.
>>>>>> 
>>>>>> OSD is running 14.2.9.
>>>>>> 
>>>>>> Would grabbing the ceph-kvstore-tool bluestore-kv <path-to-osd> stats as 
>>>>>> in that ticket be of any usefulness to this?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Reed
>>>>>> 
>>>>>>> On Jun 5, 2020, at 5:27 PM, Igor Fedotov <ifedo...@suse.de 
>>>>>>> <mailto:ifedo...@suse.de>> wrote:
>>>>>>> 
>>>>>>> This might help -see comment #4 at 
>>>>>>> https://tracker.ceph.com/issues/44509 
>>>>>>> <https://tracker.ceph.com/issues/44509>
>>>>>>> 
>>>>>>> And just for the sake of information collection - what Ceph version is 
>>>>>>> used in this cluster?
>>>>>>> 
>>>>>>> Did you setup DB volume along with OSD deployment or they were added 
>>>>>>> later as  was done in the ticket above?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Igor
>>>>>>> 
>>>>>>> On 6/6/2020 1:07 AM, Reed Dier wrote:
>>>>>>>> I'm going to piggy back on this somewhat.
>>>>>>>> 
>>>>>>>> I've battled RocksDB spillovers over the course of the life of the 
>>>>>>>> cluster since moving to bluestore, however I have always been able to 
>>>>>>>> compact it well enough.
>>>>>>>> 
>>>>>>>> But now I am stumped at getting this to compact via $ceph tell 
>>>>>>>> osd.$osd compact, which has always worked in the past.
>>>>>>>> 
>>>>>>>> No matter how many times I compact it, I always spill over exactly 
>>>>>>>> 192KiB.
>>>>>>>>> BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
>>>>>>>>>      osd.36 spilled over 192 KiB metadata from 'db' device (26 GiB 
>>>>>>>>> used of 34 GiB) to slow device
>>>>>>>>>      osd.36 spilled over 192 KiB metadata from 'db' device (16 GiB 
>>>>>>>>> used of 34 GiB) to slow device
>>>>>>>>>      osd.36 spilled over 192 KiB metadata from 'db' device (22 GiB 
>>>>>>>>> used of 34 GiB) to slow device
>>>>>>>>>      osd.36 spilled over 192 KiB metadata from 'db' device (13 GiB 
>>>>>>>>> used of 34 GiB) to slow device
>>>>>>>> 
>>>>>>>> The multiple entries are from different time trying to compact it.
>>>>>>>> 
>>>>>>>> The OSD is a 1.92TB SATA SSD, the WAL/DB is a 36GB partition on NVMe.
>>>>>>>> I tailed and tee'd the OSD's logs during a manual compaction here: 
>>>>>>>> https://pastebin.com/bcpcRGEe <https://pastebin.com/bcpcRGEe>
>>>>>>>> This is with the normal logging level.
>>>>>>>> I have no idea how to make heads or tails of that log data, but maybe 
>>>>>>>> someone can figure out why this one OSD just refuses to compact?
>>>>>>>> 
>>>>>>>> OSD is 14.2.9.
>>>>>>>> OS is U18.04.
>>>>>>>> Kernel is 4.15.0-96.
>>>>>>>> 
>>>>>>>> I haven't played with ceph-bluestore-tool or ceph-kvstore-tool but 
>>>>>>>> after seeing the above mention in this thread, I do see 
>>>>>>>> ceph-kvstore-tool <rocksdb|bluestore-kv?> compact, which sounds like 
>>>>>>>> it may be the same thing that ceph tell compact does under the hood?
>>>>>>>>> compact
>>>>>>>>> Subcommand compact is used to compact all data of kvstore. It will 
>>>>>>>>> open the database, and trigger a database's compaction. After 
>>>>>>>>> compaction, some disk space may be released.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Also, not sure if this is helpful:
>>>>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (13 GiB used of 
>>>>>>>>> 34 GiB) to slow device
>>>>>>>>> ID   CLASS WEIGHT    REWEIGHT SIZE    RAW USE  DATA    OMAP    META   
>>>>>>>>>  AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>>>>>>>>>   36   ssd   1.77879  1.00000 1.8 TiB  1.2 TiB 1.2 TiB 6.2 GiB 7.2 
>>>>>>>>> GiB 603 GiB 66.88 0.94  85     up             osd.36
>>>>>>>> You can see the breakdown between OMAP data and META data.
>>>>>>>> 
>>>>>>>> After compacting again:
>>>>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (26 GiB used of 
>>>>>>>>> 34 GiB) to slow device
>>>>>>>>> ID   CLASS WEIGHT    REWEIGHT SIZE    RAW USE  DATA    OMAP    META   
>>>>>>>>>  AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>>>>>>>>>   36   ssd   1.77879  1.00000 1.8 TiB  1.2 TiB 1.2 TiB 6.2 GiB  20 
>>>>>>>>> GiB 603 GiB 66.88 0.94  85                                            
>>>>>>>>>              up             osd.36
>>>>>>>> 
>>>>>>>> So the OMAP size remained the same, while the metadata ballooned 
>>>>>>>> (while still conspicuously spilling over 192KiB exactly)
>>>>>>>> These OSDs have a few RBD images, cephfs metadata, and librados 
>>>>>>>> objects (not RGW) stored.
>>>>>>>> 
>>>>>>>> The breakdown of OMAP size is pretty widely binned, but the GiB sizes 
>>>>>>>> are definitely the minority.
>>>>>>>> Looking at the breakdown with some simple bash-fu
>>>>>>>> KiB = 147
>>>>>>>> MiB = 105
>>>>>>>> GiB = 24
>>>>>>>> 
>>>>>>>> To further divide that, all of the GiB sized OMAPs are SSD OSD's:
>>>>>>>> 
>>>>>>>> SSD
>>>>>>>> HDD
>>>>>>>> TOTAL
>>>>>>>> KiB
>>>>>>>> 0
>>>>>>>> 147
>>>>>>>> 147
>>>>>>>> MiB
>>>>>>>> 36
>>>>>>>> 69
>>>>>>>> 105
>>>>>>>> GiB
>>>>>>>> 24
>>>>>>>> 0
>>>>>>>> 24
>>>>>>>> 
>>>>>>>> I have no idea if any of these data points are pertinent or helpful, 
>>>>>>>> but I want to give as clear a picture as possible to prevent chasing 
>>>>>>>> the wrong thread.
>>>>>>>> Appreciate any help with this.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Reed
>>>>>>>> 
>>>>>>>>> On May 26, 2020, at 9:48 AM, thoralf schulze <t.schu...@tu-berlin.de 
>>>>>>>>> <mailto:t.schu...@tu-berlin.de>> wrote:
>>>>>>>>> 
>>>>>>>>> hi there,
>>>>>>>>> 
>>>>>>>>> trying to get around my head rocksdb spillovers and how to deal with
>>>>>>>>> them … in particular, i have one osds which does not have any pools
>>>>>>>>> associated (as per ceph pg ls-by-osd $osd ), yet it does show up in 
>>>>>>>>> ceph
>>>>>>>>> health detail as:
>>>>>>>>> 
>>>>>>>>>     osd.$osd spilled over 2.9 MiB metadata from 'db' device (49 MiB
>>>>>>>>> used of 37 GiB) to slow device
>>>>>>>>> 
>>>>>>>>> compaction doesn't help. i am well aware of
>>>>>>>>> https://tracker.ceph.com/issues/38745 
>>>>>>>>> <https://tracker.ceph.com/issues/38745> , yet find it really
>>>>>>>>> counter-intuitive that an empty osd with a more-or-less optimal sized 
>>>>>>>>> db
>>>>>>>>> volume can't fit its rockdb on the former.
>>>>>>>>> 
>>>>>>>>> is there any way to repair this, apart from re-creating the osd? fwiw,
>>>>>>>>> dumping the database with
>>>>>>>>> 
>>>>>>>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd dump >
>>>>>>>>> bluestore_kv.dump
>>>>>>>>> 
>>>>>>>>> yields a file of less than 100mb in size.
>>>>>>>>> 
>>>>>>>>> and, while we're at it, a few more related questions:
>>>>>>>>> 
>>>>>>>>> - am i right to assume that the leveldb and rocksdb arguments to
>>>>>>>>> ceph-kvstore-tool are only relevant for osds with filestore-backend?
>>>>>>>>> - does ceph-kvstore-tool bluestore-kv … also deal with rocksdb-items 
>>>>>>>>> for
>>>>>>>>> osds with bluestore-backend?
>>>>>>>>> 
>>>>>>>>> thank you very much & with kind regards,
>>>>>>>>> thoralf.
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io 
>>>>>>>>> <mailto:ceph-users@ceph.io>
>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>>>>>>>>> <mailto:ceph-users-le...@ceph.io>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io 
>>>>>>>> <mailto:ceph-users@ceph.io>
>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>>>>>>>> <mailto:ceph-users-le...@ceph.io>
>>>>>> 
>>>> 
>>

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: dealing with spillovers

Reply via email to