[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza Tue, 21 Dec 2021 15:25:59 -0800

Hi Igor,

I now fixed my wrong OSD debug config to: 
[osd.7] 
        debug bluefs = 20
        debug bdev = 20


and you can download the debug log from: https://we.tl/t-3e4do1PQGj


Thanks,
Sebastian



> On 21.12.2021, at 19:44, Igor Fedotov <igor.fedo...@croit.io> wrote:
> 
> Hi Sebastian,
> 
> first of all I'm not sure this issue has the same root cause as Francois one. 
> Highly likely it's just another BlueFS/RocksDB data corruption which is 
> indicated in the same way.
> 
> In this respect I would rather mention this one reported just yesterday: 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M2ZRZD4725SRPFE5MMZPI7JBNO23FNU6/
> 
> So similarly I'd like to ask some questions/collect more data. Please find 
> the list below:
> 
> 1) Is this a bare metal or containerized deployment?
> 
> 2) What's the output for "hdparm -W <dev>" for devices in question? Any 
> enabled write caching at the disk controller?
> 
> 3) Could you please share the broken OSD startup log with debug-bluefs set to 
> 20?
> 
> 4) Could you please export bluefs files (this might need some extra space to 
> keep all the bluefs data at target filesystem) via ceph-bluestore-tool and 
> share the content of db/002182.sst file? The first 4M would be generally 
> sufficient if it's huge.
> 
> 5) Have you seen RocksDB data corruptions at this cluster before
> 
> 6)What's the disk h/w for these OSDs - disk drives and controllers?
> 
> 7) Did you reboot the nodes or just restart the OSDs?  Did all the issues 
> happen at the same or at different nodes? How many OSDs were restarted total?
> 
> 8) Is  that correct that this is a hdd-only setup, there is no standaone 
> SSD/NVMe for WAL/DB?
> 
> 9) Would you be able to run some long lasting (and potentially data 
> corrupting) experiments at this cluster in an attempt to pin point the issue. 
> I'm thinking about periodic OSD shutdown under the load to catch the 
> corrupting event. With a raised debug level for that specific OSD. The major 
> problem with this bug debugging is that we can see its consequences - but we 
> have no clue about what was happening when actual corruption happened. Hence 
> we need to reproduce that somehow. So please let me know if we can use your 
> cluster/help for that...
> 
> 
> Thanks in advance,
> 
> Igor
> 
> On 12/21/2021 7:47 PM, Sebastian Mazza wrote:
>> Hi all,
>> 
>> after a reboot of a cluster 3 OSDs can not be started. The OSDs exit with  
>> the following error message:
>>      2021-12-21T01:01:02.209+0100 7fd368cebf00  4 rocksdb: 
>> [db_impl/db_impl.cc:396] Shutdown: canceling all background work
>>      2021-12-21T01:01:02.209+0100 7fd368cebf00  4 rocksdb: 
>> [db_impl/db_impl.cc:573] Shutdown complete
>>      2021-12-21T01:01:02.209+0100 7fd368cebf00 -1 rocksdb: Corruption: Bad 
>> table magic number: expected 9863518390377041911, found 0 in db/002182.sst
>>      2021-12-21T01:01:02.213+0100 7fd368cebf00 -1 
>> bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
>>      2021-12-21T01:01:02.213+0100 7fd368cebf00  1 bluefs umount
>>      2021-12-21T01:01:02.213+0100 7fd368cebf00  1 bdev(0x559bbe0ea800 
>> /var/lib/ceph/osd/ceph-7/block) close
>>      2021-12-21T01:01:02.293+0100 7fd368cebf00  1 bdev(0x559bbe0ea400 
>> /var/lib/ceph/osd/ceph-7/block) close
>>      2021-12-21T01:01:02.537+0100 7fd368cebf00 -1 osd.7 0 OSD:init: unable 
>> to mount object store
>>      2021-12-21T01:01:02.537+0100 7fd368cebf00 -1  ** ERROR: osd init 
>> failed: (5) Input/output error
>> 
>> 
>> I found a similar problem in this Mailing list: 
>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/MJLVS7UPJ5AZKOYN3K2VQW7WIOEQGC5V/#MABLFA4FHG6SX7YN4S6BGSCP6DOAX6UE
>> 
>> In this thread, Francois was able to successfully repair his OSD data with 
>> `ceph-bluestore-tool fsck`. I tried to run:
>> `ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-7 -l 
>> /var/log/ceph/bluestore-tool-fsck-osd-7.log --log-level 20  > 
>> /var/log/ceph/bluestore-tool-fsck-osd-7.out  2>&1`
>> But that results in:
>>      2021-12-21T16:44:18.455+0100 7fc54ef7a240 -1 rocksdb: Corruption: Bad 
>> table magic number: expected 9863518390377041911, found 0 in db/002182.sst
>>      2021-12-21T16:44:18.455+0100 7fc54ef7a240 -1 
>> bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
>>      fsck failed: (5) Input/output error
>> 
>> I also tried to run `ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-7 
>> repair`. But that also fails with:
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  0 
>> bluestore(/var/lib/ceph/osd/ceph-7) _open_db_and_around read-only:0 repair:0
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1a800 
>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1a800 
>> /var/lib/ceph/osd/ceph-7/block) open size 12000134430720 (0xae9ffc00000, 11 
>> TiB)
>>              block_size 4096 (4 KiB) rotational discard not supported
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 
>> bluestore(/var/lib/ceph/osd/ceph-7) _set_cache_sizes cache_size 1073741824 
>> meta 0.45 kv 0.45 data 0.06
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 
>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 
>> /var/lib/ceph/osd/ceph-7/block) open size 12000134430720 (0xae9ffc00000, 11 
>> TiB)
>>              block_size 4096 (4 KiB) rotational discard not supported
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs add_block_device 
>> bdev 1 path /var/lib/ceph/osd/ceph-7/block size 11 TiB
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs mount
>>      2021-12-21T17:34:06.780+0100 7f35765f7240  1 bluefs _init_alloc shared, 
>> id 1, capacity 0xae9ffc00000, block size 0x10000
>>      2021-12-21T17:34:06.904+0100 7f35765f7240  1 bluefs mount 
>> shared_bdev_used = 0
>>      2021-12-21T17:34:06.904+0100 7f35765f7240  1 
>> bluestore(/var/lib/ceph/osd/ceph-7) _prepare_db_environment set db_paths to 
>> db,11400127709184 db.slow,11400127709184
>>      2021-12-21T17:34:06.908+0100 7f35765f7240 -1 rocksdb: Corruption: Bad 
>> table magic number: expected 9863518390377041911, found 0 in db/002182.sst
>>      2021-12-21T17:34:06.908+0100 7f35765f7240 -1 
>> bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
>>      2021-12-21T17:34:06.908+0100 7f35765f7240  1 bluefs umount
>>      2021-12-21T17:34:06.908+0100 7f35765f7240  1 bdev(0x55fce5a1ac00 
>> /var/lib/ceph/osd/ceph-7/block) close
>>      2021-12-21T17:34:07.072+0100 7f35765f7240  1 bdev(0x55fce5a1a800 
>> /var/lib/ceph/osd/ceph-7/block) close
>> 
>> 
>> The cluster is not in production, therefore, I can remove all corrupt pools 
>> and delete the OSDs. However, I would like to understand what was going on, 
>> in order to be able to avoid such a situation in the future.
>> 
>> I will provide the OSD logs from the time around the server reboot at the 
>> following link: https://we.tl/t-fArHXTmSM7
>> 
>> Ceph version: 16.2.6
>> 
>> 
>> Thanks,
>> Sebastian
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Reply via email to