[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza Wed, 23 Feb 2022 14:57:54 -0800

Hi Alexander,

thank you for your suggestion! All my Nodes have ECC memory. However, I have 
now checked that it was recognized correctly on every system (dmesg | grep 
EDAC). Furthermore I checkt if an error occurred by using `edac-util` and also 
by searching in the logs of the mainboard BMCs. Everything look perfectly fine. 
So I think we can now be sure that it is not a memory issue.


Thanks for reminding me to check memory!


Best regards,
Sebastian

 

> On 23.02.2022, at 02:03, Alexander E. Patrakov <patra...@gmail.com> wrote:
> 
> I have another suggestion: check the RAM, just in case, with memtest86
> or https://github.com/martinwhitaker/pcmemtest (which is a fork of
> memtest86+). Ignore the suggestion if you have ECC RAM.
> 
> вт, 22 февр. 2022 г. в 15:45, Igor Fedotov <igor.fedo...@croit.io>:
>> 
>> Hi Sebastian,
>> 
>> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
>>> Hey Igor!
>>> 
>>> 
>>>> thanks a lot for the new logs - looks like they provides some insight.
>>> I'm glad the logs are helpful.
>>> 
>>> 
>>>> At this point I think the root cause is apparently a race between deferred 
>>>> writes replay and some DB maintenance task happening on OSD startup. It 
>>>> seems that deferred write replay updates a block extent which 
>>>> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros 
>>>> content. Evidently that's just a matter of chance whether they use 
>>>> conflicting physical extent or not hence the occasional nature of the 
>>>> issue...
>>> 
>>> Do I understand that correct: The corruption of the rocksDB (Table 
>>> overwritten by zeros) happens at the first start of the OSD after  “*** 
>>> Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system 
>>> launches the OSD Service the RocksDB is still fine?
>> Looks like that. From logs I can see an unexpected write to specific
>> extent (LBA 0x630000) which shouldn't occur and at which RocksDB
>> subsequently fails.
>>> 
>>> 
>>>> So first of all I'm curious if you have any particular write patterns that 
>>>> can be culprits? E.g. something like disk wiping procedure which writes 
>>>> all-zeros to an object followed by object truncate or removal comes to my 
>>>> mind. If you can identify something like that - could you please collect 
>>>> OSD log for such an operation (followed by OSD restart) with 
>>>> debug-bluestore set to 20?
>>> Best to my knowledge the OSD was hardly doing anything and I do not see any 
>>> pattern that would fit to you explanation.
>>> However, you certainly understand a lot more about it than I do, so I try 
>>> to explain everything that could be relevant.
>>> 
>>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which 
>>> should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD 
>>> with 2TB and the other two are on 12TB HDDs.
>>> 
>>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is 
>>> a replicated rule that uses only OSDs with class ’nvme’. The second rule is 
>>> a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with 
>>> class ‘hdd’. So it only works for a size of exactly 8. That means that a 
>>> pool that uses this rule has always only “undersized” placement groups, 
>>> since the cluster has only 3 nodes. (I did not add the fourth server after 
>>> the first crash in December, since we want to reproduce the problem.)
>>> 
>>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, 
>>> fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the 
>>> crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools 
>>> fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with 
>>> k=5,m=3 and a min size of 6.
>>> 
>>> The pool fs.data-nvme.c-pool is not used and the pool test-pool was used 
>>> for rados bench a few month ago.
>>> 
>>> The pool fs.metadata-root-pool is used as metadata pool for cephFS and 
>>> fs.data-root-pool as the root data pool for the cephFS. The pool 
>>> fs.data-hdd.ec-pool is an additional data pool for the cephFS and is 
>>> specified as ceph.dir.layout for some folders of the cephFS. The whole 
>>> cephFS is mounted by each of the 3 nodes.
>>> 
>>> The pool block-nvme.c-pool hosts two RBD images that are used as boot 
>>> drives for two VMs. The first VM runes with Ubuntu Desktop and the second 
>>> with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data 
>>> part, metadata on block-nvme.c-pool) that is attached to the Debian VM as 
>>> second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub 
>>> directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. 
>>> Both VMs was doing nothing, except from being booted, in the last couple of 
>>> days.
>>> 
>>> I try to illustrate the pool usage as a tree:
>>> * c3nvme (replicated, size=3, min_size=2)
>>>     + device_health_metrics
>>>     + test-pool
>>>         - rados bench
>>>     + fs.metadata-root-pool
>>>         - CephFS (metadata)
>>>     + fs.data-root-pool
>>>         - CephFS (data root)
>>>     + fs.data-nvme.c-pool
>>>     + block-nvme.c-pool
>>>         - RBD (Ubuntu VM, boot disk with ext4)
>>>      - RBD (Debian VM, boot disk with ext4)
>>> * ec4x2hdd (ec,  k=5, m=3, min_size=6)
>>>     + fs.data-hdd.ec-pool
>>>         - CephFS (layout data)
>>>     + block-hdd.ec-pool
>>>         - RBD (Debian VM, disk 2 with BTRFS)
>>> 
>>> 
>>> Last week I was writing about 30TB to the CephFS inside the 
>>> fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the 
>>> block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of 
>>> hard links. There was already nearly the same data on the storage and I 
>>> used rsync to update it. I think something between 1 and 4 TB has changed 
>>> and was updated by rsync.
>>> 
>>> I think the cluster was totally unused on Friday, but up and running and 
>>> idling around. Then on Saturday I did a graceful shutdown of of all cluster 
>>> nodes. Arround 5 minutes later when booted the servers again, the OSD.7 
>>> crashed. I copied the logs and exported the RocksDB. Then I deleted 
>>> everything from the HDD and deployed the OSD.7 again. When, I checkt for 
>>> the first time today at around 12:00, ceph was already finished with 
>>> backfilling to OSD.7 and the cluster was idle again.
>>> 
>>> I then spend 70 minutes with writing 3 small files (one with about 500Byte 
>>> and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and 
>>> shutting down the whole cluster. After I booted all 3 nodes again I did 
>>> overwrite the 3 small files (some Bytes have changed) and shouted down all 
>>> 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 
>>> 7 crashed again.
>>> 
>>> There may be was some more IO caused by mounting the RBD images with the 
>>> BTRFS volume inside the Debian VM. But I do not know. I can only say that 
>>> no user did use it. Only I was writing 4 times the 3 small files to cephFS 
>>> on the fs.data-hdd.ec-pool.
>>> 
>>> Does this make any sense to you?
>> Well, that's actually not a write pattern I would expect to cause the
>> issue. But thanks a lot anyway. So could you please repeat that
>> procedure with overwriting 3 small files and OSD restart and collect the
>> logs with debug-bluestore set to 20.
>>> 
>>>> with debug-bluestore set to 20
>>> Do you mean
>>> ```
>>> [osd]
>>>      debug bluestore = 20
>>> ```
>>> in ceph.conf?
>> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead
>> 
>>> Thanks,
>>> Sebastian
>>> 
>>> 
>>> 
>>> 
>>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>>> 
>>>> Hey Sebastian,
>>>> 
>>>> thanks a lot for the new logs - looks like they provides some insight. At 
>>>> this point I think the root cause is apparently a race between deferred 
>>>> writes replay and some DB maintenance task happening on OSD startup. It 
>>>> seems that deferred write replay updates a block extent which 
>>>> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros 
>>>> content. Evidently that's just a matter of chance whether they use 
>>>> conflicting physical extent or not hence the occasional nature of the 
>>>> issue...
>>>> 
>>>> And now I'd like to determine what's wrong with this deferred write replay.
>>>> 
>>>> So first of all I'm curious if you have any particular write patterns that 
>>>> can be culprits? E.g. something like disk wiping procedure which writes 
>>>> all-zeros to an object followed by object truncate or removal comes to my 
>>>> mind. If you can identify something like that - could you please collect 
>>>> OSD log for such an operation (followed by OSD restart) with 
>>>> debug-bluestore set to 20?
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Igor
>>>> 
>>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
>>>>> Hi Igor,
>>>>> 
>>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And 
>>>>> this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for 
>>>>> every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD 
>>>>> has failed now the third time! Coincidence? Probably not…
>>>>> The important thing seams to be that a shutdown and not only a restart of 
>>>>> the entire cluster is performed. Since, this time the OSD failed after 
>>>>> just 4 shutdowns of all nodes in the cluster within 70 minutes.
>>>>> 
>>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I started 
>>>>> this new shutdown and boot series shortly after ceph had finished writing 
>>>>> everything back to OSD.7, earlier today.
>>>>> 
>>>>> The corrupted RocksDB file (crash) is again only 2KB in size.
>>>>> You can download the RocksDB file with the bad  table magic number and 
>>>>> the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
>>>>> What else do you want?
>>>>> 
>>>>> From the log of the OSD.7:
>>>>> —————
>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
>>>>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
>>>>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 
>>>>> 4096 ioc 0x55f0b8dbdd18 with 0 aios left
>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated 
>>>>> from /sbin/init  (PID: 1) UID: 0
>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
>>>>> Terminated ***
>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate 
>>>>> shutdown (osd_fast_shutdown=true) ***
>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
>>>>> (ceph:ceph)
>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
>>>>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process 
>>>>> ceph-osd, pid 1967
>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
>>>>> --pid-file
>>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
>>>>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
>>>>> —————
>>>>> 
>>>>> For me this looks like that the OSD did nothing for nearly 2 minutes 
>>>>> before it receives the termination request. Shouldn't this be enough time 
>>>>> for flushing every imaginable write cache?
>>>>> 
>>>>> 
>>>>> I hope this helps you.
>>>>> 
>>>>> 
>>>>> Best wishes,
>>>>> Sebastian
>>>>> 
>>>> --
>>>> Igor Fedotov
>>>> Ceph Lead Developer
>>>> 
>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>> 
>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>> Com. register: Amtsgericht Munich HRB 231263
>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>> 
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> -- 
> Alexander E. Patrakov

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Reply via email to