[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Sebastian Mazza Thu, 17 Mar 2022 13:56:20 -0700

Hi Igor,

thank you very much for your explanation. I much appreciate it.


You was right, as always. :-)
There was not a single corrupted object. I did run `time ceph-bluestore-tool 
fsck --path /var/lib/ceph/osd/ceph-$X` and `time ceph-bluestore-tool fsck 
--path /var/lib/ceph/osd/ceph-$X --deep yes` for every HDD based OSD and every 
fsck printed “fsck success”. However, I observed something odd. Since every OSD 
except the one that was recovered was reading with around 44 MByte/s while the 
recovered OSD.4 was much faster with around 125MByte/s, during the fsck.


Thanks,

Sebastian


> On 15.03.2022, at 09:52, Igor Fedotov <igor.fedo...@croit.io> wrote:
> 
> Hi Sebastian,
> 
> I don't think you have got tons of corrupted objects. The tricky thing about 
> the bug is that corruption might occur if new allocation occurred in a pretty 
> short period only:  when OSD is starting but haven't applied deferred writes 
> yet. This mostly applies to Bluefs/RocksDB performing some housekeeping on 
> startup, e.g. recovery or compaction.
> 
> Additionally there is a questionable period shortly before the shutdown when 
> similar issue is apparently possible. But that's just a speculation so far.
> 
> So IMO highly likely there are no corrupted objects in your cluster at all. 
> You can try to run fsck with deep flag on per OSD basis to check that though 
> -  it will verify  object content using checksums. Expect it to take some 
> time though...
> 
> 
> Thanks,
> 
> Igor
> 
> On 3/14/2022 9:33 PM, Sebastian Mazza wrote:
>> Hi Igor,
>> 
>> great that you was able to reproduce it!
>> 
>> I did read your comments at the issue #54547. Am I right that I probably 
>> have hundreds of corrupted objects on my EC pools (cephFSD and RBD)? But I 
>> only ever noticed when a rocksDB was damaged. A deep scrub should find the 
>> other errors, right? But the PGs of my EC pools was never scrubbed because 
>> the are all “undersized" and “degraded”. Is it possible to scrub a 
>> “undersized" / “degraded” PG? I’m not able to (deep) scrub any of the PGs on 
>> my EC pools. I did already set osd_scrub_during_recovery = true, but this 
>> does not help. Scrubbing and deep scrubbing works as expected on the 
>> replicated pools.
>> 
>> 
>> Thank you very much for helping me with my problem so patiently within the 
>> last 3 month!
>> 
>> 
>> Thanks,
>> Sebastian
>> 
>> 
>>> On 14.03.2022, at 17:07, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>> 
>>> Hi Sebastian,
>>> 
>>> the proper parameter name is 'osd fast shutdown".
>>> 
>>> As with any other OSD config parameter one can use either ceph.conf or 
>>> 'ceph config set osd.N osd_fast_shutdown false' command to adjust it.
>>> 
>>> I'd recommend the latter form.
>>> 
>>> And yeah from my last experiments it looks like setting it to false it's a 
>>> workaround indeed...
>>> 
>>> I've managed to reproduce the bug locally and it doesn't occur without fast 
>>> shutdown.
>>> 
>>> Thanks for all the information you shared, highly appreciated!
>>> 
>>> FYI: I've created another ticket (missed I already had one): 
>>> https://tracker.ceph.com/issues/54547
>>> 
>>> So please expect all the related new there.
>>> 
>>> 
>>> Kind regards,
>>> 
>>> Igor
>>> 
>>> 
>>> 
>>> On 3/14/2022 5:54 PM, Sebastian Mazza wrote:
>>>> Hallo Igor,
>>>> 
>>>> I'm glad I could be of help. Thank you for your explanation!
>>>> 
>>>>>  And I was right this is related to deferred write procedure  and 
>>>>> apparently fast shutdown mode.
>>>> Does that mean I can prevent the error in the meantime, before you can fix 
>>>> the root cause, by disabling osd_fast_shutdown?
>>>> Can it be disabled by the following statements in the ceph.conf?
>>>> ```
>>>> [osd]
>>>>    fast shutdown = false
>>>> ```
>>>> (I wasn’t able to find any documentation for “osd_fast_shutdown”, e.g.: 
>>>> https://docs.ceph.com/en/pacific/rados/configuration/osd-config-ref/ does 
>>>> not contain “osd_fast_shutdown”)
>>>> 
>>>> 
>>>>> 1) From the log I can see you're using RBD over EC pool for the repro, 
>>>>> right? What's the EC profile?
>>>> Yes, one of the two EC pools contains one RBD image and the other EC pool 
>>>> is used for cephFS.
>>>> 
>>>> Rules from my crush map:
>>>> ```
>>>> rule ec4x2hdd {
>>>>    id 0
>>>>    type erasure
>>>>    min_size 8
>>>>    max_size 8
>>>>    step set_chooseleaf_tries 5
>>>>    step set_choose_tries 100
>>>>    step take default class hdd
>>>>    step choose indep 4 type host
>>>>    step choose indep 2 type osd
>>>>    step emit
>>>> }
>>>> rule c3nvme {
>>>>    id 1
>>>>    type replicated
>>>>    min_size 1
>>>>    max_size 10
>>>>    step take default class nvme
>>>>    step chooseleaf firstn 0 type host
>>>>    step emit
>>>> }
>>>> ```
>>>> The EC rule requires 4 hosts, but as I have already mentioned the cluster 
>>>> does only have 3 nodes. Since I did never add the fourth node after the 
>>>> problem occurred the first time.
>>>> 
>>>> I created the EC profile by:
>>>> ```
>>>> $ ceph osd erasure-code-profile set ec4x2hdd-profile \
>>>>    k=5 \
>>>>    m=3 \
>>>>    crush-root=ec4x2hdd
>>>> ```
>>>> 
>>>> And the EC pools by:
>>>> ```
>>>> $ ceph osd pool create fs.data-hdd.ec-pool 64 64 erasure ec4x2hdd-profile 
>>>> ec4x2hdd warn 36000000000000     # 36TB
>>>> $ ceph osd pool create block-hdd.ec-pool 16 16 erasure ec4x2hdd-profile 
>>>> ec4x2hdd warn 1000000000000        #  1TB
>>>> ```
>>>> 
>>>> Replicated pools on NVMes
>>>> ```
>>>> ceph osd pool create block-nvme.c-pool 32 32 replicated c3nvme 0 warn 
>>>> 549755813888 # 512 GB
>>>> ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
>>>> 4294967296 # 4 GB
>>>> ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
>>>> 137438953472 # 128 GB
>>>> ```
>>>> 
>>>> enable overwrite support:
>>>> ```
>>>> $ ceph osd pool set fs.data-hdd.ec-pool allow_ec_overwrites true
>>>> $ ceph osd pool set block-hdd.ec-pool allow_ec_overwrites true
>>>> ```
>>>> 
>>>> Set Application:
>>>> ```
>>>> $ ceph osd pool application enable fs.data-hdd.ec-pool cephfs
>>>> $ ceph osd pool application enable "block-hdd.ec-pool" rbd
>>>> ```
>>>> 
>>>> Use the rbd tool to initialize the pool for use by RBD:
>>>> ```
>>>> $ rbd pool init "block-nvme.c-pool"
>>>> $ rbd pool init "block-hdd.ec-pool"
>>>> ```
>>>> 
>>>> Create RBD image:
>>>> ```
>>>> $ rbd --pool "block-nvme.c-pool" create vm-300-disk-1 --data-pool 
>>>> "block-hdd.ec-pool" --size 50G
>>>> ```
>>>> 
>>>> Create cephFS:
>>>> ```
>>>> $ ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
>>>> 4294967296       # 4 GB
>>>> $ ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
>>>> 137438953472       # 128 GB
>>>> $ ceph fs new cephfs fs.metadata-root-pool fs.data-root-pool
>>>> ```
>>>> 
>>>> Add the EC pool to the cephFS:
>>>> ```
>>>> $ ceph fs add_data_pool cephfs fs.data-hdd.ec-pool
>>>> ```
>>>> 
>>>> Set the Layout of the directory “shared” so that files that are created 
>>>> inside the folder “shares” uses the EC pool:
>>>> ```
>>>> $ setfattr -n ceph.dir.layout.pool -v fs.data-hdd.ec-pool 
>>>> /mnt/cephfs-olymp/shares/
>>>> ```
>>>> 
>>>> 
>>>> 
>>>>> 2) How did you generate the load during last experiments? Was it some 
>>>>> benchmarking tool or any other artificial load generator? If so could you 
>>>>> share job desriptions or scripts if any?
>>>> I did not use a benchmarking tool or any other artificial load generator. 
>>>> The RBD image “vm-300-disk-1” is formatted as BTRFS. I used rsync to copy 
>>>> daily website backups to the volume. The backups are versioned and go back 
>>>> 10 days. If a new version is created, a new folder is added and all files 
>>>> that did not change are hard linked to the file in the folder of the last 
>>>> day. A new or updated file is copied form the production server.
>>>> This backup procedure was not done on the ceph cluster. It is done on 
>>>> another server, I just copied the web backup directory of that server to 
>>>> the RBD image. However, I used rsync with the "-H” flag to copy the data 
>>>> from the “productive” backup server to the RBD image, therefore the hard 
>>>> links are preserved.
>>>> 
>>>> I copied shared group folders and apple time machine backups to CephFS, 
>>>> also with rsync from another server. The shared user folder contains 
>>>> typical office date, text document, CSV, files, JPEG images,…. The  apple 
>>>> time machine backups are basically folders with lots of 8MB or 32MB files 
>>>> that contains encrypted data.
>>>> 
>>>> 
>>>> In order to reproduce the rocksDB corruption I started the rsync process 
>>>> whenever I did find some time. I did wait until rsync found at least some 
>>>> files that has change and copied it to the RBD image or cephFS. Then, I 
>>>> triggered a shutdown on all 3 cluster nodes.
>>>> 
>>>> However, best to my knowledge, there was no IO at all between the last 3 
>>>> restarts of the OSDs, except the IO that is caused by mounting the cephFS 
>>>> on 4 systems and the BTRFS file system on the RBD image on one system.
>>>> 
>>>> 
>>>> If you want to create a minimal example that is able to trigger the error, 
>>>> I would suggest to create an RBD image on an EC pool with  k=5, m=3 on a 
>>>> cluster with only 6 HDD based OSDs. Format the RBD image with BTRFS and 
>>>> start creating and deleting hard links on it. I don’t know if this is 
>>>> really the trigger, but it is currently my best idea.
>>>> 
>>>> 
>>>> Best wishes,
>>>> Sebastian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 14.03.2022, at 12:49, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>>>> 
>>>>> Hi Sebastian,
>>>>> 
>>>>> this is superhelpful indeed. And I was right this is related to deferred 
>>>>> write procedure  and apparently fast shutdown mode. I'm still in the 
>>>>> process of ontaining more details on the root cause and looking for a way 
>>>>> to fix (and preferably reproduce locally) that.
>>>>> 
>>>>> Some additional questions on your setup meanwhile.  Sorry if some have 
>>>>> been asked before..
>>>>> 
>>>>> 1) From the log I can see you're using RBD over EC pool for the repro, 
>>>>> right? What's the EC profile?
>>>>> 
>>>>> 2) How did you generate the load during last experiments? Was it some 
>>>>> benchmarking tool or any other artificial load generator? If so could you 
>>>>> share job desriptions or scripts if any?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Igor
>>>>> 
>>>>> 
>>>>> On 3/10/2022 10:36 PM, Sebastian Mazza wrote:
>>>>>> Hi Igor!
>>>>>> 
>>>>>> I hope I've cracked the checkpot now. I have logs with osd debug level 
>>>>>> 20 for bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 
>>>>>> consecutive startups of the osd.4 where the second startup results in:
>>>>>> ```
>>>>>> rocksdb: Corruption: Bad table magic number: expected 
>>>>>> 9863518390377041911, found 0 in db/002193.sst
>>>>>> ```
>>>>>> 
>>>>>> The log ceph-osd.4.log shows the last two starts of the OSD. However, I 
>>>>>> send you also the previous log that contains the “*** Immediate shutdown 
>>>>>> (osd_fast_shutdown=true) ***” that happened before the last successful 
>>>>>> OSD startup, as the last line. It is the last line, because the server 
>>>>>> was down for about 39 hours. Based on what you explained to me, the 
>>>>>> corruption of the rocks DB must have occurred during the OSD startup, 
>>>>>> which starts at the first line of ceph-osd.4.log.
>>>>>> 
>>>>>> Logs and the corrupted rocks DB file: https://we.tl/t-yVsF3akxqX
>>>>>> 
>>>>>> Does the logs contain what you need?
>>>>>> Pleas tell me If you need more data from the OSD. If not I would rebuild 
>>>>>> it.
>>>>>> 
>>>>>> 
>>>>>> Best wishes,
>>>>>> Sebastian
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 26.02.2022, at 00:12, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>>>>>> 
>>>>>>> Sebastian,
>>>>>>> 
>>>>>>> On 2/25/2022 7:17 PM, Sebastian Mazza wrote:
>>>>>>>> Hi Igor,
>>>>>>>> 
>>>>>>>>> Unfortunately the last logs (from 24-02-2022) you shared don't 
>>>>>>>>> include the point where actual corruption happened as the previois 
>>>>>>>>> log did. These new logs miss the last successful OSD start which 
>>>>>>>>> apparently corrupts DB data. Do you have any output prior to their 
>>>>>>>>> content?
>>>>>>>> I have logs from the last successful startup of all OSDs, since this 
>>>>>>>> was the cluster boot where the OSD.7 failed (21-02-2022 at 
>>>>>>>> 13:49:28.452+0100). That also means that the logs does not include the 
>>>>>>>> bluestore debug info, only bluefs and bdev. I will include all logs 
>>>>>>>> that I have for the crashed OSD 5 and 6. The last successful startup 
>>>>>>>> is inside the files that ends with “4”. While the startups are 
>>>>>>>> collected with  bluefs and bdev debug infos, the logs between the 
>>>>>>>> startup (21-02-2022 at 13:49:28.452+0100) and the crash 
>>>>>>>> (2022-02-24T00:04:57.874+0100) was collected without any debug 
>>>>>>>> logging, since during that time I have rebuild OSD 7.
>>>>>>>> 
>>>>>>>> Logs: https://we.tl/t-x0la8gZl0S
>>>>>>> Unfortunately they're of little help without verbose output :(
>>>>>>>> Since I have to manually rebuild the the cluster, pools, cephFS, and 
>>>>>>>> RBD images now, It will take several days before I can try to 
>>>>>>>> reproduce the problem.
>>>>>>>> 
>>>>>>>> Should, or could I enable more verbose logging then
>>>>>>>> ```
>>>>>>>> [osd]
>>>>>>>>         debug bluefs = 20
>>>>>>>>         debug bdev = 20
>>>>>>>>         debug bluestore = 20
>>>>>>>> ```
>>>>>>>> in ceph.conf? Or somehow else?
>>>>>>> This can be done either through ceph.conf or using "ceph config set" 
>>>>>>> CLI command.
>>>>>>> 
>>>>>>> To make resulting log less you can use a bit less verbose logging 
>>>>>>> levels:
>>>>>>> 
>>>>>>> debug bluefs = 10
>>>>>>> 
>>>>>>> debug bdev = 5
>>>>>>> 
>>>>>>> debug bluestore = 20
>>>>>>> 
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 25.02.2022, at 16:18, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Sebastian,
>>>>>>>>> 
>>>>>>>>> I submitted a ticket https://tracker.ceph.com/issues/54409 which 
>>>>>>>>> shows my analysis based on your previous log (from 21-02-2022). Which 
>>>>>>>>> wasn't verbose enough at debug-bluestore level to make the final 
>>>>>>>>> conclusion.
>>>>>>>>> 
>>>>>>>>> Unfortunately the last logs (from 24-02-2022) you shared don't 
>>>>>>>>> include the point where actual corruption happened as the previois 
>>>>>>>>> log did. These new logs miss the last successful OSD start which 
>>>>>>>>> apparently corrupts DB data. Do you have any output prior to their 
>>>>>>>>> content?
>>>>>>>>> 
>>>>>>>>> If not could you please reproduce that once again? Generally  I'd 
>>>>>>>>> like to see OSD log for a broken startup along with a couple of 
>>>>>>>>> restarts back - the event sequence for the failure seems to be as 
>>>>>>>>> follows:
>>>>>>>>> 
>>>>>>>>> 1) OSD is shutdown for the first time. It (for uncear reasons) keeps 
>>>>>>>>> a set of deferred writes to be applied once again.
>>>>>>>>> 
>>>>>>>>> 2) OSD is started up which triggers deferred writes submissions. They 
>>>>>>>>> overlap (again for unclear reasons so far) with DB data content 
>>>>>>>>> written shortly before. The OSD starts properly but DB data 
>>>>>>>>> corruption has happened at this point
>>>>>>>>> 
>>>>>>>>> 3) OSD is restarted again which reveals the data corruption and since 
>>>>>>>>> that point OSD is unable to start.
>>>>>>>>> 
>>>>>>>>> So these last logs new logs include 3) only for now. While I need 1) 
>>>>>>>>> & 2) as well...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Igor
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 2/24/2022 3:04 AM, Sebastian Mazza wrote:
>>>>>>>>>> Hi Igor,
>>>>>>>>>> 
>>>>>>>>>> I let ceph rebuild the OSD.7. Then I added
>>>>>>>>>> ```
>>>>>>>>>> [osd]
>>>>>>>>>>      debug bluefs = 20
>>>>>>>>>>         debug bdev = 20
>>>>>>>>>>         debug bluestore = 20
>>>>>>>>>> ```
>>>>>>>>>> to the ceph.conf of all 3 nodes and shut down all 3 nodes without 
>>>>>>>>>> writing anything to the pools on the HDDs (the Debian VM was not 
>>>>>>>>>> even running).
>>>>>>>>>> Immediately at the first boot OSD.5 and 6 crashed with the same “Bad 
>>>>>>>>>> table magic number” error. The OSDs 5 and 6 are on the same node, 
>>>>>>>>>> but not on the node of OSD 7, wich crashed the last two times.
>>>>>>>>>> 
>>>>>>>>>> Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq
>>>>>>>>>> I have saved the entire /var/log directory of every node and the 
>>>>>>>>>> result of
>>>>>>>>>> ```
>>>>>>>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 
>>>>>>>>>> --out-dir /tmp/osd.5-data
>>>>>>>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 
>>>>>>>>>> --out-dir /tmp/osd.6-data
>>>>>>>>>> ```
>>>>>>>>>> Let me know if you need something else.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I hop you can now track it down. I'm really looking forward to your 
>>>>>>>>>> interpretation of the logs.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best Regards,
>>>>>>>>>> Sebastian
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 22.02.2022, at 11:44, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>> 
>>>>>>>>>>> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
>>>>>>>>>>>> Hey Igor!
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks a lot for the new logs - looks like they provides some 
>>>>>>>>>>>>> insight.
>>>>>>>>>>>> I'm glad the logs are helpful.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> At this point I think the root cause is apparently a race between 
>>>>>>>>>>>>> deferred writes replay and some DB maintenance task happening on 
>>>>>>>>>>>>> OSD startup. It seems that deferred write replay updates a block 
>>>>>>>>>>>>> extent which RocksDB/BlueFS are using. Hence the target BlueFS 
>>>>>>>>>>>>> file gets all-zeros content. Evidently that's just a matter of 
>>>>>>>>>>>>> chance whether they use conflicting physical extent or not hence 
>>>>>>>>>>>>> the occasional nature of the issue...
>>>>>>>>>>>> Do I understand that correct: The corruption of the rocksDB (Table 
>>>>>>>>>>>> overwritten by zeros) happens at the first start of the OSD after  
>>>>>>>>>>>> “*** Immediate shutdown (osd_fast_shutdown=true) ***”? Before the 
>>>>>>>>>>>> system launches the OSD Service the RocksDB is still fine?
>>>>>>>>>>> Looks like that. From logs I can see an unexpected write to 
>>>>>>>>>>> specific extent (LBA 0x630000) which shouldn't occur and at which 
>>>>>>>>>>> RocksDB subsequently fails.
>>>>>>>>>>>>> So first of all I'm curious if you have any particular write 
>>>>>>>>>>>>> patterns that can be culprits? E.g. something like disk wiping 
>>>>>>>>>>>>> procedure which writes all-zeros to an object followed by object 
>>>>>>>>>>>>> truncate or removal comes to my mind. If you can identify 
>>>>>>>>>>>>> something like that - could you please collect OSD log for such 
>>>>>>>>>>>>> an operation (followed by OSD restart) with debug-bluestore set 
>>>>>>>>>>>>> to 20?
>>>>>>>>>>>> Best to my knowledge the OSD was hardly doing anything and I do 
>>>>>>>>>>>> not see any pattern that would fit to you explanation.
>>>>>>>>>>>> However, you certainly understand a lot more about it than I do, 
>>>>>>>>>>>> so I try to explain everything that could be relevant.
>>>>>>>>>>>> 
>>>>>>>>>>>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot 
>>>>>>>>>>>> drive, which should not be relevant. Each node has 3 OSDs, one is 
>>>>>>>>>>>> on an U.2 NVMe SSD with 2TB and the other two are on 12TB HDDs.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The 
>>>>>>>>>>>> ‘c3nvme’ is a replicated rule that uses only OSDs with class 
>>>>>>>>>>>> ’nvme’. The second rule is a tricky erasure rule. It selects 
>>>>>>>>>>>> exactly 2 OSDs on exactly 4 Hosts with class ‘hdd’. So it only 
>>>>>>>>>>>> works for a size of exactly 8. That means that a pool that uses 
>>>>>>>>>>>> this rule has always only “undersized” placement groups, since the 
>>>>>>>>>>>> cluster has only 3 nodes. (I did not add the fourth server after 
>>>>>>>>>>>> the first crash in December, since we want to reproduce the 
>>>>>>>>>>>> problem.)
>>>>>>>>>>>> 
>>>>>>>>>>>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, 
>>>>>>>>>>>> fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses 
>>>>>>>>>>>> the crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The 
>>>>>>>>>>>> pools fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule 
>>>>>>>>>>>> ‘ec4x2hdd’ with k=5,m=3 and a min size of 6.
>>>>>>>>>>>> 
>>>>>>>>>>>> The pool fs.data-nvme.c-pool is not used and the pool test-pool 
>>>>>>>>>>>> was used for rados bench a few month ago.
>>>>>>>>>>>> 
>>>>>>>>>>>> The pool fs.metadata-root-pool is used as metadata pool for cephFS 
>>>>>>>>>>>> and fs.data-root-pool as the root data pool for the cephFS. The 
>>>>>>>>>>>> pool fs.data-hdd.ec-pool is an additional data pool for the cephFS 
>>>>>>>>>>>> and is specified as ceph.dir.layout for some folders of the 
>>>>>>>>>>>> cephFS. The whole cephFS is mounted by each of the 3 nodes.
>>>>>>>>>>>> 
>>>>>>>>>>>> The pool block-nvme.c-pool hosts two RBD images that are used as 
>>>>>>>>>>>> boot drives for two VMs. The first VM runes with Ubuntu Desktop 
>>>>>>>>>>>> and the second with Debian as OS. The pool block-hdd.ec-pool hosts 
>>>>>>>>>>>> one RBD image (the data part, metadata on block-nvme.c-pool) that 
>>>>>>>>>>>> is attached to the Debian VM as second drive formatted with BTRFS. 
>>>>>>>>>>>> Furthermore the Debian VM mounts a sub directory of the cephFS 
>>>>>>>>>>>> that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing 
>>>>>>>>>>>> nothing, except from being booted, in the last couple of days.
>>>>>>>>>>>> 
>>>>>>>>>>>> I try to illustrate the pool usage as a tree:
>>>>>>>>>>>> * c3nvme (replicated, size=3, min_size=2)
>>>>>>>>>>>>     + device_health_metrics
>>>>>>>>>>>>     + test-pool
>>>>>>>>>>>>         - rados bench
>>>>>>>>>>>>     + fs.metadata-root-pool
>>>>>>>>>>>>         - CephFS (metadata)
>>>>>>>>>>>>     + fs.data-root-pool
>>>>>>>>>>>>         - CephFS (data root)
>>>>>>>>>>>>     + fs.data-nvme.c-pool
>>>>>>>>>>>>     + block-nvme.c-pool
>>>>>>>>>>>>         - RBD (Ubuntu VM, boot disk with ext4)
>>>>>>>>>>>>    - RBD (Debian VM, boot disk with ext4)
>>>>>>>>>>>> * ec4x2hdd (ec,  k=5, m=3, min_size=6)
>>>>>>>>>>>>     + fs.data-hdd.ec-pool
>>>>>>>>>>>>         - CephFS (layout data)
>>>>>>>>>>>>     + block-hdd.ec-pool
>>>>>>>>>>>>         - RBD (Debian VM, disk 2 with BTRFS)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Last week I was writing about 30TB to the CephFS inside the 
>>>>>>>>>>>> fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the 
>>>>>>>>>>>> block-hdd.ec-pool. The 50GB written to BTRFS contains hundred 
>>>>>>>>>>>> thousands of hard links. There was already nearly the same data on 
>>>>>>>>>>>> the storage and I used rsync to update it. I think something 
>>>>>>>>>>>> between 1 and 4 TB has changed and was updated by rsync.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the cluster was totally unused on Friday, but up and 
>>>>>>>>>>>> running and idling around. Then on Saturday I did a graceful 
>>>>>>>>>>>> shutdown of of all cluster nodes. Arround 5 minutes later when 
>>>>>>>>>>>> booted the servers again, the OSD.7 crashed. I copied the logs and 
>>>>>>>>>>>> exported the RocksDB. Then I deleted everything from the HDD and 
>>>>>>>>>>>> deployed the OSD.7 again. When, I checkt for the first time today 
>>>>>>>>>>>> at around 12:00, ceph was already finished with backfilling to 
>>>>>>>>>>>> OSD.7 and the cluster was idle again.
>>>>>>>>>>>> 
>>>>>>>>>>>> I then spend 70 minutes with writing 3 small files (one with about 
>>>>>>>>>>>> 500Byte and two with about 2KB) to the cephFS on the 
>>>>>>>>>>>> fs.data-hdd.ec-pool and shutting down the whole cluster. After I 
>>>>>>>>>>>> booted all 3 nodes again I did overwrite the 3 small files (some 
>>>>>>>>>>>> Bytes have changed) and shouted down all 3 nodes. I repeated that 
>>>>>>>>>>>> procedure 4 times within 70 minutes. Then the OSD 7 crashed again.
>>>>>>>>>>>> 
>>>>>>>>>>>> There may be was some more IO caused by mounting the RBD images 
>>>>>>>>>>>> with the BTRFS volume inside the Debian VM. But I do not know. I 
>>>>>>>>>>>> can only say that no user did use it. Only I was writing 4 times 
>>>>>>>>>>>> the 3 small files to cephFS on the fs.data-hdd.ec-pool.
>>>>>>>>>>>> 
>>>>>>>>>>>> Does this make any sense to you?
>>>>>>>>>>> Well, that's actually not a write pattern I would expect to cause 
>>>>>>>>>>> the issue. But thanks a lot anyway. So could you please repeat that 
>>>>>>>>>>> procedure with overwriting 3 small files and OSD restart and 
>>>>>>>>>>> collect the logs with debug-bluestore set to 20.
>>>>>>>>>>>>> with debug-bluestore set to 20
>>>>>>>>>>>> Do you mean
>>>>>>>>>>>> ```
>>>>>>>>>>>> [osd]
>>>>>>>>>>>>    debug bluestore = 20
>>>>>>>>>>>> ```
>>>>>>>>>>>> in ceph.conf?
>>>>>>>>>>> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' 
>>>>>>>>>>> instead
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Sebastian
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey Sebastian,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks a lot for the new logs - looks like they provides some 
>>>>>>>>>>>>> insight. At this point I think the root cause is apparently a 
>>>>>>>>>>>>> race between deferred writes replay and some DB maintenance task 
>>>>>>>>>>>>> happening on OSD startup. It seems that deferred write replay 
>>>>>>>>>>>>> updates a block extent which RocksDB/BlueFS are using. Hence the 
>>>>>>>>>>>>> target BlueFS file gets all-zeros content. Evidently that's just 
>>>>>>>>>>>>> a matter of chance whether they use conflicting physical extent 
>>>>>>>>>>>>> or not hence the occasional nature of the issue...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And now I'd like to determine what's wrong with this deferred 
>>>>>>>>>>>>> write replay.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So first of all I'm curious if you have any particular write 
>>>>>>>>>>>>> patterns that can be culprits? E.g. something like disk wiping 
>>>>>>>>>>>>> procedure which writes all-zeros to an object followed by object 
>>>>>>>>>>>>> truncate or removal comes to my mind. If you can identify 
>>>>>>>>>>>>> something like that - could you please collect OSD log for such 
>>>>>>>>>>>>> an operation (followed by OSD restart) with debug-bluestore set 
>>>>>>>>>>>>> to 20?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Igor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
>>>>>>>>>>>>>> Hi Igor,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 
>>>>>>>>>>>>>> again. And this time I have logs with “debug bluefs = 20” and 
>>>>>>>>>>>>>> "debug bdev = 20” for every OSD in the cluster! It was the OSD 
>>>>>>>>>>>>>> with the ID 7 again. So the HDD has failed now the third time! 
>>>>>>>>>>>>>> Coincidence? Probably not…
>>>>>>>>>>>>>> The important thing seams to be that a shutdown and not only a 
>>>>>>>>>>>>>> restart of the entire cluster is performed. Since, this time the 
>>>>>>>>>>>>>> OSD failed after just 4 shutdowns of all nodes in the cluster 
>>>>>>>>>>>>>> within 70 minutes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I 
>>>>>>>>>>>>>> started this new shutdown and boot series shortly after ceph had 
>>>>>>>>>>>>>> finished writing everything back to OSD.7, earlier today.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The corrupted RocksDB file (crash) is again only 2KB in size.
>>>>>>>>>>>>>> You can download the RocksDB file with the bad  table magic 
>>>>>>>>>>>>>> number and the log of the OSD.7 under this link: 
>>>>>>>>>>>>>> https://we.tl/t-e0NqjpSmaQ
>>>>>>>>>>>>>> What else do you want?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From the log of the OSD.7:
>>>>>>>>>>>>>> —————
>>>>>>>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
>>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
>>>>>>>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
>>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 
>>>>>>>>>>>>>> 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left
>>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: 
>>>>>>>>>>>>>> Terminated from /sbin/init  (PID: 1) UID: 0
>>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got 
>>>>>>>>>>>>>> signal Terminated ***
>>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** 
>>>>>>>>>>>>>> Immediate shutdown (osd_fast_shutdown=true) ***
>>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 
>>>>>>>>>>>>>> 64045:64045 (ceph:ceph)
>>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
>>>>>>>>>>>>>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), 
>>>>>>>>>>>>>> process ceph-osd, pid 1967
>>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: 
>>>>>>>>>>>>>> ignore empty --pid-file
>>>>>>>>>>>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
>>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block) open path 
>>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block
>>>>>>>>>>>>>> —————
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For me this looks like that the OSD did nothing for nearly 2 
>>>>>>>>>>>>>> minutes before it receives the termination request. Shouldn't 
>>>>>>>>>>>>>> this be enough time for flushing every imaginable write cache?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I hope this helps you.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Igor Fedotov
>>>>>>>>>>>>> Ceph Lead Developer
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at 
>>>>>>>>>>>>> https://croit.io
>>>>>>>>>>>>> 
>>>>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>>>> -- 
>>>>>>>>>>> Igor Fedotov
>>>>>>>>>>> Ceph Lead Developer
>>>>>>>>>>> 
>>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at 
>>>>>>>>>>> https://croit.io
>>>>>>>>>>> 
>>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>> -- 
>>>>>>>>> Igor Fedotov
>>>>>>>>> Ceph Lead Developer
>>>>>>>>> 
>>>>>>>>> Looking for help with your Ceph cluster? Contact us at 
>>>>>>>>> https://croit.io
>>>>>>>>> 
>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>> -- 
>>>>>>> Igor Fedotov
>>>>>>> Ceph Lead Developer
>>>>>>> 
>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>> 
>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>> -- 
>>>>> Igor Fedotov
>>>>> Ceph Lead Developer
>>>>> 
>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>> 
>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>> 
>>> -- 
>>> Igor Fedotov
>>> Ceph Lead Developer
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Reply via email to