Hi Igor, thank you very much for your explanation. I much appreciate it.
You was right, as always. :-) There was not a single corrupted object. I did run `time ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-$X` and `time ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-$X --deep yes` for every HDD based OSD and every fsck printed “fsck success”. However, I observed something odd. Since every OSD except the one that was recovered was reading with around 44 MByte/s while the recovered OSD.4 was much faster with around 125MByte/s, during the fsck. Thanks, Sebastian > On 15.03.2022, at 09:52, Igor Fedotov <igor.fedo...@croit.io> wrote: > > Hi Sebastian, > > I don't think you have got tons of corrupted objects. The tricky thing about > the bug is that corruption might occur if new allocation occurred in a pretty > short period only: when OSD is starting but haven't applied deferred writes > yet. This mostly applies to Bluefs/RocksDB performing some housekeeping on > startup, e.g. recovery or compaction. > > Additionally there is a questionable period shortly before the shutdown when > similar issue is apparently possible. But that's just a speculation so far. > > So IMO highly likely there are no corrupted objects in your cluster at all. > You can try to run fsck with deep flag on per OSD basis to check that though > - it will verify object content using checksums. Expect it to take some > time though... > > > Thanks, > > Igor > > On 3/14/2022 9:33 PM, Sebastian Mazza wrote: >> Hi Igor, >> >> great that you was able to reproduce it! >> >> I did read your comments at the issue #54547. Am I right that I probably >> have hundreds of corrupted objects on my EC pools (cephFSD and RBD)? But I >> only ever noticed when a rocksDB was damaged. A deep scrub should find the >> other errors, right? But the PGs of my EC pools was never scrubbed because >> the are all “undersized" and “degraded”. Is it possible to scrub a >> “undersized" / “degraded” PG? I’m not able to (deep) scrub any of the PGs on >> my EC pools. I did already set osd_scrub_during_recovery = true, but this >> does not help. Scrubbing and deep scrubbing works as expected on the >> replicated pools. >> >> >> Thank you very much for helping me with my problem so patiently within the >> last 3 month! >> >> >> Thanks, >> Sebastian >> >> >>> On 14.03.2022, at 17:07, Igor Fedotov <igor.fedo...@croit.io> wrote: >>> >>> Hi Sebastian, >>> >>> the proper parameter name is 'osd fast shutdown". >>> >>> As with any other OSD config parameter one can use either ceph.conf or >>> 'ceph config set osd.N osd_fast_shutdown false' command to adjust it. >>> >>> I'd recommend the latter form. >>> >>> And yeah from my last experiments it looks like setting it to false it's a >>> workaround indeed... >>> >>> I've managed to reproduce the bug locally and it doesn't occur without fast >>> shutdown. >>> >>> Thanks for all the information you shared, highly appreciated! >>> >>> FYI: I've created another ticket (missed I already had one): >>> https://tracker.ceph.com/issues/54547 >>> >>> So please expect all the related new there. >>> >>> >>> Kind regards, >>> >>> Igor >>> >>> >>> >>> On 3/14/2022 5:54 PM, Sebastian Mazza wrote: >>>> Hallo Igor, >>>> >>>> I'm glad I could be of help. Thank you for your explanation! >>>> >>>>> And I was right this is related to deferred write procedure and >>>>> apparently fast shutdown mode. >>>> Does that mean I can prevent the error in the meantime, before you can fix >>>> the root cause, by disabling osd_fast_shutdown? >>>> Can it be disabled by the following statements in the ceph.conf? >>>> ``` >>>> [osd] >>>> fast shutdown = false >>>> ``` >>>> (I wasn’t able to find any documentation for “osd_fast_shutdown”, e.g.: >>>> https://docs.ceph.com/en/pacific/rados/configuration/osd-config-ref/ does >>>> not contain “osd_fast_shutdown”) >>>> >>>> >>>>> 1) From the log I can see you're using RBD over EC pool for the repro, >>>>> right? What's the EC profile? >>>> Yes, one of the two EC pools contains one RBD image and the other EC pool >>>> is used for cephFS. >>>> >>>> Rules from my crush map: >>>> ``` >>>> rule ec4x2hdd { >>>> id 0 >>>> type erasure >>>> min_size 8 >>>> max_size 8 >>>> step set_chooseleaf_tries 5 >>>> step set_choose_tries 100 >>>> step take default class hdd >>>> step choose indep 4 type host >>>> step choose indep 2 type osd >>>> step emit >>>> } >>>> rule c3nvme { >>>> id 1 >>>> type replicated >>>> min_size 1 >>>> max_size 10 >>>> step take default class nvme >>>> step chooseleaf firstn 0 type host >>>> step emit >>>> } >>>> ``` >>>> The EC rule requires 4 hosts, but as I have already mentioned the cluster >>>> does only have 3 nodes. Since I did never add the fourth node after the >>>> problem occurred the first time. >>>> >>>> I created the EC profile by: >>>> ``` >>>> $ ceph osd erasure-code-profile set ec4x2hdd-profile \ >>>> k=5 \ >>>> m=3 \ >>>> crush-root=ec4x2hdd >>>> ``` >>>> >>>> And the EC pools by: >>>> ``` >>>> $ ceph osd pool create fs.data-hdd.ec-pool 64 64 erasure ec4x2hdd-profile >>>> ec4x2hdd warn 36000000000000 # 36TB >>>> $ ceph osd pool create block-hdd.ec-pool 16 16 erasure ec4x2hdd-profile >>>> ec4x2hdd warn 1000000000000 # 1TB >>>> ``` >>>> >>>> Replicated pools on NVMes >>>> ``` >>>> ceph osd pool create block-nvme.c-pool 32 32 replicated c3nvme 0 warn >>>> 549755813888 # 512 GB >>>> ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn >>>> 4294967296 # 4 GB >>>> ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn >>>> 137438953472 # 128 GB >>>> ``` >>>> >>>> enable overwrite support: >>>> ``` >>>> $ ceph osd pool set fs.data-hdd.ec-pool allow_ec_overwrites true >>>> $ ceph osd pool set block-hdd.ec-pool allow_ec_overwrites true >>>> ``` >>>> >>>> Set Application: >>>> ``` >>>> $ ceph osd pool application enable fs.data-hdd.ec-pool cephfs >>>> $ ceph osd pool application enable "block-hdd.ec-pool" rbd >>>> ``` >>>> >>>> Use the rbd tool to initialize the pool for use by RBD: >>>> ``` >>>> $ rbd pool init "block-nvme.c-pool" >>>> $ rbd pool init "block-hdd.ec-pool" >>>> ``` >>>> >>>> Create RBD image: >>>> ``` >>>> $ rbd --pool "block-nvme.c-pool" create vm-300-disk-1 --data-pool >>>> "block-hdd.ec-pool" --size 50G >>>> ``` >>>> >>>> Create cephFS: >>>> ``` >>>> $ ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn >>>> 4294967296 # 4 GB >>>> $ ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn >>>> 137438953472 # 128 GB >>>> $ ceph fs new cephfs fs.metadata-root-pool fs.data-root-pool >>>> ``` >>>> >>>> Add the EC pool to the cephFS: >>>> ``` >>>> $ ceph fs add_data_pool cephfs fs.data-hdd.ec-pool >>>> ``` >>>> >>>> Set the Layout of the directory “shared” so that files that are created >>>> inside the folder “shares” uses the EC pool: >>>> ``` >>>> $ setfattr -n ceph.dir.layout.pool -v fs.data-hdd.ec-pool >>>> /mnt/cephfs-olymp/shares/ >>>> ``` >>>> >>>> >>>> >>>>> 2) How did you generate the load during last experiments? Was it some >>>>> benchmarking tool or any other artificial load generator? If so could you >>>>> share job desriptions or scripts if any? >>>> I did not use a benchmarking tool or any other artificial load generator. >>>> The RBD image “vm-300-disk-1” is formatted as BTRFS. I used rsync to copy >>>> daily website backups to the volume. The backups are versioned and go back >>>> 10 days. If a new version is created, a new folder is added and all files >>>> that did not change are hard linked to the file in the folder of the last >>>> day. A new or updated file is copied form the production server. >>>> This backup procedure was not done on the ceph cluster. It is done on >>>> another server, I just copied the web backup directory of that server to >>>> the RBD image. However, I used rsync with the "-H” flag to copy the data >>>> from the “productive” backup server to the RBD image, therefore the hard >>>> links are preserved. >>>> >>>> I copied shared group folders and apple time machine backups to CephFS, >>>> also with rsync from another server. The shared user folder contains >>>> typical office date, text document, CSV, files, JPEG images,…. The apple >>>> time machine backups are basically folders with lots of 8MB or 32MB files >>>> that contains encrypted data. >>>> >>>> >>>> In order to reproduce the rocksDB corruption I started the rsync process >>>> whenever I did find some time. I did wait until rsync found at least some >>>> files that has change and copied it to the RBD image or cephFS. Then, I >>>> triggered a shutdown on all 3 cluster nodes. >>>> >>>> However, best to my knowledge, there was no IO at all between the last 3 >>>> restarts of the OSDs, except the IO that is caused by mounting the cephFS >>>> on 4 systems and the BTRFS file system on the RBD image on one system. >>>> >>>> >>>> If you want to create a minimal example that is able to trigger the error, >>>> I would suggest to create an RBD image on an EC pool with k=5, m=3 on a >>>> cluster with only 6 HDD based OSDs. Format the RBD image with BTRFS and >>>> start creating and deleting hard links on it. I don’t know if this is >>>> really the trigger, but it is currently my best idea. >>>> >>>> >>>> Best wishes, >>>> Sebastian >>>> >>>> >>>> >>>> >>>> >>>>> On 14.03.2022, at 12:49, Igor Fedotov <igor.fedo...@croit.io> wrote: >>>>> >>>>> Hi Sebastian, >>>>> >>>>> this is superhelpful indeed. And I was right this is related to deferred >>>>> write procedure and apparently fast shutdown mode. I'm still in the >>>>> process of ontaining more details on the root cause and looking for a way >>>>> to fix (and preferably reproduce locally) that. >>>>> >>>>> Some additional questions on your setup meanwhile. Sorry if some have >>>>> been asked before.. >>>>> >>>>> 1) From the log I can see you're using RBD over EC pool for the repro, >>>>> right? What's the EC profile? >>>>> >>>>> 2) How did you generate the load during last experiments? Was it some >>>>> benchmarking tool or any other artificial load generator? If so could you >>>>> share job desriptions or scripts if any? >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Igor >>>>> >>>>> >>>>> On 3/10/2022 10:36 PM, Sebastian Mazza wrote: >>>>>> Hi Igor! >>>>>> >>>>>> I hope I've cracked the checkpot now. I have logs with osd debug level >>>>>> 20 for bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 >>>>>> consecutive startups of the osd.4 where the second startup results in: >>>>>> ``` >>>>>> rocksdb: Corruption: Bad table magic number: expected >>>>>> 9863518390377041911, found 0 in db/002193.sst >>>>>> ``` >>>>>> >>>>>> The log ceph-osd.4.log shows the last two starts of the OSD. However, I >>>>>> send you also the previous log that contains the “*** Immediate shutdown >>>>>> (osd_fast_shutdown=true) ***” that happened before the last successful >>>>>> OSD startup, as the last line. It is the last line, because the server >>>>>> was down for about 39 hours. Based on what you explained to me, the >>>>>> corruption of the rocks DB must have occurred during the OSD startup, >>>>>> which starts at the first line of ceph-osd.4.log. >>>>>> >>>>>> Logs and the corrupted rocks DB file: https://we.tl/t-yVsF3akxqX >>>>>> >>>>>> Does the logs contain what you need? >>>>>> Pleas tell me If you need more data from the OSD. If not I would rebuild >>>>>> it. >>>>>> >>>>>> >>>>>> Best wishes, >>>>>> Sebastian >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 26.02.2022, at 00:12, Igor Fedotov <igor.fedo...@croit.io> wrote: >>>>>>> >>>>>>> Sebastian, >>>>>>> >>>>>>> On 2/25/2022 7:17 PM, Sebastian Mazza wrote: >>>>>>>> Hi Igor, >>>>>>>> >>>>>>>>> Unfortunately the last logs (from 24-02-2022) you shared don't >>>>>>>>> include the point where actual corruption happened as the previois >>>>>>>>> log did. These new logs miss the last successful OSD start which >>>>>>>>> apparently corrupts DB data. Do you have any output prior to their >>>>>>>>> content? >>>>>>>> I have logs from the last successful startup of all OSDs, since this >>>>>>>> was the cluster boot where the OSD.7 failed (21-02-2022 at >>>>>>>> 13:49:28.452+0100). That also means that the logs does not include the >>>>>>>> bluestore debug info, only bluefs and bdev. I will include all logs >>>>>>>> that I have for the crashed OSD 5 and 6. The last successful startup >>>>>>>> is inside the files that ends with “4”. While the startups are >>>>>>>> collected with bluefs and bdev debug infos, the logs between the >>>>>>>> startup (21-02-2022 at 13:49:28.452+0100) and the crash >>>>>>>> (2022-02-24T00:04:57.874+0100) was collected without any debug >>>>>>>> logging, since during that time I have rebuild OSD 7. >>>>>>>> >>>>>>>> Logs: https://we.tl/t-x0la8gZl0S >>>>>>> Unfortunately they're of little help without verbose output :( >>>>>>>> Since I have to manually rebuild the the cluster, pools, cephFS, and >>>>>>>> RBD images now, It will take several days before I can try to >>>>>>>> reproduce the problem. >>>>>>>> >>>>>>>> Should, or could I enable more verbose logging then >>>>>>>> ``` >>>>>>>> [osd] >>>>>>>> debug bluefs = 20 >>>>>>>> debug bdev = 20 >>>>>>>> debug bluestore = 20 >>>>>>>> ``` >>>>>>>> in ceph.conf? Or somehow else? >>>>>>> This can be done either through ceph.conf or using "ceph config set" >>>>>>> CLI command. >>>>>>> >>>>>>> To make resulting log less you can use a bit less verbose logging >>>>>>> levels: >>>>>>> >>>>>>> debug bluefs = 10 >>>>>>> >>>>>>> debug bdev = 5 >>>>>>> >>>>>>> debug bluestore = 20 >>>>>>> >>>>>>> >>>>>>>> Thanks, >>>>>>>> Sebastian >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 25.02.2022, at 16:18, Igor Fedotov <igor.fedo...@croit.io> wrote: >>>>>>>>> >>>>>>>>> Hi Sebastian, >>>>>>>>> >>>>>>>>> I submitted a ticket https://tracker.ceph.com/issues/54409 which >>>>>>>>> shows my analysis based on your previous log (from 21-02-2022). Which >>>>>>>>> wasn't verbose enough at debug-bluestore level to make the final >>>>>>>>> conclusion. >>>>>>>>> >>>>>>>>> Unfortunately the last logs (from 24-02-2022) you shared don't >>>>>>>>> include the point where actual corruption happened as the previois >>>>>>>>> log did. These new logs miss the last successful OSD start which >>>>>>>>> apparently corrupts DB data. Do you have any output prior to their >>>>>>>>> content? >>>>>>>>> >>>>>>>>> If not could you please reproduce that once again? Generally I'd >>>>>>>>> like to see OSD log for a broken startup along with a couple of >>>>>>>>> restarts back - the event sequence for the failure seems to be as >>>>>>>>> follows: >>>>>>>>> >>>>>>>>> 1) OSD is shutdown for the first time. It (for uncear reasons) keeps >>>>>>>>> a set of deferred writes to be applied once again. >>>>>>>>> >>>>>>>>> 2) OSD is started up which triggers deferred writes submissions. They >>>>>>>>> overlap (again for unclear reasons so far) with DB data content >>>>>>>>> written shortly before. The OSD starts properly but DB data >>>>>>>>> corruption has happened at this point >>>>>>>>> >>>>>>>>> 3) OSD is restarted again which reveals the data corruption and since >>>>>>>>> that point OSD is unable to start. >>>>>>>>> >>>>>>>>> So these last logs new logs include 3) only for now. While I need 1) >>>>>>>>> & 2) as well... >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Igor >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2/24/2022 3:04 AM, Sebastian Mazza wrote: >>>>>>>>>> Hi Igor, >>>>>>>>>> >>>>>>>>>> I let ceph rebuild the OSD.7. Then I added >>>>>>>>>> ``` >>>>>>>>>> [osd] >>>>>>>>>> debug bluefs = 20 >>>>>>>>>> debug bdev = 20 >>>>>>>>>> debug bluestore = 20 >>>>>>>>>> ``` >>>>>>>>>> to the ceph.conf of all 3 nodes and shut down all 3 nodes without >>>>>>>>>> writing anything to the pools on the HDDs (the Debian VM was not >>>>>>>>>> even running). >>>>>>>>>> Immediately at the first boot OSD.5 and 6 crashed with the same “Bad >>>>>>>>>> table magic number” error. The OSDs 5 and 6 are on the same node, >>>>>>>>>> but not on the node of OSD 7, wich crashed the last two times. >>>>>>>>>> >>>>>>>>>> Logs and corrupted rocks DB Files: https://we.tl/t-ZBXYp8r4Hq >>>>>>>>>> I have saved the entire /var/log directory of every node and the >>>>>>>>>> result of >>>>>>>>>> ``` >>>>>>>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-5 >>>>>>>>>> --out-dir /tmp/osd.5-data >>>>>>>>>> $ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-6 >>>>>>>>>> --out-dir /tmp/osd.6-data >>>>>>>>>> ``` >>>>>>>>>> Let me know if you need something else. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I hop you can now track it down. I'm really looking forward to your >>>>>>>>>> interpretation of the logs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best Regards, >>>>>>>>>> Sebastian >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 22.02.2022, at 11:44, Igor Fedotov <igor.fedo...@croit.io> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Sebastian, >>>>>>>>>>> >>>>>>>>>>> On 2/22/2022 3:01 AM, Sebastian Mazza wrote: >>>>>>>>>>>> Hey Igor! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> thanks a lot for the new logs - looks like they provides some >>>>>>>>>>>>> insight. >>>>>>>>>>>> I'm glad the logs are helpful. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> At this point I think the root cause is apparently a race between >>>>>>>>>>>>> deferred writes replay and some DB maintenance task happening on >>>>>>>>>>>>> OSD startup. It seems that deferred write replay updates a block >>>>>>>>>>>>> extent which RocksDB/BlueFS are using. Hence the target BlueFS >>>>>>>>>>>>> file gets all-zeros content. Evidently that's just a matter of >>>>>>>>>>>>> chance whether they use conflicting physical extent or not hence >>>>>>>>>>>>> the occasional nature of the issue... >>>>>>>>>>>> Do I understand that correct: The corruption of the rocksDB (Table >>>>>>>>>>>> overwritten by zeros) happens at the first start of the OSD after >>>>>>>>>>>> “*** Immediate shutdown (osd_fast_shutdown=true) ***”? Before the >>>>>>>>>>>> system launches the OSD Service the RocksDB is still fine? >>>>>>>>>>> Looks like that. From logs I can see an unexpected write to >>>>>>>>>>> specific extent (LBA 0x630000) which shouldn't occur and at which >>>>>>>>>>> RocksDB subsequently fails. >>>>>>>>>>>>> So first of all I'm curious if you have any particular write >>>>>>>>>>>>> patterns that can be culprits? E.g. something like disk wiping >>>>>>>>>>>>> procedure which writes all-zeros to an object followed by object >>>>>>>>>>>>> truncate or removal comes to my mind. If you can identify >>>>>>>>>>>>> something like that - could you please collect OSD log for such >>>>>>>>>>>>> an operation (followed by OSD restart) with debug-bluestore set >>>>>>>>>>>>> to 20? >>>>>>>>>>>> Best to my knowledge the OSD was hardly doing anything and I do >>>>>>>>>>>> not see any pattern that would fit to you explanation. >>>>>>>>>>>> However, you certainly understand a lot more about it than I do, >>>>>>>>>>>> so I try to explain everything that could be relevant. >>>>>>>>>>>> >>>>>>>>>>>> The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot >>>>>>>>>>>> drive, which should not be relevant. Each node has 3 OSDs, one is >>>>>>>>>>>> on an U.2 NVMe SSD with 2TB and the other two are on 12TB HDDs. >>>>>>>>>>>> >>>>>>>>>>>> I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The >>>>>>>>>>>> ‘c3nvme’ is a replicated rule that uses only OSDs with class >>>>>>>>>>>> ’nvme’. The second rule is a tricky erasure rule. It selects >>>>>>>>>>>> exactly 2 OSDs on exactly 4 Hosts with class ‘hdd’. So it only >>>>>>>>>>>> works for a size of exactly 8. That means that a pool that uses >>>>>>>>>>>> this rule has always only “undersized” placement groups, since the >>>>>>>>>>>> cluster has only 3 nodes. (I did not add the fourth server after >>>>>>>>>>>> the first crash in December, since we want to reproduce the >>>>>>>>>>>> problem.) >>>>>>>>>>>> >>>>>>>>>>>> The pools device_health_metrics, test-pool, fs.metadata-root-pool, >>>>>>>>>>>> fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses >>>>>>>>>>>> the crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The >>>>>>>>>>>> pools fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule >>>>>>>>>>>> ‘ec4x2hdd’ with k=5,m=3 and a min size of 6. >>>>>>>>>>>> >>>>>>>>>>>> The pool fs.data-nvme.c-pool is not used and the pool test-pool >>>>>>>>>>>> was used for rados bench a few month ago. >>>>>>>>>>>> >>>>>>>>>>>> The pool fs.metadata-root-pool is used as metadata pool for cephFS >>>>>>>>>>>> and fs.data-root-pool as the root data pool for the cephFS. The >>>>>>>>>>>> pool fs.data-hdd.ec-pool is an additional data pool for the cephFS >>>>>>>>>>>> and is specified as ceph.dir.layout for some folders of the >>>>>>>>>>>> cephFS. The whole cephFS is mounted by each of the 3 nodes. >>>>>>>>>>>> >>>>>>>>>>>> The pool block-nvme.c-pool hosts two RBD images that are used as >>>>>>>>>>>> boot drives for two VMs. The first VM runes with Ubuntu Desktop >>>>>>>>>>>> and the second with Debian as OS. The pool block-hdd.ec-pool hosts >>>>>>>>>>>> one RBD image (the data part, metadata on block-nvme.c-pool) that >>>>>>>>>>>> is attached to the Debian VM as second drive formatted with BTRFS. >>>>>>>>>>>> Furthermore the Debian VM mounts a sub directory of the cephFS >>>>>>>>>>>> that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing >>>>>>>>>>>> nothing, except from being booted, in the last couple of days. >>>>>>>>>>>> >>>>>>>>>>>> I try to illustrate the pool usage as a tree: >>>>>>>>>>>> * c3nvme (replicated, size=3, min_size=2) >>>>>>>>>>>> + device_health_metrics >>>>>>>>>>>> + test-pool >>>>>>>>>>>> - rados bench >>>>>>>>>>>> + fs.metadata-root-pool >>>>>>>>>>>> - CephFS (metadata) >>>>>>>>>>>> + fs.data-root-pool >>>>>>>>>>>> - CephFS (data root) >>>>>>>>>>>> + fs.data-nvme.c-pool >>>>>>>>>>>> + block-nvme.c-pool >>>>>>>>>>>> - RBD (Ubuntu VM, boot disk with ext4) >>>>>>>>>>>> - RBD (Debian VM, boot disk with ext4) >>>>>>>>>>>> * ec4x2hdd (ec, k=5, m=3, min_size=6) >>>>>>>>>>>> + fs.data-hdd.ec-pool >>>>>>>>>>>> - CephFS (layout data) >>>>>>>>>>>> + block-hdd.ec-pool >>>>>>>>>>>> - RBD (Debian VM, disk 2 with BTRFS) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Last week I was writing about 30TB to the CephFS inside the >>>>>>>>>>>> fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the >>>>>>>>>>>> block-hdd.ec-pool. The 50GB written to BTRFS contains hundred >>>>>>>>>>>> thousands of hard links. There was already nearly the same data on >>>>>>>>>>>> the storage and I used rsync to update it. I think something >>>>>>>>>>>> between 1 and 4 TB has changed and was updated by rsync. >>>>>>>>>>>> >>>>>>>>>>>> I think the cluster was totally unused on Friday, but up and >>>>>>>>>>>> running and idling around. Then on Saturday I did a graceful >>>>>>>>>>>> shutdown of of all cluster nodes. Arround 5 minutes later when >>>>>>>>>>>> booted the servers again, the OSD.7 crashed. I copied the logs and >>>>>>>>>>>> exported the RocksDB. Then I deleted everything from the HDD and >>>>>>>>>>>> deployed the OSD.7 again. When, I checkt for the first time today >>>>>>>>>>>> at around 12:00, ceph was already finished with backfilling to >>>>>>>>>>>> OSD.7 and the cluster was idle again. >>>>>>>>>>>> >>>>>>>>>>>> I then spend 70 minutes with writing 3 small files (one with about >>>>>>>>>>>> 500Byte and two with about 2KB) to the cephFS on the >>>>>>>>>>>> fs.data-hdd.ec-pool and shutting down the whole cluster. After I >>>>>>>>>>>> booted all 3 nodes again I did overwrite the 3 small files (some >>>>>>>>>>>> Bytes have changed) and shouted down all 3 nodes. I repeated that >>>>>>>>>>>> procedure 4 times within 70 minutes. Then the OSD 7 crashed again. >>>>>>>>>>>> >>>>>>>>>>>> There may be was some more IO caused by mounting the RBD images >>>>>>>>>>>> with the BTRFS volume inside the Debian VM. But I do not know. I >>>>>>>>>>>> can only say that no user did use it. Only I was writing 4 times >>>>>>>>>>>> the 3 small files to cephFS on the fs.data-hdd.ec-pool. >>>>>>>>>>>> >>>>>>>>>>>> Does this make any sense to you? >>>>>>>>>>> Well, that's actually not a write pattern I would expect to cause >>>>>>>>>>> the issue. But thanks a lot anyway. So could you please repeat that >>>>>>>>>>> procedure with overwriting 3 small files and OSD restart and >>>>>>>>>>> collect the logs with debug-bluestore set to 20. >>>>>>>>>>>>> with debug-bluestore set to 20 >>>>>>>>>>>> Do you mean >>>>>>>>>>>> ``` >>>>>>>>>>>> [osd] >>>>>>>>>>>> debug bluestore = 20 >>>>>>>>>>>> ``` >>>>>>>>>>>> in ceph.conf? >>>>>>>>>>> yes. Or you might use 'ceph config set osd.7 debug-bluestore 20' >>>>>>>>>>> instead >>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Sebastian >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hey Sebastian, >>>>>>>>>>>>> >>>>>>>>>>>>> thanks a lot for the new logs - looks like they provides some >>>>>>>>>>>>> insight. At this point I think the root cause is apparently a >>>>>>>>>>>>> race between deferred writes replay and some DB maintenance task >>>>>>>>>>>>> happening on OSD startup. It seems that deferred write replay >>>>>>>>>>>>> updates a block extent which RocksDB/BlueFS are using. Hence the >>>>>>>>>>>>> target BlueFS file gets all-zeros content. Evidently that's just >>>>>>>>>>>>> a matter of chance whether they use conflicting physical extent >>>>>>>>>>>>> or not hence the occasional nature of the issue... >>>>>>>>>>>>> >>>>>>>>>>>>> And now I'd like to determine what's wrong with this deferred >>>>>>>>>>>>> write replay. >>>>>>>>>>>>> >>>>>>>>>>>>> So first of all I'm curious if you have any particular write >>>>>>>>>>>>> patterns that can be culprits? E.g. something like disk wiping >>>>>>>>>>>>> procedure which writes all-zeros to an object followed by object >>>>>>>>>>>>> truncate or removal comes to my mind. If you can identify >>>>>>>>>>>>> something like that - could you please collect OSD log for such >>>>>>>>>>>>> an operation (followed by OSD restart) with debug-bluestore set >>>>>>>>>>>>> to 20? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Igor >>>>>>>>>>>>> >>>>>>>>>>>>> On 2/21/2022 5:29 PM, Sebastian Mazza wrote: >>>>>>>>>>>>>> Hi Igor, >>>>>>>>>>>>>> >>>>>>>>>>>>>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 >>>>>>>>>>>>>> again. And this time I have logs with “debug bluefs = 20” and >>>>>>>>>>>>>> "debug bdev = 20” for every OSD in the cluster! It was the OSD >>>>>>>>>>>>>> with the ID 7 again. So the HDD has failed now the third time! >>>>>>>>>>>>>> Coincidence? Probably not… >>>>>>>>>>>>>> The important thing seams to be that a shutdown and not only a >>>>>>>>>>>>>> restart of the entire cluster is performed. Since, this time the >>>>>>>>>>>>>> OSD failed after just 4 shutdowns of all nodes in the cluster >>>>>>>>>>>>>> within 70 minutes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I redeployed the OSD.7 after the crash from 2 days ago. And I >>>>>>>>>>>>>> started this new shutdown and boot series shortly after ceph had >>>>>>>>>>>>>> finished writing everything back to OSD.7, earlier today. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The corrupted RocksDB file (crash) is again only 2KB in size. >>>>>>>>>>>>>> You can download the RocksDB file with the bad table magic >>>>>>>>>>>>>> number and the log of the OSD.7 under this link: >>>>>>>>>>>>>> https://we.tl/t-e0NqjpSmaQ >>>>>>>>>>>>>> What else do you want? >>>>>>>>>>>>>> >>>>>>>>>>>>>> From the log of the OSD.7: >>>>>>>>>>>>>> ————— >>>>>>>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 >>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000 >>>>>>>>>>>>>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 >>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio >>>>>>>>>>>>>> 0x55f0b8c7c910 r 4096 ioc 0x55f0b8dbdd18 with 0 aios left >>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received signal: >>>>>>>>>>>>>> Terminated from /sbin/init (PID: 1) UID: 0 >>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got >>>>>>>>>>>>>> signal Terminated *** >>>>>>>>>>>>>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** >>>>>>>>>>>>>> Immediate shutdown (osd_fast_shutdown=true) *** >>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 set uid:gid to >>>>>>>>>>>>>> 64045:64045 (ceph:ceph) >>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 ceph version 16.2.6 >>>>>>>>>>>>>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), >>>>>>>>>>>>>> process ceph-osd, pid 1967 >>>>>>>>>>>>>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 pidfile_write: >>>>>>>>>>>>>> ignore empty --pid-file >>>>>>>>>>>>>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00 1 bdev(0x55bd400a0800 >>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block) open path >>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-7/block >>>>>>>>>>>>>> ————— >>>>>>>>>>>>>> >>>>>>>>>>>>>> For me this looks like that the OSD did nothing for nearly 2 >>>>>>>>>>>>>> minutes before it receives the termination request. Shouldn't >>>>>>>>>>>>>> this be enough time for flushing every imaginable write cache? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I hope this helps you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best wishes, >>>>>>>>>>>>>> Sebastian >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Igor Fedotov >>>>>>>>>>>>> Ceph Lead Developer >>>>>>>>>>>>> >>>>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at >>>>>>>>>>>>> https://croit.io >>>>>>>>>>>>> >>>>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>>>>>>>> -- >>>>>>>>>>> Igor Fedotov >>>>>>>>>>> Ceph Lead Developer >>>>>>>>>>> >>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at >>>>>>>>>>> https://croit.io >>>>>>>>>>> >>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>>>>>> -- >>>>>>>>> Igor Fedotov >>>>>>>>> Ceph Lead Developer >>>>>>>>> >>>>>>>>> Looking for help with your Ceph cluster? Contact us at >>>>>>>>> https://croit.io >>>>>>>>> >>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>>>> -- >>>>>>> Igor Fedotov >>>>>>> Ceph Lead Developer >>>>>>> >>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>>>>>> >>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>> -- >>>>> Igor Fedotov >>>>> Ceph Lead Developer >>>>> >>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>>>> >>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>>> >>> -- >>> Igor Fedotov >>> Ceph Lead Developer >>> >>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>> CEO: Martin Verges - VAT-ID: DE310638492 >>> Com. register: Amtsgericht Munich HRB 231263 >>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io