[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2021-12-21 Thread ceph
Hi, This > fsck failed: (5) Input/output error Sounds like an Hardware issue. Did you have a Look on "dmesg"? Hth Mehmet Am 21. Dezember 2021 17:47:35 MEZ schrieb Sebastian Mazza : >Hi all, > >after a reboot of a cluster 3 OSDs can not be started. The OSDs exit with the >following error messa

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2021-12-21 Thread Igor Fedotov
Hi Sebastian, first of all I'm not sure this issue has the same root cause as Francois one. Highly likely it's just another BlueFS/RocksDB data corruption which is indicated in the same way. In this respect I would rather mention this one reported just yesterday: https://lists.ceph.io/hyperk

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2021-12-21 Thread Sebastian Mazza
Hi Mehmet, thank you for your suggestion. I did now check the kernel log, but I didn’t see something interesting. However, I copied the parts that seams to be related to the SATA disks of the failed OSDs. Maybe you see more than I do. [1.815801] ata7: SATA link down (SStatus 0 SControl 300)

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2021-12-21 Thread Sebastian Mazza
Hi Igor, I now fixed my wrong OSD debug config to: [osd.7] debug bluefs = 20 debug bdev = 20 and you can download the debug log from: https://we.tl/t-3e4do1PQGj Thanks, Sebastian > On 21.12.2021, at 19:44, Igor Fedotov wrote: > > Hi Sebastian, > > first of all I'm not su

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2021-12-24 Thread Igor Fedotov
Hey Sebastian, On 12/22/2021 1:53 AM, Sebastian Mazza wrote: 9) Would you be able to run some long lasting (and potentially data corrupting) experiments at this cluster in an attempt to pin point the issue. I'm thinking about periodic OSD shutdown under the load to catch the corrupting event

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2021-12-31 Thread Sebastian Mazza
Hi Mazzystr, thank you very much for your suggestion! The OSDs did find the bluestore block device and I do not use any USB drives. All failed OSD are on SATA drives connected to AMD CPUs / Chipsets. It seams now clear that the problem is that one of the RocksDBs is corrupted on each of the fa

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-21 Thread Sebastian Mazza
Hi Igor, I want to give you a short update, since I tried now for quite some time to reproduce the problem as you suggested. I've tried to simulate every imaginable load that the cluster might have done before the three OSD crashed. I rebooted the servers many times while the Custer was under lo

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-21 Thread Igor Fedotov
Hey Sebastian, thanks a lot for your help and the update. On 1/21/2022 4:58 PM, Sebastian Mazza wrote: Hi Igor, I want to give you a short update, since I tried now for quite some time to reproduce the problem as you suggested. I've tried to simulate every imaginable load that the cluster mi

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-21 Thread Sebastian Mazza
Hey Igor, thank you for your response and your suggestions. >> I've tried to simulate every imaginable load that the cluster might have >> done before the three OSD crashed. >> I rebooted the servers many times while the Custer was under load. If more >> than a single node was rebooted at the s

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-24 Thread Igor Fedotov
Hey Sebastian, thanks a lot for the update, please see more questions inline. Thanks, Igor On 1/22/2022 2:13 AM, Sebastian Mazza wrote: Hey Igor, thank you for your response and your suggestions. I've tried to simulate every imaginable load that the cluster might have done before the thr

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-25 Thread Sebastian Mazza
Hey Igor, thank you for your response! >> >> Do you suggest to disable the HDD write-caching and / or the >> bluefs_buffered_io for productive clusters? >> > Generally upstream recommendation is to disable disk write caching, there > were multiple complains it might negatively impact the perf

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-28 Thread Igor Fedotov
On 1/26/2022 1:18 AM, Sebastian Mazza wrote: Hey Igor, thank you for your response! Do you suggest to disable the HDD write-caching and / or the bluefs_buffered_io for productive clusters? Generally upstream recommendation is to disable disk write caching, there were multiple complains it

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-20 Thread Sebastian Mazza
Hi Igor, it happened again. One of the OSDs that crashed last time, has a corrupted RocksDB again. Unfortunately I do not have debug logs from the OSDs again. I was collecting hundreds of Gigabytes of OSD debug logs in the last two month. But this week, I disabled the debug logging, because I d

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Igor Fedotov
Hi Sebastian, could you please share failing OSD startup log? Thanks, Igor On 2/20/2022 5:10 PM, Sebastian Mazza wrote: Hi Igor, it happened again. One of the OSDs that crashed last time, has a corrupted RocksDB again. Unfortunately I do not have debug logs from the OSDs again. I was coll

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Sebastian Mazza
Hi Igor, please find the the startup log under the following link: https://we.tl/t-E6CadpW1ZL It also includes the “normal" log of that OSD from the the day before the crash and the RocksDB sst file with the “Bad table magic number” (db/001922.sst) Best regards, Sebastian > On 21.02.2022, at

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Sebastian Mazza
Hi Igor, today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third time! Coincidence? Probably not… The import

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Igor Fedotov
Hey Sebastian, thanks a lot for the new logs - looks like they provides some insight. At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task happening on OSD startup. It seems that deferred write replay updates a block extent whic

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Sebastian Mazza
Hey Igor! > thanks a lot for the new logs - looks like they provides some insight. I'm glad the logs are helpful. > At this point I think the root cause is apparently a race between deferred > writes replay and some DB maintenance task happening on OSD startup. It seems > that deferred writ

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-22 Thread Igor Fedotov
Hi Sebastian, On 2/22/2022 3:01 AM, Sebastian Mazza wrote: Hey Igor! thanks a lot for the new logs - looks like they provides some insight. I'm glad the logs are helpful. At this point I think the root cause is apparently a race between deferred writes replay and some DB maintenance task

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-22 Thread Alexander E. Patrakov
I have another suggestion: check the RAM, just in case, with memtest86 or https://github.com/martinwhitaker/pcmemtest (which is a fork of memtest86+). Ignore the suggestion if you have ECC RAM. вт, 22 февр. 2022 г. в 15:45, Igor Fedotov : > > Hi Sebastian, > > On 2/22/2022 3:01 AM, Sebastian Mazza

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-23 Thread Sebastian Mazza
Hi Alexander, thank you for your suggestion! All my Nodes have ECC memory. However, I have now checked that it was recognized correctly on every system (dmesg | grep EDAC). Furthermore I checkt if an error occurred by using `edac-util` and also by searching in the logs of the mainboard BMCs. Ev

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-23 Thread Sebastian Mazza
Hi Igor, I let ceph rebuild the OSD.7. Then I added ``` [osd] debug bluefs = 20 debug bdev = 20 debug bluestore = 20 ``` to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing anything to the pools on the HDDs (the Debian VM was not even running). Immed

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-25 Thread Igor Fedotov
Hi Sebastian, I submitted a ticket https://tracker.ceph.com/issues/54409 which shows my analysis based on your previous log (from 21-02-2022). Which wasn't verbose enough at debug-bluestore level to make the final conclusion. Unfortunately the last logs (from 24-02-2022) you shared don't incl

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-10 Thread Sebastian Mazza
Hi Igor! I hope I've cracked the checkpot now. I have logs with osd debug level 20 for bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 consecutive startups of the osd.4 where the second startup results in: ``` rocksdb: Corruption: Bad table magic number: expected 98635183903770

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-14 Thread Sebastian Mazza
Hallo Igor, I'm glad I could be of help. Thank you for your explanation! > And I was right this is related to deferred write procedure and apparently > fast shutdown mode. Does that mean I can prevent the error in the meantime, before you can fix the root cause, by disabling osd_fast_shutdow

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-14 Thread Igor Fedotov
Hi Sebastian, the proper parameter name is 'osd fast shutdown". As with any other OSD config parameter one can use either ceph.conf or 'ceph config set osd.N osd_fast_shutdown false' command to adjust it. I'd recommend the latter form. And yeah from my last experiments it looks like setting

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-14 Thread Sebastian Mazza
Hi Igor, great that you was able to reproduce it! I did read your comments at the issue #54547. Am I right that I probably have hundreds of corrupted objects on my EC pools (cephFSD and RBD)? But I only ever noticed when a rocksDB was damaged. A deep scrub should find the other errors, right?

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-15 Thread Igor Fedotov
Hi Sebastian, I don't think you have got tons of corrupted objects. The tricky thing about the bug is that corruption might occur if new allocation occurred in a pretty short period only:  when OSD is starting but haven't applied deferred writes yet. This mostly applies to Bluefs/RocksDB perfo

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-17 Thread Sebastian Mazza
Hi Igor, thank you very much for your explanation. I much appreciate it. You was right, as always. :-) There was not a single corrupted object. I did run `time ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-$X` and `time ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-$X --deep y

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-17 Thread Igor Fedotov
Hi Sebastian, actually it's hard to tell what's happening with this osd... May be it's less fragmented and hence benefit from the sequential reading. IIRC you're using spinning drives IIRC which are very susceptible to access pattern. Thanks, Igor On 3/17/2022 11:54 PM, Sebastian Mazza wr