[ceph-users] Re: OSDs crashing

2020-02-05 Thread Raymond Clotfelter
I have found that if I set norecovery then I can get almost all OSDs to come up and stay up, but the moment I unset norecovery so that the cluster can heal itself, tons of OSDs go offline again. The OSD host servers have plenty of available RAM, they are not maxing on CPU or I/O near as I can t

[ceph-users] Re: OSDs crashing/flapping

2022-08-04 Thread Torkil Svensgaard
On 8/4/22 09:17, Torkil Svensgaard wrote: Hi We have a lot of OSDs flapping during recovery and eventually they don't come up again until kicked with "ceph orch daemon restart osd.x". This is the end of the log for one OSD going down for good: " 2022-08-04T09:57:31.752+ 7f3812cb2700 1

[ceph-users] Re: OSDs crashing/flapping

2022-08-04 Thread Igor Fedotov
Hi Torkil, it looks like you're facing pretty well known problem with RocksDB performance degradation caused by bulk data removal. This has been discussed multiple times in this mailing list. And here is one of the relevant tracker: https://tracker.ceph.com/issues/40741 To eliminate the ef

[ceph-users] Re: OSDs crashing after server reboot.

2021-03-11 Thread Igor Fedotov
Hi Cassiano, the backtrace you've provided relates to the bug fixed by: https://github.com/ceph/ceph/pull/37793 This fix is going to be releases with the upcoming v14.2.17. But I doubt that your original crashes have the same root cause - this issue appears during shutdown only. Anyway yo

[ceph-users] Re: OSDs crashing after server reboot.

2021-03-11 Thread Cassiano Pilipavicius
Hi, really this error was only showing up when I've tried to run ceph-bluestore-tool repair, In my 3 OSDs that keeps crashing, it show the following log... please let me know if there is something I can do to get the pool back to a functioning state. Uptime(secs): 0.0 total, 0.0 interval Flush(GB)

[ceph-users] Re: OSDs crashing after server reboot.

2021-03-12 Thread Cassiano Pilipavicius
Thanks again Igor, using the ceph-bluestore-tool with the CEPH_ARGS="--bluestore_hybrid_alloc_mem_cap=0" I was able to detect two OSDs returning IO errors. This OSDs crashing has caused backfills operations that triggered some OSDs marking others as down due to some kind of slow operations and the