Hello all,
I'm having some stability issues with my ceph cluster at the moment.
Using CentOS 7, and Ceph 12.2.4.
I have osds that are segfaulting regularly. roughly every minute or
so, and it seems to be getting worse, now with cascading failures.
Backtraces look like this:
ceph version 12.2.4
Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.
I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than
On 04/05/2018 06:15 PM, Adam Tygart wrote:
Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.
I've ran another one of them with more logging, debug
On 04/05/2018 08:11 PM, Josh Durgin wrote:
On 04/05/2018 06:15 PM, Adam Tygart wrote:
Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.
I've ran a
Thank you! Setting norecover has seemed to work in terms of keeping
the osds up. I am glad my logs were of use to tracking this down. I am
looking forward to future updates.
Let me know if you need anything else.
--
Adam
On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin wrote:
> On 04/05/2018 08:11
You should be able to avoid the crash by setting:
osd recovery max single start = 1
osd recovery max active = 1
With that, you can unset norecover to let recovery start again.
A fix so you don't need those settings is here:
https://github.com/ceph/ceph/pull/21273
If you see any other backtra
I set this about 15 minutes ago, with the following:
ceph tell osd.* injectargs '--osd-recovery-max-single-start 1
--osd-recovery-max-active 1'
ceph osd unset noout
ceph osd unset norecover
I also set those settings in ceph.conf just in case the "not observed"
response was true.
Things have been