[ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart
Hello all, I'm having some stability issues with my ceph cluster at the moment. Using CentOS 7, and Ceph 12.2.4. I have osds that are segfaulting regularly. roughly every minute or so, and it seems to be getting worse, now with cascading failures. Backtraces look like this: ceph version 12.2.4

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart
Well, the cascading crashes are getting worse. I'm routinely seeing 8-10 of my 518 osds crash. I cannot start 2 of them without triggering 14 or so of them to crash repeatedly for more than an hour. I've ran another one of them with more logging, debug osd = 20; debug ms = 1 (definitely more than

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Josh Durgin
On 04/05/2018 06:15 PM, Adam Tygart wrote: Well, the cascading crashes are getting worse. I'm routinely seeing 8-10 of my 518 osds crash. I cannot start 2 of them without triggering 14 or so of them to crash repeatedly for more than an hour. I've ran another one of them with more logging, debug

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Josh Durgin
On 04/05/2018 08:11 PM, Josh Durgin wrote: On 04/05/2018 06:15 PM, Adam Tygart wrote: Well, the cascading crashes are getting worse. I'm routinely seeing 8-10 of my 518 osds crash. I cannot start 2 of them without triggering 14 or so of them to crash repeatedly for more than an hour. I've ran a

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart
Thank you! Setting norecover has seemed to work in terms of keeping the osds up. I am glad my logs were of use to tracking this down. I am looking forward to future updates. Let me know if you need anything else. -- Adam On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin wrote: > On 04/05/2018 08:11

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-06 Thread Josh Durgin
You should be able to avoid the crash by setting: osd recovery max single start = 1 osd recovery max active = 1 With that, you can unset norecover to let recovery start again. A fix so you don't need those settings is here: https://github.com/ceph/ceph/pull/21273 If you see any other backtra

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-06 Thread Adam Tygart
I set this about 15 minutes ago, with the following: ceph tell osd.* injectargs '--osd-recovery-max-single-start 1 --osd-recovery-max-active 1' ceph osd unset noout ceph osd unset norecover I also set those settings in ceph.conf just in case the "not observed" response was true. Things have been