Re: [ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub

Gregory Farnum Tue, 15 Aug 2017 17:17:34 -0700

Yes, you can set it on the one one. That configuration is for an entirely
internal system and can mismatch across OSDs without trouble.


On Tue, Aug 15, 2017 at 4:25 PM Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Thanks, I'll try and do that. Since I'm running a cluster with
> multiple nodes, do I have to set this in ceph.conf on all nodes or
> does it suffice with just the node with that particular osd?
>
> On 15 August 2017 at 22:51, Gregory Farnum <gfar...@redhat.com> wrote:
> >
> >
> > On Tue, Aug 15, 2017 at 7:03 AM Andreas Calminder
> > <andreas.calmin...@klarna.com> wrote:
> >>
> >> Hi,
> >> I got hit with osd suicide timeouts while deep-scrub runs on a
> >> specific pg, there's a RH article
> >> (https://access.redhat.com/solutions/2127471) suggesting changing
> >> osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem
> >> is the article is for Hammer and the osd_scrub_thread_suicide_timeout
> >> doesn't exist when running
> >> ceph daemon osd.34 config show
> >> and the default timeout (60s) suggested in the article doesn't really
> >> match the sucide timeout time in the logs:
> >>
> >> 2017-08-15 15:39:37.512216 7fb293137700  1 heartbeat_map is_healthy
> >> 'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150
> >> 2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In
> >> function 'bool ceph::HeartbeatMap::_check(const
> >> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700
> >> time 2017-08-15 15:39:37.512230
> >> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
> >>
> >> The suicide timeout (150) does match the
> >> osd_op_thread_suicide_timeout, however when I try changing this I get:
> >> ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300
> >> {
> >>     "success": "osd_op_thread_suicide_timeout = '300' (unchangeable) "
> >> }
> >>
> >> And the deep scrub will sucide timeout after 150 seconds, just like
> >> before.
> >>
> >> The cluster is left with osd.34 flapping. Is there any way to let the
> >> deep-scrub finish and get out of the infinite deep-scrub loop?
> >
> >
> > You can set that option in ceph.conf. It's "unchangeable" because it's
> used
> > to initialize some other structures at boot so you can't edit it live.
> >
> >>
> >>
> >> Regards,
> >> Andreas
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub

Reply via email to