Re: [ceph-users] OSD assert hit suicide timeout

David Turner Wed, 20 Sep 2017 13:29:32 -0700

We have been running with the following settings for months for pid_max and
conntrack_max.  Are these at values you would deem too low?
  kernel.pid_max = 4194303
  net.nf_conntrack_max = 262144



I went back in the logs to see where the slow requests and timeouts
started.  It seems to have come out of nowhere.  I included the log for 8
minutes prior to the osd op thread time outs and 8.5 minutes before the
first slow request message.  I don't see any indicators at all that it was
scrubbing or anything.  I'm also not seeing any SMART errors on the drives.

2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088

2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table #125303:
started
2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table #125303:
1509406 bytes OK
2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277

2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table #125305:
started
2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table #125305:
1503050 bytes OK
2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302

2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table #125307:
started
2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table #125307:
1500713 bytes OK
2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304

2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15

--- this lines repeated thousands of times before we see the first slow
request message ---

2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.726969 seconds old, received at 2017-09-19 01:22:07.045866:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
reqid=client.12636870.0:480468942, at_version=48111'1958916,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.668230 seconds old, received at 2017-09-19 01:22:07.104606:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
reqid=client.12636870.0:480468943, at_version=48111'1958917,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.057066 seconds old, received at 2017-09-19 01:22:07.715770:
MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
reqid=client.18422767.0:328744332, at_version=48111'1953782,
trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.664709 seconds old, received at 2017-09-19 01:22:07.108126:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
reqid=client.12636870.0:480468944, at_version=48111'1958918,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.657596 seconds old, received at 2017-09-19 01:22:07.115240:
MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
reqid=client.24892528.0:35633485, at_version=48111'1958524,
trim_to=48066'1955458, trim_rollback_to=48111'1958513)) currently started
2017-09-19 01:22:37.785581 7f1381ee9700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
2017-09-19 01:22:37.785583 7f1381ee9700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15


On Tue, Sep 19, 2017 at 7:31 PM Jordan Share <jordan.sh...@gmail.com> wrote:

> We had suicide timeouts, but unfortunately I can't remember the specific
> root cause at this point.
>
> It was definitely one of two things:
>    * too low of net.netfilter.nf_conntrack_max (preventing the osds from
> opening new connections to each other)
>    * too low of kernel.pid_max or kernel.threads-max (preventing new
> threads from starting)
>
> I am pretty sure we hit the pid_max early on, but the conntrack_max didn't
> cause trouble until we had enough VMs running to push the total number of
> connections past the default limit.
>
> Our cluster is approximately the same size as yours.
>
> Jordan
>
> On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang <stanley.zh...@smxemail.com
> > wrote:
>
>> We don't use EC pools, but my experience with similar slow requests on
>> RGW+replicated_pools is that in the logs you need to find out the first
>> slow request and identify where it's from, for example, is it deep-scrub,
>> or some client accessing corrupted objects, disk errors etc.
>>
>> On 20/09/17 8:13 AM, David Turner wrote:
>>
>> Just starting 3 nights ago we started seeing OSDs randomly going down in
>> our cluster (Jewel 10.2.7).  At first I saw that each OSD that was recently
>> marked down in the cluster (`ceph osd dump | grep -E '^osd\.[0-9]+\s' |
>> sort -nrk11` sorted list of OSDs by which OSDs have been marked down in the
>> most recent OSD map epochs) and all of them had been wrongly marked down.
>> Prior to this there is a lot of evidence of slow requests, OSD op thread
>> timing out, Filestore op thread timing out, and other errors.  At the
>> bottom of the email is an excerpt of such errors.  It is definitely not a
>> comprehensive log of these errors.
>>
>> The map epoch of the last time the OSD is marked down in the OSD map
>> matches when it logs that it was wrongly marked down.  After thousands more
>> lines of op threads timing out and slow requests, the OSD finally asserts
>> with "hit suicide timeout".  The first 2 nights this was localized to 2
>> osds on the same host, but last night this happened on an osd that is on a
>> second host.  2 of the 3 OSDs have hit this 3 times and the other has hit
>> it twice.
>>
>> There are 15 OSD nodes and 224 total OSDs in the cluster.  So far only
>> these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas of
>> what to do next?  Or have any information that is missing here to be able
>> to understand things further?  It's too intermittent to run the OSD with
>> log level 20.
>>
>> 2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
>> 2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
>> 2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log [WRN]
>> : 891 slow requests, 5 included below; oldest blocked for > 150.987639 secs
>> 2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log [WRN]
>> : slow request 150.985487 seconds old, received at 2017-09-19
>> 02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
>> 4267, reqid=client.24351228.0:139832730, at_version=48121'1955383,
>> trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started
>> 2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log [WRN]
>> : slow request 150.510718 seconds old, received at 2017-09-19
>> 02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
>> 7590, reqid=client.24892528.0:36410074, at_version=48121'1960640,
>> trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started
>> 2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log [WRN]
>> : slow request 149.177285 seconds old, received at 2017-09-19
>> 02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
>> 93253, reqid=client.24892528.0:36410582, at_version=48121'1964351,
>> trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently
>> queued_for_pg
>> 2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log [WRN]
>> : slow request 147.359320 seconds old, received at 2017-09-19
>> 02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
>> 587, reqid=client.18422767.0:329511916, at_version=48121'1965073,
>> trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently
>> queued_for_pg
>> 2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log [WRN]
>> : slow request 146.404241 seconds old, received at 2017-09-19
>> 02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
>> 4291, reqid=client.18422767.0:329512018, at_version=48121'1955389,
>> trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently
>> queued_for_pg
>> 2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 10.10.113.29:6826/3332721
>> >> 10.10.13.32:6822/130263 pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268
>> cs=1 l=0 c=0x7fe01b44be00).connect got RESETSESSION
>> 2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721
>> >> 10.10.13.33:6802/2651380 pipe(0x7fe008140800 sd=70 :61127 s=1
>> pgs=7059 cs=1 l=0 c=0x7fe01be72600).connect got RESETSESSION
>> 2017-09-19 02:06:55.974657 7fdf78e4e700  0 -- 10.10.13.29:6826/3332721
>> >> 10.10.13.35:6827/477826 pipe(0x7fe004aef400 sd=91 :19464 s=1 pgs=7316
>> cs=1 l=0 c=0x7fe01b179080).connect got RESETSESSION
>> 2017-09-19 02:06:55.977120 7fdf95310700  0 -- 10.10.13.29:6826/3332721
>> >> 10.10.13.22:6812/2276428 pipe(0x7fe003d15400 sd=86 :58892 s=1
>> pgs=15908 cs=1 l=0 c=0x7fe01b2e1500).connect got RESETSESSION
>> 2017-09-19 02:06:55.979722 7fdf9500d700  0 -- 10.10.13.29:6826/3332721
>> >> 10.10.13.27:6830/2018590 pipe(0x7fe003a06800 sd=98 :42191 s=1
>> pgs=12697 cs=1 l=0 c=0x7fe01b44d780).connect got RESETSESSION
>> 2017-09-19 02:06:56.106436 7fdfba1dc700  0 -- 10.10.13.29:6826/3332721
>> >> 10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=1
>> pgs=11500 cs=1 l=0 c=0x7fe005820880).connect got RESETSESSION
>> 2017-09-19 02:06:56.107146 7fdfbbaf5700  0 -- 10.10.13.29:6826/3332721
>> >> 10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=2
>> pgs=11602 cs=1 l=0 c=0x7fe005820880).fault, initiating reconnect
>> ---
>> 2017-09-19 02:06:56.213980 7fdfdd58d700  0 log_channel(cluster) log [WRN]
>> : map e48123 wrongly marked me down
>> ---
>> 2017-09-19 03:06:34.778837 7fdfeae86700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
>> 2017-09-19 03:06:34.778840 7fdfeae86700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7fdfddd8e700' had timed out after 60
>> 2017-09-19 03:06:39.778908 7fdfeae86700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fdfc422f700' had timed out after 15
>> 2017-09-19 03:06:39.778921 7fdfeae86700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fdfc5231700' had timed out after 15
>> 2017-09-19 03:06:39.778930 7fdfeae86700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fdfc5231700' had suicide timed out after 150
>> 2017-09-19 03:06:39.782749 7fdfeae86700 -1 common/HeartbeatMap.cc: In
>> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
>> const char*, time_t)' thread 7fdfeae86700 time 2017-09-19
>>  03:06:39.778940
>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>>
>>
>> _______________________________________________
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>>
>> *Stanley Zhang | * Senior Operations Engineer
>> *Telephone:* +64 9 302 0515 <+64%209-302%200515> *Fax:* +64 9 302 0518
>> <+64%209-302%200518>
>> *Mobile:* +64 22 318 3664 <+64%2022%20318%203664> *Freephone:* 0800 SMX
>> SMX (769 769)
>> *SMX Limited:* Level 15, 19 Victoria Street West, Auckland, New Zealand
>> <https://maps.google.com/?q=19+Victoria%0D+Street+West,+Auckland,+New+Zealand&entry=gmail&source=g>
>> *Web:* http://smxemail.com
>> [image: SMX | Cloud Email Hosting & Security]
>> This email has been filtered by SMX. For more information visit
>> smxemail.com.
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD assert hit suicide timeout

Reply via email to