Re: [ceph-users] OSD assert hit suicide timeout

Jordan Share Wed, 20 Sep 2017 14:50:05 -0700

We do have our nf_conntrack_max set to 1048576, but our count is actually
only between 65k and 101k, so your 256k is probably fine.  You'd have
noticed in dmesg anyway, I figure, but I thought I'd mention it.


I am out of ideas. :)

Jordan

On Wed, Sep 20, 2017 at 1:28 PM, David Turner <drakonst...@gmail.com> wrote:

> We have been running with the following settings for months for pid_max
> and conntrack_max.  Are these at values you would deem too low?
>   kernel.pid_max = 4194303
>   net.nf_conntrack_max = 262144
>
>
> I went back in the logs to see where the slow requests and timeouts
> started.  It seems to have come out of nowhere.  I included the log for 8
> minutes prior to the osd op thread time outs and 8.5 minutes before the
> first slow request message.  I don't see any indicators at all that it was
> scrubbing or anything.  I'm also not seeing any SMART errors on the drives.
>
> 2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088
>
> 2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table #125303:
> started
> 2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table #125303:
> 1509406 bytes OK
> 2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277
>
> 2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table #125305:
> started
> 2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table #125305:
> 1503050 bytes OK
> 2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302
>
> 2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table #125307:
> started
> 2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table #125307:
> 1500713 bytes OK
> 2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304
>
> 2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
> 2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>
> --- this lines repeated thousands of times before we see the first slow
> request message ---
>
> 2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
> 2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
> 2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : 25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
> 2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.726969 seconds old, received at 2017-09-19
> 01:22:07.045866: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
> reqid=client.12636870.0:480468942, at_version=48111'1958916,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.668230 seconds old, received at 2017-09-19
> 01:22:07.104606: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
> reqid=client.12636870.0:480468943, at_version=48111'1958917,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.057066 seconds old, received at 2017-09-19
> 01:22:07.715770: MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
> reqid=client.18422767.0:328744332, at_version=48111'1953782,
> trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
> 2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.664709 seconds old, received at 2017-09-19
> 01:22:07.108126: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
> reqid=client.12636870.0:480468944, at_version=48111'1958918,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.657596 seconds old, received at 2017-09-19
> 01:22:07.115240: MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
> reqid=client.24892528.0:35633485, at_version=48111'1958524,
> trim_to=48066'1955458, trim_rollback_to=48111'1958513)) currently started
> 2017-09-19 01:22:37.785581 7f1381ee9700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
> 2017-09-19 01:22:37.785583 7f1381ee9700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>
>
> On Tue, Sep 19, 2017 at 7:31 PM Jordan Share <jordan.sh...@gmail.com>
> wrote:
>
>> We had suicide timeouts, but unfortunately I can't remember the specific
>> root cause at this point.
>>
>> It was definitely one of two things:
>>    * too low of net.netfilter.nf_conntrack_max (preventing the osds from
>> opening new connections to each other)
>>    * too low of kernel.pid_max or kernel.threads-max (preventing new
>> threads from starting)
>>
>> I am pretty sure we hit the pid_max early on, but the conntrack_max
>> didn't cause trouble until we had enough VMs running to push the total
>> number of connections past the default limit.
>>
>> Our cluster is approximately the same size as yours.
>>
>> Jordan
>>
>> On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang <
>> stanley.zh...@smxemail.com> wrote:
>>
>>> We don't use EC pools, but my experience with similar slow requests on
>>> RGW+replicated_pools is that in the logs you need to find out the first
>>> slow request and identify where it's from, for example, is it deep-scrub,
>>> or some client accessing corrupted objects, disk errors etc.
>>>
>>> On 20/09/17 8:13 AM, David Turner wrote:
>>>
>>> Just starting 3 nights ago we started seeing OSDs randomly going down in
>>> our cluster (Jewel 10.2.7).  At first I saw that each OSD that was recently
>>> marked down in the cluster (`ceph osd dump | grep -E '^osd\.[0-9]+\s' |
>>> sort -nrk11` sorted list of OSDs by which OSDs have been marked down in the
>>> most recent OSD map epochs) and all of them had been wrongly marked down.
>>> Prior to this there is a lot of evidence of slow requests, OSD op thread
>>> timing out, Filestore op thread timing out, and other errors.  At the
>>> bottom of the email is an excerpt of such errors.  It is definitely not a
>>> comprehensive log of these errors.
>>>
>>> The map epoch of the last time the OSD is marked down in the OSD map
>>> matches when it logs that it was wrongly marked down.  After thousands more
>>> lines of op threads timing out and slow requests, the OSD finally asserts
>>> with "hit suicide timeout".  The first 2 nights this was localized to 2
>>> osds on the same host, but last night this happened on an osd that is on a
>>> second host.  2 of the 3 OSDs have hit this 3 times and the other has hit
>>> it twice.
>>>
>>> There are 15 OSD nodes and 224 total OSDs in the cluster.  So far only
>>> these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas of
>>> what to do next?  Or have any information that is missing here to be able
>>> to understand things further?  It's too intermittent to run the OSD with
>>> log level 20.
>>>
>>> 2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
>>> 2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
>>> 2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log
>>> [WRN] : 891 slow requests, 5 included below; oldest blocked for >
>>> 150.987639 secs
>>> 2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log
>>> [WRN] : slow request 150.985487 seconds old, received at 2017-09-19
>>> 02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
>>> 4267, reqid=client.24351228.0:139832730, at_version=48121'1955383,
>>> trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently
>>> started
>>> 2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log
>>> [WRN] : slow request 150.510718 seconds old, received at 2017-09-19
>>> 02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
>>> 7590, reqid=client.24892528.0:36410074, at_version=48121'1960640,
>>> trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently
>>> started
>>> 2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log
>>> [WRN] : slow request 149.177285 seconds old, received at 2017-09-19
>>> 02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
>>> 93253, reqid=client.24892528.0:36410582, at_version=48121'1964351,
>>> trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently
>>> queued_for_pg
>>> 2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log
>>> [WRN] : slow request 147.359320 seconds old, received at 2017-09-19
>>> 02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
>>> 587, reqid=client.18422767.0:329511916, at_version=48121'1965073,
>>> trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently
>>> queued_for_pg
>>> 2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log
>>> [WRN] : slow request 146.404241 seconds old, received at 2017-09-19
>>> 02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
>>> 4291, reqid=client.18422767.0:329512018, at_version=48121'1955389,
>>> trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently
>>> queued_for_pg
>>> 2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 10.10.113.29:6826/3332721
>>> >> 10.10.13.32:6822/130263 pipe(0x7fe003b1c800 sd=67 :24892 s=1
>>> pgs=7268 cs=1 l=0 c=0x7fe01b44be00).connect got RESETSESSION
>>> 2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721
>>> >> 10.10.13.33:6802/2651380 pipe(0x7fe008140800 sd=70 :61127 s=1
>>> pgs=7059 cs=1 l=0 c=0x7fe01be72600).connect got RESETSESSION
>>> 2017-09-19 02:06:55.974657 7fdf78e4e700  0 -- 10.10.13.29:6826/3332721
>>> >> 10.10.13.35:6827/477826 pipe(0x7fe004aef400 sd=91 :19464 s=1
>>> pgs=7316 cs=1 l=0 c=0x7fe01b179080).connect got RESETSESSION
>>> 2017-09-19 02:06:55.977120 7fdf95310700  0 -- 10.10.13.29:6826/3332721
>>> >> 10.10.13.22:6812/2276428 pipe(0x7fe003d15400 sd=86 :58892 s=1
>>> pgs=15908 cs=1 l=0 c=0x7fe01b2e1500).connect got RESETSESSION
>>> 2017-09-19 02:06:55.979722 7fdf9500d700  0 -- 10.10.13.29:6826/3332721
>>> >> 10.10.13.27:6830/2018590 pipe(0x7fe003a06800 sd=98 :42191 s=1
>>> pgs=12697 cs=1 l=0 c=0x7fe01b44d780).connect got RESETSESSION
>>> 2017-09-19 02:06:56.106436 7fdfba1dc700  0 -- 10.10.13.29:6826/3332721
>>> >> 10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=1
>>> pgs=11500 cs=1 l=0 c=0x7fe005820880).connect got RESETSESSION
>>> 2017-09-19 02:06:56.107146 7fdfbbaf5700  0 -- 10.10.13.29:6826/3332721
>>> >> 10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=2
>>> pgs=11602 cs=1 l=0 c=0x7fe005820880).fault, initiating reconnect
>>> ---
>>> 2017-09-19 02:06:56.213980 7fdfdd58d700  0 log_channel(cluster) log
>>> [WRN] : map e48123 wrongly marked me down
>>> ---
>>> 2017-09-19 03:06:34.778837 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
>>> 2017-09-19 03:06:34.778840 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'FileStore::op_tp thread 0x7fdfddd8e700' had timed out after 60
>>> 2017-09-19 03:06:39.778908 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'OSD::osd_op_tp thread 0x7fdfc422f700' had timed out after 15
>>> 2017-09-19 03:06:39.778921 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'OSD::osd_op_tp thread 0x7fdfc5231700' had timed out after 15
>>> 2017-09-19 03:06:39.778930 7fdfeae86700  1 heartbeat_map is_healthy
>>> 'OSD::osd_op_tp thread 0x7fdfc5231700' had suicide timed out after 150
>>> 2017-09-19 03:06:39.782749 7fdfeae86700 -1 common/HeartbeatMap.cc: In
>>> function 'bool ceph::HeartbeatMap::_check(const
>>> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fdfeae86700 time
>>> 2017-09-19
>>>  03:06:39.778940
>>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing 
>>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> --
>>>
>>> *Stanley Zhang | * Senior Operations Engineer
>>> *Telephone:* +64 9 302 0515 <+64%209-302%200515> *Fax:* +64 9 302 0518
>>> <+64%209-302%200518>
>>> *Mobile:* +64 22 318 3664 <+64%2022%20318%203664> *Freephone:* 0800 SMX
>>> SMX (769 769)
>>> *SMX Limited:* Level 15, 19 Victoria Street West, Auckland, New Zealand
>>> <https://maps.google.com/?q=19+Victoria%0D+Street+West,+Auckland,+New+Zealand&entry=gmail&source=g>
>>> *Web:* http://smxemail.com
>>> [image: SMX | Cloud Email Hosting & Security]
>>> This email has been filtered by SMX. For more information visit
>>> smxemail.com.
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD assert hit suicide timeout

Reply via email to