Re: [ceph-users] OSD assert hit suicide timeout

2017-09-20 Thread Brad Hubbard
Start gathering something akin to what systat gathers (there are of course
numerous options for this) and going over it carefully. Tools like perf or
oprofile can also provide important clues.

This "looks" like a network issue or some sort of resource shortage.
Comprehensive monitoring and gathering of stats from the period that
coincides with an "event" is definitely where I'd start as well as looking
at ceph data such as dump_historic_ops, dump_ops_in_flight and trying to
establish any commonality or indication of where to focus.

There's a trail there, we just need to find it.

On Thu, Sep 21, 2017 at 7:48 AM, Jordan Share 
wrote:

> We do have our nf_conntrack_max set to 1048576, but our count is actually
> only between 65k and 101k, so your 256k is probably fine.  You'd have
> noticed in dmesg anyway, I figure, but I thought I'd mention it.
>
> I am out of ideas. :)
>
> Jordan
>
> On Wed, Sep 20, 2017 at 1:28 PM, David Turner 
> wrote:
>
>> We have been running with the following settings for months for pid_max
>> and conntrack_max.  Are these at values you would deem too low?
>>   kernel.pid_max = 4194303
>>   net.nf_conntrack_max = 262144
>>
>>
>> I went back in the logs to see where the slow requests and timeouts
>> started.  It seems to have come out of nowhere.  I included the log for 8
>> minutes prior to the osd op thread time outs and 8.5 minutes before the
>> first slow request message.  I don't see any indicators at all that it was
>> scrubbing or anything.  I'm also not seeing any SMART errors on the drives.
>>
>> 2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088
>>
>> 2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table
>> #125303: started
>> 2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table
>> #125303: 1509406 bytes OK
>> 2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277
>>
>> 2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table
>> #125305: started
>> 2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table
>> #125305: 1503050 bytes OK
>> 2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302
>>
>> 2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table
>> #125307: started
>> 2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table
>> #125307: 1500713 bytes OK
>> 2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304
>>
>> 2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
>> 2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>>
>> --- this lines repeated thousands of times before we see the first slow
>> request message ---
>>
>> 2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
>> 2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
>> 2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : 25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
>> 2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.726969 seconds old, received at 2017-09-19
>> 01:22:07.045866: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
>> reqid=client.12636870.0:480468942, at_version=48111'1958916,
>> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
>> 2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.668230 seconds old, received at 2017-09-19
>> 01:22:07.104606: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
>> reqid=client.12636870.0:480468943, at_version=48111'1958917,
>> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
>> 2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.057066 seconds old, received at 2017-09-19
>> 01:22:07.715770: MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
>> reqid=client.18422767.0:328744332, at_version=48111'1953782,
>> trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
>> 2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.664709 seconds old, received at 2017-09-19
>> 01:22:07.108126: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
>> reqid=client.12636870.0:480468944, at_version=48111'1958918,
>> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
>> 2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN]
>> : slow request 30.657596 seconds old, received at 2017-09-19
>> 01:22:07.115240: MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
>> reqid=client.24892528.0:35633485, at_version=48111'1958524,
>> trim_to=48066'1955458, trim_rollback_to=48111'19

Re: [ceph-users] OSD assert hit suicide timeout

2017-09-20 Thread Jordan Share
We do have our nf_conntrack_max set to 1048576, but our count is actually
only between 65k and 101k, so your 256k is probably fine.  You'd have
noticed in dmesg anyway, I figure, but I thought I'd mention it.

I am out of ideas. :)

Jordan

On Wed, Sep 20, 2017 at 1:28 PM, David Turner  wrote:

> We have been running with the following settings for months for pid_max
> and conntrack_max.  Are these at values you would deem too low?
>   kernel.pid_max = 4194303
>   net.nf_conntrack_max = 262144
>
>
> I went back in the logs to see where the slow requests and timeouts
> started.  It seems to have come out of nowhere.  I included the log for 8
> minutes prior to the osd op thread time outs and 8.5 minutes before the
> first slow request message.  I don't see any indicators at all that it was
> scrubbing or anything.  I'm also not seeing any SMART errors on the drives.
>
> 2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088
>
> 2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table #125303:
> started
> 2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table #125303:
> 1509406 bytes OK
> 2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277
>
> 2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table #125305:
> started
> 2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table #125305:
> 1503050 bytes OK
> 2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302
>
> 2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table #125307:
> started
> 2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table #125307:
> 1500713 bytes OK
> 2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304
>
> 2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
> 2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>
> --- this lines repeated thousands of times before we see the first slow
> request message ---
>
> 2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
> 2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
> 2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : 25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
> 2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.726969 seconds old, received at 2017-09-19
> 01:22:07.045866: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
> reqid=client.12636870.0:480468942, at_version=48111'1958916,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.668230 seconds old, received at 2017-09-19
> 01:22:07.104606: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
> reqid=client.12636870.0:480468943, at_version=48111'1958917,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.057066 seconds old, received at 2017-09-19
> 01:22:07.715770: MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
> reqid=client.18422767.0:328744332, at_version=48111'1953782,
> trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
> 2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.664709 seconds old, received at 2017-09-19
> 01:22:07.108126: MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
> reqid=client.12636870.0:480468944, at_version=48111'1958918,
> trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
> 2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN]
> : slow request 30.657596 seconds old, received at 2017-09-19
> 01:22:07.115240: MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
> reqid=client.24892528.0:35633485, at_version=48111'1958524,
> trim_to=48066'1955458, trim_rollback_to=48111'1958513)) currently started
> 2017-09-19 01:22:37.785581 7f1381ee9700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
> 2017-09-19 01:22:37.785583 7f1381ee9700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15
>
>
> On Tue, Sep 19, 2017 at 7:31 PM Jordan Share 
> wrote:
>
>> We had suicide timeouts, but unfortunately I can't remember the specific
>> root cause at this point.
>>
>> It was definitely one of two things:
>>* too low of net.netfilter.nf_conntrack_max (preventing the osds from
>> opening new connections to each other)
>>* too low of kernel.pid_max or kernel.threads-max (preventing new
>> threads from starting)
>>
>> I am p

Re: [ceph-users] OSD assert hit suicide timeout

2017-09-20 Thread David Turner
We have been running with the following settings for months for pid_max and
conntrack_max.  Are these at values you would deem too low?
  kernel.pid_max = 4194303
  net.nf_conntrack_max = 262144


I went back in the logs to see where the slow requests and timeouts
started.  It seems to have come out of nowhere.  I included the log for 8
minutes prior to the osd op thread time outs and 8.5 minutes before the
first slow request message.  I don't see any indicators at all that it was
scrubbing or anything.  I'm also not seeing any SMART errors on the drives.

2017-09-19 01:14:25.539030 7f134326a700  1 leveldb: Delete type=2 #125088

2017-09-19 01:16:25.544797 7f134326a700  1 leveldb: Level-0 table #125303:
started
2017-09-19 01:16:25.571103 7f134326a700  1 leveldb: Level-0 table #125303:
1509406 bytes OK
2017-09-19 01:16:25.572424 7f134326a700  1 leveldb: Delete type=0 #125277

2017-09-19 01:18:14.329887 7f134326a700  1 leveldb: Level-0 table #125305:
started
2017-09-19 01:18:14.359580 7f134326a700  1 leveldb: Level-0 table #125305:
1503050 bytes OK
2017-09-19 01:18:14.361107 7f134326a700  1 leveldb: Delete type=0 #125302

2017-09-19 01:20:13.239215 7f134326a700  1 leveldb: Level-0 table #125307:
started
2017-09-19 01:20:13.268553 7f134326a700  1 leveldb: Level-0 table #125307:
1500713 bytes OK
2017-09-19 01:20:13.270018 7f134326a700  1 leveldb: Delete type=0 #125304

2017-09-19 01:22:24.023066 7f13c142b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
2017-09-19 01:22:24.023081 7f13c142b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15

--- this lines repeated thousands of times before we see the first slow
request message ---

2017-09-19 01:22:37.770750 7f134ddac700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139d4e6700' had timed out after 15
2017-09-19 01:22:37.770751 7f134ddac700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139e4e8700' had timed out after 15
2017-09-19 01:22:37.772928 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
25 slow requests, 5 included below; oldest blocked for > 30.890114 secs
2017-09-19 01:22:37.772946 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.726969 seconds old, received at 2017-09-19 01:22:07.045866:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428263,
reqid=client.12636870.0:480468942, at_version=48111'1958916,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.772957 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.668230 seconds old, received at 2017-09-19 01:22:07.104606:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428264,
reqid=client.12636870.0:480468943, at_version=48111'1958917,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.772984 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.057066 seconds old, received at 2017-09-19 01:22:07.715770:
MOSDECSubOpWrite(97.2e6s2 48111 ECSubWrite(tid=3560867,
reqid=client.18422767.0:328744332, at_version=48111'1953782,
trim_to=48066'1950689, trim_rollback_to=48111'1953780)) currently started
2017-09-19 01:22:37.772989 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.664709 seconds old, received at 2017-09-19 01:22:07.108126:
MOSDECSubOpWrite(97.2ebs4 48111 ECSubWrite(tid=6428265,
reqid=client.12636870.0:480468944, at_version=48111'1958918,
trim_to=48066'1955886, trim_rollback_to=48111'1958910)) currently started
2017-09-19 01:22:37.773000 7f13bc0b6700  0 log_channel(cluster) log [WRN] :
slow request 30.657596 seconds old, received at 2017-09-19 01:22:07.115240:
MOSDECSubOpWrite(97.51s3 48111 ECSubWrite(tid=5500858,
reqid=client.24892528.0:35633485, at_version=48111'1958524,
trim_to=48066'1955458, trim_rollback_to=48111'1958513)) currently started
2017-09-19 01:22:37.785581 7f1381ee9700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139a4e0700' had timed out after 15
2017-09-19 01:22:37.785583 7f1381ee9700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f139ace1700' had timed out after 15


On Tue, Sep 19, 2017 at 7:31 PM Jordan Share  wrote:

> We had suicide timeouts, but unfortunately I can't remember the specific
> root cause at this point.
>
> It was definitely one of two things:
>* too low of net.netfilter.nf_conntrack_max (preventing the osds from
> opening new connections to each other)
>* too low of kernel.pid_max or kernel.threads-max (preventing new
> threads from starting)
>
> I am pretty sure we hit the pid_max early on, but the conntrack_max didn't
> cause trouble until we had enough VMs running to push the total number of
> connections past the default limit.
>
> Our cluster is approximately the same size as yours.
>
> Jordan
>
> On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang  > wrote:
>
>> We don't use EC pools, but my experience with similar slow requests on
>> RGW+replicated_pools is that in the logs you need to find out t

Re: [ceph-users] OSD assert hit suicide timeout

2017-09-19 Thread Jordan Share
We had suicide timeouts, but unfortunately I can't remember the specific
root cause at this point.

It was definitely one of two things:
   * too low of net.netfilter.nf_conntrack_max (preventing the osds from
opening new connections to each other)
   * too low of kernel.pid_max or kernel.threads-max (preventing new
threads from starting)

I am pretty sure we hit the pid_max early on, but the conntrack_max didn't
cause trouble until we had enough VMs running to push the total number of
connections past the default limit.

Our cluster is approximately the same size as yours.

Jordan

On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang 
wrote:

> We don't use EC pools, but my experience with similar slow requests on
> RGW+replicated_pools is that in the logs you need to find out the first
> slow request and identify where it's from, for example, is it deep-scrub,
> or some client accessing corrupted objects, disk errors etc.
>
> On 20/09/17 8:13 AM, David Turner wrote:
>
> Just starting 3 nights ago we started seeing OSDs randomly going down in
> our cluster (Jewel 10.2.7).  At first I saw that each OSD that was recently
> marked down in the cluster (`ceph osd dump | grep -E '^osd\.[0-9]+\s' |
> sort -nrk11` sorted list of OSDs by which OSDs have been marked down in the
> most recent OSD map epochs) and all of them had been wrongly marked down.
> Prior to this there is a lot of evidence of slow requests, OSD op thread
> timing out, Filestore op thread timing out, and other errors.  At the
> bottom of the email is an excerpt of such errors.  It is definitely not a
> comprehensive log of these errors.
>
> The map epoch of the last time the OSD is marked down in the OSD map
> matches when it logs that it was wrongly marked down.  After thousands more
> lines of op threads timing out and slow requests, the OSD finally asserts
> with "hit suicide timeout".  The first 2 nights this was localized to 2
> osds on the same host, but last night this happened on an osd that is on a
> second host.  2 of the 3 OSDs have hit this 3 times and the other has hit
> it twice.
>
> There are 15 OSD nodes and 224 total OSDs in the cluster.  So far only
> these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas of
> what to do next?  Or have any information that is missing here to be able
> to understand things further?  It's too intermittent to run the OSD with
> log level 20.
>
> 2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
> 2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
> 2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : 891 slow requests, 5 included below; oldest blocked for > 150.987639 secs
> 2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 150.985487 seconds old, received at 2017-09-19
> 02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
> 4267, reqid=client.24351228.0:139832730, at_version=48121'1955383,
> trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started
> 2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 150.510718 seconds old, received at 2017-09-19
> 02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
> 7590, reqid=client.24892528.0:36410074, at_version=48121'1960640,
> trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started
> 2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 149.177285 seconds old, received at 2017-09-19
> 02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
> 93253, reqid=client.24892528.0:36410582, at_version=48121'1964351,
> trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently
> queued_for_pg
> 2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 147.359320 seconds old, received at 2017-09-19
> 02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
> 587, reqid=client.18422767.0:329511916, at_version=48121'1965073,
> trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently
> queued_for_pg
> 2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 146.404241 seconds old, received at 2017-09-19
> 02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
> 4291, reqid=client.18422767.0:329512018, at_version=48121'1955389,
> trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently
> queued_for_pg
> 2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 10.10.113.29:6826/3332721
> >> 10.10.13.32:6822/130263 pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268
> cs=1 l=0 c=0x7fe01b44be00).connect got RESETSESSION
> 2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721 >>
> 10.10.13.33:6802/2651380 pipe(0x7fe008140800 sd=70 :61127 s=1 pgs=7059
> cs=1 l=0 c=0x7fe01be7260

Re: [ceph-users] OSD assert hit suicide timeout

2017-09-19 Thread Stanley Zhang
We don't use EC pools, but my experience with similar slow requests on 
RGW+replicated_pools is that in the logs you need to find out the first 
slow request and identify where it's from, for example, is it 
deep-scrub, or some client accessing corrupted objects, disk errors etc.



On 20/09/17 8:13 AM, David Turner wrote:
Just starting 3 nights ago we started seeing OSDs randomly going down 
in our cluster (Jewel 10.2.7).  At first I saw that each OSD that was 
recently marked down in the cluster (`ceph osd dump | grep -E 
'^osd\.[0-9]+\s' | sort -nrk11` sorted list of OSDs by which OSDs have 
been marked down in the most recent OSD map epochs) and all of them 
had been wrongly marked down. Prior to this there is a lot of evidence 
of slow requests, OSD op thread timing out, Filestore op thread timing 
out, and other errors.  At the bottom of the email is an excerpt of 
such errors.  It is definitely not a comprehensive log of these errors.


The map epoch of the last time the OSD is marked down in the OSD map 
matches when it logs that it was wrongly marked down.  After thousands 
more lines of op threads timing out and slow requests, the OSD finally 
asserts with "hit suicide timeout".  The first 2 nights this was 
localized to 2 osds on the same host, but last night this happened on 
an osd that is on a second host.  2 of the 3 OSDs have hit this 3 
times and the other has hit it twice.


There are 15 OSD nodes and 224 total OSDs in the cluster. So far only 
these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas 
of what to do next?  Or have any information that is missing here to 
be able to understand things further?  It's too intermittent to run 
the OSD with log level 20.


2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy 
'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : 891 slow requests, 5 included below; oldest blocked for > 
150.987639 secs
2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 150.985487 seconds old, received at 2017-09-19 
02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4267, reqid=client.24351228.0:139832730, at_version=48121'1955383, 
trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started
2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 150.510718 seconds old, received at 2017-09-19 
02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
7590, reqid=client.24892528.0:36410074, at_version=48121'1960640, 
trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started
2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 149.177285 seconds old, received at 2017-09-19 
02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
93253, reqid=client.24892528.0:36410582, at_version=48121'1964351, 
trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently 
queued_for_pg
2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 147.359320 seconds old, received at 2017-09-19 
02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
587, reqid=client.18422767.0:329511916, at_version=48121'1965073, 
trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently 
queued_for_pg
2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 146.404241 seconds old, received at 2017-09-19 
02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4291, reqid=client.18422767.0:329512018, at_version=48121'1955389, 
trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently 
queued_for_pg
2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 
10.10.113.29:6826/3332721  >> 
10.10.13.32:6822/130263  
pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268 cs=1 l=0 
c=0x7fe01b44be00).connect got RESETSESSION
2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721 
 >> 10.10.13.33:6802/2651380 
 pipe(0x7fe008140800 sd=70 :61127 s=1 
pgs=7059 cs=1 l=0 c=0x7fe01be72600).connect got RESETSESSION
2017-09-19 02:06:55.974657 7fdf78e4e700  0 -- 10.10.13.29:6826/3332721 
 >> 10.10.13.35:6827/477826 
 pipe(0x7fe004aef400 sd=91 :19464 s=1 
pgs=7316 cs=1 l=0 c=0x7fe01b179080).connect got RESETSESSION
2017-09-19 02:06:55.977120 7fdf95310700  0 -- 10.10.13.29:6826/3332721 
 >> 10.10.13.22:6812/2276428 
 pipe(0x7fe003d15400 sd=86 :58892 s=1 
pgs=15908 cs=1 l=0 c=0x7fe01b2e1500).connect got RESETSESSION
2017-09-19 02:06:55.979722 7fdf