Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-11 Thread Alex Gorbachev
On Mon, Mar 5, 2018 at 11:20 PM, Brad Hubbard  wrote:
> On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachev  
> wrote:
>> On Thu, Mar 1, 2018 at 10:57 PM, David Turner  wrote:
>>> Blocked requests and slow requests are synonyms in ceph. They are 2 names
>>> for the exact same thing.
>>>
>>>
>>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  
>>> wrote:

 On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
 wrote:
 > `ceph health detail` should show you more information about the slow
 > requests.  If the output is too much stuff, you can grep out for blocked
 > or
 > something.  It should tell you which OSDs are involved, how long they've
 > been slow, etc.  The default is for them to show '> 32 sec' but that may
 > very well be much longer and `ceph health detail` will show that.

 Hi David,

 Thank you for the reply.  Unfortunately, the health detail only shows
 blocked requests.  This seems to be related to a compression setting
 on the pool, nothing in OSD logs.

 I replied to another compression thread.  This makes sense since
 compression is new, and in the past all such issues were reflected in
 OSD logs and related to either network or OSD hardware.

 Regards,
 Alex

 >
 > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
 > wrote:
 >>
 >> Is there a switch to turn on the display of specific OSD issues?  Or
 >> does the below indicate a generic problem, e.g. network and no any
 >> specific OSD?
 >>
 >> 2018-02-28 18:09:36.438300 7f6dead56700  0
 >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
 >> total 15997 MB, used 6154 MB, avail 9008 MB
 >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
 >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
 >> (REQUEST_SLOW)
 >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
 >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
 >> (REQUEST_SLOW)
 >> 2018-02-28 18:09:53.794882 7f6de8551700  0
 >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
 >> "status", "format": "json"} v 0) v1
 >>
 >> --
>>
>> I was wrong where the pool compression does not matter, even
>> uncompressed pool also generates these slow messages.
>>
>> Question is why no subsequent message relating to specific OSDs (like
>> in Jewel and prior, like this example from RH:
>>
>> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
>> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>>
>> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
>> old, received at {date-time}: osd_op(client.4240.0:8
>> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
>> currently waiting for subops from [610]
>>
>> In comparison, my Luminous cluster only shows the general slow/blocked 
>> message:
>>
>> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
>> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
>> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
>> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
>> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
>> blocked > 32 sec)
>> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
>> [INF] : Cluster is now healthy
>>
>> So where are the details?
>
> Working on this, thanks.
>
> See https://tracker.ceph.com/issues/23236

As in the tracker, but I think it would be useful to others:

If something like below comes up, how do you troubleshoot the cause of
past events, especially if this is specific to just a handful of 1000s
of OSDs on many hosts?

2018-03-11 22:00:00.000132 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK
2018-03-11 22:44:46.173825 mon.roc-vm-sc3c234 [WRN] Health check
failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:44:52.245738 mon.roc-vm-sc3c234 [WRN] Health check
update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:44:57.925686 mon.roc-vm-sc3c234 [WRN] Health check
update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:45:02.926031 mon.roc-vm-sc3c234 [WRN] Health check
update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-03-11 22:45:06.413741 mon.roc-vm-sc3c234 [INF] Health check
cleared: REQUEST_SLOW (was: 14 slow requests are blocked > 32 sec)
2018-03-11 22:45:06.413814 mon.roc-vm-sc3c234 [INF] Cluster is now 

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-05 Thread Brad Hubbard
On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachev  wrote:
> On Thu, Mar 1, 2018 at 10:57 PM, David Turner  wrote:
>> Blocked requests and slow requests are synonyms in ceph. They are 2 names
>> for the exact same thing.
>>
>>
>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  
>> wrote:
>>>
>>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
>>> wrote:
>>> > `ceph health detail` should show you more information about the slow
>>> > requests.  If the output is too much stuff, you can grep out for blocked
>>> > or
>>> > something.  It should tell you which OSDs are involved, how long they've
>>> > been slow, etc.  The default is for them to show '> 32 sec' but that may
>>> > very well be much longer and `ceph health detail` will show that.
>>>
>>> Hi David,
>>>
>>> Thank you for the reply.  Unfortunately, the health detail only shows
>>> blocked requests.  This seems to be related to a compression setting
>>> on the pool, nothing in OSD logs.
>>>
>>> I replied to another compression thread.  This makes sense since
>>> compression is new, and in the past all such issues were reflected in
>>> OSD logs and related to either network or OSD hardware.
>>>
>>> Regards,
>>> Alex
>>>
>>> >
>>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
>>> > wrote:
>>> >>
>>> >> Is there a switch to turn on the display of specific OSD issues?  Or
>>> >> does the below indicate a generic problem, e.g. network and no any
>>> >> specific OSD?
>>> >>
>>> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
>>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>>> >> total 15997 MB, used 6154 MB, avail 9008 MB
>>> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
>>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>>> >> "status", "format": "json"} v 0) v1
>>> >>
>>> >> --
>
> I was wrong where the pool compression does not matter, even
> uncompressed pool also generates these slow messages.
>
> Question is why no subsequent message relating to specific OSDs (like
> in Jewel and prior, like this example from RH:
>
> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>
> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
> old, received at {date-time}: osd_op(client.4240.0:8
> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
> currently waiting for subops from [610]
>
> In comparison, my Luminous cluster only shows the general slow/blocked 
> message:
>
> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
> blocked > 32 sec)
> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Cluster is now healthy
>
> So where are the details?

Working on this, thanks.

See https://tracker.ceph.com/issues/23236

>
> Thanks,
> Alex
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-03 Thread Alex Gorbachev
On Fri, Mar 2, 2018 at 9:56 AM, Alex Gorbachev  wrote:
>
> On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar  wrote:
>>
>> On 2018-03-02 07:54, Alex Gorbachev wrote:
>>
>> On Thu, Mar 1, 2018 at 10:57 PM, David Turner 
>> wrote:
>>
>> Blocked requests and slow requests are synonyms in ceph. They are 2 names
>> for the exact same thing.
>>
>>
>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev 
>> wrote:
>>
>>
>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
>> wrote:
>>
>> `ceph health detail` should show you more information about the slow
>> requests.  If the output is too much stuff, you can grep out for blocked
>> or
>> something.  It should tell you which OSDs are involved, how long they've
>> been slow, etc.  The default is for them to show '> 32 sec' but that may
>> very well be much longer and `ceph health detail` will show that.
>>
>>
>> Hi David,
>>
>> Thank you for the reply.  Unfortunately, the health detail only shows
>> blocked requests.  This seems to be related to a compression setting
>> on the pool, nothing in OSD logs.
>>
>> I replied to another compression thread.  This makes sense since
>> compression is new, and in the past all such issues were reflected in
>> OSD logs and related to either network or OSD hardware.
>>
>> Regards,
>> Alex
>>
>>
>> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
>> wrote:
>>
>>
>> Is there a switch to turn on the display of specific OSD issues?  Or
>> does the below indicate a generic problem, e.g. network and no any
>> specific OSD?
>>
>> 2018-02-28 18:09:36.438300 7f6dead56700  0
>> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>> total 15997 MB, used 6154 MB, avail 9008 MB
>> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-02-28 18:09:53.794882 7f6de8551700  0
>> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>> "status", "format": "json"} v 0) v1
>>
>> --
>>
>>
>> I was wrong where the pool compression does not matter, even
>> uncompressed pool also generates these slow messages.
>>
>> Question is why no subsequent message relating to specific OSDs (like
>> in Jewel and prior, like this example from RH:
>>
>> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
>> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>>
>> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
>> old, received at {date-time}: osd_op(client.4240.0:8
>> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
>> currently waiting for subops from [610]
>>
>> In comparison, my Luminous cluster only shows the general slow/blocked
>> message:
>>
>> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
>> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
>> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
>> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
>> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
>> blocked > 32 sec)
>> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
>> [INF] : Cluster is now healthy
>>
>> So where are the details?
>>
>> Thanks,
>> Alex
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> Hi Alex,
>>
>> I agree, the health detail should display osd info.
>> Regarding the slow requests: ESX is very sensitive to io delay, this could
>> be due to cluster overload or disk going bad. If you have a monitoring tool,
>> do you see high load spikes (such as disk % busy ) on your disks ? Generally
>> it is good (if you can) simulate the max client workload + simulate
>> concurrent recovery and scrubbing and monitor your resources and see if they
>> get overloaded. Most if not all issues with ESX will go away if you have
>> enough hardware. I suspect this is what you are see-ing, specially the
>> VMotion and compression could be the extra load which triggered the issue,
>> in such case the problem is not specific to a particular osd. The other
>> possibility is that a particular disk is going bad, the monitoring tool
>> should be able to identify this.
>>
>> Note that VMotion generates an io pattern that keeps hitting the same osd
>> with concurrent 

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-02 Thread Alex Gorbachev
On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar  wrote:

> On 2018-03-02 07:54, Alex Gorbachev wrote:
>
> On Thu, Mar 1, 2018 at 10:57 PM, David Turner 
> wrote:
>
> Blocked requests and slow requests are synonyms in ceph. They are 2 names
> for the exact same thing.
>
>
> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev 
> wrote:
>
>
> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
> wrote:
>
> `ceph health detail` should show you more information about the slow
> requests.  If the output is too much stuff, you can grep out for blocked
> or
> something.  It should tell you which OSDs are involved, how long they've
> been slow, etc.  The default is for them to show '> 32 sec' but that may
> very well be much longer and `ceph health detail` will show that.
>
>
> Hi David,
>
> Thank you for the reply.  Unfortunately, the health detail only shows
> blocked requests.  This seems to be related to a compression setting
> on the pool, nothing in OSD logs.
>
> I replied to another compression thread.  This makes sense since
> compression is new, and in the past all such issues were reflected in
> OSD logs and related to either network or OSD hardware.
>
> Regards,
> Alex
>
>
> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
> wrote:
>
>
> Is there a switch to turn on the display of specific OSD issues?  Or
> does the below indicate a generic problem, e.g. network and no any
> specific OSD?
>
> 2018-02-28 18:09:36.438300 7f6dead56700  0
> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> total 15997 MB, used 6154 MB, avail 9008 MB
> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:53.794882 7f6de8551700  0
> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> "status", "format": "json"} v 0) v1
>
> --
>
>
> I was wrong where the pool compression does not matter, even
> uncompressed pool also generates these slow messages.
>
> Question is why no subsequent message relating to specific OSDs (like
> in Jewel and prior, like this example from RH:
>
> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>
> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
> old, received at {date-time}: osd_op(client.4240.0:8
> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
> currently waiting for subops from [610]
>
> In comparison, my Luminous cluster only shows the general slow/blocked
> message:
>
> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
> blocked > 32 sec)
> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Cluster is now healthy
>
> So where are the details?
>
> Thanks,
> Alex
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> Hi Alex,
>
> I agree, the health detail should display osd info.
> Regarding the slow requests: ESX is very sensitive to io delay, this could
> be due to cluster overload or disk going bad. If you have a monitoring
> tool, do you see high load spikes (such as disk % busy ) on your disks ?
> Generally it is good (if you can) simulate the max client workload +
> simulate concurrent recovery and scrubbing and monitor your resources and
> see if they get overloaded. Most if not all issues with ESX will go away if
> you have enough hardware. I suspect this is what you are see-ing, specially
> the VMotion and compression could be the extra load which triggered the
> issue, in such case the problem is not specific to a particular osd. The
> other possibility is that a particular disk is going bad, the monitoring
> tool should be able to identify this.
>
> Note that VMotion generates an io pattern that keeps hitting the same osd
> with concurrent small writes, if you can use rbd striping, creating 64k
> stripes may help. A controller with write back cache will also help.
>
> Maged
>

Thank you, Maged. This does seem to be related to compression. I did two

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-02 Thread Maged Mokhtar
On 2018-03-02 07:54, Alex Gorbachev wrote:

> On Thu, Mar 1, 2018 at 10:57 PM, David Turner  wrote: 
> Blocked requests and slow requests are synonyms in ceph. They are 2 names
> for the exact same thing.
> 
> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  
> wrote: 
> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
> wrote: `ceph health detail` should show you more information about the slow
> requests.  If the output is too much stuff, you can grep out for blocked
> or
> something.  It should tell you which OSDs are involved, how long they've
> been slow, etc.  The default is for them to show '> 32 sec' but that may
> very well be much longer and `ceph health detail` will show that. 
> Hi David,
> 
> Thank you for the reply.  Unfortunately, the health detail only shows
> blocked requests.  This seems to be related to a compression setting
> on the pool, nothing in OSD logs.
> 
> I replied to another compression thread.  This makes sense since
> compression is new, and in the past all such issues were reflected in
> OSD logs and related to either network or OSD hardware.
> 
> Regards,
> Alex
> 
> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
> wrote: 
> Is there a switch to turn on the display of specific OSD issues?  Or
> does the below indicate a generic problem, e.g. network and no any
> specific OSD?
> 
> 2018-02-28 18:09:36.438300 7f6dead56700  0
> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> total 15997 MB, used 6154 MB, avail 9008 MB
> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:53.794882 7f6de8551700  0
> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> "status", "format": "json"} v 0) v1
> 
> --

I was wrong where the pool compression does not matter, even
uncompressed pool also generates these slow messages.

Question is why no subsequent message relating to specific OSDs (like
in Jewel and prior, like this example from RH:

2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
old, received at {date-time}: osd_op(client.4240.0:8
benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
currently waiting for subops from [610]

In comparison, my Luminous cluster only shows the general slow/blocked
message:

2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check failed: 116 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 66 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 5 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
[INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
blocked > 32 sec)
2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
[INF] : Cluster is now healthy

So where are the details?

Thanks,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

Hi Alex,

I agree, the health detail should display osd info.
Regarding the slow requests: ESX is very sensitive to io delay, this
could be due to cluster overload or disk going bad. If you have a
monitoring tool, do you see high load spikes (such as disk % busy ) on
your disks ? Generally it is good (if you can) simulate the max client
workload + simulate concurrent recovery and scrubbing and monitor your
resources and see if they get overloaded. Most if not all issues with
ESX will go away if you have enough hardware. I suspect this is what you
are see-ing, specially the VMotion and compression could be the extra
load which triggered the issue, in such case the problem is not specific
to a particular osd. The other possibility is that a particular disk is
going bad, the monitoring tool should be able to identify this. 

Note that VMotion generates an io pattern that keeps hitting the same
osd with concurrent small writes, if you can use rbd striping, creating
64k stripes may help. A controller with write back cache will also help.

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-01 Thread Alex Gorbachev
On Thu, Mar 1, 2018 at 10:57 PM, David Turner  wrote:
> Blocked requests and slow requests are synonyms in ceph. They are 2 names
> for the exact same thing.
>
>
> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  wrote:
>>
>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
>> wrote:
>> > `ceph health detail` should show you more information about the slow
>> > requests.  If the output is too much stuff, you can grep out for blocked
>> > or
>> > something.  It should tell you which OSDs are involved, how long they've
>> > been slow, etc.  The default is for them to show '> 32 sec' but that may
>> > very well be much longer and `ceph health detail` will show that.
>>
>> Hi David,
>>
>> Thank you for the reply.  Unfortunately, the health detail only shows
>> blocked requests.  This seems to be related to a compression setting
>> on the pool, nothing in OSD logs.
>>
>> I replied to another compression thread.  This makes sense since
>> compression is new, and in the past all such issues were reflected in
>> OSD logs and related to either network or OSD hardware.
>>
>> Regards,
>> Alex
>>
>> >
>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
>> > wrote:
>> >>
>> >> Is there a switch to turn on the display of specific OSD issues?  Or
>> >> does the below indicate a generic problem, e.g. network and no any
>> >> specific OSD?
>> >>
>> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>> >> total 15997 MB, used 6154 MB, avail 9008 MB
>> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>> >> (REQUEST_SLOW)
>> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>> >> (REQUEST_SLOW)
>> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>> >> "status", "format": "json"} v 0) v1
>> >>
>> >> --

I was wrong where the pool compression does not matter, even
uncompressed pool also generates these slow messages.

Question is why no subsequent message relating to specific OSDs (like
in Jewel and prior, like this example from RH:

2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
old, received at {date-time}: osd_op(client.4240.0:8
benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
currently waiting for subops from [610]

In comparison, my Luminous cluster only shows the general slow/blocked message:

2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check failed: 116 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 66 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 5 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
[INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
blocked > 32 sec)
2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
[INF] : Cluster is now healthy

So where are the details?

Thanks,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-01 Thread David Turner
Blocked requests and slow requests are synonyms in ceph. They are 2 names
for the exact same thing.

On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  wrote:

> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
> wrote:
> > `ceph health detail` should show you more information about the slow
> > requests.  If the output is too much stuff, you can grep out for blocked
> or
> > something.  It should tell you which OSDs are involved, how long they've
> > been slow, etc.  The default is for them to show '> 32 sec' but that may
> > very well be much longer and `ceph health detail` will show that.
>
> Hi David,
>
> Thank you for the reply.  Unfortunately, the health detail only shows
> blocked requests.  This seems to be related to a compression setting
> on the pool, nothing in OSD logs.
>
> I replied to another compression thread.  This makes sense since
> compression is new, and in the past all such issues were reflected in
> OSD logs and related to either network or OSD hardware.
>
> Regards,
> Alex
>
> >
> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
> > wrote:
> >>
> >> Is there a switch to turn on the display of specific OSD issues?  Or
> >> does the below indicate a generic problem, e.g. network and no any
> >> specific OSD?
> >>
> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> >> total 15997 MB, used 6154 MB, avail 9008 MB
> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> >> (REQUEST_SLOW)
> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> >> (REQUEST_SLOW)
> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> >> "status", "format": "json"} v 0) v1
> >>
> >> --
> >> Alex Gorbachev
> >> Storcium
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-01 Thread Alex Gorbachev
On Thu, Mar 1, 2018 at 2:47 PM, David Turner  wrote:
> `ceph health detail` should show you more information about the slow
> requests.  If the output is too much stuff, you can grep out for blocked or
> something.  It should tell you which OSDs are involved, how long they've
> been slow, etc.  The default is for them to show '> 32 sec' but that may
> very well be much longer and `ceph health detail` will show that.

Hi David,

Thank you for the reply.  Unfortunately, the health detail only shows
blocked requests.  This seems to be related to a compression setting
on the pool, nothing in OSD logs.

I replied to another compression thread.  This makes sense since
compression is new, and in the past all such issues were reflected in
OSD logs and related to either network or OSD hardware.

Regards,
Alex

>
> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
> wrote:
>>
>> Is there a switch to turn on the display of specific OSD issues?  Or
>> does the below indicate a generic problem, e.g. network and no any
>> specific OSD?
>>
>> 2018-02-28 18:09:36.438300 7f6dead56700  0
>> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>> total 15997 MB, used 6154 MB, avail 9008 MB
>> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-02-28 18:09:53.794882 7f6de8551700  0
>> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>> "status", "format": "json"} v 0) v1
>>
>> --
>> Alex Gorbachev
>> Storcium
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-01 Thread David Turner
`ceph health detail` should show you more information about the slow
requests.  If the output is too much stuff, you can grep out for blocked or
something.  It should tell you which OSDs are involved, how long they've
been slow, etc.  The default is for them to show '> 32 sec' but that may
very well be much longer and `ceph health detail` will show that.

On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
wrote:

> Is there a switch to turn on the display of specific OSD issues?  Or
> does the below indicate a generic problem, e.g. network and no any
> specific OSD?
>
> 2018-02-28 18:09:36.438300 7f6dead56700  0
> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> total 15997 MB, used 6154 MB, avail 9008 MB
> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:53.794882 7f6de8551700  0
> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> "status", "format": "json"} v 0) v1
>
> --
> Alex Gorbachev
> Storcium
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow requests troubleshooting in Luminous - details missing

2018-03-01 Thread Alex Gorbachev
Is there a switch to turn on the display of specific OSD issues?  Or
does the below indicate a generic problem, e.g. network and no any
specific OSD?

2018-02-28 18:09:36.438300 7f6dead56700  0
mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
total 15997 MB, used 6154 MB, avail 9008 MB
2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
[WRN] : Health check failed: 73 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
[WRN] : Health check update: 74 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-28 18:09:53.794882 7f6de8551700  0
mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
"status", "format": "json"} v 0) v1

--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com