Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On Mon, Mar 5, 2018 at 11:20 PM, Brad Hubbardwrote: > On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachev > wrote: >> On Thu, Mar 1, 2018 at 10:57 PM, David Turner wrote: >>> Blocked requests and slow requests are synonyms in ceph. They are 2 names >>> for the exact same thing. >>> >>> >>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev >>> wrote: On Thu, Mar 1, 2018 at 2:47 PM, David Turner wrote: > `ceph health detail` should show you more information about the slow > requests. If the output is too much stuff, you can grep out for blocked > or > something. It should tell you which OSDs are involved, how long they've > been slow, etc. The default is for them to show '> 32 sec' but that may > very well be much longer and `ceph health detail` will show that. Hi David, Thank you for the reply. Unfortunately, the health detail only shows blocked requests. This seems to be related to a compression setting on the pool, nothing in OSD logs. I replied to another compression thread. This makes sense since compression is new, and in the past all such issues were reflected in OSD logs and related to either network or OSD hardware. Regards, Alex > > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev > wrote: >> >> Is there a switch to turn on the display of specific OSD issues? Or >> does the below indicate a generic problem, e.g. network and no any >> specific OSD? >> >> 2018-02-28 18:09:36.438300 7f6dead56700 0 >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% >> total 15997 MB, used 6154 MB, avail 9008 MB >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:53.794882 7f6de8551700 0 >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": >> "status", "format": "json"} v 0) v1 >> >> -- >> >> I was wrong where the pool compression does not matter, even >> uncompressed pool also generates these slow messages. >> >> Question is why no subsequent message relating to specific OSDs (like >> in Jewel and prior, like this example from RH: >> >> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] >> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs >> >> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds >> old, received at {date-time}: osd_op(client.4240.0:8 >> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 >> currently waiting for subops from [610] >> >> In comparison, my Luminous cluster only shows the general slow/blocked >> message: >> >> 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check update: 66 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check update: 5 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log >> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are >> blocked > 32 sec) >> 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log >> [INF] : Cluster is now healthy >> >> So where are the details? > > Working on this, thanks. > > See https://tracker.ceph.com/issues/23236 As in the tracker, but I think it would be useful to others: If something like below comes up, how do you troubleshoot the cause of past events, especially if this is specific to just a handful of 1000s of OSDs on many hosts? 2018-03-11 22:00:00.000132 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK 2018-03-11 22:44:46.173825 mon.roc-vm-sc3c234 [WRN] Health check failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-11 22:44:52.245738 mon.roc-vm-sc3c234 [WRN] Health check update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-11 22:44:57.925686 mon.roc-vm-sc3c234 [WRN] Health check update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-11 22:45:02.926031 mon.roc-vm-sc3c234 [WRN] Health check update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-11 22:45:06.413741 mon.roc-vm-sc3c234 [INF] Health check cleared: REQUEST_SLOW (was: 14 slow requests are blocked > 32 sec) 2018-03-11 22:45:06.413814 mon.roc-vm-sc3c234 [INF] Cluster is now
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachevwrote: > On Thu, Mar 1, 2018 at 10:57 PM, David Turner wrote: >> Blocked requests and slow requests are synonyms in ceph. They are 2 names >> for the exact same thing. >> >> >> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev >> wrote: >>> >>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner >>> wrote: >>> > `ceph health detail` should show you more information about the slow >>> > requests. If the output is too much stuff, you can grep out for blocked >>> > or >>> > something. It should tell you which OSDs are involved, how long they've >>> > been slow, etc. The default is for them to show '> 32 sec' but that may >>> > very well be much longer and `ceph health detail` will show that. >>> >>> Hi David, >>> >>> Thank you for the reply. Unfortunately, the health detail only shows >>> blocked requests. This seems to be related to a compression setting >>> on the pool, nothing in OSD logs. >>> >>> I replied to another compression thread. This makes sense since >>> compression is new, and in the past all such issues were reflected in >>> OSD logs and related to either network or OSD hardware. >>> >>> Regards, >>> Alex >>> >>> > >>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev >>> > wrote: >>> >> >>> >> Is there a switch to turn on the display of specific OSD issues? Or >>> >> does the below indicate a generic problem, e.g. network and no any >>> >> specific OSD? >>> >> >>> >> 2018-02-28 18:09:36.438300 7f6dead56700 0 >>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% >>> >> total 15997 MB, used 6154 MB, avail 9008 MB >>> >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log >>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec >>> >> (REQUEST_SLOW) >>> >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log >>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec >>> >> (REQUEST_SLOW) >>> >> 2018-02-28 18:09:53.794882 7f6de8551700 0 >>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": >>> >> "status", "format": "json"} v 0) v1 >>> >> >>> >> -- > > I was wrong where the pool compression does not matter, even > uncompressed pool also generates these slow messages. > > Question is why no subsequent message relating to specific OSDs (like > in Jewel and prior, like this example from RH: > > 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] > 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs > > 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds > old, received at {date-time}: osd_op(client.4240.0:8 > benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 > currently waiting for subops from [610] > > In comparison, my Luminous cluster only shows the general slow/blocked > message: > > 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check failed: 116 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check update: 66 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check update: 5 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log > [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are > blocked > 32 sec) > 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log > [INF] : Cluster is now healthy > > So where are the details? Working on this, thanks. See https://tracker.ceph.com/issues/23236 > > Thanks, > Alex > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On Fri, Mar 2, 2018 at 9:56 AM, Alex Gorbachevwrote: > > On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar wrote: >> >> On 2018-03-02 07:54, Alex Gorbachev wrote: >> >> On Thu, Mar 1, 2018 at 10:57 PM, David Turner >> wrote: >> >> Blocked requests and slow requests are synonyms in ceph. They are 2 names >> for the exact same thing. >> >> >> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev >> wrote: >> >> >> On Thu, Mar 1, 2018 at 2:47 PM, David Turner >> wrote: >> >> `ceph health detail` should show you more information about the slow >> requests. If the output is too much stuff, you can grep out for blocked >> or >> something. It should tell you which OSDs are involved, how long they've >> been slow, etc. The default is for them to show '> 32 sec' but that may >> very well be much longer and `ceph health detail` will show that. >> >> >> Hi David, >> >> Thank you for the reply. Unfortunately, the health detail only shows >> blocked requests. This seems to be related to a compression setting >> on the pool, nothing in OSD logs. >> >> I replied to another compression thread. This makes sense since >> compression is new, and in the past all such issues were reflected in >> OSD logs and related to either network or OSD hardware. >> >> Regards, >> Alex >> >> >> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev >> wrote: >> >> >> Is there a switch to turn on the display of specific OSD issues? Or >> does the below indicate a generic problem, e.g. network and no any >> specific OSD? >> >> 2018-02-28 18:09:36.438300 7f6dead56700 0 >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% >> total 15997 MB, used 6154 MB, avail 9008 MB >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:53.794882 7f6de8551700 0 >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": >> "status", "format": "json"} v 0) v1 >> >> -- >> >> >> I was wrong where the pool compression does not matter, even >> uncompressed pool also generates these slow messages. >> >> Question is why no subsequent message relating to specific OSDs (like >> in Jewel and prior, like this example from RH: >> >> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] >> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs >> >> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds >> old, received at {date-time}: osd_op(client.4240.0:8 >> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 >> currently waiting for subops from [610] >> >> In comparison, my Luminous cluster only shows the general slow/blocked >> message: >> >> 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check update: 66 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check update: 5 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log >> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are >> blocked > 32 sec) >> 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log >> [INF] : Cluster is now healthy >> >> So where are the details? >> >> Thanks, >> Alex >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> Hi Alex, >> >> I agree, the health detail should display osd info. >> Regarding the slow requests: ESX is very sensitive to io delay, this could >> be due to cluster overload or disk going bad. If you have a monitoring tool, >> do you see high load spikes (such as disk % busy ) on your disks ? Generally >> it is good (if you can) simulate the max client workload + simulate >> concurrent recovery and scrubbing and monitor your resources and see if they >> get overloaded. Most if not all issues with ESX will go away if you have >> enough hardware. I suspect this is what you are see-ing, specially the >> VMotion and compression could be the extra load which triggered the issue, >> in such case the problem is not specific to a particular osd. The other >> possibility is that a particular disk is going bad, the monitoring tool >> should be able to identify this. >> >> Note that VMotion generates an io pattern that keeps hitting the same osd >> with concurrent
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtarwrote: > On 2018-03-02 07:54, Alex Gorbachev wrote: > > On Thu, Mar 1, 2018 at 10:57 PM, David Turner > wrote: > > Blocked requests and slow requests are synonyms in ceph. They are 2 names > for the exact same thing. > > > On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev > wrote: > > > On Thu, Mar 1, 2018 at 2:47 PM, David Turner > wrote: > > `ceph health detail` should show you more information about the slow > requests. If the output is too much stuff, you can grep out for blocked > or > something. It should tell you which OSDs are involved, how long they've > been slow, etc. The default is for them to show '> 32 sec' but that may > very well be much longer and `ceph health detail` will show that. > > > Hi David, > > Thank you for the reply. Unfortunately, the health detail only shows > blocked requests. This seems to be related to a compression setting > on the pool, nothing in OSD logs. > > I replied to another compression thread. This makes sense since > compression is new, and in the past all such issues were reflected in > OSD logs and related to either network or OSD hardware. > > Regards, > Alex > > > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev > wrote: > > > Is there a switch to turn on the display of specific OSD issues? Or > does the below indicate a generic problem, e.g. network and no any > specific OSD? > > 2018-02-28 18:09:36.438300 7f6dead56700 0 > mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% > total 15997 MB, used 6154 MB, avail 9008 MB > 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check failed: 73 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check update: 74 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:53.794882 7f6de8551700 0 > mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": > "status", "format": "json"} v 0) v1 > > -- > > > I was wrong where the pool compression does not matter, even > uncompressed pool also generates these slow messages. > > Question is why no subsequent message relating to specific OSDs (like > in Jewel and prior, like this example from RH: > > 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] > 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs > > 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds > old, received at {date-time}: osd_op(client.4240.0:8 > benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 > currently waiting for subops from [610] > > In comparison, my Luminous cluster only shows the general slow/blocked > message: > > 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check failed: 116 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check update: 66 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log > [WRN] : Health check update: 5 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log > [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are > blocked > 32 sec) > 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log > [INF] : Cluster is now healthy > > So where are the details? > > Thanks, > Alex > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > Hi Alex, > > I agree, the health detail should display osd info. > Regarding the slow requests: ESX is very sensitive to io delay, this could > be due to cluster overload or disk going bad. If you have a monitoring > tool, do you see high load spikes (such as disk % busy ) on your disks ? > Generally it is good (if you can) simulate the max client workload + > simulate concurrent recovery and scrubbing and monitor your resources and > see if they get overloaded. Most if not all issues with ESX will go away if > you have enough hardware. I suspect this is what you are see-ing, specially > the VMotion and compression could be the extra load which triggered the > issue, in such case the problem is not specific to a particular osd. The > other possibility is that a particular disk is going bad, the monitoring > tool should be able to identify this. > > Note that VMotion generates an io pattern that keeps hitting the same osd > with concurrent small writes, if you can use rbd striping, creating 64k > stripes may help. A controller with write back cache will also help. > > Maged > Thank you, Maged. This does seem to be related to compression. I did two
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On 2018-03-02 07:54, Alex Gorbachev wrote: > On Thu, Mar 1, 2018 at 10:57 PM, David Turnerwrote: > Blocked requests and slow requests are synonyms in ceph. They are 2 names > for the exact same thing. > > On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev > wrote: > On Thu, Mar 1, 2018 at 2:47 PM, David Turner > wrote: `ceph health detail` should show you more information about the slow > requests. If the output is too much stuff, you can grep out for blocked > or > something. It should tell you which OSDs are involved, how long they've > been slow, etc. The default is for them to show '> 32 sec' but that may > very well be much longer and `ceph health detail` will show that. > Hi David, > > Thank you for the reply. Unfortunately, the health detail only shows > blocked requests. This seems to be related to a compression setting > on the pool, nothing in OSD logs. > > I replied to another compression thread. This makes sense since > compression is new, and in the past all such issues were reflected in > OSD logs and related to either network or OSD hardware. > > Regards, > Alex > > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev > wrote: > Is there a switch to turn on the display of specific OSD issues? Or > does the below indicate a generic problem, e.g. network and no any > specific OSD? > > 2018-02-28 18:09:36.438300 7f6dead56700 0 > mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% > total 15997 MB, used 6154 MB, avail 9008 MB > 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check failed: 73 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check update: 74 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:53.794882 7f6de8551700 0 > mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": > "status", "format": "json"} v 0) v1 > > -- I was wrong where the pool compression does not matter, even uncompressed pool also generates these slow messages. Question is why no subsequent message relating to specific OSDs (like in Jewel and prior, like this example from RH: 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] In comparison, my Luminous cluster only shows the general slow/blocked message: 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log [WRN] : Health check failed: 116 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log [WRN] : Health check update: 66 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log [WRN] : Health check update: 5 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are blocked > 32 sec) 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log [INF] : Cluster is now healthy So where are the details? Thanks, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi Alex, I agree, the health detail should display osd info. Regarding the slow requests: ESX is very sensitive to io delay, this could be due to cluster overload or disk going bad. If you have a monitoring tool, do you see high load spikes (such as disk % busy ) on your disks ? Generally it is good (if you can) simulate the max client workload + simulate concurrent recovery and scrubbing and monitor your resources and see if they get overloaded. Most if not all issues with ESX will go away if you have enough hardware. I suspect this is what you are see-ing, specially the VMotion and compression could be the extra load which triggered the issue, in such case the problem is not specific to a particular osd. The other possibility is that a particular disk is going bad, the monitoring tool should be able to identify this. Note that VMotion generates an io pattern that keeps hitting the same osd with concurrent small writes, if you can use rbd striping, creating 64k stripes may help. A controller with write back cache will also help. Maged___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On Thu, Mar 1, 2018 at 10:57 PM, David Turnerwrote: > Blocked requests and slow requests are synonyms in ceph. They are 2 names > for the exact same thing. > > > On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev wrote: >> >> On Thu, Mar 1, 2018 at 2:47 PM, David Turner >> wrote: >> > `ceph health detail` should show you more information about the slow >> > requests. If the output is too much stuff, you can grep out for blocked >> > or >> > something. It should tell you which OSDs are involved, how long they've >> > been slow, etc. The default is for them to show '> 32 sec' but that may >> > very well be much longer and `ceph health detail` will show that. >> >> Hi David, >> >> Thank you for the reply. Unfortunately, the health detail only shows >> blocked requests. This seems to be related to a compression setting >> on the pool, nothing in OSD logs. >> >> I replied to another compression thread. This makes sense since >> compression is new, and in the past all such issues were reflected in >> OSD logs and related to either network or OSD hardware. >> >> Regards, >> Alex >> >> > >> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev >> > wrote: >> >> >> >> Is there a switch to turn on the display of specific OSD issues? Or >> >> does the below indicate a generic problem, e.g. network and no any >> >> specific OSD? >> >> >> >> 2018-02-28 18:09:36.438300 7f6dead56700 0 >> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% >> >> total 15997 MB, used 6154 MB, avail 9008 MB >> >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log >> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec >> >> (REQUEST_SLOW) >> >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log >> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec >> >> (REQUEST_SLOW) >> >> 2018-02-28 18:09:53.794882 7f6de8551700 0 >> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": >> >> "status", "format": "json"} v 0) v1 >> >> >> >> -- I was wrong where the pool compression does not matter, even uncompressed pool also generates these slow messages. Question is why no subsequent message relating to specific OSDs (like in Jewel and prior, like this example from RH: 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] In comparison, my Luminous cluster only shows the general slow/blocked message: 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log [WRN] : Health check failed: 116 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log [WRN] : Health check update: 66 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log [WRN] : Health check update: 5 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are blocked > 32 sec) 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log [INF] : Cluster is now healthy So where are the details? Thanks, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
Blocked requests and slow requests are synonyms in ceph. They are 2 names for the exact same thing. On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachevwrote: > On Thu, Mar 1, 2018 at 2:47 PM, David Turner > wrote: > > `ceph health detail` should show you more information about the slow > > requests. If the output is too much stuff, you can grep out for blocked > or > > something. It should tell you which OSDs are involved, how long they've > > been slow, etc. The default is for them to show '> 32 sec' but that may > > very well be much longer and `ceph health detail` will show that. > > Hi David, > > Thank you for the reply. Unfortunately, the health detail only shows > blocked requests. This seems to be related to a compression setting > on the pool, nothing in OSD logs. > > I replied to another compression thread. This makes sense since > compression is new, and in the past all such issues were reflected in > OSD logs and related to either network or OSD hardware. > > Regards, > Alex > > > > > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev > > wrote: > >> > >> Is there a switch to turn on the display of specific OSD issues? Or > >> does the below indicate a generic problem, e.g. network and no any > >> specific OSD? > >> > >> 2018-02-28 18:09:36.438300 7f6dead56700 0 > >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% > >> total 15997 MB, used 6154 MB, avail 9008 MB > >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log > >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec > >> (REQUEST_SLOW) > >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log > >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec > >> (REQUEST_SLOW) > >> 2018-02-28 18:09:53.794882 7f6de8551700 0 > >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": > >> "status", "format": "json"} v 0) v1 > >> > >> -- > >> Alex Gorbachev > >> Storcium > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
On Thu, Mar 1, 2018 at 2:47 PM, David Turnerwrote: > `ceph health detail` should show you more information about the slow > requests. If the output is too much stuff, you can grep out for blocked or > something. It should tell you which OSDs are involved, how long they've > been slow, etc. The default is for them to show '> 32 sec' but that may > very well be much longer and `ceph health detail` will show that. Hi David, Thank you for the reply. Unfortunately, the health detail only shows blocked requests. This seems to be related to a compression setting on the pool, nothing in OSD logs. I replied to another compression thread. This makes sense since compression is new, and in the past all such issues were reflected in OSD logs and related to either network or OSD hardware. Regards, Alex > > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev > wrote: >> >> Is there a switch to turn on the display of specific OSD issues? Or >> does the below indicate a generic problem, e.g. network and no any >> specific OSD? >> >> 2018-02-28 18:09:36.438300 7f6dead56700 0 >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% >> total 15997 MB, used 6154 MB, avail 9008 MB >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:53.794882 7f6de8551700 0 >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": >> "status", "format": "json"} v 0) v1 >> >> -- >> Alex Gorbachev >> Storcium >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing
`ceph health detail` should show you more information about the slow requests. If the output is too much stuff, you can grep out for blocked or something. It should tell you which OSDs are involved, how long they've been slow, etc. The default is for them to show '> 32 sec' but that may very well be much longer and `ceph health detail` will show that. On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachevwrote: > Is there a switch to turn on the display of specific OSD issues? Or > does the below indicate a generic problem, e.g. network and no any > specific OSD? > > 2018-02-28 18:09:36.438300 7f6dead56700 0 > mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% > total 15997 MB, used 6154 MB, avail 9008 MB > 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check failed: 73 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log > [WRN] : Health check update: 74 slow requests are blocked > 32 sec > (REQUEST_SLOW) > 2018-02-28 18:09:53.794882 7f6de8551700 0 > mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": > "status", "format": "json"} v 0) v1 > > -- > Alex Gorbachev > Storcium > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Slow requests troubleshooting in Luminous - details missing
Is there a switch to turn on the display of specific OSD issues? Or does the below indicate a generic problem, e.g. network and no any specific OSD? 2018-02-28 18:09:36.438300 7f6dead56700 0 mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% total 15997 MB, used 6154 MB, avail 9008 MB 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log [WRN] : Health check failed: 73 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log [WRN] : Health check update: 74 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-02-28 18:09:53.794882 7f6de8551700 0 mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1 -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com