Re: [ceph-users] slow requests going up and down
I don't think there were any stale or unclean PGs, (when there are, I have seen health detail list them and it did not in this case). I have since restarted the 2 osds and the health went immediately to HEALTH_OK. -- Tom -Original Message- From: Will.Boege [mailto:will.bo...@target.com] Sent: Monday, July 13, 2015 10:19 PM To: Deneau, Tom; ceph-users@lists.ceph.com Subject: Re: [ceph-users] slow requests going up and down Does the ceph health detail show anything about stale or unclean PGs, or are you just getting the blocked ops messages? On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests going up and down
In my experience I have seen something like this this happen twice - First time there were unclean PGs because Ceph was down to one replica of a PG. When that happens Ceph blocks IO to remaining replicas when the number falls below the Œmin_size¹ parameter. That will manifest as blocked ops. Second time the disk was Œsoft-failing¹ - gaining many bad sectors but SMART still reported the drive as OK. Maybe check OSD.5 and OSD.7 for low level media errors with a tool like MegaCli, or whatever controller management tool comes with your hardware. At any rate, restarting the problem-child OSD is probably troubleshooting step #1, which you have done. On 7/14/15, 6:45 AM, Deneau, Tom tom.den...@amd.com wrote: I don't think there were any stale or unclean PGs, (when there are, I have seen health detail list them and it did not in this case). I have since restarted the 2 osds and the health went immediately to HEALTH_OK. -- Tom -Original Message- From: Will.Boege [mailto:will.bo...@target.com] Sent: Monday, July 13, 2015 10:19 PM To: Deneau, Tom; ceph-users@lists.ceph.com Subject: Re: [ceph-users] slow requests going up and down Does the ceph health detail show anything about stale or unclean PGs, or are you just getting the blocked ops messages? On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests going up and down
Does the ceph health detail show anything about stale or unclean PGs, or are you just getting the blocked ops messages? On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests going up and down
Hello, to quote Sherlock Holmes: Data, data, data. I cannot make bricks without clay. That the number of blocked requests is varying is indeed interesting, but I presume you're more interested in fixing this than dissecting this particular tidbit? If so... Start with the basics, all relevant software version, a description of your cluster, full outputs of ceph osd tree and ceph -s, etc. The same 2 OSDs are affected, anything peculiar going on in their logs? How about their SMART status? Are they being deep-scrubbed (logs above) or otherwise busy (atop, iostat)? You may find something in the performance counters, blocked requests section, see: http://ceph.com/docs/v0.69/dev/perf_counters/ Lastly, the most likely fix will be restarting the affected OSDs. See also: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15410.html Christian On Mon, 13 Jul 2015 22:38:57 + Deneau, Tom wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests going up and down
I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com