Re: [ceph-users] slow requests going up and down

2015-07-14 Thread Deneau, Tom
I don't think there were any stale or unclean PGs,  (when there are,
I have seen health detail list them and it did not in this case).
I have since restarted the 2 osds and the health went immediately to HEALTH_OK.

-- Tom

 -Original Message-
 From: Will.Boege [mailto:will.bo...@target.com]
 Sent: Monday, July 13, 2015 10:19 PM
 To: Deneau, Tom; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] slow requests going up and down
 
 Does the ceph health detail show anything about stale or unclean PGs, or
 are you just getting the blocked ops messages?
 
 On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote:
 
 I have a cluster where over the weekend something happened and successive
 calls to ceph health detail show things like below.
 What does it mean when the number of blocked requests goes up and down
 like this?
 Some clients are still running successfully.
 
 -- Tom Deneau, AMD
 
 
 
 HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
 20 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 18 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
 4 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 2 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
 27 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
 34 ops are blocked  536871 sec
 9 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests going up and down

2015-07-14 Thread Will . Boege
In my experience I have seen something like this this happen twice - First
time there were unclean PGs because Ceph was down to one replica of a PG.
When that happens Ceph blocks IO to remaining replicas when the number
falls below the Œmin_size¹ parameter. That will manifest as blocked ops.
Second time the disk was Œsoft-failing¹ - gaining many bad sectors but
SMART still reported the drive as OK.  Maybe check OSD.5 and OSD.7 for low
level media errors with a tool like MegaCli, or whatever controller
management tool comes with your hardware.
At any rate, restarting the problem-child OSD is probably troubleshooting
step #1, which you have done.

On 7/14/15, 6:45 AM, Deneau, Tom tom.den...@amd.com wrote:

I don't think there were any stale or unclean PGs,  (when there are,
I have seen health detail list them and it did not in this case).
I have since restarted the 2 osds and the health went immediately to
HEALTH_OK.

-- Tom

 -Original Message-
 From: Will.Boege [mailto:will.bo...@target.com]
 Sent: Monday, July 13, 2015 10:19 PM
 To: Deneau, Tom; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] slow requests going up and down
 
 Does the ceph health detail show anything about stale or unclean PGs, or
 are you just getting the blocked ops messages?
 
 On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote:
 
 I have a cluster where over the weekend something happened and
successive
 calls to ceph health detail show things like below.
 What does it mean when the number of blocked requests goes up and down
 like this?
 Some clients are still running successfully.
 
 -- Tom Deneau, AMD
 
 
 
 HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
 20 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 18 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
 4 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 2 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
 27 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
 34 ops are blocked  536871 sec
 9 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests going up and down

2015-07-13 Thread Will . Boege
Does the ceph health detail show anything about stale or unclean PGs, or
are you just getting the blocked ops messages?

On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote:

I have a cluster where over the weekend something happened and successive
calls to ceph health detail show things like below.
What does it mean when the number of blocked requests goes up and down
like this?
Some clients are still running successfully.

-- Tom Deneau, AMD



HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
20 ops are blocked  536871 sec
2 ops are blocked  536871 sec on osd.5
18 ops are blocked  536871 sec on osd.7
2 osds have slow requests

HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
4 ops are blocked  536871 sec
2 ops are blocked  536871 sec on osd.5
2 ops are blocked  536871 sec on osd.7
2 osds have slow requests

HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
27 ops are blocked  536871 sec
2 ops are blocked  536871 sec on osd.5
25 ops are blocked  536871 sec on osd.7
2 osds have slow requests

HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
34 ops are blocked  536871 sec
9 ops are blocked  536871 sec on osd.5
25 ops are blocked  536871 sec on osd.7
2 osds have slow requests
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests going up and down

2015-07-13 Thread Christian Balzer

Hello,

to quote Sherlock Holmes:

Data, data, data. I cannot make bricks without clay.

That the number of blocked requests is varying is indeed interesting, but
I presume you're more interested in fixing this than dissecting this
particular tidbit?

If so...

Start with the basics, all relevant software version, a description of
your cluster, full outputs of ceph osd tree and ceph -s, etc.

The same 2 OSDs are affected, anything peculiar going on in their logs?

How about their SMART status?

Are they being deep-scrubbed (logs above) or otherwise busy (atop, iostat)?

You may find something in the performance counters, blocked requests
section, see: http://ceph.com/docs/v0.69/dev/perf_counters/

Lastly, the most likely fix will be restarting the affected OSDs. 

See also:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15410.html

Christian

On Mon, 13 Jul 2015 22:38:57 + Deneau, Tom wrote:

 I have a cluster where over the weekend something happened and
 successive calls to ceph health detail show things like below. What does
 it mean when the number of blocked requests goes up and down like this?
 Some clients are still running successfully.
 
 -- Tom Deneau, AMD
 
 
 
 HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
 20 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 18 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
 4 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 2 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
 27 ops are blocked  536871 sec
 2 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 
 HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
 34 ops are blocked  536871 sec
 9 ops are blocked  536871 sec on osd.5
 25 ops are blocked  536871 sec on osd.7
 2 osds have slow requests
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow requests going up and down

2015-07-13 Thread Deneau, Tom
I have a cluster where over the weekend something happened and successive calls 
to ceph health detail show things like below.
What does it mean when the number of blocked requests goes up and down like 
this?
Some clients are still running successfully.

-- Tom Deneau, AMD



HEALTH_WARN 20 requests are blocked  32 sec; 2 osds have slow requests
20 ops are blocked  536871 sec
2 ops are blocked  536871 sec on osd.5
18 ops are blocked  536871 sec on osd.7
2 osds have slow requests

HEALTH_WARN 4 requests are blocked  32 sec; 2 osds have slow requests
4 ops are blocked  536871 sec
2 ops are blocked  536871 sec on osd.5
2 ops are blocked  536871 sec on osd.7
2 osds have slow requests

HEALTH_WARN 27 requests are blocked  32 sec; 2 osds have slow requests
27 ops are blocked  536871 sec
2 ops are blocked  536871 sec on osd.5
25 ops are blocked  536871 sec on osd.7
2 osds have slow requests

HEALTH_WARN 34 requests are blocked  32 sec; 2 osds have slow requests
34 ops are blocked  536871 sec
9 ops are blocked  536871 sec on osd.5
25 ops are blocked  536871 sec on osd.7
2 osds have slow requests
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com