Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Eric Choi
I realized I sent this under wrong thread: here I am sending it again:

---

Hello all,

I work in the same team as Tyler here, and I can provide more info here..

The cluster is indeed an RGW cluster, with many small (100 KB) objects
similar to your use case Bryan.  But we have the blind bucket set up with
 "index_type": 1 for this particular bucket, as we wanted to avoid this
bottleneck to begin with (we didn't need listing feature)  Would the bucket
sharding still be a problem for blind buckets?

Mark, would setting logging to 20 give any insights to what threads are
doing?


Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Bryan Stillwell
Is this on an RGW cluster?

If so, you might be running into the same problem I was seeing with large 
bucket sizes:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018504.html

The solution is to shard your buckets so the bucket index doesn't get too big.

Bryan

From: ceph-users  on behalf of Tyler Bischel 

Date: Monday, June 12, 2017 at 5:12 PM
To: "ceph-us...@ceph.com" 
Subject: [ceph-users] osd_op_tp timeouts

Hi,
  We've been having this ongoing problem with threads timing out on the OSDs.  
Typically we'll see the OSD become unresponsive for about a minute, as threads 
from other OSDs time out.  The timeouts don't seem to be correlated to high 
load.  We turned up the logs to 10/10 for part of a day to catch some of these 
in progress, and saw the pattern below in the logs several times (grepping for 
individual threads involved in the time outs).

We are using Jewel 10.2.7.

Logs:

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967027 lcod 5484'12967028 active] add_log_entry 
5484'12967030 (0'0) modify   
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
 by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967028 lcod 5484'12967028 active] append_log: 
trimming to 5484'12967028 entries 5484'12967028 (5484'12967026) delete   
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
 by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to: 0'0, 
dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, 
divergent_priors: 0, writeout_from: 5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15



Metrics:
OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to 16, 
IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time spike.  
Average Queue Size on the device spikes.  One minute later, Write Time, Reads, 
and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Mark Nelson

Hi Tyler,

I wanted to make sure you got a reply to this, but unfortunately I don't 
have much to give you.  It sounds like you already took a look at the 
disk metrics and ceph is probably not waiting on disk IO based on your 
description.  If you can easily invoke the problem, you could attach gdb 
to the OSD and do a "thread apply all bt" to see what the threads are 
doing when it's timing out.  Also, please open a tracker ticket if one 
doesn't already exist so we can make sure we get it recorded in case 
other people see the same thing.


Mark

On 06/12/2017 06:12 PM, Tyler Bischel wrote:

Hi,
  We've been having this ongoing problem with threads timing out on the
OSDs.  Typically we'll see the OSD become unresponsive for about a
minute, as threads from other OSDs time out.  The timeouts don't seem to
be correlated to high load.  We turned up the logs to 10/10 for part of
a day to catch some of these in progress, and saw the pattern below in
the logs several times (grepping for individual threads involved in the
time outs).

We are using Jewel 10.2.7.

*Logs:*

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484
pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476
n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967027 lcod
5484'12967028 active] add_log_entry 5484'12967030 (0'0) modify
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484
pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476
n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967028 lcod
5484'12967028 active] append_log: trimming to 5484'12967028 entries
5484'12967028 (5484'12967026) delete
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to:
0'0, dirty_from: 4294967295'18446744073709551615,
dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15


*Metrics:*

OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0
to 16, IO In progress spikes from 0 to hundreds, IO Time Weighted, IO
Time spike.  Average Queue Size on the device spikes.  One minute later,
Write Time, Reads, and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com