I doesn't seem like the problem here, but I've noticed that slow OSDs have
a large fan-out.  I have less than 100 OSDs, so every OSD talks to every
other OSD in my cluster.

I was getting slow notices from all of my OSDs.  Nothing jumped out, so I
started looking at disk write latency graphs.  I noticed that all the OSDs
in one node had 10x the write latency of the other nodes.  After that, I
graphed the number of slow notices per OSD, and noticed that a much higher
number of slow requests on that node.

Long story short, I lost a battery on my write cache.  But it wasn't at all
obvious from the slow request notices, not until I dug deeper.



On Mon, Jan 5, 2015 at 4:07 PM, Sanders, Bill <bill.sand...@teradata.com>
wrote:

>  Thanks for the reply.
>
> 14 and 18 happened to show up during that run, but its certainly not only
> those OSD's.  It seems to vary each run.  Just from the runs I've done
> today I've seen the following pairs of OSD's:
>
> ['0,13', '0,18', '0,24', '0,25', '0,32', '0,34', '0,36', '10,22', '11,30',
> '12,28', '13,30', '14,22', '14,24', '14,27', '14,30', '14,31', '14,33',
> '14,34', '14,35', '14,39', '16,20', '16,27', '18,38', '19,30', '19,31',
> '19,39', '20,38', '22,30', '26,37', '26,38', '27,33', '27,34', '27,36',
> '28,32', '28,34', '28,36', '28,37', '3,18', '3,27', '3,29', '3,37', '4,10',
> '4,29', '5,19', '5,37', '6,25', '9,28', '9,29', '9,37']
>
> Which is almost all of the OSD's in the system.
>
> Bill
>
>  ------------------------------
> *From:* Lincoln Bryant [linco...@uchicago.edu]
> *Sent:* Monday, January 05, 2015 3:40 PM
> *To:* Sanders, Bill
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Slow/Hung IOs
>
>  Hi BIll,
>
>  From your log excerpt, it looks like your slow requests are happening on
> OSDs 14 and 18. Is it always these two OSDs?
>
>  If you don't have a long recovery time (e.g., the cluster is just full
> of test data), maybe you could try setting OSDs 14 and 18 out and
> re-benching?
>
>  Alternatively I suppose you could just use bonnie++ or dd etc to write
> to those OSDs (careful to not clobber any Ceph dirs) and see how the
> performance looks.
>
>  Cheers,
> Lincoln
>
>   On Jan 5, 2015, at 4:36 PM, Sanders, Bill wrote:
>
>   Hi Ceph Users,
>
> We've got a Ceph cluster we've built, and we're experiencing issues with
> slow or hung IO's, even running 'rados bench' on the OSD cluster.  Things
> start out great, ~600 MB/s, then rapidly drops off as the test waits for
> IO's. Nothing seems to be taxed... the system just seems to be waiting.
> Any help trying to figure out what could cause the slow IO's is appreciated.
>
> For example, 'rados -p rbd bench 60 write -t 32' takes over 900s to
> complete:
>
> A typical rados bench:
>  Total time run:         957.458274
> Total writes made:      9251
> Write size:             4194304
> Bandwidth (MB/sec):     38.648
>
> Stddev Bandwidth:       157.323
> Max bandwidth (MB/sec): 964
> Min bandwidth (MB/sec): 0
> Average Latency:        3.21126
> Stddev Latency:         51.9546
> Max latency:            910.72
> Min latency:            0.04516
>
>
> According to ceph.log, we're not experiencing any OSD flapping or monitor
> election cycles, just slow requests:
>
> # grep slow /var/log/ceph/ceph.log:
> 2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow
> requests, 1 included below; oldest blocked for > 513.611379 secs
> 2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow
> request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205:
> osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write
> 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
> from 3,37
> 2015-01-05 13:42:49.938681 osd.18 39.7.48.7:6803/11185 222 : [WRN] 3 slow
> requests, 1 included below; oldest blocked for > 520.612372 secs
> 2015-01-05 13:42:49.938688 osd.18 39.7.48.7:6803/11185 223 : [WRN] slow
> request 480.636547 seconds old, received at 2015-01-05 13:34:49.302080:
> osd_op(client.92008.1:3100010 rb.0.140d.238e1f29.000000000c77 [write
> 3622400~512] 3.d031a69f ondisk+write e994) v4 currently waiting for subops
> from 26,37
> 2015-01-05 13:43:12.941838 osd.18 39.7.48.7:6803/11185 224 : [WRN] 3 slow
> requests, 1 included below; oldest blocked for > 543.615545 secs
> 2015-01-05 13:43:12.941844 osd.18 39.7.48.7:6803/11185 225 : [WRN] slow
> request 60.140595 seconds old, received at 2015-01-05 13:42:12.801205:
> osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write
> 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
> from 3,37
> 2015-01-05 13:44:04.933440 osd.14 39.7.48.7:6818/11640 251 : [WRN] 4 slow
> requests, 1 included below; oldest blocked for > 606.941954 secs
> 2015-01-05 13:44:04.933469 osd.14 39.7.48.7:6818/11640 252 : [WRN] slow
> request 240.101138 seconds old, received at 2015-01-05 13:40:04.832272:
> osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.000000000010 [write
> 475136~512] 3.5e623815 ondisk+write e994) v4 currently waiting for subops
> from 27,33
> 2015-01-05 13:44:12.950805 osd.18 39.7.48.7:6803/11185 226 : [WRN] 3 slow
> requests, 1 included below; oldest blocked for > 603.624511 secs
> 2015-01-05 13:44:12.950812 osd.18 39.7.48.7:6803/11185 227 : [WRN] slow
> request 120.149561 seconds old, received at 2015-01-05 13:42:12.801205:
> osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write
> 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
> from 3,37
> 2015-01-05 13:46:12.988010 osd.18 39.7.48.7:6803/11185 228 : [WRN] 3 slow
> requests, 1 included below; oldest blocked for > 723.661722 secs
> 2015-01-05 13:46:12.988017 osd.18 39.7.48.7:6803/11185 229 : [WRN] slow
> request 240.186772 seconds old, received at 2015-01-05 13:42:12.801205:
> osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.00000000000f [write
> 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
> from 3,37
> 2015-01-05 13:46:18.971570 osd.14 39.7.48.7:6818/11640 253 : [WRN] 4 slow
> requests, 1 included below; oldest blocked for > 740.980083 secs
> 2015-01-05 13:46:18.971577 osd.14 39.7.48.7:6818/11640 254 : [WRN] slow
> request 480.063439 seconds old, received at 2015-01-05 13:38:18.908100:
> osd_op(client.91911.1:3113675 rb.0.13f5.238e1f29.000000000010 [write
> 475136~512] 3.679a939d ondisk+write e994) v4 currently waiting for subops
> from 27,34
> 2015-01-05 13:48:05.030581 osd.14 39.7.48.7:6818/11640 255 : [WRN] 4 slow
> requests, 1 included below; oldest blocked for > 847.039098 secs
> 2015-01-05 13:48:05.030587 osd.14 39.7.48.7:6818/11640 256 : [WRN] slow
> request 480.198282 seconds old, received at 2015-01-05 13:40:04.832272:
> osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.000000000010 [write
> 475136~512] 3.5e623815 ondisk+write e994) v4 currently waiting for subops
> from 27,33
>
>
> iostat -dxm:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda            5662.99     3.63  155.17    3.37    22.94     0.03
> 296.72     1.05    6.64   1.79  28.32
> sdb               0.73     0.04    3.62   70.66     0.04    10.85
> 300.04     0.06    0.79   0.08   0.61
> sdc               0.70     0.00    1.70   65.45     0.01    10.34
> 315.60     0.06    0.85   0.09   0.57
> sdd               0.07     0.95   16.78   13.47     1.03     2.28
> 224.25     0.37   12.32   3.84  11.62
> sde               0.26     0.80   25.79   10.89     2.11     1.85
> 220.74     2.82   76.94   4.73  17.35
> sdf               0.23     0.94   21.16   12.39     1.56     2.14
> 225.94     2.81   83.84   4.70  15.75
> sdg               0.02     0.91    9.22   12.30     0.33     2.06
> 227.64     0.26   12.26   3.67   7.89
> sdh               0.07     0.86   14.31   12.35     0.84     2.06
> 223.02     2.17   81.27   4.79  12.76
> sdk               0.19     0.85   15.92   10.48     1.06     1.78
> 220.09     3.11  114.07   4.68  12.35
> sdl               0.30     0.87   20.32   12.07     1.45     2.06
> 221.99     2.71   82.65   4.68  15.16
> sdi               0.13     0.88   16.03   11.51     1.07     1.94
> 224.27     2.99  108.40   4.64  12.78
> sdj               0.15     0.90   15.63   11.83     0.95     2.01
> 220.32     2.85  103.83   4.65  12.76
> sdm               0.19     0.86   20.48   12.04     1.46     2.04
> 220.42     2.64   81.17   4.72  15.34
>
> Our ceph cluster is 4x Dell R720xd nodes:
> 2x1TB spinners configured in RAID for the OS
> 10x4TB spinners for OSD's (XFS)
> 2x400GB SSD's, each with 5x~50GB OSD journals
> 2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores)
> 128GB RAM
> Two networks (public+cluster), both over infiniband
>
> Three monitors are configured on the first three nodes, and use a chunk of
> one of the SSDs for their data, on an XFS partition
>
> Software:
> SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client"
> backported from 3.10)
> Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE
>
> ceph.conf:
> fsid = 3e8dbfd8-c3c8-4d30-80e2-cd059619d757
> mon initial members = tvsaq1, tvsaq2, tvsar1
> mon host = 39.7.48.6, 39.7.48.7, 39.7.48.8
>
> cluster network = 39.64.0.0/12
> public network = 39.0.0.0/12
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> osd journal size = 9000
> filestore xattr use omap = true
> osd crush update on start = false
> osd pool default size = 3
> osd pool default min size = 1
> osd pool default pg num = 4096
> osd pool default pgp num = 4096
>
> mon clock drift allowed = .100
> osd mount options xfs = rw,noatime,inode64
>
>
>
>
>  _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to