Re: [ceph-users] Potential OSD deadlock?

Robert LeBlanc Mon, 21 Sep 2015 23:45:00 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm starting to wonder if this has to do with some OSDs getting full
or the 0.94.3 code. Earlier this afternoon, I cleared out my test
cluster so there was no pools. I created anew rbd pool and started
filling it with 6 - 1TB fio jobs replication 3 with 6 spindles over
six servers. It was running 0.94.2 at the time. After several hours of
writes, we had the new patched 0.93.3 binaries ready for testing so I
rolled the update on the test cluster while the fio jobs were running.
There were a few blocked I/O as the services were restarted (nothing
I'm concerned about). Now that the OSDs are about 60% full, the
blocked I/O is becoming very frequent even with the backports. The
write bandwidth was consistently at 200 MB/s until this point, now it
is fluctuating between 200 MB/s and 75 MB/s mostly around about
100MB/s. Our production cluster is XFS on the OSDs, this test cluster
is EXT4.


I'll see if I can go back to 0.94.2 and fill the cluster up again....
Going back to 0.94.2 and 0.94.0 still has the issue (although I didn't
refill the cluster, I didn't delete what was already there). I'm
building the latest of hammer-backports now and see if it resolves the
issue.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAPioCRDmVDuy+mK58QAAOIwP/3D86CWYlgozKBNlsuIv
AT30S7ZrqDZmxygaJQ9PZgSyQlgQuXpDLL4CnVtbUNd+dgz91i7CVecVGj3h
/jrFwrH063yPD1r3nMmSdc2GTTIahH1JhvzpWqcP9pkmuGHoYlWqteYnosfn
ptOjJI57AFw/goxcJLUExLfdp+L/3GkHNoMMKtJXZX7OIEWdkMj1f9jBGEK6
tJ3AGbbpL6eZGB/KFDObHwCEjfwouTkRk0wNh0luDAU9QlBokmcKS134Ht2C
kRtggOMlXxOKaQiXKZHZL7TUEgvlwldpS01rgDLnNOn3AHZMiAoaC2noFDDS
48ZnbkJgdqpMX2nMFcbwh4zdWOmRRcFqNXuA/t4m0UrZwRCWlSwcVPxDqbHr
00kjDMFtlbov1NWfDXfcMF32qSdsfVaDAwjCmMct1IEn3EXYKYeYA8GUePia
+A9FvUezeYSELWxk59Hirk69A39wNsA40lrMbFzIOkp8CLLuKiHSKs8dTFtJ
CaIPMwZDElcKJDKXPEMu260/GIcJmERUZXPayIQp2Attgx3/gvDpU3crWN7C
49dqnPOVqm6+f+ciUBVwIgQ7Xbbqom+yc1jxlvmpMW1C5iu9vjH/mvO42N/c
e+R0/SgCJnDQU4tYppYadA8vKA/e9JyjMfBlbTW0urxHQlkNqohFY9G+edLW
Zkxf
=kYQ2
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Sep 21, 2015 at 4:33 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> On Mon, Sep 21, 2015 at 3:14 PM, Robert LeBlanc <rob...@leblancnet.us> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> In my lab cluster I can saturate the disks and I'm not seeing any of
>> the blocked I/Os from the Ceph side, although the client shows that
>> I/O stops for a while. I'm not convinced that it is load related.
>>
>> I was looking through the logs using the technique you described as
>> well as looking for the associated PG. There is a lot of data to go
>> through and it is taking me some time.
>>
>> We are rolling some of the backports for 0.94.4 into a build, one for
>> the PG split problem, and 5 others that might help. One that I'm
>> really hopeful about is http://tracker.ceph.com/issues/12843, but I'm
>> not sure the messages we are seeing are exactly related. We are
>> planning to roll the new binaries tomorrow night. I'll update this
>> thread after the new code has been rolled.
>
> Ah, yep, I didn't realize we still had any of those in hammer. That
> bug is indeed a good bet for what you're seeing.
> -Greg
>
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAIE9CRDmVDuy+mK58QAAyS8P/21+0Y+QhsByqgu/bTiS
>> 3dG6hNMyElXFyuWXievqqvyvaak7Y/nkVhC+oII1glujWFRRTL+61K4Qq8oo
>> abFBtFVSRkkQpg0BCuHH0LsbXwyK7bmiSTZted2/XzZfJdcuQcDCVXZ0K3En
>> LLWn0PvDj7OBnLexAAKAMF91a8gCnjuKq3AJnEYxQBeI/Fv58cpfERAiYa+W
>> Fl6jBKPboJr8sgbQ87k6hu4aLuHGepliFJlUO3XPTvuD4WQ6Ak1HAD+KtmXd
>> i8GYOZK9ukMQs8YavO8GqVAiZvUcuIGHVf502fP0v+7SR/s/9OY6Loo00/kK
>> QdG0+mgV0o60AZ4r/setlsd7Uo3l9u4ra9n3D2RUtSJZRvcBK2HweeMiit4u
>> FgA5dcx0lRFd6IluxZstgZlQiyxggIWHUgoQYFashtNWu/bl8bXn+gzK0GxO
>> mWZqaeKBMauBWwLADIX1Q+VYBSvZWqFCfKGUawQ4bRnyz7zlHXQANlL1t7iF
>> /QakoriydMW3l2WPftk4kDt4egFGhxxrCRZfA0TnVNx1DOLE9vRBKXKgTr0j
>> miB0Ca9v9DQzVnTWhPCTfb8UdEHzozMTMEv30V3nskafPolsRJmjO04C1K7e
>> 61R+cawG02J0RQqFMMNj3X2Gnbp/CC6JzUpQ5JPvNrvO34lcTYBWkdfwtolg
>> 9ExB
>> =hAcJ
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Sep 21, 2015 at 4:00 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>>> So it sounds like you've got two different things here:
>>> 1) You get a lot of slow operations that show up as warnings.
>>>
>>> 2) Rarely, you get blocked op warnings that don't seem to go away
>>> until the cluster state changes somehow.
>>>
>>> (2) is the interesting one. Since you say the cluster is under heavy
>>> load, I presume (1) is just you overloading your servers and getting
>>> some hot spots that take time to clear up.
>>>
>>> There have been bugs in the past where slow op warnings weren't
>>> getting removed when they should have. I don't *think* any are in
>>> .94.3 but could be wrong. Have you observed these from the other
>>> direction, where a client has blocked operations?
>>> If you want to go through the logs yourself, you should try and find
>>> all the lines about one of the operations which seems to be blocked.
>>> They aren't the most readable but if you grep for the operation ID
>>> (client.4267090.0:3510311) and then once you're in the right area look
>>> for what the threads processing it are doing you should get some idea
>>> of where things are going wrong you can share.
>>> -Greg
>>>
>>> On Sun, Sep 20, 2015 at 10:43 PM, Robert LeBlanc <rob...@leblancnet.us> 
>>> wrote:
>>>> We set the logging on an OSD that had problems pretty frequently, but
>>>> cleared up in less than 30 seconds. The logs are at
>>>> http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed
>>>> at 8.6GB. Some of the messages we were seeing in ceph -w are:
>>>>
>>>> 2015-09-20 20:55:44.029041 osd.112 [WRN] 10 slow requests, 10 included
>>>> below; oldest blocked for > 30.132696 secs
>>>> 2015-09-20 20:55:44.029047 osd.112 [WRN] slow request 30.132696
>>>> seconds old, received at 2015-09-20 20:55:13.896286:
>>>> osd_op(client.3289538.0:62497509
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 2588672~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently reached_pg
>>>> 2015-09-20 20:55:44.029051 osd.112 [WRN] slow request 30.132619
>>>> seconds old, received at 2015-09-20 20:55:13.896363:
>>>> osd_op(client.3289538.0:62497510
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 2908160~12288]
>>>> 17.118f0c67 ack+ondisk+write+known_if_redirected e57590) currently
>>>> waiting for rw locks
>>>> 2015-09-20 20:55:44.029054 osd.112 [WRN] slow request 30.132520
>>>> seconds old, received at 2015-09-20 20:55:13.896462:
>>>> osd_op(client.3289538.0:62497511
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 2949120~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:44.029058 osd.112 [WRN] slow request 30.132415
>>>> seconds old, received at 2015-09-20 20:55:13.896567:
>>>> osd_op(client.3289538.0:62497512
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 2957312~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:44.029061 osd.112 [WRN] slow request 30.132302
>>>> seconds old, received at 2015-09-20 20:55:13.896680:
>>>> osd_op(client.3289538.0:62497513
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 2998272~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:45.029290 osd.112 [WRN] 9 slow requests, 5 included
>>>> below; oldest blocked for > 31.132843 secs
>>>> 2015-09-20 20:55:45.029298 osd.112 [WRN] slow request 31.132447
>>>> seconds old, received at 2015-09-20 20:55:13.896759:
>>>> osd_op(client.3289538.0:62497514
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 3035136~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:45.029303 osd.112 [WRN] slow request 31.132362
>>>> seconds old, received at 2015-09-20 20:55:13.896845:
>>>> osd_op(client.3289538.0:62497515
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 3047424~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:45.029309 osd.112 [WRN] slow request 31.132276
>>>> seconds old, received at 2015-09-20 20:55:13.896931:
>>>> osd_op(client.3289538.0:62497516
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 3072000~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:45.029315 osd.112 [WRN] slow request 31.132199
>>>> seconds old, received at 2015-09-20 20:55:13.897008:
>>>> osd_op(client.3289538.0:62497517
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 3211264~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> 2015-09-20 20:55:45.029326 osd.112 [WRN] slow request 31.132127
>>>> seconds old, received at 2015-09-20 20:55:13.897079:
>>>> osd_op(client.3289538.0:62497518
>>>> rbd_data.29b9ae3f960770.0000000000000200 [stat,set-alloc-hint
>>>> object_size 8388608 write_size 8388608,write 3235840~4096] 17.118f0c67
>>>> ack+ondisk+write+known_if_redirected e57590) currently waiting for rw
>>>> locks
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Sun, Sep 20, 2015 at 7:02 PM, Robert LeBlanc <rob...@leblancnet.us> 
>>>> wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> I was able to catch the tail end of one of these and increased the
>>>>> logging on it. I had to kill it a minute or two after the logging was
>>>>> increased because of the time of the day.
>>>>>
>>>>> I've put the logs at https://robert.leblancnet.us/ceph-osd.8.log.xz .
>>>>>
>>>>> Thanks,
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Sun, Sep 20, 2015 at 9:03 AM, Robert LeBlanc  wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>>
>>>>>> We had another incident of 100 long blocked I/O this morning, but I
>>>>>> didn't get to it in time. I wound up clearing itself after almost
>>>>>> 1,000 seconds. On interesting note is that the blocked I/O kept
>>>>>> creeping up until I see a bunch of entrys in the log like:
>>>>>>
>>>>>> 2015-09-20 08:20:01.870141 7f5fbe05c700  0 -- 10.217.72.12:6812/1468
>>>>>>>> 10.217.72.35:0/4027675 pipe(0x2f7a3000 sd=363 :6812 s=0 pgs=0 cs=0
>>>>>> l=1 c=0x1bd40840).accept replacing existing (lossy) channel (new one
>>>>>> lossy=1)
>>>>>> 2015-09-20 08:20:02.061539 7f5f43de3700  0 -- 10.217.72.12:6812/1468
>>>>>>>> 10.217.72.33:0/2012691 pipe(0x28857000 sd=408 :6812 s=0 pgs=0 cs=0
>>>>>> l=1 c=0x1bd43020).accept replacing existing (lossy) channel (new one
>>>>>> lossy=1)
>>>>>> 2015-09-20 08:20:02.817884 7f5fb2ca9700  0 -- 10.217.72.12:6812/1468
>>>>>>>> 10.217.72.33:0/2040360 pipe(0x283ff000 sd=605 :6812 s=0 pgs=0 cs=0
>>>>>> l=1 c=0x1bd402c0).accept replacing existing (lossy) channel (new one
>>>>>> lossy=1)
>>>>>>
>>>>>> after almost 100 of these in about a 2 second period, the I/O starts
>>>>>> draining over the next 3-4 seconds. What does this message mean? 1468
>>>>>> appears to be the PID of one of the OSD processes, but the number in
>>>>>> the same position on the dest is not a PID. What would cause a channel
>>>>>> to be dropped and recreated?
>>>>>>
>>>>>> There are 10 OSDs on this host, but only two other OSDs show anything
>>>>>> and only two messages (but it is many seconds AFTER the problem clears
>>>>>> up).
>>>>>>
>>>>>> OSD.128
>>>>>>
>>>>>> 2015-09-20 08:25:32.331268 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
>>>>>> submit_message osd_op_reply(5553069
>>>>>> rbd_data.3e30b0275d493f.000000000000d800 [set-alloc-hint object_size
>>>>>> 4194304 write_size 4194304,write 143360~4096] v57084'16557680
>>>>>> uv16557680 ack = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
>>>>>> con, dropping message 0xa099180
>>>>>> 2015-09-20 08:25:32.331401 7fe92e7e0700  0 -- 10.217.72.12:6816/7060
>>>>>> submit_message osd_op_reply(5553069
>>>>>> rbd_data.3e30b0275d493f.000000000000d800 [set-alloc-hint object_size
>>>>>> 4194304 write_size 4194304,write 143360~4096] v57084'16557680
>>>>>> uv16557680 ondisk = 0) v6 remote, 10.217.72.33:0/4005620, failed lossy
>>>>>> con, dropping message 0x16217600
>>>>>>
>>>>>> OSD.121
>>>>>>
>>>>>> 2015-09-20 08:29:15.192055 7f51c07e2700  0 -- 10.217.72.12:6802/25568
>>>>>> submit_message osd_op_reply(1483497
>>>>>> rbd_data.3fc0847f32a8f4.0000000000003200 [set-alloc-hint object_size
>>>>>> 4194304 write_size 4194304,write 135168~4096] v57086'14949171
>>>>>> uv14949171 ack = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
>>>>>> con, dropping message 0x1dd8a840
>>>>>> 2015-09-20 08:29:15.192612 7f51c07e2700  0 -- 10.217.72.12:6802/25568
>>>>>> submit_message osd_op_reply(1483497
>>>>>> rbd_data.3fc0847f32a8f4.0000000000003200 [set-alloc-hint object_size
>>>>>> 4194304 write_size 4194304,write 135168~4096] v57086'14949171
>>>>>> uv14949171 ondisk = 0) v6 remote, 10.217.72.31:0/5002159, failed lossy
>>>>>> con, dropping message 0x1476ec00
>>>>>>
>>>>>> Our Ceph processes start with some real high limits:
>>>>>> root      1465  0.0  0.0 127700  3492 ?        Ss   Jun11   0:00
>>>>>> /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 126 --pid-file
>>>>>> /var/run/ceph/osd.126.pid -c /etc/ceph/ceph.conf --cluster ceph -f
>>>>>> root      1468  9.7  1.4 4849560 923536 ?      Sl   Jun11 14190:59
>>>>>> /usr/bin/ceph-osd -i 126 --pid-file /var/run/ceph/osd.126.pid -c
>>>>>> /etc/ceph/ceph.conf --cluster ceph -f
>>>>>>
>>>>>> [root@ceph2 1468]# cat /etc/security/limits.d/90-local.conf
>>>>>> *       -       nofile  16384
>>>>>>
>>>>>> We have 130 OSDs on 13 hosts. We also have ~50 KVM VM running on this 
>>>>>> storage.
>>>>>>
>>>>>>     cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>>>>>>      health HEALTH_WARN
>>>>>>             198 pgs backfill
>>>>>>             4 pgs backfilling
>>>>>>             169 pgs degraded
>>>>>>             150 pgs recovery_wait
>>>>>>             169 pgs stuck degraded
>>>>>>             352 pgs stuck unclean
>>>>>>             12 pgs stuck undersized
>>>>>>             12 pgs undersized
>>>>>>             recovery 161065/41285858 objects degraded (0.390%)
>>>>>>             recovery 2871014/41285858 objects misplaced (6.954%)
>>>>>>             noscrub,nodeep-scrub flag(s) set
>>>>>>      monmap e2: 3 mons at
>>>>>> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
>>>>>>             election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>>>>>>      osdmap e57086: 130 osds: 130 up, 130 in; 270 remapped pgs
>>>>>>             flags noscrub,nodeep-scrub
>>>>>>       pgmap v10921036: 2308 pgs, 3 pools, 39046 GB data, 9735 kobjects
>>>>>>             151 TB used, 320 TB / 472 TB avail
>>>>>>             161065/41285858 objects degraded (0.390%)
>>>>>>             2871014/41285858 objects misplaced (6.954%)
>>>>>>                 1956 active+clean
>>>>>>                  183 active+remapped+wait_backfill
>>>>>>                   82 active+recovery_wait+degraded
>>>>>>                   67 active+recovery_wait+degraded+remapped
>>>>>>                    9 active+undersized+degraded+remapped+wait_backfill
>>>>>>                    6 active+degraded+remapped+wait_backfill
>>>>>>                    2 active+undersized+degraded+remapped+backfilling
>>>>>>                    2 active+degraded+remapped+backfilling
>>>>>>                    1 active+recovery_wait+undersized+degraded+remapped
>>>>>> recovery io 25770 kB/s, 6 objects/s
>>>>>>   client io 78274 kB/s rd, 119 MB/s wr, 5949 op/s
>>>>>>
>>>>>> Any ideas would be helpful.
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.1.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>>
>>>>>> wsFcBAEBCAAQBQJV/sqiCRDmVDuy+mK58QAAdp0QAMEU7JW88NDSzpIidGfT
>>>>>> EZpUkN3UqcOWeb/szCasHzjEs+IPPNGVEICs4KoTfQQobJaVIi3T6RmioOER
>>>>>> RrI0S4SO/LCtxyb/3iAfSKgFd9nEDlZKBmywQ1HgZZR0a2uMwzbSj3zfnOaq
>>>>>> tuLqVj10HGoO3KnjYjv4z+itpGrOOPew1Zjj4mgani+feaR5KxnopVKTNaut
>>>>>> ZivEkjEseHbtj6rxio8w4XKj8LzZCARUeiLB6c1d1/NOegcK4sPZvCXV7syx
>>>>>> eOeP9spr4YWrBnO0HNKqxNcKW2GfWbVGEIEmEsZYFoj8E3u2uUZ9Y9s41BtT
>>>>>> Sy67Edr8prfe2+9Jb29buXnj9k2bnzQ87SjtbGdym8/WgkHRmKeV8enGJ+X4
>>>>>> IpaYj3HuLfGL724tigxt685uYOdqS8ahyeED/1tW6194lvarXgfYycXt1lj5
>>>>>> TGan/hsQQUCDFYA76Gqvfzj//Umytg8Ub1B38X/H1XlMDinqFfcmW8/R7Wwe
>>>>>> PubXTM3zNS3j3Fl4/+MZS1T3qNlKEMk+jWRC5nYwE7e1aomABY0QbHHGxgPK
>>>>>> pFl9sm9cOKfWCRXQX4w7mMRspiMosW1X1WbmLe2cU17xtudc0rsEZycPOt3n
>>>>>> XCwin7/+yxKwp/MSWCk/vR/pY7Q/73Pi4kKRzXpFHLpJnteZ3moATTmuSSQm
>>>>>> 3Ggb
>>>>>> =SJmS
>>>>>> -----END PGP SIGNATURE-----
>>>>>> ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 19, 2015 at 1:03 AM, Robert LeBlanc  wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>>
>>>>>>> We have had two situations where I/O just seems to be indefinitely
>>>>>>> blocked on our production cluster today (0.94.3). In the case this
>>>>>>> morning, it was just normal I/O traffic, no recovery or backfill. The
>>>>>>> case this evening, we were backfilling to some new OSDs. I would have
>>>>>>> loved to have bumped up the debugging to get an idea of what was going
>>>>>>> on, but time was exhausted. The incident this evening I was able to do
>>>>>>> some additional troubleshooting, but got real anxious after I/O had
>>>>>>> been blocked for 10 minutes and OPs was getting hot around the collar.
>>>>>>>
>>>>>>> Here are the important parts of the logs:
>>>>>>> [osd.30]
>>>>>>> 2015-09-18 23:05:36.188251 7efed0ef0700  0 log_channel(cluster) log
>>>>>>> [WRN] : slow request 30.662958 seconds old,
>>>>>>>  received at 2015-09-18 23:05:05.525220: 
>>>>>>> osd_op(client.3117179.0:18654441
>>>>>>>  rbd_data.1099d2f67aaea.0000000000000f62 [set-alloc-hint object_size
>>>>>>> 8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c
>>>>>>> ack+ondisk+write+known_if_redirected e55919)
>>>>>>>  currently waiting for subops from 32,70,72
>>>>>>>
>>>>>>> [osd.72]
>>>>>>> 2015-09-18 23:05:19.302985 7f3fa19f8700  0 log_channel(cluster) log
>>>>>>> [WRN] : slow request 30.200408 seconds old,
>>>>>>>  received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311
>>>>>>>  rbd_data.3f41d41bd65b28.0000000000009e2b [set-alloc-hint object_size
>>>>>>> 4194304 write_size 4194304,write 1048576~421888] 17.40adcada
>>>>>>> ack+ondisk+write+known_if_redirected e55919)
>>>>>>>  currently waiting for subops from 2,30,90
>>>>>>>
>>>>>>> The other OSDs listed (32,70,2,90) did not have any errors in the logs
>>>>>>> about blocked I/O. It seems that osd.30 was waiting for osd.72 and
>>>>>>> visa versa. I looked at top and iostat of these two hosts and the OSD
>>>>>>> processes and disk I/O were pretty idle.
>>>>>>>
>>>>>>> I know that this isn't a lot to go on. Our cluster is under very heavy
>>>>>>> load and we get several blocked I/Os every hour, but they usually
>>>>>>> clear up within 15 seconds. We seem to get I/O blocked when the op
>>>>>>> latency of the cluster goes above 1 (average from all OSDs as seen by
>>>>>>> Graphite).
>>>>>>>
>>>>>>> Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately
>>>>>>> cleared all the blocked I/O and then it was fine after rejoining the
>>>>>>> cluster. Increasing what logs and to what level would be most
>>>>>>> beneficial in this case for troubleshooting?
>>>>>>>
>>>>>>> I hope this makes sense, it has been a long day.
>>>>>>>
>>>>>>> - ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.1.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>
>>>>>>> wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV
>>>>>>> LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7
>>>>>>> cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc
>>>>>>> K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE
>>>>>>> 9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC
>>>>>>> DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75
>>>>>>> Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c
>>>>>>> MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl
>>>>>>> kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY
>>>>>>> nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2
>>>>>>> 4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq
>>>>>>> qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio
>>>>>>> Roj5
>>>>>>> =ut4B
>>>>>>> -----END PGP SIGNATURE-----
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.1.0
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJV/1coCRDmVDuy+mK58QAAV14QAKzoivHyvrlUBNa3ltti
>>>>> 7xOQv6QIvRNuaYX1jwUb0JA2JwpcUoZFMukTtrXJmon/MFTbIelIO4tzijLc
>>>>> kXGqOcV+z0RpXydOR5Rdu+wdb59TxIjmE/mkUHZfdPVSJGQnTjnCJ7WkAwtz
>>>>> D0RREo7odc8C6AfSYRzJ0RPz9grNflKh4fTAkDGdkWVwRPDMHtB0D7FEviZs
>>>>> JQlBsBuSNT3OIRqvjQsc0Yl+Xodmr4uaK3yj6+Sx7fQpilS6I5nqRQx0HuRk
>>>>> K41425/hkAH0Es9PjR54fl6xi8xPf/zLDw8brMjmZQxu0d+sJWkOTDXmmqk8
>>>>> l0s0UcC/W2RDxkiTudNPvx4PIy8bm5/l0NGFXiA9uaq42w6l45vARrCHWCc6
>>>>> FQkKmQLlW89CzSyCDwYVZEc73F/cr97DFkBQctSXND2FmeLWQNOTsUo02C8n
>>>>> qTksSKjH8qI0aAj3vnf1GKxFq62dPejnnmOhwxKiTulksAzbIRmr9F2uPDBo
>>>>> +wDD5acAqVCOYCX7QwxoHSW2HR3AUzf6dLOnM1bSYOM5RF09ZAGw7sBmXYvf
>>>>> b1CwpIS1hK98Ww8+A+RkEFlyFWJPIO3KEK7z1K5Hjrp0dIdCLcpl3D66v/c8
>>>>> SB6B+KNoNh5IP6GTFBu72sOaAyeSnnPrt5oZbJ8Kt7vJ+wg3fvNhh7at/B1l
>>>>> E/IO
>>>>> =AVQ7
>>>>> -----END PGP SIGNATURE-----
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Potential OSD deadlock?

Reply via email to