The appearance of these socket closed messages seems to coincide with
the slowdown symptoms. What is the cause?
2015-04-23T14:08:47.111838+00:00 i-65062482 kernel: [ 4229.485489] libceph:
osd1 192.168.160.4:6800 socket closed (con state OPEN)
2015-04-23T14:09:06.961823+00:00 i-65062482 kernel: [ 4249.332547] libceph:
osd2 192.168.96.4:6800 socket closed (con state OPEN)
2015-04-23T14:09:09.701819+00:00 i-65062482 kernel: [ 4252.070594] libceph:
osd4 192.168.64.4:6800 socket closed (con state OPEN)
2015-04-23T14:09:10.381817+00:00 i-65062482 kernel: [ 4252.755400] libceph:
osd5 192.168.128.4:6800 socket closed (con state OPEN)
2015-04-23T14:09:14.831817+00:00 i-65062482 kernel: [ 4257.200257] libceph:
osd5 192.168.128.4:6800 socket closed (con state OPEN)
2015-04-23T14:13:57.061877+00:00 i-65062482 kernel: [ 4539.431624] libceph:
osd4 192.168.64.4:6800 socket closed (con state OPEN)
2015-04-23T14:13:57.541842+00:00 i-65062482 kernel: [ 4539.913284] libceph:
osd5 192.168.128.4:6800 socket closed (con state OPEN)
2015-04-23T14:13:59.801822+00:00 i-65062482 kernel: [ 4542.177187] libceph:
osd3 192.168.0.4:6800 socket closed (con state OPEN)
2015-04-23T14:14:11.361819+00:00 i-65062482 kernel: [ 4553.733566] libceph:
osd4 192.168.64.4:6800 socket closed (con state OPEN)
2015-04-23T14:14:47.871829+00:00 i-65062482 kernel: [ 4590.242136] libceph:
osd5 192.168.128.4:6800 socket closed (con state OPEN)
2015-04-23T14:14:47.991826+00:00 i-65062482 kernel: [ 4590.364078] libceph:
osd2 192.168.96.4:6800 socket closed (con state OPEN)
2015-04-23T14:15:00.081817+00:00 i-65062482 kernel: [ 4602.452980] libceph:
osd5 192.168.128.4:6800 socket closed (con state OPEN)
2015-04-23T14:16:21.301820+00:00 i-65062482 kernel: [ 4683.671614] libceph:
osd5 192.168.128.4:6800 socket closed (con state OPEN)
Jeff
On 04/23/2015 12:26 AM, Jeff Epstein wrote:
Do you have some idea how I can diagnose this problem?
I'll look at ceph -s output while you get these stuck process to see
if there's any unusual activity (scrub/deep
scrub/recovery/bacfills/...). Is it correlated in any way with rbd
removal (ie: write blocking don't appear unless you removed at least
one rbd for say one hour before the write performance problems).
I'm not familiar with Amazon VMs. If you map the rbds using the
kernel driver to local block devices do you have control over the
kernel you run (I've seen reports of various problems with older
kernels and you probably want the latest possible) ?
ceph status shows nothing unusual. However, on the problematic node,
we typically see entries in ps like this:
1468 12329 root D 0.0 mkfs.ext4 wait_on_page_bit
1468 12332 root D 0.0 mkfs.ext4 wait_on_buffer
Notice the "D" blocking state. Here, mkfs is stopped on some wait
functions for long periods of time. (Also, we are formatting the RBDs
as ext4 even though the OSDs are xfs; I assume this shouldn't be a
problem?)
We're on kernel 3.18.4pl2, which is pretty recent. Still, an outdated
kernel driver isn't out of the question; if anyone has any concrete
information, I'd be grateful.
Jeff
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com