The appearance of these socket closed messages seems to coincide with the slowdown symptoms. What is the cause?

2015-04-23T14:08:47.111838+00:00 i-65062482 kernel: [ 4229.485489] libceph: 
osd1 192.168.160.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:06.961823+00:00 i-65062482 kernel: [ 4249.332547] libceph: 
osd2 192.168.96.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:09.701819+00:00 i-65062482 kernel: [ 4252.070594] libceph: 
osd4 192.168.64.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:10.381817+00:00 i-65062482 kernel: [ 4252.755400] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:09:14.831817+00:00 i-65062482 kernel: [ 4257.200257] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:13:57.061877+00:00 i-65062482 kernel: [ 4539.431624] libceph: 
osd4 192.168.64.4:6800 socket closed (con state OPEN)

2015-04-23T14:13:57.541842+00:00 i-65062482 kernel: [ 4539.913284] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:13:59.801822+00:00 i-65062482 kernel: [ 4542.177187] libceph: 
osd3 192.168.0.4:6800 socket closed (con state OPEN)

2015-04-23T14:14:11.361819+00:00 i-65062482 kernel: [ 4553.733566] libceph: 
osd4 192.168.64.4:6800 socket closed (con state OPEN)

2015-04-23T14:14:47.871829+00:00 i-65062482 kernel: [ 4590.242136] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:14:47.991826+00:00 i-65062482 kernel: [ 4590.364078] libceph: 
osd2 192.168.96.4:6800 socket closed (con state OPEN)

2015-04-23T14:15:00.081817+00:00 i-65062482 kernel: [ 4602.452980] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)

2015-04-23T14:16:21.301820+00:00 i-65062482 kernel: [ 4683.671614] libceph: 
osd5 192.168.128.4:6800 socket closed (con state OPEN)



Jeff

On 04/23/2015 12:26 AM, Jeff Epstein wrote:

Do you have some idea how I can diagnose this problem?

I'll look at ceph -s output while you get these stuck process to see if there's any unusual activity (scrub/deep scrub/recovery/bacfills/...). Is it correlated in any way with rbd removal (ie: write blocking don't appear unless you removed at least one rbd for say one hour before the write performance problems).

I'm not familiar with Amazon VMs. If you map the rbds using the kernel driver to local block devices do you have control over the kernel you run (I've seen reports of various problems with older kernels and you probably want the latest possible) ?

ceph status shows nothing unusual. However, on the problematic node, we typically see entries in ps like this:

 1468 12329 root     D     0.0 mkfs.ext4       wait_on_page_bit
 1468 12332 root     D     0.0 mkfs.ext4       wait_on_buffer

Notice the "D" blocking state. Here, mkfs is stopped on some wait functions for long periods of time. (Also, we are formatting the RBDs as ext4 even though the OSDs are xfs; I assume this shouldn't be a problem?)

We're on kernel 3.18.4pl2, which is pretty recent. Still, an outdated kernel driver isn't out of the question; if anyone has any concrete information, I'd be grateful.

Jeff

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to