Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
On Fri, Apr 29, 2016 at 9:34 AM, Mike Lovell 
wrote:

> On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov <
> asheplya...@mirantis.com> wrote:
>
>> Hi,
>>
>> > i also wonder if just taking 148 out of the cluster (probably just
>> marking it out) would help
>>
>> As far as I understand this can only harm your data. The acting set of PG
>> 17.73 is  [41, 148],
>> so after stopping/taking out OSD 148  OSD 41 will store the only copy of
>> objects in PG 17.73
>> (so it won't accept writes any more).
>>
>> > since there are other osds in the up set (140 and 5)
>>
>> These OSDs are not in the acting set, they have no (at least some of the)
>> objects from PG 17.73,
>> and are copying the missing objects from OSDs 41 and 148. Naturally this
>> slows down or even
>> blocks writes to PG 17.73.
>>
>
> k. i didn't know if it could just use the members of the up set that are
> not in the acting set for completing writes. when thinking through it in my
> head it seemed reasonable but i could also see pitfalls with doing it.
> thats why i was asking if it was possible.
>
>
> > the only thing holding things together right now is a while loop doing
>> an 'ceph osd down 41' every minute
>>
>> As far as I understand this disturbs the backfilling and further delays
>> writes to that poor PG.
>>
>
> it definitely does seem to have an impact similar to that. the only upside
> is that it clears the slow io messages though i don't know if it actually
> lets the client io complete. recovery doesn't make any progress though in
> between the down commands. its not making any progress on its own anyways.
>

i went to check things this morning and noticed that the number of objects
misplaced had dropped from what i was expecting and was occasionally seeing
lines from ceph -w saying a number of objects were recovering. the only PG
in a state other than active+clean was the one that 41 and 148 were
bickering about so it looks like they were now passing traffic. it appeared
to start just after on of the osd down events that was happening in the
loop i had running. a little while after the backfill started making
progress, it completed. so its fine now. i would still like to try and find
out the cause since this has happened twice now. but at least its not an
emergency for me at the moment.

one other thing that was odd was that i saw the misplaced objects go
negative during the backfill. this is one of the lines from ceph -w.

2016-04-29 10:38:15.011241 mon.0 [INF] pgmap v27055697: 6144 pgs: 6143
active+clean, 1 active+undersized+degraded+remapped+backfilling; 123 TB
data, 372 TB used, 304 TB / 691 TB avail; 130 MB/s rd, 135 MB/s wr, 11210
op/s; 14547/93845634 objects degraded (0.016%); -13959/93845634 objects
misplaced (-0.015%); 27358 kB/s, 7 objects/s recovering

it seemed to complete around the point where it got to -14.5k misplaced.
i'm guessing this is just a reporting error but i immediately started a
deep-scrub on the pg just to make sure things are consistent.

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov  wrote:

> Hi,
>
> > i also wonder if just taking 148 out of the cluster (probably just
> marking it out) would help
>
> As far as I understand this can only harm your data. The acting set of PG
> 17.73 is  [41, 148],
> so after stopping/taking out OSD 148  OSD 41 will store the only copy of
> objects in PG 17.73
> (so it won't accept writes any more).
>
> > since there are other osds in the up set (140 and 5)
>
> These OSDs are not in the acting set, they have no (at least some of the)
> objects from PG 17.73,
> and are copying the missing objects from OSDs 41 and 148. Naturally this
> slows down or even
> blocks writes to PG 17.73.
>

k. i didn't know if it could just use the members of the up set that are
not in the acting set for completing writes. when thinking through it in my
head it seemed reasonable but i could also see pitfalls with doing it.
thats why i was asking if it was possible.


> the only thing holding things together right now is a while loop doing an
> 'ceph osd down 41' every minute
>
> As far as I understand this disturbs the backfilling and further delays
> writes to that poor PG.
>

it definitely does seem to have an impact similar to that. the only upside
is that it clears the slow io messages though i don't know if it actually
lets the client io complete. recovery doesn't make any progress though in
between the down commands. its not making any progress on its own anyways.

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Alexey Sheplyakov
Hi,

> i also wonder if just taking 148 out of the cluster (probably just
marking it out) would help

As far as I understand this can only harm your data. The acting set of PG
17.73 is  [41, 148],
so after stopping/taking out OSD 148  OSD 41 will store the only copy of
objects in PG 17.73
(so it won't accept writes any more).

> since there are other osds in the up set (140 and 5)

These OSDs are not in the acting set, they have no (at least some of the)
objects from PG 17.73,
and are copying the missing objects from OSDs 41 and 148. Naturally this
slows down or even
blocks writes to PG 17.73.

> the only thing holding things together right now is a while loop doing an
'ceph osd down 41' every minute

As far as I understand this disturbs the backfilling and further delays
writes to that poor PG.

Best regards,
   Alexey


On Fri, Apr 29, 2016 at 8:06 AM, Mike Lovell 
wrote:

> i attempted to grab some logs from the two osds in questions with debug_ms
> and debug_osd at 20. i have looked through them a little bit but digging
> through the logs at this verbosity is something i don't have much
> experience with. hopefully someone on the list can help make sense of it.
> the logs are at these urls.
>
> http://stuff.dev-zero.net/ceph-osd.148.debug.log.gz
> http://stuff.dev-zero.net/ceph-osd.41.debug.log.gz
> http://stuff.dev-zero.net/ceph.mon.log.gz
>
> it last one is a trimmed portion of the ceph.log from one of the monitors
> for the time frame the osd logs cover. to make these, i moved the existing
> log file, set the increased verbosity, had the osds reopen their log files,
> gave it a few minutes, moved the log files again, and had the osds reopen
> their logs a second time. this resulted in something that is hopefully just
> enough context to see whats going on.
>
> i did a 'ceph osd down 41' at about the 20:40:06 mark and the cluster
> seems to report normal data for the next 30 seconds. after that, the slow
> io messages from both osds about ops from each other start appearing. i
> tried tracing a few ops in both logs but couldn't make sense of it. can
> anyone help me with taking a look and/or pointers about how to understand
> what's going on?
>
> oh. this is 0.94.5. the basic cluster layout is two racks with 9 nodes in
> each rack with either 12 or 14 osds per node. ssd cache tiering is being
> used. the pools are just replicated ones with a size of 3. here is the data
> from pg dump for the pg that isn't making progress on recovery, which i'm
> guessing is a result of the same problem. the workload is a bunch of vms
> with rbd.
>
> pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
> reported up up_primary acting acting_primary last_scrub scrub_stamp
> last_deep_scrub deep_scrub_stamp
> 17.73 14545 0 14545 14547 0 62650182662 10023 10023
> active+undersized+degraded+remapped+backfilling 2016-04-29 01:59:42.148644
> 161768'26740604 161768:37459478 [140,5,41] 140 [41,148] 41 55547'11246156 
> 2015-09-15
> 08:53:32.724322 53282'7470580 2015-09-01 07:19:45.054261
>
> i also wonder if just taking 148 out of the cluster (probably just marking
> it out) would help. the min size is 2 but, since there are other osds in
> the up set (140 and 5), will the cluster keep working? or will it block
> until the PG has finished with recovery to the new osds?
>
> thanks in advance. hopefully someone can help soon because right now the
> only thing holding things together right now is a while loop doing an 'ceph
> osd down 41' every minute. :(
>
> mike
>
> On Thu, Apr 28, 2016 at 5:49 PM, Samuel Just  wrote:
>
>> I'd guess that to make any progress we'll need debug ms = 20 on both
>> sides of the connection when a message is lost.
>> -Sam
>>
>> On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell 
>> wrote:
>> > there was a problem on one of the clusters i manage a couple weeks ago
>> where
>> > pairs of OSDs would wait indefinitely on subops from the other OSD in
>> the
>> > pair. we used a liberal dose of "ceph osd down ##" on the osds and
>> > eventually things just sorted them out a couple days later.
>> >
>> > it seems to have come back today and co-workers and i are stuck on
>> trying to
>> > figure out why this is happening. here are the details that i know.
>> > currently 2 OSDs, 41 and 148, keep waiting on subops from each other
>> > resulting in lines such as the following in ceph.log.
>> >
>> > 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 :
>> cluster
>> > [WRN] slow request 30.642736 seconds old, received at 2016-04-28
>> > 13:28:56.233001: osd_op(client.11172360.0:516946146
>> > rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size
>> 4194304
>> > write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
>> > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
>> > waiting for subops from 5,140,148
>> >
>> > 2016-04-28 13:29:28.031452 osd.148 

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
i attempted to grab some logs from the two osds in questions with debug_ms
and debug_osd at 20. i have looked through them a little bit but digging
through the logs at this verbosity is something i don't have much
experience with. hopefully someone on the list can help make sense of it.
the logs are at these urls.

http://stuff.dev-zero.net/ceph-osd.148.debug.log.gz
http://stuff.dev-zero.net/ceph-osd.41.debug.log.gz
http://stuff.dev-zero.net/ceph.mon.log.gz

it last one is a trimmed portion of the ceph.log from one of the monitors
for the time frame the osd logs cover. to make these, i moved the existing
log file, set the increased verbosity, had the osds reopen their log files,
gave it a few minutes, moved the log files again, and had the osds reopen
their logs a second time. this resulted in something that is hopefully just
enough context to see whats going on.

i did a 'ceph osd down 41' at about the 20:40:06 mark and the cluster seems
to report normal data for the next 30 seconds. after that, the slow io
messages from both osds about ops from each other start appearing. i tried
tracing a few ops in both logs but couldn't make sense of it. can anyone
help me with taking a look and/or pointers about how to understand what's
going on?

oh. this is 0.94.5. the basic cluster layout is two racks with 9 nodes in
each rack with either 12 or 14 osds per node. ssd cache tiering is being
used. the pools are just replicated ones with a size of 3. here is the data
from pg dump for the pg that isn't making progress on recovery, which i'm
guessing is a result of the same problem. the workload is a bunch of vms
with rbd.

pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
reported up up_primary acting acting_primary last_scrub scrub_stamp
last_deep_scrub deep_scrub_stamp
17.73 14545 0 14545 14547 0 62650182662 10023 10023
active+undersized+degraded+remapped+backfilling 2016-04-29 01:59:42.148644
161768'26740604 161768:37459478 [140,5,41] 140 [41,148] 41
55547'11246156 2015-09-15
08:53:32.724322 53282'7470580 2015-09-01 07:19:45.054261

i also wonder if just taking 148 out of the cluster (probably just marking
it out) would help. the min size is 2 but, since there are other osds in
the up set (140 and 5), will the cluster keep working? or will it block
until the PG has finished with recovery to the new osds?

thanks in advance. hopefully someone can help soon because right now the
only thing holding things together right now is a while loop doing an 'ceph
osd down 41' every minute. :(

mike

On Thu, Apr 28, 2016 at 5:49 PM, Samuel Just  wrote:

> I'd guess that to make any progress we'll need debug ms = 20 on both
> sides of the connection when a message is lost.
> -Sam
>
> On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell 
> wrote:
> > there was a problem on one of the clusters i manage a couple weeks ago
> where
> > pairs of OSDs would wait indefinitely on subops from the other OSD in the
> > pair. we used a liberal dose of "ceph osd down ##" on the osds and
> > eventually things just sorted them out a couple days later.
> >
> > it seems to have come back today and co-workers and i are stuck on
> trying to
> > figure out why this is happening. here are the details that i know.
> > currently 2 OSDs, 41 and 148, keep waiting on subops from each other
> > resulting in lines such as the following in ceph.log.
> >
> > 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
> > [WRN] slow request 30.642736 seconds old, received at 2016-04-28
> > 13:28:56.233001: osd_op(client.11172360.0:516946146
> > rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size
> 4194304
> > write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
> > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> > waiting for subops from 5,140,148
> >
> > 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 :
> cluster
> > [WRN] slow request 30.960922 seconds old, received at 2016-04-28
> > 13:28:57.070471: osd_op(client.24127500.0:2960618
> > rbd_data.38178d8adeb4d.10f8 [set-alloc-hint object_size
> 8388608
> > write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
> > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> > waiting for subops from 41,115
> >
> > from digging in the logs, it appears like some messages are being lost
> > between the OSDs. this is what osd.41 sees:
> > -
> > 2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
> > client.11172360 10.217.72.41:0/6031968 6 
> > osd_op(client.11172360.0:516946146
> rbd_data.36bfe359c4998.0d08
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> 1835008~143360]
> > 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected
> > e159001) v5  236+0+143360 (781016428 0 3953649960) 0x1d551c00 con
> > 0x1a78d9c0
> > 2016-04-28 

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-28 Thread Samuel Just
I'd guess that to make any progress we'll need debug ms = 20 on both
sides of the connection when a message is lost.
-Sam

On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell  wrote:
> there was a problem on one of the clusters i manage a couple weeks ago where
> pairs of OSDs would wait indefinitely on subops from the other OSD in the
> pair. we used a liberal dose of "ceph osd down ##" on the osds and
> eventually things just sorted them out a couple days later.
>
> it seems to have come back today and co-workers and i are stuck on trying to
> figure out why this is happening. here are the details that i know.
> currently 2 OSDs, 41 and 148, keep waiting on subops from each other
> resulting in lines such as the following in ceph.log.
>
> 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
> [WRN] slow request 30.642736 seconds old, received at 2016-04-28
> 13:28:56.233001: osd_op(client.11172360.0:516946146
> rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size 4194304
> write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 5,140,148
>
> 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster
> [WRN] slow request 30.960922 seconds old, received at 2016-04-28
> 13:28:57.070471: osd_op(client.24127500.0:2960618
> rbd_data.38178d8adeb4d.10f8 [set-alloc-hint object_size 8388608
> write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 41,115
>
> from digging in the logs, it appears like some messages are being lost
> between the OSDs. this is what osd.41 sees:
> -
> 2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
> client.11172360 10.217.72.41:0/6031968 6 
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected
> e159001) v5  236+0+143360 (781016428 0 3953649960) 0x1d551c00 con
> 0x1a78d9c0
> 2016-04-28 13:28:56.233983 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6db200 con 0x21add440
> 2016-04-28 13:28:56.234017 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6dd000 con 0x21ada000
> 2016-04-28 13:28:56.234046 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146
> 17.73 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
> 159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
> 2016-04-28 13:28:56.243555 7f3b35976700  1 -- 10.217.89.22:6825/313003769
> <== osd.140 10.217.89.11:6810/4543 23 
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
>  83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
> 2016-04-28 13:28:56.257816 7f3b27d9b700  1 -- 10.217.89.22:6825/313003769
> <== osd.5 10.217.89.18:6806/1010441 35 
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
>  83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440
>
>
> this, however is what osd.148 sees:
> -
> [ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
> 2016-04-28 13:29:33.470156 7f195fcfc700  1 -- 10.217.72.11:6820/6487 <==
> client.11172360 10.217.72.41:0/6031968 460 
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=2 ack+ondisk+retry+write+redirected+known_if_redirected
> e159002) v5  236+0+143360 (129493315 0 3953649960) 0x1edf2300 con
> 0x24dc0d60
>
> also, due to the ceph osd down commands, there is recovery that needs to
> happen for a PG shared between these OSDs that is never making any progress.
> its probably due to whatever is cause the repops to fail.
>
> i did some tcpdump on both sides limiting things to the ip addresses and
> ports being used by these two OSDs and see packets flowing between the two
> osds. i attempted to have wireshark decode the actual ceph traffic but it
> was only able to get bits and pieces of the ceph protocol bits but at least
> for the moment i'm blaming that on the ceph dissector for wireshark. there
> aren't any dropped or error packets on any of the network interfaces
> involved.
>
> does anyone have any ideas of where to look next or other tips for this?
> we've put debug_ms and debug_osd 

[ceph-users] help troubleshooting some osd communication problems

2016-04-28 Thread Mike Lovell
there was a problem on one of the clusters i manage a couple weeks ago
where pairs of OSDs would wait indefinitely on subops from the other OSD in
the pair. we used a liberal dose of "ceph osd down ##" on the osds and
eventually things just sorted them out a couple days later.

it seems to have come back today and co-workers and i are stuck on trying
to figure out why this is happening. here are the details that i know.
currently 2 OSDs, 41 and 148, keep waiting on subops from each other
resulting in lines such as the following in ceph.log.

2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
[WRN] slow request 30.642736 seconds old, received at 2016-04-28
13:28:56.233001: osd_op(client.11172360.0:516946146
rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size 4194304
write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
waiting for subops from 5,140,148

2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster
[WRN] slow request 30.960922 seconds old, received at 2016-04-28
13:28:57.070471: osd_op(client.24127500.0:2960618
rbd_data.38178d8adeb4d.10f8 [set-alloc-hint object_size 8388608
write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
waiting for subops from 41,115

from digging in the logs, it appears like some messages are being lost
between the OSDs. this is what osd.41 sees:
-
2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
client.11172360 10.217.72.41:0/6031968 6 
osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
[set-alloc-hint object_size 4194304 write_size 4194304,write
1835008~143360] 17.3df49873 RETRY=1
ack+ondisk+retry+write+redirected+known_if_redirected e159001) v5 
236+0+143360 (781016428 0 3953649960) 0x1d551c00 con 0x1a78d9c0
2016-04-28 13:28:56.233983 7f3b49020700  1 -- 10.217.89.22:6825/313003769
--> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146
17.73 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
159001'26722799) v1 -- ?+46 0x1d6db200 con 0x21add440
2016-04-28 13:28:56.234017 7f3b49020700  1 -- 10.217.89.22:6825/313003769
--> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73
3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
159001'26722799) v1 -- ?+46 0x1d6dd000 con 0x21ada000
2016-04-28 13:28:56.234046 7f3b49020700  1 -- 10.217.89.22:6825/313003769
--> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146
17.73 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
2016-04-28 13:28:56.243555 7f3b35976700  1 -- 10.217.89.22:6825/313003769
<== osd.140 10.217.89.11:6810/4543 23 
osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
 83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
2016-04-28 13:28:56.257816 7f3b27d9b700  1 -- 10.217.89.22:6825/313003769
<== osd.5 10.217.89.18:6806/1010441 35 
osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
 83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440


this, however is what osd.148 sees:
-
[ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
2016-04-28 13:29:33.470156 7f195fcfc700  1 -- 10.217.72.11:6820/6487 <==
client.11172360 10.217.72.41:0/6031968 460 
osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
[set-alloc-hint object_size 4194304 write_size 4194304,write
1835008~143360] 17.3df49873 RETRY=2
ack+ondisk+retry+write+redirected+known_if_redirected e159002) v5 
236+0+143360 (129493315 0 3953649960) 0x1edf2300 con 0x24dc0d60

also, due to the ceph osd down commands, there is recovery that needs to
happen for a PG shared between these OSDs that is never making any
progress. its probably due to whatever is cause the repops to fail.

i did some tcpdump on both sides limiting things to the ip addresses and
ports being used by these two OSDs and see packets flowing between the two
osds. i attempted to have wireshark decode the actual ceph traffic but it
was only able to get bits and pieces of the ceph protocol bits but at least
for the moment i'm blaming that on the ceph dissector for wireshark. there
aren't any dropped or error packets on any of the network interfaces
involved.

does anyone have any ideas of where to look next or other tips for this?
we've put debug_ms and debug_osd at 1/1 to get the bits of info mentioned.
putting them at 20 probably isn't going to be helpful so anyone have a
suggestion on another level to put it at that might be useful? go figure
that this would happen while i'm at the openstack summit and it would keep
me from paying attention to some interesting presentations.

thanks in advance for any help.

mike