Re: Mon losing touch with OSDs

2013-03-08 Thread Chris Dunlop
On Fri, Mar 08, 2013 at 02:12:40PM +1100, Chris Dunlop wrote:
 On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote:
 On Fri, 1 Mar 2013, Chris Dunlop wrote:
 On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
 I just looked at the logs.  I can't tell what happend to cause 
 that 10 
 second delay.. strangely, messages were passing from 0 - 1, but 
 nothing 
 came back from 1 - 0 (although 1 was queuing, if not sending, 
 them).
 
 Is there any way of telling where they were delayed, i.e. in the 1's 
 output
 queue or 0's input queue?
 
 Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
 generate a lot of logging, though.
 
 I really don't want to load the system with too much logging, but I'm 
 happy
 modifying code...  Are there specific interesting debug outputs which 
 I can
 modify so they're output under ms = 1?
 
 I'm basically interested in everything in writer() and write_message(), 
 and reader() and read_message()...
 
 Like this?
 
 Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
 that this is the lions share of what debug 20 will spam to the log, but 
 hopefully the load is manageable!
 
 Good idea on the '2'. I'll get that installed and wait for it to happen 
 again.
 
 FYI...
 
 To avoid running out of disk space for the massive logs, I
 started using logrotate on the ceph logs every two hours, which
 does a 'service ceph reload' to re-open the log files.
 
 In the week since doing that I haven't seen any 'slow requests'
 at all (the load has stayed the same as before the change),
 which means the issue with the osds dropping out, then the
 system not recovering properly, also hasn't happened.
 
 That's a bit suspicious, no?
 
 I suspect the logging itself is changing the timing.  Let's wait and see 
 if we get lucky... 
 
 We got lucky...
 
 ceph-mon.0.log:
 2013-03-08 03:46:44.786682 7fcc62172700  1 -- 192.168.254.132:0/20298 -- 
 192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 
 03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160
   [no ping_reply logged, then later...]
 2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no 
 reply from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 
 03:46:36.211992)

Bugger. I just realised that the cluster had come up without
the ms = 2 logging enabled, making these logs no more useful
than the previous ones for working out who dropped the missing
ping_reply.

Injected ms = 2 into the osds and mons, and added it to the
config files.

Sigh. And we're waiting again...

Chris.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-03-07 Thread Chris Dunlop
On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote:
 On Fri, 1 Mar 2013, Chris Dunlop wrote:
 On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
 I just looked at the logs.  I can't tell what happend to cause 
 that 10 
 second delay.. strangely, messages were passing from 0 - 1, but 
 nothing 
 came back from 1 - 0 (although 1 was queuing, if not sending, 
 them).
 
 Is there any way of telling where they were delayed, i.e. in the 1's 
 output
 queue or 0's input queue?
 
 Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
 generate a lot of logging, though.
 
 I really don't want to load the system with too much logging, but I'm 
 happy
 modifying code...  Are there specific interesting debug outputs which I 
 can
 modify so they're output under ms = 1?
 
 I'm basically interested in everything in writer() and write_message(), 
 and reader() and read_message()...
 
 Like this?
 
 Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
 that this is the lions share of what debug 20 will spam to the log, but 
 hopefully the load is manageable!
 
 Good idea on the '2'. I'll get that installed and wait for it to happen 
 again.
 
 FYI...
 
 To avoid running out of disk space for the massive logs, I
 started using logrotate on the ceph logs every two hours, which
 does a 'service ceph reload' to re-open the log files.
 
 In the week since doing that I haven't seen any 'slow requests'
 at all (the load has stayed the same as before the change),
 which means the issue with the osds dropping out, then the
 system not recovering properly, also hasn't happened.
 
 That's a bit suspicious, no?
 
 I suspect the logging itself is changing the timing.  Let's wait and see 
 if we get lucky... 

We got lucky...

ceph-mon.0.log:
2013-03-08 03:46:44.786682 7fcc62172700  1 -- 192.168.254.132:0/20298 -- 
192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 
03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160
  [no ping_reply logged, then later...]
2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no reply 
from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 03:46:36.211992)

ceph-mon.1.log:
2013-03-08 03:46:44.786848 7fe6f47a4700  1 -- 192.168.254.133:6801/23939 == 
osd.0 192.168.254.132:0/20298 178549  osd_ping(ping e815 stamp 2013-03-08 
03:46:44.786679) v2  47+0+0 (1298645350 0 0) 0x98256c0 con 0x7bd2160
2013-03-08 03:46:44.786880 7fe6f47a4700  1 -- 192.168.254.133:6801/23939 -- 
192.168.254.132:0/20298 -- osd_ping(ping_reply e815 stamp 2013-03-08 
03:46:44.786679) v2 -- ?+0 0x29876c0 con 0x7bd2160

Interestingly, the matching ping_reply from osd.1 never appears in the
osd.0 log, in contrast to the previous incident upthread where the
missing ping replies were all seen in a rush (but after osd.1 had been
marked down).

The missing ping_reply caused osd.1 to get marked down, then it marked
itself up again a bit later (map e818 wrongly marked me down). However
the system still hadn't recovered before 07:46:29 when a 'service ceph
restart' was done on the machine holding mon.b5 and osd.1, bringing things
back to life.

Before the restart:

# ceph -s
   health HEALTH_WARN 273 pgs peering; 2 pgs recovery_wait; 273 pgs stuck 
inactive; 576 pgs stuck unclean; recovery 43/293224 degraded (0.015%)
   monmap e9: 3 mons at 
{b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}, 
election epoch 898, quorum 0,1,2 b2,b4,b5
   osdmap e825: 2 osds: 2 up, 2 in
pgmap v3545580: 576 pgs: 301 active, 2 active+recovery_wait, 273 peering; 
560 GB data, 1348 GB used, 2375 GB / 3724 GB avail; 43/293224 degraded (0.015%)
   mdsmap e1: 0/0/1 up

After the restart:

# ceph -s
   health HEALTH_WARN 19 pgs recovering; 24 pgs recovery_wait; 43 pgs stuck 
unclean; recovery 66/293226 degraded (0.023%)
   monmap e9: 3 mons at 
{b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}, 
election epoch 902, quorum 0,1,2 b2,b4,b5
   osdmap e828: 2 osds: 2 up, 2 in
pgmap v3545603: 576 pgs: 533 active+clean, 24 active+recovery_wait, 19 
active+recovering; 560 GB data, 1348 GB used, 2375 GB / 3724 GB avail; 0B/s rd, 
8135KB/s wr, 224op/s; 66/293226 degraded (0.023%)
   mdsmap e1: 0/0/1 up

Logs covering 00:00 to 09:00 are in:

https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2

Cheers,

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Mon losing touch with OSDs

2013-02-28 Thread Chris Dunlop
On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
 I just looked at the logs.  I can't tell what happend to cause that 
 10 
 second delay.. strangely, messages were passing from 0 - 1, but 
 nothing 
 came back from 1 - 0 (although 1 was queuing, if not sending, them).
 
 Is there any way of telling where they were delayed, i.e. in the 1's 
 output
 queue or 0's input queue?
 
 Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
 generate a lot of logging, though.
 
 I really don't want to load the system with too much logging, but I'm 
 happy
 modifying code...  Are there specific interesting debug outputs which I 
 can
 modify so they're output under ms = 1?
 
 I'm basically interested in everything in writer() and write_message(), 
 and reader() and read_message()...
 
 Like this?
 
 Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
 that this is the lions share of what debug 20 will spam to the log, but 
 hopefully the load is manageable!
 
 Good idea on the '2'. I'll get that installed and wait for it to happen again.

FYI...

To avoid running out of disk space for the massive logs, I
started using logrotate on the ceph logs every two hours, which
does a 'service ceph reload' to re-open the log files.

In the week since doing that I haven't seen any 'slow requests'
at all (the load has stayed the same as before the change),
which means the issue with the osds dropping out, then the
system not recovering properly, also hasn't happened.

That's a bit suspicious, no?

I've now put the log dirs on each machine on their own 2TB
partition and reverted back to the default daily rotates.

And once more we're waiting... Godot, is that you?


Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-28 Thread Sage Weil
On Fri, 1 Mar 2013, Chris Dunlop wrote:
 On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
  On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
  On Sat, 23 Feb 2013, Chris Dunlop wrote:
  On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
  On Sat, 23 Feb 2013, Chris Dunlop wrote:
  On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
  On Sat, 23 Feb 2013, Chris Dunlop wrote:
  On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
  On Sat, 23 Feb 2013, Chris Dunlop wrote:
  On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
  I just looked at the logs.  I can't tell what happend to cause 
  that 10 
  second delay.. strangely, messages were passing from 0 - 1, but 
  nothing 
  came back from 1 - 0 (although 1 was queuing, if not sending, 
  them).
  
  Is there any way of telling where they were delayed, i.e. in the 1's 
  output
  queue or 0's input queue?
  
  Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
  generate a lot of logging, though.
  
  I really don't want to load the system with too much logging, but I'm 
  happy
  modifying code...  Are there specific interesting debug outputs which I 
  can
  modify so they're output under ms = 1?
  
  I'm basically interested in everything in writer() and write_message(), 
  and reader() and read_message()...
  
  Like this?
  
  Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
  that this is the lions share of what debug 20 will spam to the log, but 
  hopefully the load is manageable!
  
  Good idea on the '2'. I'll get that installed and wait for it to happen 
  again.
 
 FYI...
 
 To avoid running out of disk space for the massive logs, I
 started using logrotate on the ceph logs every two hours, which
 does a 'service ceph reload' to re-open the log files.
 
 In the week since doing that I haven't seen any 'slow requests'
 at all (the load has stayed the same as before the change),
 which means the issue with the osds dropping out, then the
 system not recovering properly, also hasn't happened.
 
 That's a bit suspicious, no?

I suspect the logging itself is changing the timing.  Let's wait and see 
if we get lucky... 

sage

 
 I've now put the log dirs on each machine on their own 2TB
 partition and reverted back to the default daily rotates.
 
 And once more we're waiting... Godot, is that you?
 
 
 Chris
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil
On Fri, 22 Feb 2013, Chris Dunlop wrote:
 G'day,
 
 It seems there might be two issues here: the first being the delayed
 receipt of echo replies causing an seemingly otherwise healthy osd to be
 marked down, the second being the lack of recovery once the downed osd is
 recognised as up again.
 
 Is it worth my opening tracker reports for this, just so it doesn't get
 lost?

I just looked at the logs.  I can't tell what happend to cause that 10 
second delay.. strangely, messages were passing from 0 - 1, but nothing 
came back from 1 - 0 (although 1 was queuing, if not sending, them).

The strange bit is that after this, you get those indefinite hangs.  From 
the logs it looks like the OSD rebound to an old port that was previously 
open from osd.0.. probably from way back.  Do you have logs going further 
back than what you posted?  Also, do you have osdmaps, say, 750 and 
onward?  It looks like there is a bug in the connection handling code 
(that is unrelated to the delay above).

Thanks!
sage


 
 Cheers,
 
 Chris
 
 On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote:
  On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
  On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
  On Mon, 18 Feb 2013, Chris Dunlop wrote:
  On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
  On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
  On Fri, 15 Feb 2013, Chris Dunlop wrote:
  In an otherwise seemingly healthy cluster (ceph 0.56.2), what might 
  cause the
  mons to lose touch with the osds?
  
  Can you enable 'debug ms = 1' on the mons and leave them that way, in 
  the 
  hopes that this happens again?  It will give us more information to go 
  on.
  
  Debug turned on.
  
  We haven't experienced the cluster losing touch with the osds completely
  since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
  for a few seconds before it recovered. See below for logs (reminder: 3
  boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
  
  Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
  the osds as well.  That will give us more to go on if/when it happens 
  again, and should not affect performance significantly.
  
  Done: ceph osd tell '*' injectargs '--debug-ms 1'
  
  Now to wait for it to happen again.
  
  OK, we got it again. Full logs covering the incident available at:
  
  https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip
  
  Archive:  /tmp/ceph-logs.zip
   Length   MethodSize  CmprDateTime   CRC-32   Name
    --  ---  -- -   
 11492  Defl:X 1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
   1270789  Defl:X89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
   1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
   2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
   2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
154938  Defl:X12989  92% 2013-02-20 12:04 d2729b05  ceph.log
    ---  ------
   6908631   646576  91%6 files
  
  My naive analysis, based on the log extracts below (best viewed on a wide
  screen!)...
  
  Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
  which marks osd.1 down.
  
  However the whole time, the osd.1 log indicates that it's receiving and
  responding to each ping from osd.0 in a timely fashion. In contrast, the 
  osd.0
  log indicates it isn't seeing the osd.1 replies for a while, then sees them 
  all
  arrive in a flurry, until they're delayed enough to cause osd.0 to tell 
  the
  mon.
  
  During the time osd.0 is not seeing the osd.1 ping_replies, there's other 
  traffic
  (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, 
  indicating
  that it's not a network problem.
  
  The load on both osds during this period was 90% idle and 1% iow.
  
  Is this pointing to osd.0 experiencing some kind of scheduling or priority
  starvation on the ping thread (assuming the ping is in it's own thread)?
  
  The next odd thing is that, although the osds are both back by 04:38:50 (2
  osds: 2 up, 2 in), the system still wasn't working (see the disk stats for
  both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was 
  restarted
  on one of the boxes at around 05:50 (not shown in the logs, but full logs
  available if needed).
  
  Prior to the restart:
  
  # ceph health
  HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean
  
  (Sorry, once again didn't get a 'ceph -s' prior to the restart.)
  
  Chris.
  
  --
  ceph.log
  --
  2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap 
  v3000932: 576 pgs: 575 

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil
On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
  On Fri, 22 Feb 2013, Chris Dunlop wrote:
  G'day,
  
  It seems there might be two issues here: the first being the delayed
  receipt of echo replies causing an seemingly otherwise healthy osd to be
  marked down, the second being the lack of recovery once the downed osd is
  recognised as up again.
  
  Is it worth my opening tracker reports for this, just so it doesn't get
  lost?
  
  I just looked at the logs.  I can't tell what happend to cause that 10 
  second delay.. strangely, messages were passing from 0 - 1, but nothing 
  came back from 1 - 0 (although 1 was queuing, if not sending, them).
  
  The strange bit is that after this, you get those indefinite hangs.  From 
  the logs it looks like the OSD rebound to an old port that was previously 
  open from osd.0.. probably from way back.  Do you have logs going further 
  back than what you posted?  Also, do you have osdmaps, say, 750 and 
  onward?  It looks like there is a bug in the connection handling code 
  (that is unrelated to the delay above).
 
 Currently uploading logs starting midnight to dropbox, will send
 links when when they're up.
 
 How would I retrieve the interesting osdmaps?

They are in the monitor data directory, in the osdmap_full dir.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop
On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
 On Fri, 22 Feb 2013, Chris Dunlop wrote:
 G'day,
 
 It seems there might be two issues here: the first being the delayed
 receipt of echo replies causing an seemingly otherwise healthy osd to be
 marked down, the second being the lack of recovery once the downed osd is
 recognised as up again.
 
 Is it worth my opening tracker reports for this, just so it doesn't get
 lost?
 
 I just looked at the logs.  I can't tell what happend to cause that 10 
 second delay.. strangely, messages were passing from 0 - 1, but nothing 
 came back from 1 - 0 (although 1 was queuing, if not sending, them).

Is there any way of telling where they were delayed, i.e. in the 1's output
queue or 0's input queue?

 The strange bit is that after this, you get those indefinite hangs.  From 
 the logs it looks like the OSD rebound to an old port that was previously 
 open from osd.0.. probably from way back.  Do you have logs going further 
 back than what you posted?  Also, do you have osdmaps, say, 750 and 
 onward?  It looks like there is a bug in the connection handling code 
 (that is unrelated to the delay above).
 
 Currently uploading logs starting midnight to dropbox, will send
 links when when they're up.
 
 How would I retrieve the interesting osdmaps?
 
 They are in the monitor data directory, in the osdmap_full dir.

Logs from midnight onwards and osdmaps are in this folder:

https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2

  ceph-mon.b2.log.bz2
  ceph-mon.b4.log.bz2
  ceph-mon.b5.log.bz2
  ceph-osd.0.log.bz2
  ceph-osd.1.log.bz2 (still uploading as I type)
  osdmaps.zip

Cheers,

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil
On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
  On Sat, 23 Feb 2013, Chris Dunlop wrote:
  On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
  On Fri, 22 Feb 2013, Chris Dunlop wrote:
  G'day,
  
  It seems there might be two issues here: the first being the delayed
  receipt of echo replies causing an seemingly otherwise healthy osd to be
  marked down, the second being the lack of recovery once the downed osd is
  recognised as up again.
  
  Is it worth my opening tracker reports for this, just so it doesn't get
  lost?
  
  I just looked at the logs.  I can't tell what happend to cause that 10 
  second delay.. strangely, messages were passing from 0 - 1, but nothing 
  came back from 1 - 0 (although 1 was queuing, if not sending, them).
 
 Is there any way of telling where they were delayed, i.e. in the 1's output
 queue or 0's input queue?

Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
generate a lot of logging, though.

  The strange bit is that after this, you get those indefinite hangs.  From 
  the logs it looks like the OSD rebound to an old port that was previously 
  open from osd.0.. probably from way back.  Do you have logs going further 
  back than what you posted?  Also, do you have osdmaps, say, 750 and 
  onward?  It looks like there is a bug in the connection handling code 
  (that is unrelated to the delay above).
  
  Currently uploading logs starting midnight to dropbox, will send
  links when when they're up.
  
  How would I retrieve the interesting osdmaps?
  
  They are in the monitor data directory, in the osdmap_full dir.
 
 Logs from midnight onwards and osdmaps are in this folder:
 
 https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2
 
   ceph-mon.b2.log.bz2
   ceph-mon.b4.log.bz2
   ceph-mon.b5.log.bz2
   ceph-osd.0.log.bz2
   ceph-osd.1.log.bz2 (still uploading as I type)
   osdmaps.zip

I'll take a look...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-19 Thread Chris Dunlop
On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
 On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
 On Mon, 18 Feb 2013, Chris Dunlop wrote:
 On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
 On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
 On Fri, 15 Feb 2013, Chris Dunlop wrote:
 In an otherwise seemingly healthy cluster (ceph 0.56.2), what might 
 cause the
 mons to lose touch with the osds?
 
 Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
 hopes that this happens again?  It will give us more information to go on.
 
 Debug turned on.
 
 We haven't experienced the cluster losing touch with the osds completely
 since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
 for a few seconds before it recovered. See below for logs (reminder: 3
 boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
 
 Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
 the osds as well.  That will give us more to go on if/when it happens 
 again, and should not affect performance significantly.
 
 Done: ceph osd tell '*' injectargs '--debug-ms 1'
 
 Now to wait for it to happen again.

OK, we got it again. Full logs covering the incident available at:

https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip

Archive:  /tmp/ceph-logs.zip
 Length   MethodSize  CmprDateTime   CRC-32   Name
  --  ---  -- -   
   11492  Defl:X 1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
 1270789  Defl:X89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
 1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
 2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
 2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
  154938  Defl:X12989  92% 2013-02-20 12:04 d2729b05  ceph.log
  ---  ------
 6908631   646576  91%6 files

My naive analysis, based on the log extracts below (best viewed on a wide
screen!)...

Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
which marks osd.1 down.

However the whole time, the osd.1 log indicates that it's receiving and
responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0
log indicates it isn't seeing the osd.1 replies for a while, then sees them all
arrive in a flurry, until they're delayed enough to cause osd.0 to tell the
mon.

During the time osd.0 is not seeing the osd.1 ping_replies, there's other 
traffic
(osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating
that it's not a network problem.

The load on both osds during this period was 90% idle and 1% iow.

Is this pointing to osd.0 experiencing some kind of scheduling or priority
starvation on the ping thread (assuming the ping is in it's own thread)?

The next odd thing is that, although the osds are both back by 04:38:50 (2
osds: 2 up, 2 in), the system still wasn't working (see the disk stats for
both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted
on one of the boxes at around 05:50 (not shown in the logs, but full logs
available if needed).

Prior to the restart:

# ceph health
HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean

(Sorry, once again didn't get a 'ceph -s' prior to the restart.)

Chris.

--
ceph.log
--
2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap 
v3000932: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:53.541471 mon.0 10.200.63.130:6789/0 120772 : [INF] pgmap 
v3000933: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:56.063059 mon.0 10.200.63.130:6789/0 120773 : [INF] pgmap 
v3000934: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:58.532763 mon.0 10.200.63.130:6789/0 120774 : [INF] pgmap 
v3000935: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
 osd.0 sees delayed ping_replies from here 
2013-02-20 04:38:01.057939 mon.0 10.200.63.130:6789/0 120775 : [INF] pgmap 
v3000936: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:03.541404 mon.0 10.200.63.130:6789/0 120776 : [INF] pgmap 
v3000937: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:06.133004 mon.0 10.200.63.130:6789/0 120777 : [INF] pgmap 
v3000938: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 
GB used, 2881 GB / 3724 GB avail

Re: Mon losing touch with OSDs

2013-02-18 Thread Chris Dunlop
On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
 On Mon, 18 Feb 2013, Chris Dunlop wrote:
 On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
 On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
 On Fri, 15 Feb 2013, Chris Dunlop wrote:
 In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause 
 the
 mons to lose touch with the osds?
 
 Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
 hopes that this happens again?  It will give us more information to go on.
 
 Debug turned on.
 
 We haven't experienced the cluster losing touch with the osds completely
 since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
 for a few seconds before it recovered. See below for logs (reminder: 3
 boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
 
 Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
 the osds as well.  That will give us more to go on if/when it happens 
 again, and should not affect performance significantly.

Done: ceph osd tell '*' injectargs '--debug-ms 1'

Now to wait for it to happen again.

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-15 Thread Chris Dunlop
G'day Sage,

On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
 On Fri, 15 Feb 2013, Chris Dunlop wrote:
 In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
 mons to lose touch with the osds?
 
 Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
 hopes that this happens again?  It will give us more information to go on.

Debug turned on.

 Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened 
 in
 any of the many previous slow requests intances, and the timing doesn't 
 look
 quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
 the osd.0 log shows nothing problems at all, then the mon complains about not
 having heard from osd.1 since 20:11:21, whereas the first indication of 
 trouble
 on osd.1 was the request from 20:26:20 not being processed in a timely 
 fashion.
 
 My guess is the above was a side-effect of osd.0 being marked out.   On 
 0.56.2 there is some strange peering workqueue laggyness that could 
 potentially contribute as well.  I recommend moving to 0.56.3.

Upgraded to 0.56.3.

 Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
 restarting the osds ('service ceph restart osd' on each osd host).
 
 The immediate issue was resolved by restarting ceph completely on one of the
 mon/osd hosts (service ceph restart). Possibly a restart of just the mon 
 would
 have been sufficient.
 
 Did you notice that the osds you restarted didn't immediately mark 
 themselves in?  Again, it could be explained by the peering wq issue, 
 especially if there are pools in your cluster that are not getting any IO.

Sorry, no. I was kicking myself later for losing the 'ceph -s' output 
when I killed that terminal session but in the heat of the moment...

I can't see anything about osd marking themselves in from the logs from the
time (with no debugging), but I'm on my ipad at the moment so I could easily
have missed it. Should that info be in the logs somewhere?

There's certainly unused pools: we're only using the rbd pool and so the
default data and metadata pools are unused.

Thanks for your attention!

Cheers,

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mon losing touch with OSDs

2013-02-14 Thread Chris Dunlop
G'day,

In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
mons to lose touch with the osds?

I imagine a network glitch could cause it, but I can't see any issues in any
other system logs on any of the machines on the network.

Having (mostly?) resolved my previous slow requests issue
(http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
5 seconds later:

ceph-mon.b2.log:
2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 pgs: 
576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 pgs: 
576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or 
pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago.  
marking down
2013-02-14 20:26:20.780244 7fa48d4f8700  1 mon.b2@0(leader).osd e769 e769: 2 
osds: 1 up, 2 in
2013-02-14 20:26:20.837123 7fa48d4f8700  0 log [INF] : osdmap e769: 2 osds: 1 
up, 2 in
2013-02-14 20:26:20.947523 7fa48d4f8700  0 log [INF] : pgmap v2822098: 576 pgs: 
304 active+clean, 272 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 
3724 GB avail
2013-02-14 20:26:25.709341 7fa48dcf9700 -1 mon.b2@0(leader).osd e769 no osd or 
pg stats from osd.1 since 2013-02-14 20:11:21.523741, 904.185596 seconds ago.  
marking down
2013-02-14 20:26:25.822773 7fa48d4f8700  1 mon.b2@0(leader).osd e770 e770: 2 
osds: 0 up, 2 in
2013-02-14 20:26:25.863493 7fa48d4f8700  0 log [INF] : osdmap e770: 2 osds: 0 
up, 2 in
2013-02-14 20:26:25.954799 7fa48d4f8700  0 log [INF] : pgmap v2822099: 576 pgs: 
576 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:31:30.772360 7fa48dcf9700  0 log [INF] : osd.1 out (down for 
304.933403)
2013-02-14 20:31:30.893521 7fa48d4f8700  1 mon.b2@0(leader).osd e771 e771: 2 
osds: 0 up, 1 in
2013-02-14 20:31:30.933439 7fa48d4f8700  0 log [INF] : osdmap e771: 2 osds: 0 
up, 1 in
2013-02-14 20:31:31.055408 7fa48d4f8700  0 log [INF] : pgmap v2822100: 576 pgs: 
576 stale+active+clean; 407 GB data, 417 GB used, 1444 GB / 1862 GB avail
2013-02-14 20:35:05.831221 7fa48dcf9700  0 log [INF] : osd.0 out (down for 
525.033581)
2013-02-14 20:35:05.989724 7fa48d4f8700  1 mon.b2@0(leader).osd e772 e772: 2 
osds: 0 up, 0 in
2013-02-14 20:35:06.031409 7fa48d4f8700  0 log [INF] : osdmap e772: 2 osds: 0 
up, 0 in
2013-02-14 20:35:06.129046 7fa48d4f8700  0 log [INF] : pgmap v2822101: 576 pgs: 
576 stale+active+clean; 407 GB data, 0 KB used, 0 KB / 0 KB avail

The other 2 mons both have messages like this in their logs, starting at around 
20:12:

2013-02-14 20:12:26.534977 7f2092b86700  0 -- 10.200.63.133:6789/0  
10.200.63.133:6800/6466 pipe(0xade76500 sd=22 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:24.741092 7f2092d88700  0 -- 10.200.63.133:6789/0  
10.200.63.132:6800/2456 pipe(0x9f8b7180 sd=28 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:56.551908 7f2090560700  0 -- 10.200.63.133:6789/0  
10.200.63.133:6800/6466 pipe(0x9f8b6000 sd=41 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:14:24.752356 7f209035e700  0 -- 10.200.63.133:6789/0  
10.200.63.132:6800/2456 pipe(0x9f8b6500 sd=42 :6789 s=0 pgs=0 cs=0 l=1).accept 
replacing existing (lossy) channel (new one lossy=1)

(10.200.63.132 is mon.b4/osd.0, 10.200.63.133 is mon.b5/osd.1)

...although Greg Farnum indicates these messages are normal:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/5989/focus=5993

Osd.0 doesn't show any signs of distress at all:

ceph-osd.0.log:
2013-02-14 20:00:10.280601 7ffceb012700  0 log [INF] : 2.7e scrub ok
2013-02-14 20:14:19.923490 7ffceb012700  0 log [INF] : 2.5b scrub ok
2013-02-14 20:14:50.571980 7ffceb012700  0 log [INF] : 2.7b scrub ok
2013-02-14 20:17:48.475129 7ffceb012700  0 log [INF] : 2.7d scrub ok
2013-02-14 20:28:22.601594 7ffceb012700  0 log [INF] : 2.91 scrub ok
2013-02-14 20:28:32.839278 7ffceb012700  0 log [INF] : 2.92 scrub ok
2013-02-14 20:28:46.992226 7ffceb012700  0 log [INF] : 2.93 scrub ok
2013-02-14 20:29:12.330668 7ffceb012700  0 log [INF] : 2.95 scrub ok

...although osd.1 started seeing problems around this time:

ceph-osd.1.log:
2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for  30.750063 secs
2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 
seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 
xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map
2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 
seconds old, 

Re: Mon losing touch with OSDs

2013-02-14 Thread Sage Weil
Hi Chris,

On Fri, 15 Feb 2013, Chris Dunlop wrote:
 G'day,
 
 In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
 mons to lose touch with the osds?
 
 I imagine a network glitch could cause it, but I can't see any issues in any
 other system logs on any of the machines on the network.
 
 Having (mostly?) resolved my previous slow requests issue
 (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
 5 seconds later:
 
 ceph-mon.b2.log:
 2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 
 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
 2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 
 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd 
 or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds 
 ago.  marking down

There is a safety check that if the osd doesn't check in for a long period 
of time we assume it is dead.  But it seems as though that shouldn't 
happen, since osd.0 has some PGs assigned and is scrubbing away.

Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
hopes that this happens again?  It will give us more information to go on.

 ...although osd.1 started seeing problems around this time:
 
 ceph-osd.1.log:
 2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
 2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 
 included below; oldest blocked for  30.750063 secs
 2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 
 seconds old, received at 2013-02-14 20:26:20.851304: 
 osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait 
 for new map
 2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 
 seconds old, received at 2013-02-14 20:26:20.851420: 
 osd_op(client.10001.0:618473 yy.rbd [watch 1~0] 2.3854277a) v4 currently 
 wait for new map
 2013-02-14 20:26:51.601440 7fd1e6f26700  0 log [WRN] : slow request 30.749938 
 seconds old, received at 2013-02-14 20:26:20.851429: 
 osd_op(client.9998.0:39716 zz.rbd [watch 1~0] 2.71731007) v4 currently 
 wait for new map
 2013-02-14 20:26:51.601442 7fd1e6f26700  0 log [WRN] : slow request 30.749907 
 seconds old, received at 2013-02-14 20:26:20.851460: 
 osd_op(client.10007.0:59572 aa.rbd [watch 1~0] 2.320eebb8) v4 currently 
 wait for new map
 2013-02-14 20:26:51.601445 7fd1e6f26700  0 log [WRN] : slow request 30.749630 
 seconds old, received at 2013-02-14 20:26:20.851737: 
 osd_op(client.9980.0:86883 bb.rbd [watch 1~0] 2.ab9b579f) v4 currently 
 wait for new map
 
 Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened 
 in
 any of the many previous slow requests intances, and the timing doesn't look
 quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
 the osd.0 log shows nothing problems at all, then the mon complains about not
 having heard from osd.1 since 20:11:21, whereas the first indication of 
 trouble
 on osd.1 was the request from 20:26:20 not being processed in a timely 
 fashion.

My guess is the above was a side-effect of osd.0 being marked out.   On 
0.56.2 there is some strange peering workqueue laggyness that could 
potentially contribute as well.  I recommend moving to 0.56.3.

 No knowing enough about how the various pieces of ceph talk to each other
 makes it difficult to distinguish cause and effect!
 
 Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
 restarting the osds ('service ceph restart osd' on each osd host).
 
 The immediate issue was resolved by restarting ceph completely on one of the
 mon/osd hosts (service ceph restart). Possibly a restart of just the mon would
 have been sufficient.

Did you notice that the osds you restarted didn't immediately mark 
themselves in?  Again, it could be explained by the peering wq issue, 
especially if there are pools in your cluster that are not getting any IO.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html