Re: Mon losing touch with OSDs
On Fri, Mar 08, 2013 at 02:12:40PM +1100, Chris Dunlop wrote: On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote: On Fri, 1 Mar 2013, Chris Dunlop wrote: On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will generate a lot of logging, though. I really don't want to load the system with too much logging, but I'm happy modifying code... Are there specific interesting debug outputs which I can modify so they're output under ms = 1? I'm basically interested in everything in writer() and write_message(), and reader() and read_message()... Like this? Yeah. You could do 2 instead of 1 so you can turn it down. I suspect that this is the lions share of what debug 20 will spam to the log, but hopefully the load is manageable! Good idea on the '2'. I'll get that installed and wait for it to happen again. FYI... To avoid running out of disk space for the massive logs, I started using logrotate on the ceph logs every two hours, which does a 'service ceph reload' to re-open the log files. In the week since doing that I haven't seen any 'slow requests' at all (the load has stayed the same as before the change), which means the issue with the osds dropping out, then the system not recovering properly, also hasn't happened. That's a bit suspicious, no? I suspect the logging itself is changing the timing. Let's wait and see if we get lucky... We got lucky... ceph-mon.0.log: 2013-03-08 03:46:44.786682 7fcc62172700 1 -- 192.168.254.132:0/20298 -- 192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160 [no ping_reply logged, then later...] 2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no reply from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 03:46:36.211992) Bugger. I just realised that the cluster had come up without the ms = 2 logging enabled, making these logs no more useful than the previous ones for working out who dropped the missing ping_reply. Injected ms = 2 into the osds and mons, and added it to the config files. Sigh. And we're waiting again... Chris. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote: On Fri, 1 Mar 2013, Chris Dunlop wrote: On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will generate a lot of logging, though. I really don't want to load the system with too much logging, but I'm happy modifying code... Are there specific interesting debug outputs which I can modify so they're output under ms = 1? I'm basically interested in everything in writer() and write_message(), and reader() and read_message()... Like this? Yeah. You could do 2 instead of 1 so you can turn it down. I suspect that this is the lions share of what debug 20 will spam to the log, but hopefully the load is manageable! Good idea on the '2'. I'll get that installed and wait for it to happen again. FYI... To avoid running out of disk space for the massive logs, I started using logrotate on the ceph logs every two hours, which does a 'service ceph reload' to re-open the log files. In the week since doing that I haven't seen any 'slow requests' at all (the load has stayed the same as before the change), which means the issue with the osds dropping out, then the system not recovering properly, also hasn't happened. That's a bit suspicious, no? I suspect the logging itself is changing the timing. Let's wait and see if we get lucky... We got lucky... ceph-mon.0.log: 2013-03-08 03:46:44.786682 7fcc62172700 1 -- 192.168.254.132:0/20298 -- 192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160 [no ping_reply logged, then later...] 2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no reply from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 03:46:36.211992) ceph-mon.1.log: 2013-03-08 03:46:44.786848 7fe6f47a4700 1 -- 192.168.254.133:6801/23939 == osd.0 192.168.254.132:0/20298 178549 osd_ping(ping e815 stamp 2013-03-08 03:46:44.786679) v2 47+0+0 (1298645350 0 0) 0x98256c0 con 0x7bd2160 2013-03-08 03:46:44.786880 7fe6f47a4700 1 -- 192.168.254.133:6801/23939 -- 192.168.254.132:0/20298 -- osd_ping(ping_reply e815 stamp 2013-03-08 03:46:44.786679) v2 -- ?+0 0x29876c0 con 0x7bd2160 Interestingly, the matching ping_reply from osd.1 never appears in the osd.0 log, in contrast to the previous incident upthread where the missing ping replies were all seen in a rush (but after osd.1 had been marked down). The missing ping_reply caused osd.1 to get marked down, then it marked itself up again a bit later (map e818 wrongly marked me down). However the system still hadn't recovered before 07:46:29 when a 'service ceph restart' was done on the machine holding mon.b5 and osd.1, bringing things back to life. Before the restart: # ceph -s health HEALTH_WARN 273 pgs peering; 2 pgs recovery_wait; 273 pgs stuck inactive; 576 pgs stuck unclean; recovery 43/293224 degraded (0.015%) monmap e9: 3 mons at {b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}, election epoch 898, quorum 0,1,2 b2,b4,b5 osdmap e825: 2 osds: 2 up, 2 in pgmap v3545580: 576 pgs: 301 active, 2 active+recovery_wait, 273 peering; 560 GB data, 1348 GB used, 2375 GB / 3724 GB avail; 43/293224 degraded (0.015%) mdsmap e1: 0/0/1 up After the restart: # ceph -s health HEALTH_WARN 19 pgs recovering; 24 pgs recovery_wait; 43 pgs stuck unclean; recovery 66/293226 degraded (0.023%) monmap e9: 3 mons at {b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}, election epoch 902, quorum 0,1,2 b2,b4,b5 osdmap e828: 2 osds: 2 up, 2 in pgmap v3545603: 576 pgs: 533 active+clean, 24 active+recovery_wait, 19 active+recovering; 560 GB data, 1348 GB used, 2375 GB / 3724 GB avail; 0B/s rd, 8135KB/s wr, 224op/s; 66/293226 degraded (0.023%) mdsmap e1: 0/0/1 up Logs covering 00:00 to 09:00 are in: https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2 Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Mon losing touch with OSDs
On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will generate a lot of logging, though. I really don't want to load the system with too much logging, but I'm happy modifying code... Are there specific interesting debug outputs which I can modify so they're output under ms = 1? I'm basically interested in everything in writer() and write_message(), and reader() and read_message()... Like this? Yeah. You could do 2 instead of 1 so you can turn it down. I suspect that this is the lions share of what debug 20 will spam to the log, but hopefully the load is manageable! Good idea on the '2'. I'll get that installed and wait for it to happen again. FYI... To avoid running out of disk space for the massive logs, I started using logrotate on the ceph logs every two hours, which does a 'service ceph reload' to re-open the log files. In the week since doing that I haven't seen any 'slow requests' at all (the load has stayed the same as before the change), which means the issue with the osds dropping out, then the system not recovering properly, also hasn't happened. That's a bit suspicious, no? I've now put the log dirs on each machine on their own 2TB partition and reverted back to the default daily rotates. And once more we're waiting... Godot, is that you? Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, 1 Mar 2013, Chris Dunlop wrote: On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will generate a lot of logging, though. I really don't want to load the system with too much logging, but I'm happy modifying code... Are there specific interesting debug outputs which I can modify so they're output under ms = 1? I'm basically interested in everything in writer() and write_message(), and reader() and read_message()... Like this? Yeah. You could do 2 instead of 1 so you can turn it down. I suspect that this is the lions share of what debug 20 will spam to the log, but hopefully the load is manageable! Good idea on the '2'. I'll get that installed and wait for it to happen again. FYI... To avoid running out of disk space for the massive logs, I started using logrotate on the ceph logs every two hours, which does a 'service ceph reload' to re-open the log files. In the week since doing that I haven't seen any 'slow requests' at all (the load has stayed the same as before the change), which means the issue with the osds dropping out, then the system not recovering properly, also hasn't happened. That's a bit suspicious, no? I suspect the logging itself is changing the timing. Let's wait and see if we get lucky... sage I've now put the log dirs on each machine on their own 2TB partition and reverted back to the default daily rotates. And once more we're waiting... Godot, is that you? Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, 22 Feb 2013, Chris Dunlop wrote: G'day, It seems there might be two issues here: the first being the delayed receipt of echo replies causing an seemingly otherwise healthy osd to be marked down, the second being the lack of recovery once the downed osd is recognised as up again. Is it worth my opening tracker reports for this, just so it doesn't get lost? I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). The strange bit is that after this, you get those indefinite hangs. From the logs it looks like the OSD rebound to an old port that was previously open from osd.0.. probably from way back. Do you have logs going further back than what you posted? Also, do you have osdmaps, say, 750 and onward? It looks like there is a bug in the connection handling code (that is unrelated to the delay above). Thanks! sage Cheers, Chris On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote: On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote: On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote: On Mon, 18 Feb 2013, Chris Dunlop wrote: On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote: On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote: On Fri, 15 Feb 2013, Chris Dunlop wrote: In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. Debug turned on. We haven't experienced the cluster losing touch with the osds completely since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1 for a few seconds before it recovered. See below for logs (reminder: 3 boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1). Hrm, I don't see any obvious clues. You could enable 'debug ms = 1' on the osds as well. That will give us more to go on if/when it happens again, and should not affect performance significantly. Done: ceph osd tell '*' injectargs '--debug-ms 1' Now to wait for it to happen again. OK, we got it again. Full logs covering the incident available at: https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip Archive: /tmp/ceph-logs.zip Length MethodSize CmprDateTime CRC-32 Name -- --- -- - 11492 Defl:X 1186 90% 2013-02-20 12:04 c0cba4ae ceph-mon.b2.log 1270789 Defl:X89278 93% 2013-02-20 12:00 2208d035 ceph-mon.b4.log 1375858 Defl:X 104025 92% 2013-02-20 12:05 c64c1dad ceph-mon.b5.log 2020042 Defl:X 215000 89% 2013-02-20 10:40 f74ae4ca ceph-osd.0.log 2075512 Defl:X 224098 89% 2013-02-20 12:05 b454d2ec ceph-osd.1.log 154938 Defl:X12989 92% 2013-02-20 12:04 d2729b05 ceph.log --- ------ 6908631 646576 91%6 files My naive analysis, based on the log extracts below (best viewed on a wide screen!)... Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon, which marks osd.1 down. However the whole time, the osd.1 log indicates that it's receiving and responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0 log indicates it isn't seeing the osd.1 replies for a while, then sees them all arrive in a flurry, until they're delayed enough to cause osd.0 to tell the mon. During the time osd.0 is not seeing the osd.1 ping_replies, there's other traffic (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating that it's not a network problem. The load on both osds during this period was 90% idle and 1% iow. Is this pointing to osd.0 experiencing some kind of scheduling or priority starvation on the ping thread (assuming the ping is in it's own thread)? The next odd thing is that, although the osds are both back by 04:38:50 (2 osds: 2 up, 2 in), the system still wasn't working (see the disk stats for both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted on one of the boxes at around 05:50 (not shown in the logs, but full logs available if needed). Prior to the restart: # ceph health HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean (Sorry, once again didn't get a 'ceph -s' prior to the restart.) Chris. -- ceph.log -- 2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap v3000932: 576 pgs: 575
Re: Mon losing touch with OSDs
On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: On Fri, 22 Feb 2013, Chris Dunlop wrote: G'day, It seems there might be two issues here: the first being the delayed receipt of echo replies causing an seemingly otherwise healthy osd to be marked down, the second being the lack of recovery once the downed osd is recognised as up again. Is it worth my opening tracker reports for this, just so it doesn't get lost? I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). The strange bit is that after this, you get those indefinite hangs. From the logs it looks like the OSD rebound to an old port that was previously open from osd.0.. probably from way back. Do you have logs going further back than what you posted? Also, do you have osdmaps, say, 750 and onward? It looks like there is a bug in the connection handling code (that is unrelated to the delay above). Currently uploading logs starting midnight to dropbox, will send links when when they're up. How would I retrieve the interesting osdmaps? They are in the monitor data directory, in the osdmap_full dir. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: On Fri, 22 Feb 2013, Chris Dunlop wrote: G'day, It seems there might be two issues here: the first being the delayed receipt of echo replies causing an seemingly otherwise healthy osd to be marked down, the second being the lack of recovery once the downed osd is recognised as up again. Is it worth my opening tracker reports for this, just so it doesn't get lost? I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? The strange bit is that after this, you get those indefinite hangs. From the logs it looks like the OSD rebound to an old port that was previously open from osd.0.. probably from way back. Do you have logs going further back than what you posted? Also, do you have osdmaps, say, 750 and onward? It looks like there is a bug in the connection handling code (that is unrelated to the delay above). Currently uploading logs starting midnight to dropbox, will send links when when they're up. How would I retrieve the interesting osdmaps? They are in the monitor data directory, in the osdmap_full dir. Logs from midnight onwards and osdmaps are in this folder: https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2 ceph-mon.b2.log.bz2 ceph-mon.b4.log.bz2 ceph-mon.b5.log.bz2 ceph-osd.0.log.bz2 ceph-osd.1.log.bz2 (still uploading as I type) osdmaps.zip Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: On Fri, 22 Feb 2013, Chris Dunlop wrote: G'day, It seems there might be two issues here: the first being the delayed receipt of echo replies causing an seemingly otherwise healthy osd to be marked down, the second being the lack of recovery once the downed osd is recognised as up again. Is it worth my opening tracker reports for this, just so it doesn't get lost? I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 - 1, but nothing came back from 1 - 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will generate a lot of logging, though. The strange bit is that after this, you get those indefinite hangs. From the logs it looks like the OSD rebound to an old port that was previously open from osd.0.. probably from way back. Do you have logs going further back than what you posted? Also, do you have osdmaps, say, 750 and onward? It looks like there is a bug in the connection handling code (that is unrelated to the delay above). Currently uploading logs starting midnight to dropbox, will send links when when they're up. How would I retrieve the interesting osdmaps? They are in the monitor data directory, in the osdmap_full dir. Logs from midnight onwards and osdmaps are in this folder: https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2 ceph-mon.b2.log.bz2 ceph-mon.b4.log.bz2 ceph-mon.b5.log.bz2 ceph-osd.0.log.bz2 ceph-osd.1.log.bz2 (still uploading as I type) osdmaps.zip I'll take a look... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote: On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote: On Mon, 18 Feb 2013, Chris Dunlop wrote: On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote: On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote: On Fri, 15 Feb 2013, Chris Dunlop wrote: In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. Debug turned on. We haven't experienced the cluster losing touch with the osds completely since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1 for a few seconds before it recovered. See below for logs (reminder: 3 boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1). Hrm, I don't see any obvious clues. You could enable 'debug ms = 1' on the osds as well. That will give us more to go on if/when it happens again, and should not affect performance significantly. Done: ceph osd tell '*' injectargs '--debug-ms 1' Now to wait for it to happen again. OK, we got it again. Full logs covering the incident available at: https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip Archive: /tmp/ceph-logs.zip Length MethodSize CmprDateTime CRC-32 Name -- --- -- - 11492 Defl:X 1186 90% 2013-02-20 12:04 c0cba4ae ceph-mon.b2.log 1270789 Defl:X89278 93% 2013-02-20 12:00 2208d035 ceph-mon.b4.log 1375858 Defl:X 104025 92% 2013-02-20 12:05 c64c1dad ceph-mon.b5.log 2020042 Defl:X 215000 89% 2013-02-20 10:40 f74ae4ca ceph-osd.0.log 2075512 Defl:X 224098 89% 2013-02-20 12:05 b454d2ec ceph-osd.1.log 154938 Defl:X12989 92% 2013-02-20 12:04 d2729b05 ceph.log --- ------ 6908631 646576 91%6 files My naive analysis, based on the log extracts below (best viewed on a wide screen!)... Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon, which marks osd.1 down. However the whole time, the osd.1 log indicates that it's receiving and responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0 log indicates it isn't seeing the osd.1 replies for a while, then sees them all arrive in a flurry, until they're delayed enough to cause osd.0 to tell the mon. During the time osd.0 is not seeing the osd.1 ping_replies, there's other traffic (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating that it's not a network problem. The load on both osds during this period was 90% idle and 1% iow. Is this pointing to osd.0 experiencing some kind of scheduling or priority starvation on the ping thread (assuming the ping is in it's own thread)? The next odd thing is that, although the osds are both back by 04:38:50 (2 osds: 2 up, 2 in), the system still wasn't working (see the disk stats for both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted on one of the boxes at around 05:50 (not shown in the logs, but full logs available if needed). Prior to the restart: # ceph health HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean (Sorry, once again didn't get a 'ceph -s' prior to the restart.) Chris. -- ceph.log -- 2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap v3000932: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail 2013-02-20 04:37:53.541471 mon.0 10.200.63.130:6789/0 120772 : [INF] pgmap v3000933: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail 2013-02-20 04:37:56.063059 mon.0 10.200.63.130:6789/0 120773 : [INF] pgmap v3000934: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail 2013-02-20 04:37:58.532763 mon.0 10.200.63.130:6789/0 120774 : [INF] pgmap v3000935: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail osd.0 sees delayed ping_replies from here 2013-02-20 04:38:01.057939 mon.0 10.200.63.130:6789/0 120775 : [INF] pgmap v3000936: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail 2013-02-20 04:38:03.541404 mon.0 10.200.63.130:6789/0 120776 : [INF] pgmap v3000937: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail 2013-02-20 04:38:06.133004 mon.0 10.200.63.130:6789/0 120777 : [INF] pgmap v3000938: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
Re: Mon losing touch with OSDs
On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote: On Mon, 18 Feb 2013, Chris Dunlop wrote: On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote: On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote: On Fri, 15 Feb 2013, Chris Dunlop wrote: In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. Debug turned on. We haven't experienced the cluster losing touch with the osds completely since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1 for a few seconds before it recovered. See below for logs (reminder: 3 boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1). Hrm, I don't see any obvious clues. You could enable 'debug ms = 1' on the osds as well. That will give us more to go on if/when it happens again, and should not affect performance significantly. Done: ceph osd tell '*' injectargs '--debug-ms 1' Now to wait for it to happen again. Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
G'day Sage, On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote: On Fri, 15 Feb 2013, Chris Dunlop wrote: In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. Debug turned on. Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in any of the many previous slow requests intances, and the timing doesn't look quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but the osd.0 log shows nothing problems at all, then the mon complains about not having heard from osd.1 since 20:11:21, whereas the first indication of trouble on osd.1 was the request from 20:26:20 not being processed in a timely fashion. My guess is the above was a side-effect of osd.0 being marked out. On 0.56.2 there is some strange peering workqueue laggyness that could potentially contribute as well. I recommend moving to 0.56.3. Upgraded to 0.56.3. Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did restarting the osds ('service ceph restart osd' on each osd host). The immediate issue was resolved by restarting ceph completely on one of the mon/osd hosts (service ceph restart). Possibly a restart of just the mon would have been sufficient. Did you notice that the osds you restarted didn't immediately mark themselves in? Again, it could be explained by the peering wq issue, especially if there are pools in your cluster that are not getting any IO. Sorry, no. I was kicking myself later for losing the 'ceph -s' output when I killed that terminal session but in the heat of the moment... I can't see anything about osd marking themselves in from the logs from the time (with no debugging), but I'm on my ipad at the moment so I could easily have missed it. Should that info be in the logs somewhere? There's certainly unused pools: we're only using the rbd pool and so the default data and metadata pools are unused. Thanks for your attention! Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mon losing touch with OSDs
G'day, In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? I imagine a network glitch could cause it, but I can't see any issues in any other system logs on any of the machines on the network. Having (mostly?) resolved my previous slow requests issue (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1 5 seconds later: ceph-mon.b2.log: 2013-02-14 20:11:19.892060 7fa48d4f8700 0 log [INF] : pgmap v2822096: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:11:21.719513 7fa48d4f8700 0 log [INF] : pgmap v2822097: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago. marking down 2013-02-14 20:26:20.780244 7fa48d4f8700 1 mon.b2@0(leader).osd e769 e769: 2 osds: 1 up, 2 in 2013-02-14 20:26:20.837123 7fa48d4f8700 0 log [INF] : osdmap e769: 2 osds: 1 up, 2 in 2013-02-14 20:26:20.947523 7fa48d4f8700 0 log [INF] : pgmap v2822098: 576 pgs: 304 active+clean, 272 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:26:25.709341 7fa48dcf9700 -1 mon.b2@0(leader).osd e769 no osd or pg stats from osd.1 since 2013-02-14 20:11:21.523741, 904.185596 seconds ago. marking down 2013-02-14 20:26:25.822773 7fa48d4f8700 1 mon.b2@0(leader).osd e770 e770: 2 osds: 0 up, 2 in 2013-02-14 20:26:25.863493 7fa48d4f8700 0 log [INF] : osdmap e770: 2 osds: 0 up, 2 in 2013-02-14 20:26:25.954799 7fa48d4f8700 0 log [INF] : pgmap v2822099: 576 pgs: 576 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:31:30.772360 7fa48dcf9700 0 log [INF] : osd.1 out (down for 304.933403) 2013-02-14 20:31:30.893521 7fa48d4f8700 1 mon.b2@0(leader).osd e771 e771: 2 osds: 0 up, 1 in 2013-02-14 20:31:30.933439 7fa48d4f8700 0 log [INF] : osdmap e771: 2 osds: 0 up, 1 in 2013-02-14 20:31:31.055408 7fa48d4f8700 0 log [INF] : pgmap v2822100: 576 pgs: 576 stale+active+clean; 407 GB data, 417 GB used, 1444 GB / 1862 GB avail 2013-02-14 20:35:05.831221 7fa48dcf9700 0 log [INF] : osd.0 out (down for 525.033581) 2013-02-14 20:35:05.989724 7fa48d4f8700 1 mon.b2@0(leader).osd e772 e772: 2 osds: 0 up, 0 in 2013-02-14 20:35:06.031409 7fa48d4f8700 0 log [INF] : osdmap e772: 2 osds: 0 up, 0 in 2013-02-14 20:35:06.129046 7fa48d4f8700 0 log [INF] : pgmap v2822101: 576 pgs: 576 stale+active+clean; 407 GB data, 0 KB used, 0 KB / 0 KB avail The other 2 mons both have messages like this in their logs, starting at around 20:12: 2013-02-14 20:12:26.534977 7f2092b86700 0 -- 10.200.63.133:6789/0 10.200.63.133:6800/6466 pipe(0xade76500 sd=22 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) 2013-02-14 20:13:24.741092 7f2092d88700 0 -- 10.200.63.133:6789/0 10.200.63.132:6800/2456 pipe(0x9f8b7180 sd=28 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) 2013-02-14 20:13:56.551908 7f2090560700 0 -- 10.200.63.133:6789/0 10.200.63.133:6800/6466 pipe(0x9f8b6000 sd=41 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) 2013-02-14 20:14:24.752356 7f209035e700 0 -- 10.200.63.133:6789/0 10.200.63.132:6800/2456 pipe(0x9f8b6500 sd=42 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1) (10.200.63.132 is mon.b4/osd.0, 10.200.63.133 is mon.b5/osd.1) ...although Greg Farnum indicates these messages are normal: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/5989/focus=5993 Osd.0 doesn't show any signs of distress at all: ceph-osd.0.log: 2013-02-14 20:00:10.280601 7ffceb012700 0 log [INF] : 2.7e scrub ok 2013-02-14 20:14:19.923490 7ffceb012700 0 log [INF] : 2.5b scrub ok 2013-02-14 20:14:50.571980 7ffceb012700 0 log [INF] : 2.7b scrub ok 2013-02-14 20:17:48.475129 7ffceb012700 0 log [INF] : 2.7d scrub ok 2013-02-14 20:28:22.601594 7ffceb012700 0 log [INF] : 2.91 scrub ok 2013-02-14 20:28:32.839278 7ffceb012700 0 log [INF] : 2.92 scrub ok 2013-02-14 20:28:46.992226 7ffceb012700 0 log [INF] : 2.93 scrub ok 2013-02-14 20:29:12.330668 7ffceb012700 0 log [INF] : 2.95 scrub ok ...although osd.1 started seeing problems around this time: ceph-osd.1.log: 2013-02-14 20:03:11.413352 7fd1d8f0a700 0 log [INF] : 2.23 scrub ok 2013-02-14 20:26:51.601425 7fd1e6f26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 30.750063 secs 2013-02-14 20:26:51.601432 7fd1e6f26700 0 log [WRN] : slow request 30.750063 seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map 2013-02-14 20:26:51.601437 7fd1e6f26700 0 log [WRN] : slow request 30.749947 seconds old,
Re: Mon losing touch with OSDs
Hi Chris, On Fri, 15 Feb 2013, Chris Dunlop wrote: G'day, In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the mons to lose touch with the osds? I imagine a network glitch could cause it, but I can't see any issues in any other system logs on any of the machines on the network. Having (mostly?) resolved my previous slow requests issue (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1 5 seconds later: ceph-mon.b2.log: 2013-02-14 20:11:19.892060 7fa48d4f8700 0 log [INF] : pgmap v2822096: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:11:21.719513 7fa48d4f8700 0 log [INF] : pgmap v2822097: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago. marking down There is a safety check that if the osd doesn't check in for a long period of time we assume it is dead. But it seems as though that shouldn't happen, since osd.0 has some PGs assigned and is scrubbing away. Can you enable 'debug ms = 1' on the mons and leave them that way, in the hopes that this happens again? It will give us more information to go on. ...although osd.1 started seeing problems around this time: ceph-osd.1.log: 2013-02-14 20:03:11.413352 7fd1d8f0a700 0 log [INF] : 2.23 scrub ok 2013-02-14 20:26:51.601425 7fd1e6f26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 30.750063 secs 2013-02-14 20:26:51.601432 7fd1e6f26700 0 log [WRN] : slow request 30.750063 seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map 2013-02-14 20:26:51.601437 7fd1e6f26700 0 log [WRN] : slow request 30.749947 seconds old, received at 2013-02-14 20:26:20.851420: osd_op(client.10001.0:618473 yy.rbd [watch 1~0] 2.3854277a) v4 currently wait for new map 2013-02-14 20:26:51.601440 7fd1e6f26700 0 log [WRN] : slow request 30.749938 seconds old, received at 2013-02-14 20:26:20.851429: osd_op(client.9998.0:39716 zz.rbd [watch 1~0] 2.71731007) v4 currently wait for new map 2013-02-14 20:26:51.601442 7fd1e6f26700 0 log [WRN] : slow request 30.749907 seconds old, received at 2013-02-14 20:26:20.851460: osd_op(client.10007.0:59572 aa.rbd [watch 1~0] 2.320eebb8) v4 currently wait for new map 2013-02-14 20:26:51.601445 7fd1e6f26700 0 log [WRN] : slow request 30.749630 seconds old, received at 2013-02-14 20:26:20.851737: osd_op(client.9980.0:86883 bb.rbd [watch 1~0] 2.ab9b579f) v4 currently wait for new map Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in any of the many previous slow requests intances, and the timing doesn't look quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but the osd.0 log shows nothing problems at all, then the mon complains about not having heard from osd.1 since 20:11:21, whereas the first indication of trouble on osd.1 was the request from 20:26:20 not being processed in a timely fashion. My guess is the above was a side-effect of osd.0 being marked out. On 0.56.2 there is some strange peering workqueue laggyness that could potentially contribute as well. I recommend moving to 0.56.3. No knowing enough about how the various pieces of ceph talk to each other makes it difficult to distinguish cause and effect! Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did restarting the osds ('service ceph restart osd' on each osd host). The immediate issue was resolved by restarting ceph completely on one of the mon/osd hosts (service ceph restart). Possibly a restart of just the mon would have been sufficient. Did you notice that the osds you restarted didn't immediately mark themselves in? Again, it could be explained by the peering wq issue, especially if there are pools in your cluster that are not getting any IO. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html