Sounds like you've got a few different things happening here. On Tue, Aug 15, 2017 at 4:23 AM Sean Purdy <s.pu...@cv-library.co.uk> wrote:
> Luminous 12.1.1 rc1 > > Hi, > > > I have a three node cluster with 6 OSD and 1 mon per node. > > I had to turn off one node for rack reasons. While the node was down, the > cluster was still running and accepting files via radosgw. However, when I > turned the machine back on, radosgw uploads stopped working and things like > "ceph status" starting timed out. It took 20 minutes for "ceph status" to > be OK. > > In the recent past I've rebooted one or other node and the cluster kept > working, and when the machine came back, the OSDs and monitor rejoined the > cluster and things went on as usual. > > The machine was off for 21 hours or so. > > Any idea what might be happening, and how to mitigate the effects of this > next time a machine has to be down for any length of time? > > > "ceph status" said: > > 2017-08-15 11:28:29.835943 7fdf2d74b700 0 monclient(hunting): > authenticate timed out after 300 2017-08-15 > 11:28:29.835993 7fdf2d74b700 0 librados: client.admin authentication error > (110) Connection timed out > That just means the client couldn't connect to an in-quorum monitor. It should have tried them all in sequence though — did you check if you had *any* functioning quorum? > > > monitor log said things like this before everything came together: > > 2017-08-15 11:23:07.180123 7f11c0fcc700 0 -- 172.16.0.43:0/2471 >> > 172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1 > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 > l=0).handle_connect_reply connect got BADAUTHORIZER > This one's odd. We did get one report of seeing something like that, but I tend to think it's a clock sync issue. > > but "ceph --admin-daemon /var/run/ceph/ceph-mon.xxx.asok quorum_status" > did work. This monitor node was detected but not yet in quorum. > > > OSDs had 15 minutes of > > ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2) No > such file or directory > > And that would appear to be something happening underneath Ceph, wherein your data wasn't actually all the way mounted or something? Anyway, it should have survived that transition without any noticeable impact (unless you are running so close to capacity that merely getting the downed node up-to-date overwhelmed your disks/cpu). But without some basic information about what the cluster as a whole was doing I couldn't speculate. -Greg
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com