That definitely sounds like a time sync issue. Are you *sure* they matched each other? Is it reproducible on restart?
On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy <[email protected]> wrote: > > Hi, > > > Luminous 12.2.0 > > Three node cluster, 18 OSD, debian stretch. > > > One node is down for maintenance for several hours. When bringing it back > up, OSDs rejoin after 5 minutes, but health is still warning. monitor has > not joined quorum after 40 minutes and logs show BADAUTHORIZER message > every time the monitor tries to connect to the leader. > > 2017-09-20 09:46:05.581590 7f49e2b29700 0 -- 172.16.0.45:0/2243 >> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 > l=0).handle_connect_reply connect got BADAUTHORIZER > > Then after ~45 minutes monitor *does* join quorum. > > I'm presuming this isn't normal behaviour? Or if it is, let me know and I > won't worry. > > All three nodes are using ntp and look OK timewise. > > > ceph-mon log: > > (.43 is leader, .45 is rebooted node, .44 is other live node in quorum) > > Boot: > > 2017-09-20 09:45:21.874152 7f49efeb8f80 0 ceph version 12.2.0 > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process > (unknown), pid 2243 > > 2017-09-20 09:46:01.824708 7f49e1b27700 0 -- 172.16.0.45:6789/0 >> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > accept connect_seq 3 vs existing csq=0 existing_state=STATE_CONNECTING > 2017-09-20 09:46:01.824723 7f49e1b27700 0 -- 172.16.0.45:6789/0 >> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > accept we reset (peer sent cseq 3, 0x5600722c0000.cseq = 0), sending > RESETSESSION > 2017-09-20 09:46:01.825247 7f49e1b27700 0 -- 172.16.0.45:6789/0 >> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING > 2017-09-20 09:46:01.828053 7f49e1b27700 0 -- 172.16.0.45:6789/0 >> > 172.16.0.44:6789/0 conn(0x5600722c0000 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=21872 cs=1 l=0).process > missed message? skipped from seq 0 to 552717734 > > 2017-09-20 09:46:05.580342 7f49e1b27700 0 -- 172.16.0.45:6789/0 >> > 172.16.0.43:6789/0 conn(0x5600720fe800 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=49261 cs=1 l=0).process > missed message? skipped from seq 0 to 1151972199 > 2017-09-20 09:46:05.581097 7f49e2b29700 0 -- 172.16.0.45:0/2243 >> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 > l=0).handle_connect_reply connect got BADAUTHORIZER > 2017-09-20 09:46:05.581590 7f49e2b29700 0 -- 172.16.0.45:0/2243 >> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 > l=0).handle_connect_reply connect got BADAUTHORIZER > ... > [message repeats for 45 minutes] > ... > 2017-09-20 10:23:38.818767 7f49e2b29700 0 -- 172.16.0.45:0/2243 >> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1 > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 > l=0).handle_connect_reply connect > got BADAUTHORIZER > > > At this point, "ceph mon stat" says .45/store03 not in quorum: > > e5: 3 mons at {store01= > 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0}, > election epoch 376, leader 0 store01, quorum 0,1 store01,store02 > > > Then suddenly a valid connection is made and sync happens: > > 2017-09-20 10:23:43.041009 7f49e5b2f700 1 mon.store03@2(synchronizing).mds > e1 Unable to load 'last_metadata' > 2017-09-20 10:23:43.041967 7f49e5b2f700 1 mon.store03@2(synchronizing).osd > e2381 e2381: 18 total, 13 up, 14 in > ... > 2017-09-20 10:23:43.045961 7f49e5b2f700 1 mon.store03@2(synchronizing).osd > e2393 e2393: 18 total, 15 up, 15 in > ... > 2017-09-20 10:23:43.049255 7f49e5b2f700 1 mon.store03@2(synchronizing).osd > e2406 e2406: 18 total, 18 up, 18 in > ... > 2017-09-20 10:23:43.054828 7f49e5b2f700 0 log_channel(cluster) log [INF] > : mon.store03 calling new monitor election > 2017-09-20 10:23:43.054901 7f49e5b2f700 1 > mon.store03@2(electing).elector(372) > init, last seen epoch 372 > > > Now "ceph mon stat" says: > > e5: 3 mons at {store01= > 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0}, > election epoch 378, leader 0 store01, quorum 0,1,2 store01,store02,store03 > > and everything's happy. > > > What should I look for/fix? It's a fairly vanilla system. > > > Thanks in advance, > > Sean Purdy > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
