Re: [ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

Gregory Farnum Wed, 20 Sep 2017 14:49:34 -0700

That definitely sounds like a time sync issue. Are you *sure* they matched
each other? Is it reproducible on restart?


On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy <[email protected]> wrote:

>
> Hi,
>
>
> Luminous 12.2.0
>
> Three node cluster, 18 OSD, debian stretch.
>
>
> One node is down for maintenance for several hours.  When bringing it back
> up, OSDs rejoin after 5 minutes, but health is still warning.  monitor has
> not joined quorum after 40 minutes and logs show BADAUTHORIZER message
> every time the monitor tries to connect to the leader.
>
> 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect got BADAUTHORIZER
>
> Then after ~45 minutes monitor *does* join quorum.
>
> I'm presuming this isn't normal behaviour?  Or if it is, let me know and I
> won't worry.
>
> All three nodes are using ntp and look OK timewise.
>
>
> ceph-mon log:
>
> (.43 is leader, .45 is rebooted node, .44 is other live node in quorum)
>
> Boot:
>
> 2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0
> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> (unknown), pid 2243
>
> 2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 3 vs existing csq=0 existing_state=STATE_CONNECTING
> 2017-09-20 09:46:01.824723 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept we reset (peer sent cseq 3, 0x5600722c0000.cseq = 0), sending
> RESETSESSION
> 2017-09-20 09:46:01.825247 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x56007244d000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
> 2017-09-20 09:46:01.828053 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.44:6789/0 conn(0x5600722c0000 :-1
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=21872 cs=1 l=0).process
> missed message?  skipped from seq 0 to 552717734
>
> 2017-09-20 09:46:05.580342 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
> 172.16.0.43:6789/0 conn(0x5600720fe800 :-1
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=49261 cs=1 l=0).process
> missed message?  skipped from seq 0 to 1151972199
> 2017-09-20 09:46:05.581097 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect got BADAUTHORIZER
> 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect got BADAUTHORIZER
> ...
> [message repeats for 45 minutes]
> ...
> 2017-09-20 10:23:38.818767 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
> 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> l=0).handle_connect_reply connect
>  got BADAUTHORIZER
>
>
> At this point, "ceph mon stat" says .45/store03 not in quorum:
>
> e5: 3 mons at {store01=
> 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
> election epoch 376, leader 0 store01, quorum 0,1 store01,store02
>
>
> Then suddenly a valid connection is made and sync happens:
>
> 2017-09-20 10:23:43.041009 7f49e5b2f700  1 mon.store03@2(synchronizing).mds
> e1 Unable to load 'last_metadata'
> 2017-09-20 10:23:43.041967 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
> e2381 e2381: 18 total, 13 up, 14 in
> ...
> 2017-09-20 10:23:43.045961 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
> e2393 e2393: 18 total, 15 up, 15 in
> ...
> 2017-09-20 10:23:43.049255 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
> e2406 e2406: 18 total, 18 up, 18 in
> ...
> 2017-09-20 10:23:43.054828 7f49e5b2f700  0 log_channel(cluster) log [INF]
> : mon.store03 calling new monitor election
> 2017-09-20 10:23:43.054901 7f49e5b2f700  1 
> mon.store03@2(electing).elector(372)
> init, last seen epoch 372
>
>
> Now "ceph mon stat" says:
>
> e5: 3 mons at {store01=
> 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
> election epoch 378, leader 0 store01, quorum 0,1,2 store01,store02,store03
>
> and everything's happy.
>
>
> What should I look for/fix?  It's a fairly vanilla system.
>
>
> Thanks in advance,
>
> Sean Purdy
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

Reply via email to