On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou <y...@nvidia.com> wrote: > Hi, > > Need expert's view to address a problem we are seeing now and then: A > ovsdb-server node in a 3-nodes raft cluster keeps printing out the > "raft_is_connected: false" message, and its "connected" state in its > _Server DB stays as false. > > According to the ovsdb-server(5) manpage, it means this server is not > contacting with a majority of its cluster. > > Except its "connected" state, from what we can see, this server is in the > follower state and works fine, and connection between it and the other two > servers appear healthy as well. > > Below is its raft structure snapshot at the time of the problem. Note that > its candidate_retrying field stays as true. > > Hopefully the provide information can help to figure out what goes wrong > here. Unfortunately we don't have a solid case to reproduce it: >
Thanks for reporting the issue. This looks really strange. In the below state, leader_sid is non-zero, but candidate_retrying is true. According to the latest code, whenever leader_sid is set to non-zero (in raft_set_leader()), candidate_retrying will be set to false; whenever candidate_retrying is set to true (in raft_start_election()), leader_sid will be set to UUID_ZERO. And the data struct is initialized with xzalloc, making sure candidate_retrying is false in the beginning. So, sorry that I can't explain how it ends up with this conflict situation. It would be helpful if there is a way to reproduce. How often does it happen? Thanks, Han > (gdb) print *(struct raft *)0xa872c0 > $19 = { > hmap_node = { > hash = 2911123117, > next = 0x0 > }, > log = 0xa83690, > cid = { > parts = {2699238234, 2258650653, 3035282424, 813064186} > }, > sid = { > parts = {1071328836, 400573240, 2626104521, 1746414343} > }, > local_address = 0xa874e0 "tcp:10.8.51.55:6643", > local_nickname = 0xa876d0 "3fdb", > name = 0xa876b0 "OVN_Northbound", > servers = { > buckets = 0xad4bc0, > one = 0x0, > mask = 3, > n = 3 > }, > election_timer = 1000, > election_timer_new = 0, > term = 3, > vote = { > parts = {1071328836, 400573240, 2626104521, 1746414343} > }, > synced_term = 3, > synced_vote = { > parts = {1071328836, 400573240, 2626104521, 1746414343} > }, > entries = 0xbf0fe0, > log_start = 2, > log_end = 312, > log_synced = 311, > allocated_log = 512, > snap = { > term = 1, > data = 0xaafb10, > eid = { > parts = {1838862864, 1569866528, 2969429118, 3021055395} > }, > servers = 0xaafa70, > election_timer = 1000 > }, > role = RAFT_FOLLOWER, > commit_index = 311, > last_applied = 311, > leader_sid = { > parts = {642765114, 43797788, 2533161504, 3088745929} > }, > election_base = 6043283367, > election_timeout = 6043284593, > joining = false, > remote_addresses = { > map = { > buckets = 0xa87410, > one = 0xa879c0, > mask = 0, > n = 1 > } > }, > join_timeout = 6037634820, > leaving = false, > left = false, > leave_timeout = 0, > failed = false, > waiters = { > prev = 0xa87448, > next = 0xa87448 > }, > listener = 0xaafad0, > listen_backoff = -9223372036854775808, > conns = { > prev = 0xbcd660, > next = 0xaafc20 > }, > add_servers = { > buckets = 0xa87480, > one = 0x0, > mask = 0, > n = 0 > }, > remove_server = 0x0, > commands = { > buckets = 0xa874a8, > one = 0x0, > mask = 0, > n = 0 > }, > ping_timeout = 6043283700, > n_votes = 1, > candidate_retrying = true, > had_leader = false, > ever_had_leader = true > } > > Thanks > - Yun > > -- > You received this message because you are subscribed to the Google Groups > "ovn-kubernetes" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to ovn-kubernetes+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com > . >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss