Hi, I’ve got a 3-node RAFT cluster, serving ovn northbound DB and trying to understand what triggered ovsdb-server leadership change. Can somebody help explain that?
This cluster runs ovs 2.13.4 and has ~25 active clients: 3 x ovn-northd, 3 x ovn-ic, 11 python-ovsdbapp clients (CMS). Server logs (per-node) listed above. Node 1: 2021-08-12T07:45:55.189Z|03131|raft|INFO|server be61 is leader for term 357 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout expired, starting election 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+ of 3 servers 2021-08-12T07:46:05.558Z|03134|raft|INFO|rejecting term 357 < current term 358 received in vote_reply message from server be61 2021-08-12T07:46:29.181Z|03135|timeval|WARN|Unreasonably long 2122ms poll interval (2120ms user, 0ms system) 2021-08-12T07:46:29.181Z|03136|timeval|WARN|context switches: 0 voluntary, 7 involuntary 2021-08-12T07:46:29.181Z|03137|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=3b99bba2: 2021-08-12T07:46:29.181Z|03138|coverage|INFO|hmap_pathological 94.0/sec 9.917/sec 0.1819/sec total: 85714 2021-08-12T07:46:29.181Z|03139|coverage|INFO|hmap_expand 39281.4/sec 4318.283/sec 112.9786/sec total: 193444648 2021-08-12T07:46:29.181Z|03140|coverage|INFO|lockfile_lock 0.0/sec 0.000/sec 0.0000/sec total: 1 2021-08-12T07:46:29.181Z|03141|coverage|INFO|poll_create_node 201.6/sec 109.617/sec 49.4042/sec total: 425254534 2021-08-12T07:46:29.181Z|03142|coverage|INFO|poll_zero_timeout 2.0/sec 1.567/sec 0.9350/sec total: 2938208 2021-08-12T07:46:29.181Z|03143|coverage|INFO|seq_change 13.0/sec 10.433/sec 5.5928/sec total: 22320268 2021-08-12T07:46:29.181Z|03144|coverage|INFO|pstream_open 0.0/sec 0.000/sec 0.0000/sec total: 4 2021-08-12T07:46:29.181Z|03145|coverage|INFO|stream_open 0.0/sec 0.000/sec 0.0000/sec total: 20 2021-08-12T07:46:29.181Z|03146|coverage|INFO|unixctl_received 0.0/sec 0.017/sec 0.0164/sec total: 54512 2021-08-12T07:46:29.181Z|03147|coverage|INFO|unixctl_replied 0.0/sec 0.017/sec 0.0164/sec total: 54512 2021-08-12T07:46:29.181Z|03148|coverage|INFO|util_xalloc 1693863.2/sec 181037.517/sec 3963.9467/sec total: 4830747405 2021-08-12T07:46:29.181Z|03149|coverage|INFO|97 events never hit 2021-08-12T07:46:29.186Z|03150|raft|WARN|ignoring vote request received as leader 2021-08-12T07:46:29.186Z|03151|raft|INFO|server be61 is leader for term 359 2021-08-12T07:46:29.187Z|03152|raft|INFO|1076 truncating 1 entries from end of log 2021-08-12T07:46:29.187Z|03153|raft|INFO|rejected append_reply (not leader) 2021-08-12T07:46:29.187Z|03154|raft|INFO|rejected append_reply (not leader) 2021-08-12T07:46:29.187Z|03155|raft|INFO|rejected append_reply (not leader) 2021-08-12T07:46:29.191Z|03156|raft|INFO|rejected append_reply (not leader) 2021-08-12T07:46:29.221Z|03157|jsonrpc|WARN|Dropped 4 log messages in last 14866 seconds (most recently, 14865 seconds ago) due to excessive rate 2021-08-12T07:46:29.221Z|03158|jsonrpc|WARN|tcp:client-1:42402: receive error: Connection reset by peer 2021-08-12T07:46:29.222Z|03159|reconnect|WARN|tcp:client-1:42402: connection dropped (Connection reset by peer) 2021-08-12T07:46:29.222Z|03160|jsonrpc|WARN|tcp:client-2:45746: receive error: Connection reset by peer 2021-08-12T07:46:29.222Z|03161|reconnect|WARN|tcp:client-2:45746: connection dropped (Connection reset by peer) 2021-08-12T07:46:29.225Z|03162|jsonrpc|WARN|tcp:client-3:54218: receive error: Connection reset by peer 2021-08-12T07:46:29.225Z|03163|reconnect|WARN|tcp:client-3:54218: connection dropped (Connection reset by peer) 2021-08-12T07:46:29.232Z|03164|jsonrpc|WARN|tcp:client-4:48064: receive error: Connection reset by peer 2021-08-12T07:46:29.232Z|03165|reconnect|WARN|tcp:client-4:48064: connection dropped (Connection reset by peer) 2021-08-12T07:46:29.232Z|03166|jsonrpc|WARN|tcp:client-5:49022: receive error: Connection reset by peer 2021-08-12T07:46:29.232Z|03167|reconnect|WARN|tcp:client-5:49022: connection dropped (Connection reset by peer) 2021-08-12T07:46:29.566Z|03168|reconnect|WARN|tcp:node-3:40902: connection dropped (Connection reset by peer) 2021-08-12T07:46:30.047Z|03169|poll_loop|INFO|Dropped 64 log messages in last 14879 seconds (most recently, 14876 seconds ago) due to excessive rate 2021-08-12T07:46:30.047Z|03170|poll_loop|INFO|wakeup due to [POLLOUT] on fd 49 (node-1:6641<->ip7:48658) at lib/stream-fd.c:153 (72% CPU usage) 2021-08-12T07:46:30.195Z|03171|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage) 2021-08-12T07:46:30.351Z|03172|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43 (node-1:6641<->ip8:48656) at lib/stream-fd.c:153 (72% CPU usage) 2021-08-12T07:46:30.529Z|03173|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage) 2021-08-12T07:46:30.753Z|03174|poll_loop|INFO|wakeup due to [POLLOUT] on fd 39 (node-1:6641<->ip9:48062) at lib/stream-fd.c:153 (72% CPU usage) 2021-08-12T07:46:30.864Z|03175|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage) 2021-08-12T07:46:30.980Z|03176|poll_loop|INFO|wakeup due to [POLLOUT] on fd 49 (node-1:6641<->ip10:48658) at lib/stream-fd.c:153 (72% CPU usage) 2021-08-12T07:46:31.156Z|03177|poll_loop|INFO|wakeup due to [POLLOUT] on fd 45 (node-1:6641<->ip11:54216) at lib/stream-fd.c:153 (72% CPU usage) 2021-08-12T07:46:31.197Z|03178|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage) 2021-08-12T07:46:31.198Z|03179|poll_loop|INFO|wakeup due to 0-ms timeout at lib/reconnect.c:664 (72% CPU usage) 2021-08-12T07:46:46.654Z|03180|raft|INFO|server 9f6e is leader for term 360 2021-08-12T07:46:46.954Z|03181|raft|INFO|rejecting term 359 < current term 360 received in append_request message from server be61 Node 2: 2021-08-12T07:45:54.748Z|02385|raft|INFO|Transferring leadership to write a snapshot. 2021-08-12T07:45:55.058Z|02386|raft|INFO|server be61 is leader for term 357 2021-08-12T07:45:55.185Z|02387|raft|INFO|rejecting term 356 < current term 357 received in append_reply message from server 1076 2021-08-12T07:46:04.192Z|02388|raft|INFO|server 1076 is leader for term 358 2021-08-12T07:46:28.339Z|02389|raft|INFO|server be61 is leader for term 359 2021-08-12T07:46:29.177Z|02390|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076 2021-08-12T07:46:29.183Z|02391|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076 2021-08-12T07:46:46.650Z|02392|raft|INFO|term 360: 1425 ms timeout expired, starting election 2021-08-12T07:46:46.654Z|02393|raft|INFO|term 360: elected leader by 2+ of 3 servers 2021-08-12T07:46:46.954Z|02394|raft|INFO|rejecting term 359 < current term 360 received in append_request message from server be61 Node 3: 2021-08-12T07:45:54.750Z|03526|raft|INFO|received leadership transfer from 9f6e in term 356 2021-08-12T07:45:54.750Z|03527|raft|INFO|term 357: starting election 2021-08-12T07:45:55.057Z|03528|raft|INFO|term 357: elected leader by 2+ of 3 servers 2021-08-12T07:46:05.558Z|03529|raft|WARN|ignoring vote request received as leader 2021-08-12T07:46:05.558Z|03530|raft|INFO|server 1076 is leader for term 358 2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll interval (2733ms user, 0ms system) 2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major 2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes 2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0 voluntary, 19 involuntary 2021-08-12T07:46:05.558Z|03535|coverage|INFO|Skipping details of duplicate event coverage for hash=3b99bba2 2021-08-12T07:46:05.559Z|03536|poll_loop|INFO|Dropped 59 log messages in last 8160 seconds (most recently, 8157 seconds ago) due to excessive rate 2021-08-12T07:46:05.559Z|03537|poll_loop|INFO|wakeup due to [POLLOUT] on fd 37 (node-3:6641<->client-1:49952) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03538|poll_loop|INFO|wakeup due to [POLLOUT] on fd 46 (node-3:6641<->client-2:39512) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03539|poll_loop|INFO|wakeup due to [POLLOUT] on fd 35 (node-3:6641<->client-3:35686) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03540|poll_loop|INFO|wakeup due to [POLLOUT] on fd 47 (node-3:6641<->client-4:49954) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03541|poll_loop|INFO|wakeup due to [POLLOUT] on fd 39 (node-3:6641<->client-5:33490) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03542|poll_loop|INFO|wakeup due to [POLLOUT] on fd 45 (node-3:6641<->client-6:41156) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03543|poll_loop|INFO|wakeup due to [POLLOUT] on fd 40 (node-3:6641<->client-7:40098) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03544|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43 (node-3:6641<->client-8:38632) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03545|poll_loop|INFO|wakeup due to [POLLOUT] on fd 36 (node-3:6641<->client-9:34912) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.559Z|03546|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-10:55638) at lib/stream-fd.c:153 (57% CPU usage) 2021-08-12T07:46:05.896Z|03547|jsonrpc|WARN|tcp:client-11:43772: send error: Broken pipe 2021-08-12T07:46:05.896Z|03548|reconnect|WARN|tcp:client-11:43772: connection dropped (Broken pipe) 2021-08-12T07:46:05.928Z|03549|jsonrpc|WARN|tcp:client-12:55640: receive error: Connection reset by peer 2021-08-12T07:46:05.929Z|03550|reconnect|WARN|tcp:client-12:55640: connection dropped (Connection reset by peer) 2021-08-12T07:46:19.068Z|03551|jsonrpc|WARN|tcp:client-13:33490: receive error: Connection reset by peer 2021-08-12T07:46:19.069Z|03552|reconnect|WARN|tcp:client-13:33490: connection dropped (Connection reset by peer) 2021-08-12T07:46:19.176Z|03553|jsonrpc|WARN|tcp:client-14:38632: receive error: Connection reset by peer 2021-08-12T07:46:19.177Z|03554|reconnect|WARN|tcp:client-14:38632: connection dropped (Connection reset by peer) 2021-08-12T07:46:20.712Z|03555|jsonrpc|WARN|tcp:client-15:34912: receive error: Connection reset by peer 2021-08-12T07:46:20.713Z|03556|reconnect|WARN|tcp:client-15:34912: connection dropped (Connection reset by peer) 2021-08-12T07:46:23.453Z|03557|jsonrpc|WARN|tcp:client-16:41158: receive error: Connection reset by peer 2021-08-12T07:46:23.453Z|03558|reconnect|WARN|tcp:client-16:41158: connection dropped (Connection reset by peer) 2021-08-12T07:46:28.336Z|03559|raft|INFO|term 359: 1318 ms timeout expired, starting election 2021-08-12T07:46:28.339Z|03560|raft|INFO|term 359: elected leader by 2+ of 3 servers 2021-08-12T07:46:29.178Z|03561|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076 2021-08-12T07:46:29.182Z|03562|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076 2021-08-12T07:46:29.187Z|03563|raft|INFO|rejecting term 358 < current term 359 received in vote_reply message from server 1076 2021-08-12T07:46:46.957Z|03564|timeval|WARN|Unreasonably long 1645ms poll interval (1642ms user, 0ms system) 2021-08-12T07:46:46.957Z|03565|timeval|WARN|faults: 163 minor, 0 major 2021-08-12T07:46:46.957Z|03566|timeval|WARN|context switches: 0 voluntary, 22 involuntary 2021-08-12T07:46:46.957Z|03567|coverage|INFO|Skipping details of duplicate event coverage for hash=3b99bba2 2021-08-12T07:46:47.233Z|03568|raft|INFO|rejected append_reply (not leader) 2021-08-12T07:46:47.233Z|03569|raft|WARN|ignoring vote request received after only 3 ms (minimum election time is 1000 ms) 2021-08-12T07:46:47.233Z|03570|raft|INFO|server 9f6e is leader for term 360 2021-08-12T07:46:47.234Z|03571|raft|INFO|rejected append_reply (not leader) 2021-08-12T07:46:48.290Z|03572|poll_loop|INFO|Dropped 49 log messages in last 43 seconds (most recently, 40 seconds ago) due to excessive rate 2021-08-12T07:46:48.290Z|03573|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43 (node-3:6641<->client-17:34970) at lib/stream-fd.c:153 (64% CPU usage) 2021-08-12T07:46:48.446Z|03574|poll_loop|INFO|wakeup due to [POLLIN] on fd 30 (node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage) 2021-08-12T07:46:48.647Z|03575|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-18:38648) at lib/stream-fd.c:153 (64% CPU usage) 2021-08-12T07:46:48.780Z|03576|poll_loop|INFO|wakeup due to [POLLIN] on fd 30 (node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage) 2021-08-12T07:46:48.919Z|03577|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-19:38648) at lib/stream-fd.c:153 (64% CPU usage) 2021-08-12T07:46:49.113Z|03578|poll_loop|INFO|wakeup due to [POLLIN] on fd 30 (1node-32.20.1.125:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage) 2021-08-12T07:46:49.296Z|03579|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-20:38648) at lib/stream-fd.c:153 (64% CPU usage) 2021-08-12T07:46:50.837Z|03580|jsonrpc|WARN|tcp:client-21:35700: receive error: Connection reset by peer 2021-08-12T07:46:50.837Z|03581|reconnect|WARN|tcp:client-21:35700: connection dropped (Connection reset by peer) 2021-08-12T07:46:55.042Z|03582|jsonrpc|WARN|tcp:client-22:39840: receive error: Connection reset by peer 2021-08-12T07:46:55.043Z|03583|reconnect|WARN|tcp:client-22:39840: connection dropped (Connection reset by peer) 2021-08-12T07:47:01.015Z|03584|jsonrpc|WARN|tcp:client-23:55826: receive error: Connection reset by peer 2021-08-12T07:47:01.015Z|03585|reconnect|WARN|tcp:client-23:55826: connection dropped (Connection reset by peer) First leader ownership transfer (at 07:45:54) is understandable. node-3 received leadership transfer from node-1 because node-1 was prepearing to take a DB snapshot/compaction. But then at 07:46:04.187Z node-1 wrote: 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout expired, starting election 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+ of 3 servers I’m not sure, but I suppose node-1 didn’t get heartbet from leader (node-3), started leader election and won it. Right? My question is: how to understand the reason why node-3 couldn’t send raft heartbeat in time? Can this long poll be the indicator that server was not answering heartbeats? 2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll interval (2733ms user, 0ms system) 2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major 2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes 2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0 voluntary, 19 involuntary Thanks in advance for your help. Regards, Vladislav Odintsov
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss