Further Analysis indicates that cluster is not able to select a new leader and hence every node keeps retrying:
ovn-northd logs around the same time on all three nodes complains about disconnection from cluster and failing to choose leader #Node1 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642: clustered database server is not cluster leader; trying another server 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642: connected 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. #Node2 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642: clustered database server is disconnected from cluster; trying another server 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642: connected 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. #Node3 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642: clustered database server is disconnected frr om cluster; trying another server 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642: connected 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. Regards, Aliasgar On Fri, Apr 6, 2018 at 2:16 PM, aginwala <aginw...@asu.edu> wrote: > Hi Ben/Numan: > > So I went ahead to try out clustered db in scale env and results are not > as expected. > OVS branch used is master; commit-id(2062840612904ff0873d46b2f4f226 > bc23f2296d) > > Setup is uing 3 nodes. > > Also disabled inactivity_probe, > ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: > 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 > ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: > 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 > > > 1. With wait_up = true for 10k lports and 1k HVs which internally uses > wait-until, it was able to create around 1k lports and rally exited due db > connection error. > Cause of failure: database connection failed is @ > https://raw.githubusercontent.com/noah8713/ovn-scale-test/ > e398ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html > > Some data from sb db logs are as below > 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms timeout > at lib/reconnect.c:643 (90% CPU usage) > 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid > 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite > 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request > 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid > f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew > 51% in last 35031.7 seconds, from 24748 kB to 37332 kB > 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1 > monitors:3 sessions:431 > 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received > as leader > 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms poll > interval (1044ms user, 232ms system) > 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 major > 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 > voluntary, 3 involuntary > 2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate > over last: 5 seconds, last minute, last hour, hash=1dd2af20: > 2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological > 25.0/sec 24.633/sec 1.5550/sec total: 19274 > 2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand > 4202.6/sec 4041.417/sec 578.9233/sec total: 14158936 > 2018-04-05T06:32:27.959Z|00081|coverage|INFO|lockfile_lock > 0.0/sec 0.000/sec 0.0000/sec total: 1 > 2018-04-05T06:32:27.960Z|00082|coverage|INFO|poll_create_node > 50281.4/sec 47195.233/sec 29424.6886/sec total: 1001238470 > 2018-04-05T06:32:27.960Z|00083|coverage|INFO|poll_zero_timeout > 9.8/sec 12.167/sec 8.2392/sec total: 312540 > 2018-04-05T06:32:27.960Z|00084|coverage|INFO|seq_change > 122.8/sec 115.967/sec 68.2761/sec total: 2309640 > 2018-04-05T06:32:27.961Z|00085|coverage|INFO|pstream_open > 0.0/sec 0.000/sec 0.0000/sec total: 4 > 2018-04-05T06:32:27.961Z|00086|coverage|INFO|stream_open > 0.0/sec 0.000/sec 0.0000/sec total: 9 > 2018-04-05T06:32:27.961Z|00087|coverage|INFO|util_xalloc > 229362.2/sec 218086.767/sec 46494.5058/sec total: 1344641131 > > > > some data from nb db logs: > 2018-04-05T23:35:33.458Z|00020|raft|INFO|adding b7bd (b7bd at tcp: > 192.168.220.103:6645) to cluster a36d succeeded (completed) > 2018-04-05T23:53:06.360Z|00021|jsonrpc|WARN|tcp:192.168.10.10:57298: > receive error: Connection reset by peer > 2018-04-05T23:53:06.360Z|00022|reconnect|WARN|tcp:192.168.10.10:57298: > connection dropped (Connection reset by peer) > 2018-04-05T23:53:09.071Z|00023|jsonrpc|WARN|tcp:192.168.10.10:57348: > receive error: Connection reset by peer > 2018-04-05T23:53:09.071Z|00024|reconnect|WARN|tcp:192.168.10.10:57348: > connection dropped (Connection reset by peer) > ^2018-04-06T00:02:03.864Z|00076|raft|INFO|current entry eid > 0b9eeff2-8377-4f6d-a907-cb4e98396d14 does not match prerequisite > 277c09b4-96b3-43e9-b482-a9df29576988 > 2018-04-06T00:02:06.882Z|00077|poll_loop|INFO|Dropped 6187 log messages > in last 6 seconds (most recently, 0 seconds ago) due to excessive rate > 2018-04-06T00:02:06.882Z|00078|poll_loop|INFO|wakeup due to [POLLIN] on > fd 31 (192.168.220.102:6645<->192.168.10.10:54782) at lib/stream-fd.c:157 > (62% CPU usage) > 2018-04-06T00:02:09.130Z|00079|jsonrpc|WARN|tcp:192.168.10.10:33592: > receive error: Connection reset by peer > 2018-04-06T00:02:09.131Z|00080|reconnect|WARN|tcp:192.168.10.10:33592: > connection dropped (Connection reset by peer) > 2018-04-06T00:02:33.986Z|00081|poll_loop|INFO|Dropped 1509 log messages > in last 26 seconds (most recently, 24 seconds ago) due to excessive rate > 2018-04-06T00:02:33.986Z|00082|poll_loop|INFO|wakeup due to [POLLIN] on > fd 24 (FIFO pipe:[40290445]) at ovsdb/log.c:994 (51% CPU usage) > 2018-04-06T00:02:33.986Z|00083|poll_loop|INFO|wakeup due to [POLLIN] on > fd 29 (192.168.220.102:6645<->192.168.10.10:55738) at lib/stream-fd.c:157 > (51% CPU usage) > > > > 2018-04-06T00:13:48.141Z|00037|raft|INFO|rejecting term 4 < current term > 5 received in vote_reply message from server b7bd > 2018-04-06T00:14:21.138Z|00038|jsonrpc|WARN|tcp:192.168.10.10:33990: > receive error: Connection reset by peer > 2018-04-06T00:14:21.138Z|00039|reconnect|WARN|tcp:192.168.10.10:33990: > connection dropped (Connection reset by peer) > 2018-04-06T00:17:01.532Z|00040|raft|INFO|current entry eid > e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > d47dbdf1-0981-4077-baab-372b05aeb0bb > 2018-04-06T00:17:01.532Z|00041|raft|INFO|current entry eid > e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > d47dbdf1-0981-4077-baab-372b05aeb0bb > 2018-04-06T00:17:01.532Z|00042|raft|INFO|current entry eid > e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > d47dbdf1-0981-4077-baab-372b05aeb0bb > 2018-04-06T00:17:01.533Z|00043|raft|INFO|current entry eid > e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > d47dbdf1-0981-4077-baab-372b05aeb0bb > 2018-04-06T00:17:01.533Z|00044|raft|INFO|current entry eid > e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > d47dbdf1-0981-4077-baab-372b05aeb0bb > 2018-04-06T00:19:32.039Z|00045|timeval|WARN|Unreasonably long 1373ms poll > interval (1340ms user, 24ms system) > 2018-04-06T00:19:32.039Z|00046|timeval|WARN|faults: 319 minor, 0 major > 2018-04-06T00:19:32.039Z|00047|timeval|WARN|disk: 0 reads, 368 writes > 2018-04-06T00:19:32.039Z|00048|timeval|WARN|context switches: 5 > voluntary, 19 involuntary > 2018-04-06T00:19:32.039Z|00049|coverage|INFO|Event coverage, avg rate > over last: 5 seconds, last minute, last hour, hash=1dd2af20: > 2018-04-06T00:19:32.039Z|00050|coverage|INFO|hmap_pathological > 0.0/sec 0.000/sec 3.4164/sec total: 12306 > 2018-04-06T00:19:32.039Z|00051|coverage|INFO|hmap_expand > 73.6/sec 71.267/sec 729.9656/sec total: 2628980 > 2018-04-06T00:19:32.039Z|00052|coverage|INFO|lockfile_lock > 0.0/sec 0.000/sec 0.0003/sec total: 1 > 2018-04-06T00:19:32.039Z|00053|coverage|INFO|poll_create_node > 129.6/sec 122.333/sec 866.7178/sec total: 3120304 > 2018-04-06T00:19:32.039Z|00054|coverage|INFO|poll_zero_timeout > 0.0/sec 0.267/sec 27.4550/sec total: 98838 > 2018-04-06T00:19:32.039Z|00055|coverage|INFO|seq_change > 10.8/sec 10.217/sec 135.4125/sec total: 487498 > 2018-04-06T00:19:32.039Z|00056|coverage|INFO|pstream_open > 0.0/sec 0.000/sec 0.0011/sec total: 4 > 2018-04-06T00:19:32.039Z|00057|coverage|INFO|stream_open > 0.0/sec 0.000/sec 0.0019/sec total: 7 > 2018-04-06T00:19:32.039Z|00058|coverage|INFO|util_xalloc > 1542.2/sec 1488.750/sec 21346.2483/sec total: 76872987 > 2018-04-06T00:19:32.039Z|00059|coverage| > > > > To find that wait-until was only the bottleneck, I skipped wait-until and > test got completed in 2.4 hours but things did not go as expected. All the > chassis ovn-controllers crashed and ports actually did not get bind all the > way down to the compute as ovs sb db was running super hot > > Below is the controller log from sample HV without wait-until. > 2018-04-06T04:50:40.274Z|01703|poll_loop|INFO|Dropped 5903 log messages > in last 6 seconds (most recently, 0 seconds ago) due to excessive rate > 2018-04-06T04:50:40.274Z|01704|poll_loop|INFO|wakeup due to > [POLLIN][POLLHUP] on fd 13 (<->/root/sandbox-192.168.72.45/br-int.mgmt) > at ../lib/stream-fd.c:157 (100% CPU usage) > 2018-04-06T04:50:41.060Z|01705|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > clustered database server is disconnected from cluster; trying another > server > 2018-04-06T04:50:41.066Z|01706|binding|INFO|Claiming lport > lport_a3060c_mlAOEq for this chassis. > 2018-04-06T04:50:41.066Z|01707|binding|INFO|lport_a3060c_mlAOEq: Claiming > a2:1c:47:32:30:52 172.145.23.37 > 2018-04-06T04:50:41.066Z|01708|vconn_stream|ERR|send: Broken pipe > 2018-04-06T04:50:41.066Z|01709|rconn|WARN|unix:/root/ > sandbox-192.168.72.45/br-int.mgmt: connection dropped (Broken pipe) > 2018-04-06T04:50:41.067Z|01710|rconn|INFO|unix:/root/ > sandbox-192.168.72.45/br-int.mgmt: connected > 2018-04-06T04:50:41.067Z|01711|rconn|WARN|unix:/root/ > sandbox-192.168.72.45/br-int.mgmt: connection dropped (Broken pipe) > 2018-04-06T04:50:41.067Z|01712|util|EMER|../lib/ofp-msgs.c:1062: > assertion version >= info->min_version && version <= info->max_version > failed in raw_instance_get() > > > > Also northd sees many transaction errors without wait-until: > northd.log > 2018-04-06T04:25:26.594Z|01076|ovsdb_idl|WARN|transaction error: > {"details":"Transaction causes multiple rows in \"Port_Binding\" table to > have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 104) for > index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > 489f777f-d7f6-4b92-80b9-c1c64c083ea4, existed in the database before this > transaction and was not modified by the transaction. Second row, with UUID > b59a2500-3da1-40ab-932c-5e2dd4acc3b4, was inserted by this > transaction.","error":"constraint violation"} > 2018-04-06T04:25:27.202Z|01077|ovsdb_idl|WARN|transaction error: > {"details":"Transaction causes multiple rows in \"Port_Binding\" table to > have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 26) for > index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > 0668b300-55f8-4bdd-b0a8-8038258d4502, was inserted by this transaction. > Second row, with UUID 63972d45-8845-4489-ac3e-916909a593c7, existed in > the database before this transaction and was not modified by the > transaction.","error":"constraint violation"} > 2018-04-06T04:25:32.036Z|01078|ovsdb_idl|WARN|transaction error: > {"details":"Transaction causes multiple rows in \"Port_Binding\" table to > have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 224) for > index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > 528dea47-f284-40d6-bc7e-a97a084f49eb, existed in the database before this > transaction and was not modified by the transaction. Second row, with UUID > 6fcc2500-fedb-4ee9-b54d-ea548dd36936, was inserted by this > transaction.","error":"constraint violation"} > 2018-04-06T04:25:32.482Z|01079|ovsdb_idl|WARN|transaction error: > {"details":"Transaction causes multiple rows in \"Port_Binding\" table to > have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 23) for > index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > 925eb001-adeb-4748-a55f-6c5812e70a7e, was inserted by this transaction. > Second row, with UUID 5ed7bbc5-b7f5-4d4e-8e99-d8c4adcb2b34, existed in > the database before this transaction and was not modified by the > transaction.","error":"constraint violation"} > 2018-04-06T04:25:35.204Z|01080|ovsdb_idl|WARN|transaction error: > {"details":"Transaction causes multiple rows in \"Port_Binding\" table to > have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 18) for > index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > 4b429b00-75ca-4e38-904d-80a11e6c0f38, was inserted by this transaction. > Second row, with UUID c2db03fe-18b6-4a2b-8db0-bf37ef93db92, existed in > the database before this transaction and was not modified by the > transaction.","error":"constraint violation"} > 2018-04-06T04:25:35.933Z|01081|ovsdb_idl|INFO|tcp:192.168.220.101:6642: > clustered database server is not cluster leader; trying another server > > > To summerize: > > - ovn nb db has issues with wait-until and works fine without > wait-until > - ovn sb db causes controller to crash due to more HVs and just few > hundred of ports with no solid root cause without wait-until. Even sbctl > commands hangs as sb db is 100% on cpu. > - ovn northd sees transaction error without wait-until > > > > Will continue to troubleshoot and would love to get any > improvements/suggestions if any. > > > Regards, > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev