On Tue, Apr 17, 2018 at 5:52 AM, aginwala <aginw...@asu.edu> wrote: > Hi Ben/Numan: > > > Just following up again. Let me know for any further > improvements/suggestions on this. > > Hi Aliasgar,
I need to educate myself first by studing raft and Ben's patches :). I hope I will do it soon. Thanks Numan > Regards, > > > On Mon, Apr 9, 2018 at 10:53 AM, aginwala <aginw...@asu.edu> wrote: > > > Further Analysis indicates that cluster is not able to select a new > leader > > and hence every node keeps retrying: > > > > ovn-northd logs around the same time on all three nodes complains about > > disconnection from cluster and failing to choose leader > > > > > > #Node1 > > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock acquired. > > This ovn-northd instance is now active. > > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > > clustered database server is not cluster leader; trying another server > > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642: > > connected > > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > > > #Node2 > > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock acquired. > > This ovn-northd instance is now active. > > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642: > > clustered database server is disconnected from cluster; trying another > > server > > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642: > > connected > > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > > > > > #Node3 > > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock acquired. > > This ovn-northd instance is now active. > > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > > clustered database server is disconnected frr > > om cluster; trying another server > > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642: > > connected > > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost. > This > > ovn-northd instance is now on standby. > > > > > > > > Regards, > > Aliasgar > > > > On Fri, Apr 6, 2018 at 2:16 PM, aginwala <aginw...@asu.edu> wrote: > > > >> Hi Ben/Numan: > >> > >> So I went ahead to try out clustered db in scale env and results are not > >> as expected. > >> OVS branch used is master; commit-id(2062840612904ff0873d > >> 46b2f4f226bc23f2296d) > >> > >> Setup is uing 3 nodes. > >> > >> Also disabled inactivity_probe, > >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp: > >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0 > >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642,tcp: > >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0 > >> > >> > >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses > >> wait-until, it was able to create around 1k lports and rally exited due > db > >> connection error. > >> Cause of failure: database connection failed is @ > >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3 > >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html > >> > >> Some data from sb db logs are as below > >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms > timeout > >> at lib/reconnect.c:643 (90% CPU usage) > >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid > >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite > >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request > >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid > >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite > >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request > >> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size grew > >> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB > >> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1 > >> monitors:3 sessions:431 > >> 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request received > >> as leader > >> 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms > >> poll interval (1044ms user, 232ms system) > >> 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0 > major > >> 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0 > >> voluntary, 3 involuntary > >> 2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate > >> over last: 5 seconds, last minute, last hour, hash=1dd2af20: > >> 2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological > >> 25.0/sec 24.633/sec 1.5550/sec total: 19274 > >> 2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand > >> 4202.6/sec 4041.417/sec 578.9233/sec total: 14158936 > >> 2018-04-05T06:32:27.959Z|00081|coverage|INFO|lockfile_lock > >> 0.0/sec 0.000/sec 0.0000/sec total: 1 > >> 2018-04-05T06:32:27.960Z|00082|coverage|INFO|poll_create_node > >> 50281.4/sec 47195.233/sec 29424.6886/sec total: 1001238470 > >> 2018-04-05T06:32:27.960Z|00083|coverage|INFO|poll_zero_timeout > >> 9.8/sec 12.167/sec 8.2392/sec total: 312540 > >> 2018-04-05T06:32:27.960Z|00084|coverage|INFO|seq_change > >> 122.8/sec 115.967/sec 68.2761/sec total: 2309640 > >> 2018-04-05T06:32:27.961Z|00085|coverage|INFO|pstream_open > >> 0.0/sec 0.000/sec 0.0000/sec total: 4 > >> 2018-04-05T06:32:27.961Z|00086|coverage|INFO|stream_open > >> 0.0/sec 0.000/sec 0.0000/sec total: 9 > >> 2018-04-05T06:32:27.961Z|00087|coverage|INFO|util_xalloc > >> 229362.2/sec 218086.767/sec 46494.5058/sec total: 1344641131 > >> > >> > >> > >> some data from nb db logs: > >> 2018-04-05T23:35:33.458Z|00020|raft|INFO|adding b7bd (b7bd at tcp: > >> 192.168.220.103:6645) to cluster a36d succeeded (completed) > >> 2018-04-05T23:53:06.360Z|00021|jsonrpc|WARN|tcp:192.168.10.10:57298: > >> receive error: Connection reset by peer > >> 2018-04-05T23:53:06.360Z|00022|reconnect|WARN|tcp:192.168.10.10:57298: > >> connection dropped (Connection reset by peer) > >> 2018-04-05T23:53:09.071Z|00023|jsonrpc|WARN|tcp:192.168.10.10:57348: > >> receive error: Connection reset by peer > >> 2018-04-05T23:53:09.071Z|00024|reconnect|WARN|tcp:192.168.10.10:57348: > >> connection dropped (Connection reset by peer) > >> ^2018-04-06T00:02:03.864Z|00076|raft|INFO|current entry eid > >> 0b9eeff2-8377-4f6d-a907-cb4e98396d14 does not match prerequisite > >> 277c09b4-96b3-43e9-b482-a9df29576988 > >> 2018-04-06T00:02:06.882Z|00077|poll_loop|INFO|Dropped 6187 log messages > >> in last 6 seconds (most recently, 0 seconds ago) due to excessive rate > >> 2018-04-06T00:02:06.882Z|00078|poll_loop|INFO|wakeup due to [POLLIN] on > >> fd 31 (192.168.220.102:6645<->192.168.10.10:54782) at > >> lib/stream-fd.c:157 (62% CPU usage) > >> 2018-04-06T00:02:09.130Z|00079|jsonrpc|WARN|tcp:192.168.10.10:33592: > >> receive error: Connection reset by peer > >> 2018-04-06T00:02:09.131Z|00080|reconnect|WARN|tcp:192.168.10.10:33592: > >> connection dropped (Connection reset by peer) > >> 2018-04-06T00:02:33.986Z|00081|poll_loop|INFO|Dropped 1509 log messages > >> in last 26 seconds (most recently, 24 seconds ago) due to excessive rate > >> 2018-04-06T00:02:33.986Z|00082|poll_loop|INFO|wakeup due to [POLLIN] on > >> fd 24 (FIFO pipe:[40290445]) at ovsdb/log.c:994 (51% CPU usage) > >> 2018-04-06T00:02:33.986Z|00083|poll_loop|INFO|wakeup due to [POLLIN] on > >> fd 29 (192.168.220.102:6645<->192.168.10.10:55738) at > >> lib/stream-fd.c:157 (51% CPU usage) > >> > >> > >> > >> 2018-04-06T00:13:48.141Z|00037|raft|INFO|rejecting term 4 < current > term > >> 5 received in vote_reply message from server b7bd > >> 2018-04-06T00:14:21.138Z|00038|jsonrpc|WARN|tcp:192.168.10.10:33990: > >> receive error: Connection reset by peer > >> 2018-04-06T00:14:21.138Z|00039|reconnect|WARN|tcp:192.168.10.10:33990: > >> connection dropped (Connection reset by peer) > >> 2018-04-06T00:17:01.532Z|00040|raft|INFO|current entry eid > >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > >> d47dbdf1-0981-4077-baab-372b05aeb0bb > >> 2018-04-06T00:17:01.532Z|00041|raft|INFO|current entry eid > >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > >> d47dbdf1-0981-4077-baab-372b05aeb0bb > >> 2018-04-06T00:17:01.532Z|00042|raft|INFO|current entry eid > >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > >> d47dbdf1-0981-4077-baab-372b05aeb0bb > >> 2018-04-06T00:17:01.533Z|00043|raft|INFO|current entry eid > >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > >> d47dbdf1-0981-4077-baab-372b05aeb0bb > >> 2018-04-06T00:17:01.533Z|00044|raft|INFO|current entry eid > >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite > >> d47dbdf1-0981-4077-baab-372b05aeb0bb > >> 2018-04-06T00:19:32.039Z|00045|timeval|WARN|Unreasonably long 1373ms > >> poll interval (1340ms user, 24ms system) > >> 2018-04-06T00:19:32.039Z|00046|timeval|WARN|faults: 319 minor, 0 major > >> 2018-04-06T00:19:32.039Z|00047|timeval|WARN|disk: 0 reads, 368 writes > >> 2018-04-06T00:19:32.039Z|00048|timeval|WARN|context switches: 5 > >> voluntary, 19 involuntary > >> 2018-04-06T00:19:32.039Z|00049|coverage|INFO|Event coverage, avg rate > >> over last: 5 seconds, last minute, last hour, hash=1dd2af20: > >> 2018-04-06T00:19:32.039Z|00050|coverage|INFO|hmap_pathological > >> 0.0/sec 0.000/sec 3.4164/sec total: 12306 > >> 2018-04-06T00:19:32.039Z|00051|coverage|INFO|hmap_expand > >> 73.6/sec 71.267/sec 729.9656/sec total: 2628980 > >> 2018-04-06T00:19:32.039Z|00052|coverage|INFO|lockfile_lock > >> 0.0/sec 0.000/sec 0.0003/sec total: 1 > >> 2018-04-06T00:19:32.039Z|00053|coverage|INFO|poll_create_node > >> 129.6/sec 122.333/sec 866.7178/sec total: 3120304 > >> 2018-04-06T00:19:32.039Z|00054|coverage|INFO|poll_zero_timeout > >> 0.0/sec 0.267/sec 27.4550/sec total: 98838 > >> 2018-04-06T00:19:32.039Z|00055|coverage|INFO|seq_change > >> 10.8/sec 10.217/sec 135.4125/sec total: 487498 > >> 2018-04-06T00:19:32.039Z|00056|coverage|INFO|pstream_open > >> 0.0/sec 0.000/sec 0.0011/sec total: 4 > >> 2018-04-06T00:19:32.039Z|00057|coverage|INFO|stream_open > >> 0.0/sec 0.000/sec 0.0019/sec total: 7 > >> 2018-04-06T00:19:32.039Z|00058|coverage|INFO|util_xalloc > >> 1542.2/sec 1488.750/sec 21346.2483/sec total: 76872987 > >> 2018-04-06T00:19:32.039Z|00059|coverage| > >> > >> > >> > >> To find that wait-until was only the bottleneck, I skipped wait-until > and > >> test got completed in 2.4 hours but things did not go as expected. All > the > >> chassis ovn-controllers crashed and ports actually did not get bind all > the > >> way down to the compute as ovs sb db was running super hot > >> > >> Below is the controller log from sample HV without wait-until. > >> 2018-04-06T04:50:40.274Z|01703|poll_loop|INFO|Dropped 5903 log messages > >> in last 6 seconds (most recently, 0 seconds ago) due to excessive rate > >> 2018-04-06T04:50:40.274Z|01704|poll_loop|INFO|wakeup due to > >> [POLLIN][POLLHUP] on fd 13 (<->/root/sandbox-192.168.72.45/br-int.mgmt) > >> at ../lib/stream-fd.c:157 (100% CPU usage) > >> 2018-04-06T04:50:41.060Z|01705|ovsdb_idl|INFO|tcp:192.168.220.102:6642: > >> clustered database server is disconnected from cluster; trying another > >> server > >> 2018-04-06T04:50:41.066Z|01706|binding|INFO|Claiming lport > >> lport_a3060c_mlAOEq for this chassis. > >> 2018-04-06T04:50:41.066Z|01707|binding|INFO|lport_a3060c_mlAOEq: > >> Claiming a2:1c:47:32:30:52 172.145.23.37 > >> 2018-04-06T04:50:41.066Z|01708|vconn_stream|ERR|send: Broken pipe > >> 2018-04-06T04:50:41.066Z|01709|rconn|WARN|unix:/root/sandbox- > >> 192.168.72.45/br-int.mgmt: connection dropped (Broken pipe) > >> 2018-04-06T04:50:41.067Z|01710|rconn|INFO|unix:/root/sandbox- > >> 192.168.72.45/br-int.mgmt: connected > >> 2018-04-06T04:50:41.067Z|01711|rconn|WARN|unix:/root/sandbox- > >> 192.168.72.45/br-int.mgmt: connection dropped (Broken pipe) > >> 2018-04-06T04:50:41.067Z|01712|util|EMER|../lib/ofp-msgs.c:1062: > >> assertion version >= info->min_version && version <= info->max_version > >> failed in raw_instance_get() > >> > >> > >> > >> Also northd sees many transaction errors without wait-until: > >> northd.log > >> 2018-04-06T04:25:26.594Z|01076|ovsdb_idl|WARN|transaction error: > >> {"details":"Transaction causes multiple rows in \"Port_Binding\" table > to > >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 104) > for > >> index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > >> 489f777f-d7f6-4b92-80b9-c1c64c083ea4, existed in the database before > >> this transaction and was not modified by the transaction. Second row, > with > >> UUID b59a2500-3da1-40ab-932c-5e2dd4acc3b4, was inserted by this > >> transaction.","error":"constraint violation"} > >> 2018-04-06T04:25:27.202Z|01077|ovsdb_idl|WARN|transaction error: > >> {"details":"Transaction causes multiple rows in \"Port_Binding\" table > to > >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 26) for > >> index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > >> 0668b300-55f8-4bdd-b0a8-8038258d4502, was inserted by this transaction. > >> Second row, with UUID 63972d45-8845-4489-ac3e-916909a593c7, existed in > >> the database before this transaction and was not modified by the > >> transaction.","error":"constraint violation"} > >> 2018-04-06T04:25:32.036Z|01078|ovsdb_idl|WARN|transaction error: > >> {"details":"Transaction causes multiple rows in \"Port_Binding\" table > to > >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 224) > for > >> index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > >> 528dea47-f284-40d6-bc7e-a97a084f49eb, existed in the database before > >> this transaction and was not modified by the transaction. Second row, > with > >> UUID 6fcc2500-fedb-4ee9-b54d-ea548dd36936, was inserted by this > >> transaction.","error":"constraint violation"} > >> 2018-04-06T04:25:32.482Z|01079|ovsdb_idl|WARN|transaction error: > >> {"details":"Transaction causes multiple rows in \"Port_Binding\" table > to > >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 23) for > >> index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > >> 925eb001-adeb-4748-a55f-6c5812e70a7e, was inserted by this transaction. > >> Second row, with UUID 5ed7bbc5-b7f5-4d4e-8e99-d8c4adcb2b34, existed in > >> the database before this transaction and was not modified by the > >> transaction.","error":"constraint violation"} > >> 2018-04-06T04:25:35.204Z|01080|ovsdb_idl|WARN|transaction error: > >> {"details":"Transaction causes multiple rows in \"Port_Binding\" table > to > >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 18) for > >> index on columns \"datapath\" and \"tunnel_key\". First row, with UUID > >> 4b429b00-75ca-4e38-904d-80a11e6c0f38, was inserted by this transaction. > >> Second row, with UUID c2db03fe-18b6-4a2b-8db0-bf37ef93db92, existed in > >> the database before this transaction and was not modified by the > >> transaction.","error":"constraint violation"} > >> 2018-04-06T04:25:35.933Z|01081|ovsdb_idl|INFO|tcp:192.168.220.101:6642: > >> clustered database server is not cluster leader; trying another server > >> > >> > >> To summerize: > >> > >> - ovn nb db has issues with wait-until and works fine without > >> wait-until > >> - ovn sb db causes controller to crash due to more HVs and just few > >> hundred of ports with no solid root cause without wait-until. Even > sbctl > >> commands hangs as sb db is 100% on cpu. > >> - ovn northd sees transaction error without wait-until > >> > >> > >> > >> Will continue to troubleshoot and would love to get any > >> improvements/suggestions if any. > >> > >> > >> Regards, > >> > >> > > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev