On Wed, Feb 7, 2018 at 12:47 PM, Han Zhou wrote:
>
> When doing scale testing for OVN (using
https://github.com/openvswitch/ovn-scale-test), we had some interesting
findings, and need some help here.
>
> We ran the test "create and bind lports" against branch 2.9 and branch
2.6, and we found that 2.6 was must faster. With some analysis, we found
out the reason is not because of OVN gets slower in 2.9, but because the
bottleneck of this test in branch 2.9 is ovs-vswitchd.
>
> The testing was run in an environment with 20 farm nodes, each has 50
sandbox HVs (I will just mention them as HVs in short). Before the test,
there are already 9500 lports bound in 950 HVs on 19 farm nodes. The test
run against the last farm node to bind the lport on the 50 HVs there. The
steps in the test scenario are:
>
> 1. Create 5 new LSs in NB (so that the LSs will not be shared with any of
HVs on other farm nodes)
> 2. create 100 lports in NB on a LS
> 3. bind these lports on HVs, 2 for each HV. They are bound sequentially
on each HV, and for each HV the 2 ports are bound using one command
together: ovs-vsctl add-port -- set Interface external-ids:... --
add-port -- set Interface external-ids:... (the script didn't set
type to internal, but I hope it is not an issue for this test).
> 4. wait the port stated changed to up in NB for all the 100 lports (with
a single ovn-nbctl command)
>
> These steps are repeated for 5 times, one for each LS. So in the end we
got 500 more lports created and bound (the total scale is then 1k HVs and
10k lports).
>
> When running with 2.6, the ovn-controllers were taking most of the CPU
time. However, with 2.9, the CPU of ovn-controllers spikes but there is
always ovs-vswitchd on the top with 100% CPU. It means the ovs-vswitchd is
the bottleneck in this testing. There is only one ovs-vswitchd with 100% at
the same time and different ovs-vswitchd will spike one after another,
since the ports are bound sequentially on each HV. From the rally log, each
2 ports binding takes around 4 - 5 seconds. This is just the ovs-vsctl
command execution time. The 100% CPU of ovs-vswitchd explains the slowness.
>
> So, based on this result, we can not using the total time to evaluate the
efficiency of OVN, instead we can evaluate by CPU cost of ovn-controller
processes. In fact, 2.9 ovn-controller costs around 70% less CPU than 2.6,
which I think is due to some optimization we made earlier. (With my
work-in-progress patch it saves much more, and I will post later as RFC).
>
> However, I cannot explain why ovs-vswitchd is getting slower than 2.6
when doing port-binding. We need expert suggestions here, for what could be
the possible reason of this slowness. We can do more testing with different
versions between 2.6 and 2.9 to find out related change, but with some
pointers it might save some effort. Below are some logs of ovs-vswitchd
when port binding is happening:
>
> ==
> 2018-02-07T00:12:54.558Z|01767|bridge|INFO|bridge br-int: added interface
lport_bc65cd_QFOU3v on port 1028
> 2018-02-07T00:12:55.629Z|01768|timeval|WARN|Unreasonably long 1112ms poll
interval (1016ms user, 4ms system)
> 2018-02-07T00:12:55.629Z|01769|timeval|WARN|faults: 336 minor, 0 major
> 2018-02-07T00:12:55.629Z|01770|timeval|WARN|context switches: 0
voluntary, 13 involuntary
> 2018-02-07T00:12:55.629Z|01771|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour, hash=b256889c:
> 2018-02-07T00:12:55.629Z|01772|coverage|INFO|bridge_reconfigure
0.0/sec 0.000/sec0.0056/sec total: 29
> 2018-02-07T00:12:55.629Z|01773|coverage|INFO|ofproto_flush
0.0/sec 0.000/sec0./sec total: 1
> 2018-02-07T00:12:55.629Z|01774|coverage|INFO|ofproto_packet_out
0.0/sec 0.000/sec0.0111/sec total: 90
> 2018-02-07T00:12:55.629Z|01775|coverage|INFO|ofproto_recv_openflow
0.2/sec 0.033/sec0.4858/sec total: 6673
> 2018-02-07T00:12:55.629Z|01776|coverage|INFO|ofproto_update_port
0.0/sec 0.000/sec5.5883/sec total: 28258
> 2018-02-07T00:12:55.629Z|01777|coverage|INFO|rev_reconfigure
0.0/sec 0.000/sec0.0056/sec total: 32
> 2018-02-07T00:12:55.629Z|01778|coverage|INFO|rev_port_toggled
0.0/sec 0.000/sec0.0011/sec total: 6
> 2018-02-07T00:12:55.629Z|01779|coverage|INFO|rev_flow_table
0.0/sec 0.000/sec0.0078/sec total: 124
> 2018-02-07T00:12:55.629Z|01780|coverage|INFO|rev_mac_learning
0.0/sec 0.000/sec0.0036/sec total: 13
> 2018-02-07T00:12:55.629Z|01781|coverage|INFO|xlate_actions
0.0/sec 0.000/sec0.0111/sec total: 90
> 2018-02-07T00:12:55.629Z|01782|coverage|INFO|ccmap_expand
0.0/sec 0.000/sec0.0050/sec total: 36
> 2018-02-07T00:12:55.629Z|01783|coverage|INFO|ccmap_shrink
0.0/sec 0.000/sec0./sec total: 49
> 2018-02-07T00:12:55.629Z|01784|coverage|INFO|cmap_expand
0.0/sec 0.000/sec0.0217/sec total: 257
> 2018-02-07T00:12:55.629Z|01785|coverage|INFO|c