This series refactors netdev-linux to make more use use of event-driven
netlink notifications instead of polling for device state update, significantly
improving performance under RTNL lock contention.
## Background
The RTNL mutex is used to serialize rtnetlink requests in the Linux kernel.
It's widespread use in many network configuration paths make it a problem which
the kernel community is well aware of.
When there is a lot of network configuration activity, like when lots of
interfaces are being created or deleted, specially interfaces that interact
with HW resources such as SR-IOV VFs, contention on the RTNL mutex can make
rtnetlink requests be quite slow.
The impact of RTNL contention on OVS's main thread can be high, making the
entire loop take several seconds (even minutes!) to complete, affecting other
periodic tasks such as OVSDB updates, OpenFlow flow programming, etc.
After analyzing what requests were being sent by OVS, it was observed that most
of them came from netdev-linux state checking mechanisms. While some state is
cached (such as the MTU or MAC address), the netdev's flags are not, and they
are checked very often.
On the other hand, Netlink provides a reliable notification mechanism via
multicast groups that allows userspace to receive asynchronouse updates when
device state changes, and OVS already has some infrastructure for that
purpose.
## Approach
The series aims to change two aspects of netdev-linux operations:
A - Cache the netdev's flags
B - Reuse existing rtnetlink event infrastructure in netdev-linux's main
loop run to avoid races.
In order to accomplish A (commit 2) some refactoring is done first (commit 1).
In roder to accomplish B (commit 5) some refactoring is done to enhance existing
rtnetlink notifier infrastructure (commits 3-4).
Finally, there are some extra consolidation and cleanups (commits 6-8).
The following diagram represents the resulting infrastructure:
+---------------------------------+
| bridge.c |
| |
| to call bridge_reconfigure() |
| if ifaces changed |
+--------------------+------------+
|
+-----v-------+
| if-notifier |
+-----+-------+
|
+-----------------------+
|
+------------------------------------------------------------------------------+
+-----+ | route_table
|
| |
+-------------------------------------------+ |
| | +---------------------------+ | for route change detection
| |
| | | | |
| |
| +--+--+ for link change detection| | family: NETLINK_ROUTE
+-+--+
| | | | | | mcast:
RTNLGRP_IPV{4,6}_{ROUTE,RULE} | | |
| | | +---------------------------+ | all_ns: false
| | |
| | |
+-------------------------------------------+ | |
| |
+------------------------------------------------------------------------------+
|
| |
|
| |
+----------------------------------------------------------------------------+
|
| | | netdev_linux
| |
| | |
+----------------------------------------+ | |
| | | +-----------------------------+|for address change detection
| | |
| | | | to update netdev ||
| | |
| | +--+-+ internal (cached) state || family: NETLINK_ROUTE
| | |
| | | | +-----------------------------+| mcast: RTNLGRP_IPV4_IFADDR,
| | |
| | | | |
RTNLGRP_IPV6_{IFADDR,IFINFO} | | |
| | | | | all_ns: true
| | |
| | | |
+-----------+----------------------------+ | |
| | |
+--------------------------------------------+-------------------------------+
|
| | | |
|
+-v----v-v-------------------+ |
|
| rtnetlink_notifier.{c,h} | |
|
| | |
|
| family: NETLINK_ROUTE | |
|
| mcast: RTNLGRP_LINK | |
|
| all_ns: true | |
|
+----------------------------+ |
|
| |
|
| |
|
+------v------------------------------------------v----------------------------------v--+
|
|
| nln (netlink_notifier.{h,c})
|
|
|
+---------------------------------------------------------------------------------------+
## Testing and results
In order to test this series, I have written a small script that chruns
(deletes and recreates) some ovs ports (veths) in a way an SDN would do.
I increased the number of interfaces to churn from 10 to 100
In order to simulate RTNL mutex contention I used delay-kfunc [1] to
introduce latency to 'rtnl_lock'. The following table shows the time it
takes to complete the test:
==============================================================================
N ifaces RTNL Delay(??s) Main (s) Series (s) Delta (%)
------------------------------------------------------------------------------
10 0 0.275(0.008) 0.234(0.014) -14.9%
10 50 0.269(0.009) 0.249(0.012) -7.3%
10 100 0.278(0.011) 0.266(0.007) -4.6%
10 500 0.423(0.039) 0.395(0.046) -6.7%
10 1000 0.695(0.060) 0.586(0.045) -15.6%
10 5000 1.855(0.099) 1.818(0.041) -2.0%
10 10000 4.361(0.074) 3.106(0.111) -28.8%
20 0 0.485(0.014) 0.424(0.019) -12.6%
20 50 0.478(0.018) 0.472(0.015) -1.3%
20 100 0.504(0.018) 0.493(0.020) -2.3%
20 500 0.716(0.022) 0.678(0.031) -5.3%
20 1000 0.994(0.026) 0.926(0.083) -6.9%
20 5000 3.313(0.133) 2.851(0.039) -13.9%
20 10000 6.803(0.093) 4.875(0.117) -28.3%
30 0 0.716(0.024) 0.645(0.033) -10.0%
30 50 0.723(0.018) 0.692(0.019) -4.2%
30 100 0.744(0.024) 0.745(0.031) +0.1%
30 500 0.981(0.031) 0.997(0.034) +1.6%
30 1000 1.328(0.046) 1.222(0.040) -8.0%
30 5000 4.838(0.059) 3.865(0.079) -20.1%
30 10000 9.146(0.110) 6.653(0.110) -27.3%
40 0 0.974(0.042) 0.864(0.065) -11.3%
40 50 0.963(0.032) 0.961(0.044) -0.2%
40 100 0.997(0.040) 1.004(0.043) +0.7%
40 500 1.397(0.105) 1.359(0.035) -2.7%
40 1000 1.990(0.107) 1.805(0.096) -9.3%
40 5000 7.240(1.751) 4.967(0.587) -31.4%
40 10000 11.657(0.131) 8.289(0.308) -28.9%
50 0 1.340(0.111) 1.253(0.167) -6.5%
50 50 1.410(0.196) 1.274(0.059) -9.7%
50 100 1.411(0.108) 1.329(0.111) -5.8%
50 500 1.788(0.060) 1.779(0.079) -0.5%
50 1000 2.656(0.220) 2.446(0.097) -7.9%
50 5000 11.532(0.132) 8.216(0.094) -28.8%
50 10000 22.685(1.157) 14.098(0.186) -37.8%
60 0 1.760(0.249) 1.738(0.333) -1.3%
60 50 1.945(0.283) 1.851(0.305) -4.8%
60 100 1.777(0.340) 1.613(0.116) -9.2%
60 500 2.525(0.184) 2.330(0.125) -7.7%
60 1000 3.497(0.327) 3.247(0.174) -7.2%
60 5000 14.390(0.172) 10.093(0.138) -29.9%
60 10000 27.980(0.545) 17.383(0.211) -37.9%
80 0 3.977(0.767) 3.632(0.651) -8.7%
80 50 3.550(0.667) 3.294(0.645) -7.2%
80 100 3.854(0.679) 3.182(0.763) -17.4%
80 500 4.571(0.685) 3.998(0.619) -12.5%
80 1000 6.445(0.490) 4.955(0.281) -23.1%
80 5000 27.107(0.331) 17.348(0.197) -36.0%
80 10000 54.738(0.971) 31.525(1.116) -42.4%
100 0 8.509(2.392) 7.452(0.138) -12.4%
100 50 7.730(0.552) 7.278(1.877) -5.8%
100 100 8.084(2.648) 7.342(1.124) -9.2%
100 500 7.543(0.551) 6.851(0.997) -9.2%
100 1000 10.784(0.782) 7.990(0.651) -25.9%
100 5000 36.393(0.626) 25.800(0.363) -29.1%
100 10000 72.916(2.488) 45.648(1.929) -37.4%
==============================================================================
Notes about the above results:
- Values are shown as "{mean}({std})".
- I did not perform any kind of tuning or cpu isolation the test server.
- delay-kfunc does not always introduce the exact same delay so there is some
source of variance there as well.
- Beyond 200 interfaces, limitations of the test script itself make the results
rather unreliable.
All in all, a pretty consistent improvement is observed which increases with the
number of interfaces that we churn and with the amount of external RTNL
pressure we add.
## Future work
This is part of a larger effort to improve robustness against RTNL contention.
I plan to work on more optimizations in future series.
[1] https://github.com/xdp-project/bpf-examples/tree/main/delay-kfunc
Adrian Moreno (8):
netdev_linux: Refactor netdev flag update.
netdev-linux: Cache netdev flags.
netlink-notifier: Drain socket on overflow.
netlink-notifier: Include nsid in callbacks.
netdev-linux: Use rtnetlink to update state.
netdev-linux: Consolidate RTM_GETLINK parsing.
linux-netdev: Check status when reading stats.
netdev-linux: Consolidate netlink updates.
lib/if-notifier.c | 3 +-
lib/netdev-afxdp.c | 2 +-
lib/netdev-linux-private.h | 5 +-
lib/netdev-linux.c | 416 +++++++++++++++------------------
lib/netdev-linux.h | 1 +
lib/netlink-notifier.c | 15 +-
lib/netlink-notifier.h | 9 +-
lib/netnsid.h | 1 +
lib/route-table.c | 12 +-
lib/route-table.h | 2 +-
lib/rtnetlink.c | 40 +++-
lib/rtnetlink.h | 56 ++++-
lib/tnl-ports.c | 2 +-
tests/system-interface.at | 2 +
tests/system-tap.at | 5 +-
tests/system-traffic.at | 3 +-
tests/test-lib-route-table.c | 9 +-
tests/test-netlink-conntrack.c | 9 +-
18 files changed, 338 insertions(+), 254 deletions(-)
--
2.53.0
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev