Ilya Maximets <i.maxim...@ovn.org> writes:

> On 3/19/24 20:56, Aaron Conole wrote:
>> Ilya Maximets <i.maxim...@ovn.org> writes:
>> 
>>> Currently, ovs-vswitchd is subscribed to all the routing changes in the
>>> kernel.  On each change, it marks the internal routing table cache as
>>> invalid, then resets it and dumps all the routes from the kernel from
>>> scratch.  The reason for that is kernel routing updates not being
>>> reliable in a sense that it's hard to tell which route is getting
>>> removed or modified.  Userspace application has to track the order in
>>> which route entries are dumped from the kernel.  Updates can get lost
>>> or even duplicated and the kernel doesn't provide a good mechanism to
>>> distinguish one route from another.  To my knowledge, dumping all the
>>> routes from a kernel after each change is the only way to keep the
>>> cache consistent.  Some more info can be found in the following never
>>> addressed issues:
>>>   https://bugzilla.redhat.com/1337860
>>>   https://bugzilla.redhat.com/1337855
>>>
>>> It seems to be believed that NetworkManager "mostly" does incremental
>>> updates right.  But it is still not completely correct, will re-dump
>>> the whole table in certain cases, and it takes a huge amount of very
>>> complicated code to do the accounting and route comparisons.
>>>
>>> Going back to ovs-vswitchd, it currently dumps routes from all the
>>> routing tables.  If it will get conflicting routes from multiple
>>> tables, the cache will not be useful.  The routing cache in userspace
>>> is primarily used for checking the egress port for tunneled traffic
>>> and this way also detecting link state changes for a tunnel port.
>>> For userspace datapath it is used for actual routing of the packet
>>> after sending to a native tunnel.
>>> With kernel datapath we don't really have a mechanism to know which
>>> routing table will actually be used by the kernel after encapsulation,
>>> so our lookups on a cache may be incorrect because of this as well.
>>>
>>> So, unless all the relevant routes are in the standard tables, the
>>> lookup in userspace route cache is unreliable.
>>>
>>> Luckily, most setups are not using any complicated routing in
>>> non-standard tables that OVS has to be aware of.
>>>
>>> It is possible, but unlikely, that standard routing tables are
>>> completely empty while some other custom table is not, and all the OVS
>>> tunnel traffic is directed to that table.  That would be the only
>>> scenario where dumping non-standard tables would make sense.  But it
>>> seems like this kind of setup will likely need a way to tell OVS from
>>> which table the routes should be taken, or we'll need to dump routing
>>> rules and keep a separate cache for each table, so we can first match
>>> on rules and then lookup correct routes in a specific table.  I'm not
>>> sure if trying to implement all that is justified.
>>>
>>> For now, stop considering routes from non-standard tables to avoid
>>> mixing different tables together and also wasting CPU resources.
>>>
>>> This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
>>> running on a same host and in a same network namespace with OVS using
>>> its own custom routing table.
>>>
>>> Unfortunately, there seems to be no way to tell the kernel to send
>>> updates only for particular tables.  So, we'll still receive and parse
>>> all of them.  But they will not result in a full cache invalidation in
>>> most cases.
>>>
>>> Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
>>> So, we can make use of it and dump only standard tables when we get a
>>> relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
>>> the socket for filtering to work.  There is no reason to not enable it
>>> by default, if supported.  It is not used outside of NETLINK_ROUTE.
>>>
>>> Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.")
>>> Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb")
>>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/185
>>> Reported-at: 
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
>>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>>> ---
>>>  lib/netlink-protocol.h | 10 ++++++
>>>  lib/netlink-socket.c   |  8 +++++
>>>  lib/route-table.c      | 80 +++++++++++++++++++++++++++++++++---------
>>>  tests/system-route.at  | 64 +++++++++++++++++++++++++++++++++
>>>  4 files changed, 146 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h
>>> index 6eaa7035a..e4bb28ac9 100644
>>> --- a/lib/netlink-protocol.h
>>> +++ b/lib/netlink-protocol.h
>>> @@ -155,6 +155,11 @@ enum {
>>>  #define NLA_TYPE_MASK       ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER)
>>>  #endif
>>>  
>>> +/* Introduced in v4.4. */
>>> +#ifndef NLM_F_DUMP_FILTERED
>>> +#define NLM_F_DUMP_FILTERED 0x20
>>> +#endif
>>> +
>>>  /* These were introduced all together in 2.6.14.  (We want our programs to
>>>   * support the newer kernel features even if compiled with older headers.) 
>>> */
>>>  #ifndef NETLINK_ADD_MEMBERSHIP
>>> @@ -168,6 +173,11 @@ enum {
>>>  #define NETLINK_LISTEN_ALL_NSID 8
>>>  #endif
>>>  
>>> +/* Strict checking of netlink arguments introduced in Linux kernel v4.20. 
>>> */
>>> +#ifndef NETLINK_GET_STRICT_CHK
>>> +#define NETLINK_GET_STRICT_CHK 12
>>> +#endif
>>> +
>>>  /* These were introduced all together in 2.6.23.  (We want our programs to
>>>   * support the newer kernel features even if compiled with older headers.) 
>>> */
>>>  #ifndef CTRL_ATTR_MCAST_GRP_MAX
>>> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
>>> index 80da20d9f..eea880b2c 100644
>>> --- a/lib/netlink-socket.c
>>> +++ b/lib/netlink-socket.c
>>> @@ -205,6 +205,14 @@ nl_sock_create(int protocol, struct nl_sock **sockp)
>>>          }
>>>      }
>>>  
>>> +    /* Strict checking only supported for NETLINK_ROUTE. */
>>> +    if (protocol == NETLINK_ROUTE
>>> +        && setsockopt(sock->fd, SOL_NETLINK, NETLINK_GET_STRICT_CHK,
>>> +                      &one, sizeof one) < 0) {
>>> +        VLOG_INFO("netlink: could not enable strict checking (%s)",
>>> +                  ovs_strerror(errno));
>> 
>> Maybe this message is a bit scary for users - but I'm not sure what
>> message might be less alarming.  Something like:
>> 
>> netlink: strict checking off (%s)
>> 
>> Not a big deal, just a kindof nit.  I hope it doesn't become VLOG_WARN
>> later or something, and want to avoid someone reading it and thinking
>> that it is a misclassified warning.
>
> How about we do:
>
>   VLOG(errno == ENOPROTOOPT ? VLL_DBG : VLL_WARN, ... );
>
> i.e. keep it as a debug message for kernels that do not support it
> and WARN if something actually went wrong.  What do you think?

Seems reasonable.

>> 
>> I don't think it could happen often, but if we have opened a large
>> number of nl sockets (maybe some large nl_pool_alloc) then we end up
>> with quite a few of these messages as well.  Maybe we can just vlog this
>> once?
>
> We currently have a single place that opens NETLINK_ROUTE sockets, and
> pools are per-protocol, so we should not have more than one of these.
> Also should not be a problem with DBG.

Ack.

>> 
>>> +    }
>>> +
>>>      retval = get_socket_rcvbuf(sock->fd);
>>>      if (retval < 0) {
>>>          retval = -retval;
>>> diff --git a/lib/route-table.c b/lib/route-table.c
>>> index 9927dcc18..f1fe32714 100644
>>> --- a/lib/route-table.c
>>> +++ b/lib/route-table.c
>>> @@ -26,6 +26,7 @@
>>>  #include <linux/rtnetlink.h>
>>>  #include <net/if.h>
>>>  
>>> +#include "coverage.h"
>>>  #include "hash.h"
>>>  #include "netdev.h"
>>>  #include "netlink.h"
>>> @@ -44,6 +45,8 @@
>>>  
>>>  VLOG_DEFINE_THIS_MODULE(route_table);
>>>  
>>> +COVERAGE_DEFINE(route_table_dump);
>>> +
>>>  struct route_data {
>>>      /* Copied from struct rtmsg. */
>>>      unsigned char rtm_dst_len;
>>> @@ -80,7 +83,7 @@ static struct nln_notifier *name_notifier = NULL;
>>>  
>>>  static bool route_table_valid = false;
>>>  
>>> -static int route_table_reset(void);
>>> +static void route_table_reset(void);
>>>  static void route_table_handle_msg(const struct route_table_msg *);
>>>  static int route_table_parse(struct ofpbuf *, struct route_table_msg *);
>>>  static void route_table_change(const struct route_table_msg *, void *);
>>> @@ -153,26 +156,22 @@ route_table_wait(void)
>>>      ovs_mutex_unlock(&route_table_mutex);
>>>  }
>>>  
>>> -static int
>>> -route_table_reset(void)
>>> +static bool
>>> +route_table_dump_one_table(unsigned char id)
>>>  {
>>> -    struct nl_dump dump;
>>> -    struct rtgenmsg *rtgenmsg;
>>>      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
>>>      struct ofpbuf request, reply, buf;
>>> -
>>> -    route_map_clear();
>>> -    netdev_get_addrs_list_flush();
>>> -    route_table_valid = true;
>>> -    rt_change_seq++;
>>> +    struct rtmsg *rq_msg;
>>> +    bool filtered = true;
>>> +    struct nl_dump dump;
>>>  
>>>      ofpbuf_init(&request, 0);
>>>  
>>> -    nl_msg_put_nlmsghdr(&request, sizeof *rtgenmsg, RTM_GETROUTE,
>>> -                        NLM_F_REQUEST);
>>> +    nl_msg_put_nlmsghdr(&request, sizeof *rq_msg, RTM_GETROUTE, 
>>> NLM_F_REQUEST);
>>>  
>>> -    rtgenmsg = ofpbuf_put_zeros(&request, sizeof *rtgenmsg);
>>> -    rtgenmsg->rtgen_family = AF_UNSPEC;
>>> +    rq_msg = ofpbuf_put_zeros(&request, sizeof *rq_msg);
>>> +    rq_msg->rtm_family = AF_UNSPEC;
>>> +    rq_msg->rtm_table = id;
>> 
>> Any reason to not consider setting RTA_TABLE attribute rather than use
>> rtm_table.  RTA_TABLE supports the extended range, rather than just the
>> 8bits range that rtm_table can fit.
>
> I just wanted to avoid unnecessary memory allocation for an attribute.
> All the tables we care about are in 8 bit range.  And compiler protects
> us from an integer overflow, since all types are set to unsigned char.
>
> We can add RTA_TABLE attribute, but we'll basically add the same field
> twice as we're still putting the rtm_table in the message.
>
> We could put RTA_TABLE just to be extra safe, but I'm not sure.
>
> Does that make sense?

I guess it makes sense.  I think RTA_TABLE has been considered best
practice for quite some time, but since we are only interested in main
routing table, it doesn't matter.

>> 
>>>  
>>>      nl_dump_start(&dump, NETLINK_ROUTE, &request);
>>>      ofpbuf_uninit(&request);
>>> @@ -182,12 +181,43 @@ route_table_reset(void)
>>>          struct route_table_msg msg;
>>>  
>>>          if (route_table_parse(&reply, &msg)) {
>>> +            struct nlmsghdr *nlmsghdr = nl_msg_nlmsghdr(&reply);
>>> +
>>> +            /* Older kernels do not support filtering. */
>>> +            if (!(nlmsghdr->nlmsg_flags & NLM_F_DUMP_FILTERED)) {
>>> +                filtered = false;
>>> +            }
>>>              route_table_handle_msg(&msg);
>>>          }
>>>      }
>>>      ofpbuf_uninit(&buf);
>>> +    nl_dump_done(&dump);
>>> +
>>> +    return filtered;
>>> +}
>>> +
>>> +static void
>>> +route_table_reset(void)
>>> +{
>>> +    unsigned char tables[] = {
>>> +        RT_TABLE_DEFAULT,
>>> +        RT_TABLE_MAIN,
>>> +        RT_TABLE_LOCAL,
>>> +    };
>>>  
>>> -    return nl_dump_done(&dump);
>>> +    route_map_clear();
>>> +    netdev_get_addrs_list_flush();
>>> +    route_table_valid = true;
>>> +    rt_change_seq++;
>>> +
>>> +    COVERAGE_INC(route_table_dump);
>>> +
>>> +    for (size_t i = 0; i < ARRAY_SIZE(tables); i++) {
>>> +        if (!route_table_dump_one_table(tables[i])) {
>>> +            /* Got unfiltered reply, no need to dump further. */
>>> +            break;
>>> +        }
>>> +    }
>>>  }
>>>  
>>>  /* Return RTNLGRP_IPV4_ROUTE or RTNLGRP_IPV6_ROUTE on success, 0 on parse
>>> @@ -203,6 +233,7 @@ route_table_parse(struct ofpbuf *buf, struct 
>>> route_table_msg *change)
>>>          [RTA_GATEWAY] = { .type = NL_A_U32, .optional = true },
>>>          [RTA_MARK] = { .type = NL_A_U32, .optional = true },
>>>          [RTA_PREFSRC] = { .type = NL_A_U32, .optional = true },
>>> +        [RTA_TABLE] = { .type = NL_A_U32, .optional = true },
>>>      };
>>>  
>>>      static const struct nl_policy policy6[] = {
>>> @@ -211,6 +242,7 @@ route_table_parse(struct ofpbuf *buf, struct 
>>> route_table_msg *change)
>>>          [RTA_MARK] = { .type = NL_A_U32, .optional = true },
>>>          [RTA_GATEWAY] = { .type = NL_A_IPV6, .optional = true },
>>>          [RTA_PREFSRC] = { .type = NL_A_IPV6, .optional = true },
>>> +        [RTA_TABLE] = { .type = NL_A_U32, .optional = true },
>>>      };
>>>  
>>>      struct nlattr *attrs[ARRAY_SIZE(policy)];
>>> @@ -232,6 +264,7 @@ route_table_parse(struct ofpbuf *buf, struct 
>>> route_table_msg *change)
>>>  
>>>      if (parsed) {
>>>          const struct nlmsghdr *nlmsg;
>>> +        uint32_t table_id;
>>>          int rta_oif;      /* Output interface index. */
>>>  
>>>          nlmsg = buf->data;
>>> @@ -247,6 +280,19 @@ route_table_parse(struct ofpbuf *buf, struct 
>>> route_table_msg *change)
>>>              rtm->rtm_type != RTN_LOCAL) {
>>>              change->relevant = false;
>>>          }
>>> +
>>> +        table_id = rtm->rtm_table;
>>> +        if (attrs[RTA_TABLE]) {
>>> +            table_id = nl_attr_get_u32(attrs[RTA_TABLE]);
>>> +        }
>>> +        /* Do not consider changes in non-standard routing tables. */
>>> +        if (table_id
>>> +            && table_id != RT_TABLE_DEFAULT
>>> +            && table_id != RT_TABLE_MAIN
>>> +            && table_id != RT_TABLE_LOCAL) {
>>> +            change->relevant = false;
>>> +        }
>>> +
>>>          change->nlmsg_type     = nlmsg->nlmsg_type;
>>>          change->rd.rtm_dst_len = rtm->rtm_dst_len + (ipv4 ? 96 : 0);
>>>          change->rd.local = rtm->rtm_type == RTN_LOCAL;
>>> @@ -312,7 +358,9 @@ static void
>>>  route_table_change(const struct route_table_msg *change OVS_UNUSED,
>>>                     void *aux OVS_UNUSED)
>>>  {
>>> -    route_table_valid = false;
>>> +    if (!change || change->relevant) {
>>> +        route_table_valid = false;
>>> +    }
>>>  }
>>>  
>>>  static void
>>> diff --git a/tests/system-route.at b/tests/system-route.at
>>> index 114aaebc7..c0ecad6cf 100644
>>> --- a/tests/system-route.at
>>> +++ b/tests/system-route.at
>>> @@ -64,3 +64,67 @@ Cached: fc00:db8:beef::13/128 dev br0 GW
> fc00:db8:cafe::1 SRC fc00:db8:cafe::2])
>>>  
>>>  OVS_TRAFFIC_VSWITCHD_STOP
>>>  AT_CLEANUP
>>> +
>>> +dnl Checks that OVS doesn't use routes from non-standard tables.
>>> +AT_SETUP([ovs-route - route tables])
>>> +AT_KEYWORDS([route])
>>> +OVS_TRAFFIC_VSWITCHD_START()
>>> +
>>> +dnl Create tap port.
>>> +on_exit 'ip link del p1-route'
>>> +AT_CHECK([ip tuntap add name p1-route mode tap])
>>> +AT_CHECK([ip link set p1-route up])
>>> +
>>> +dnl Add ip address.
>>> +AT_CHECK([ip addr add 10.0.0.17/24 dev p1-route], [0], [stdout])
>>> +
>>> +dnl Check that OVS catches route updates.
>>> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
>>> [dnl
>>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local])
>>> +
>>> +dnl Add a route to the main routing table and check that OVS caches
>>> +dnl this new route.
>>> +AT_CHECK([ip route add 10.0.0.18/32 dev p1-route])
>>> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
>>> [dnl
>>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
>>> +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17])
>>> +
>>> +dnl Add a route to a custom routing table and check that OVS doesn't cache 
>>> it.
>>> +AT_CHECK([ip route add 10.0.0.19/32 dev p1-route table 42])
>>> +AT_CHECK([ip route show table 42 | grep 'p1-route' | grep -q '10.0.0.19'])
>>> +dnl Give the main thread a chance to act.
>>> +AT_CHECK([ovs-appctl revalidator/wait])
>>> +dnl Check that OVS didn't learn this route.
>>> +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl
>>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
>>> +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17
>>> +])
>>> +
>>> +dnl Delete a route from the main table and check that OVS removes the route
>>> +dnl from the cache.
>>> +AT_CHECK([ip route del 10.0.0.18/32 dev p1-route])
>>> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
>>> [dnl
>>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local])
>>> +
>>> +dnl Delete a route from a custom routing table and check that the cache
>>> +dnl dosn't change.
>>> +AT_CHECK([ip route del 10.0.0.19/32 dev p1-route table 42])
>>> +dnl Give the main thread a chance to act.
>>> +AT_CHECK([ovs-appctl revalidator/wait])
>>> +dnl Check that the cache is still the same.
>>> +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl
>>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
>>> +])
>>> +
>>> +dnl Delete ip address.
>>> +AT_CHECK([ip addr del 10.0.0.17/24 dev p1-route], [0], [stdout])
>>> +dnl Check that routes were removed from OVS.
>>> +OVS_WAIT_UNTIL([test $(ovs-appctl ovs/route/show | grep -c 'p1-route') -eq 
>>> 0 ])
>>> +
>>> +OVS_TRAFFIC_VSWITCHD_STOP
>>> +AT_CLEANUP
>> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to