[ovs-dev] [PATCH v4] tc: fix crash on EAGAIN return from recvmsg on netlink socket.

Frode Nordahl Tue, 23 May 2023 03:39:41 -0700

The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.


`tc_transact` in turn calls `nl_transact`, which via
`nl_transact_multiple__` ultimately calls and handles return
value from `recvmsg`.  On error a check for EAGAIN is performed
and a consequence of this condition is effectively to provide a
non-error (0) result and an empty reply buffer.

Before this change, the `tc_transact` and, consumers of it, were
unaware of this condition.  The use of assertions on the reply
buffer can as such lead to a fatal crash of OVS.

To be fair, the behavior of `nl_transact` when handling an EAGAIN
return is currently not documented, so this change also adds that
documentation.

While fixing the problem, it led me to find potential problems
with the one-time initialization functions in the netdev-offload-tc
module.  Make use of the information now available about an EAGAIN
condition to retry one-time initialization, and resort to logging
a warning if probing of tc features fails due to temporary
situations such as resource depletion.

For the record, the symptom of the crash is this in the log:
EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size 
failed in ofpbuf_at_assert()

And an excerpt of the backtrace looks like this:
0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, 
offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
tc_replace_flower (id=<optimized out>, flower=<optimized out>) at 
../lib/tc.c:3223
0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, 
match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at 
../lib/netdev-offload-tc.c:2096
0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, 
info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, 
actions=<optimized out>,
match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at 
../lib/dpif-netlink.c:2297
try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at 
../lib/dpif-netlink.c:2384

Reported-At: https://launchpad.net/bugs/2018500
Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
Fixes: 407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message 
in a transaction.")
Acked-by: Roi Dayan <r...@nvidia.com>
Reviewed-by: Simon Horman <simon.hor...@corigine.com>
Signed-off-by: Frode Nordahl <frode.nord...@canonical.com>
---
 lib/dpif.c              | 12 +++++--
 lib/netdev-linux.c      |  7 ++--
 lib/netdev-offload-tc.c | 73 +++++++++++++++++++++++++++++++++++++----
 lib/netlink-socket.c    |  5 +++
 lib/tc.c                | 26 +++++++++++++++
 5 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/lib/dpif.c b/lib/dpif.c
index 3305401fe..f97a57e7b 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1749,9 +1749,15 @@ flow_message_log_level(int error)
     /* If flows arrive in a batch, userspace may push down multiple
      * unique flow definitions that overlap when wildcards are applied.
      * Kernels that support flow wildcarding will reject these flows as
-     * duplicates (EEXIST), so lower the log level to debug for these
-     * types of messages. */
-    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
+     * duplicates (EEXIST).
+     *
+     * Some subsystems expose temporary error conditions such as EAGAIN return
+     * for operations on non-blocking sockets.  This is done to make the right
+     * decissions during processing.
+     *
+     * If they bubble up here we ought to not log those as a warning, so lower
+     * the log level to debug for these types of messages. */
+    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
 }
 
 static bool
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 36620199e..f5c241474 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -6235,8 +6235,7 @@ tc_query_qdisc(const struct netdev *netdev_)
      * On Linux 2.6.35+ we use the straightforward method because it allows us
      * to handle non-builtin qdiscs without handle 1:0 (e.g. codel).  However,
      * in such a case we get no response at all from the kernel (!) if a
-     * builtin qdisc is in use (which is later caught by "!error &&
-     * !qdisc->size"). */
+     * builtin qdisc is in use (which is later caught by "error == EAGAIN" */
     tcmsg = netdev_linux_tc_make_request(netdev_, RTM_GETQDISC, NLM_F_ECHO,
                                          &request);
     if (!tcmsg) {
@@ -6247,7 +6246,7 @@ tc_query_qdisc(const struct netdev *netdev_)
 
     /* Figure out what tc class to instantiate. */
     error = tc_transact(&request, &qdisc);
-    if (!error && qdisc->size) {
+    if (!error) {
         const char *kind;
 
         error = tc_parse_qdisc(qdisc, &kind, NULL);
@@ -6262,7 +6261,7 @@ tc_query_qdisc(const struct netdev *netdev_)
                 ops = &tc_ops_other;
             }
         }
-    } else if ((!error && !qdisc->size) || error == ENOENT) {
+    } else if (error == ENOENT || error == EAGAIN) {
         /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
          * set up by some other entity that doesn't have a handle 1:0.  We will
          * assume that it's the system default qdisc. */
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 4f26dd8cc..c8ca280a7 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -2571,6 +2571,28 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t 
*n_flows)
     return 0;
 }
 
+/* This macro is for use by one-time initialization functions, where we have
+ * one shot per thread/process to perform a pertinent initialization task that
+ * may return a temporary error (EAGAIN).
+ *
+ * With min/max values of 1/64 we would retry 7 times, spending at the
+ * most 127 * 1E6 nsec (0.127s) sleeping.
+ */
+#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
+#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
+#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(ERROR, CONDITION, FUNCTION, ...) \
+    for (uint64_t backoff = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                    \
+         backoff <= NETDEV_OFFLOAD_TC_BACKOFF_MAX;                            \
+         backoff <<= 1)                                                       \
+    {                                                                         \
+        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
+        if (CONDITION) {                                                      \
+            xnanosleep(backoff * 1E6);                                        \
+            continue;                                                         \
+        }                                                                     \
+        break;                                                                \
+    }
+
 static void
 probe_multi_mask_per_prio(int ifindex)
 {
@@ -2594,8 +2616,13 @@ probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id1, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_replace_flower, &id1, &flower);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe for multiple mask "
+                      "support: %s", ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2603,10 +2630,15 @@ probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
 
     id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id2, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_replace_flower, &id2, &flower);
     tc_del_flower_filter(&id1);
 
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe for multiple mask "
+                      "support: %s", ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2657,8 +2689,13 @@ probe_ct_state_support(int ifindex)
         goto out;
     }
 
-    error = tc_get_flower(&id, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_get_flower, &id, &flower);
     if (error || flower.mask.ct_state != ct_state) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out_del;
     }
 
@@ -2670,10 +2707,16 @@ probe_ct_state_support(int ifindex)
 
     /* Test for reject, ct_state >= MAX */
     ct_state = ~0;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (!error) {
         /* No reject, can't continue probing other flags */
         goto out_del;
+    } else if (error == EAGAIN) {
+        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                  ovs_strerror(error));
+        goto out_del;
     }
 
     tc_del_flower_filter(&id);
@@ -2682,8 +2725,14 @@ probe_ct_state_support(int ifindex)
     memset(&flower, 0, sizeof flower);
     ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                TCA_FLOWER_KEY_CT_FLAGS_INVALID;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2695,8 +2744,14 @@ probe_ct_state_support(int ifindex)
     ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
                TCA_FLOWER_KEY_CT_FLAGS_REPLY;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2732,13 +2787,17 @@ probe_tc_block_support(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_replace_flower, &id, &flower);
 
     tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
 
     if (!error) {
         block_support = true;
         VLOG_INFO("probe tc: block offload is supported.");
+    } else if (error == EAGAIN) {
+        VLOG_WARN("probe tc: unable to probe block offload support: %s",
+                  ovs_strerror(error));
     }
 }
 
diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index 80da20d9f..dea060fc3 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
  *
  *      2. Resending the request causes it to be re-executed, so the request
  *         needs to be idempotent.
+ *
+ *      3. In the event that the kernel is too busy to handle the request to
+ *         receive the response (i.e. EAGAIN), this function will still return
+ *         0.  The caller can check for this condition by checking for a zero
+ *         size of the 'replyp' ofpbuf buffer.
  */
 int
 nl_transact(int protocol, const struct ofpbuf *request,
diff --git a/lib/tc.c b/lib/tc.c
index 5c32c6f97..0396570ac 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -36,6 +36,7 @@
 #include <unistd.h>
 
 #include "byte-order.h"
+#include "coverage.h"
 #include "netlink-socket.h"
 #include "netlink.h"
 #include "openvswitch/ofpbuf.h"
@@ -67,6 +68,8 @@
 
 VLOG_DEFINE_THIS_MODULE(tc);
 
+COVERAGE_DEFINE(tc_transact_eagain);
+
 static struct vlog_rate_limit error_rl = VLOG_RATE_LIMIT_INIT(60, 5);
 
 static enum tc_offload_policy tc_policy = TC_POLICY_NONE;
@@ -237,11 +240,34 @@ static void request_from_tcf_id(struct tcf_id *id, 
uint16_t eth_type,
     }
 }
 
+/* The `tc_transact` function is a wrapper around `nl_transact` with the
+ * addition of:
+ *
+ * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
+ *    regardless of success or failure.
+ *
+ * 2. When a 'replyp' pointer is provided; in the event of the kernel
+ *    being too busy to process the request for the response, a positive
+ *    error return will be provided with the value of EAGAIN.
+ *
+ * Please acquaint yourself with the documentation of the `nl_transact`
+ * function in the netlink-socket module before making use of this function.
+ */
 int
 tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
 {
     int error = nl_transact(NETLINK_ROUTE, request, replyp);
     ofpbuf_uninit(request);
+
+    if (!error && replyp && !(*replyp)->size) {
+        COVERAGE_INC(tc_transact_eagain);
+        /* We replicate the behavior of `nl_transact` for error conditions and
+         * free any allocations before setting the 'replyp' buffer to NULL. */
+        ofpbuf_delete(*replyp);
+        *replyp = NULL;
+        return EAGAIN;
+    }
+
     return error;
 }
 
-- 
2.39.2

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v4] tc: fix crash on EAGAIN return from recvmsg on netlink socket.

Reply via email to