On 19/05/2026 14:42, Eelco Chaudron wrote:
External email: Use caution opening links or attachments
On 18 May 2026, at 13:23, Eli Britstein wrote:
On 11/05/2026 13:24, Eelco Chaudron wrote:
External email: Use caution opening links or attachments
On 1 Apr 2026, at 11:13, Eli Britstein wrote:
Introduce a new netdev type - "doca".
The code is placed in new files.
- ovs-doca: initialization of doca library and utility functions that
are used currently by netdev-doca and also will be used for future
hw-offload code.
- netdev-doca: implementation of the new netdev.
Supported ports are mlx5 ports in switch-dev mode only that with a NIC
that supports hw-steering.
The netdev has the concept of ESW manager. A representor port is
functional only if its ESW manager is attached to OVS. In case it is
not, the representor appears as functional in ovs-vsctl show, but it is
not. Upon initializing of an ESW manager port, each representor is
reconfigured to be functional, and upon destruction, they are first stopped.
Steering infrastructure:
- RX packets of all ports are steered to a common queue. This queue is
polled using dpdk API and the packets are classified to a per-port
memory structure.
- TX packets are marked with the target port as metadata and sent to a
common queue. The egress pipe matches on the metadata and forwards the
packets accordingly.
Hi Eli,
Thanks for this smaller patch. It took way longer than anticipated
to review. See some comments below.
//Eelco
See a few more questions and remarks below. I noticed you already
sent out a v4, so if any changes are required based on the
discussion below, you may be able to send out a v5 by the time I
start my review, as I’ll need some time to get to it anyway.
OK. I'll address the comments below.
Those are not major changes, so I'll wait for more comments on v4 before
sending v5.
Thanks.
//Eelco
[...]
+static int
+netdev_doca_esw_init(struct netdev *netdev)
...
+
+ if (doca_rdma_bridge_get_dev_pd(dev->esw_ctx->dev, &pd)) {
+ VLOG_ERR("Could not get pd for %s", devargs);
Can 'get pd for' be more user friendly, what does pd mean?
No. 'pd' is "Protection Domain". for a developer (who will have to fix a bug if
occurs) i think pd would be clearer.
Yes, but this is a user error message, so would it not be nice to be
a bit more verbose? Something like this:
"Could not get protection domain (PD) for %s"
OK
+ rv = EINVAL;
+ goto out;
+ }
+
+ if (dev->esw_ctx->cmd_fd == -1) {
+ dev->esw_ctx->cmd_fd = dup(pd->context->cmd_fd);
+ if (dev->esw_ctx->cmd_fd == -1) {
+ VLOG_ERR("Could not dup fd for %s. Error %s", devargs,
+ ovs_strerror(errno));
+ rv = EBADF;
+ goto out;
+ }
+ }
+
+ ds_put_format(&rte_devargs, "%s,cmd_fd=%d,pd_handle=%u", devargs,
+ dev->esw_ctx->cmd_fd, pd->handle);
+
+ VLOG_DBG("Probing '%s'", ds_cstr(&rte_devargs));
+ if (rte_dev_probe(ds_cstr(&rte_devargs))) {
+ VLOG_ERR("%s: rte_dev_probe failed for %s", OVS_SOURCE_LOCATOR,
+ ds_cstr(&rte_devargs));
+ close(dev->esw_ctx->cmd_fd);
+ dev->esw_ctx->cmd_fd = -1;
+ rv = ENODEV;
+ goto out;
+ }
+
+out:
+ ds_destroy(&rte_devargs);
+ if (rv) {
Could we add a comment here why we close the dev(),
i.e. the relation with refmap_ref() as it might not be
clear at the first glance.
I don't understand your meaning. netdev_doca_dev_close() is the "opposite" of
netdev_doca_dev_probe().
I added a simple comment to rollback, but maybe it's not what you meant.
It can be used as a rollback. Would it be clearer to do the stages explicitly
here?
While carefully reviewing, I fixed a leak of esw->cmd_fd, if it was dup
successfully, but failed on rte_dev_probe.
Normally it is closed inside rte_dev_close(), but in case of an error need to
do it explicitly.
I was referring to a comment something like this:
/* In case of an error, rollback the above refmap_ref()
* which initialized the device. */
OK
+ netdev_doca_dev_close(dev);
+ }
+ return rv;
+}
[...]
+ /* HPF's phys_port_name is pf0/pf1. */
+ if (len == 3 && !strncmp(phys_port_name, "pf", 2)) {
+ /* "" to workaround a false positive checkpatch issue. */
+ if (snprintf(devargs, maxlen, "%s,%s,representor=(pf%d)""vf65535", pci,
What does ""vf65535" do here? If its there for the specific case the
"" continuation is not needed.
vf65535 is for HPF port. that's the to probe it in dpdk syntax.
"" is to WA a false-possitive checkpatch issue. I put a comment for that.
+ /* HPF's phys_port_name is pf0/pf1. */
+ if (len == 3 && !strncmp(phys_port_name, "pf", 2)) {
+ /* "" to workaround a false positive checkpatch issue. */
+ if (snprintf(devargs, maxlen, "%s,%s,representor=(pf%d)""vf65535", pci,
What does ""vf65535" do here? If its there for the specific case the
"" continuation is not needed.
vf65535 is for HPF port. that's the to probe it in dpdk syntax.
"" is to WA a false-possitive checkpatch issue. I put a comment for that.
For this one, I would rather have the warning/error (it probably
needs fixing in checkpatch) than working around false positives.
OK
[...]
diff --git a/lib/ovs-doca.c b/lib/ovs-doca.c
index eae361a21..f78d0efdd 100644
--- a/lib/ovs-doca.c
+++ b/lib/ovs-doca.c
@@ -24,13 +24,34 @@
I guess the compiler.h above can be removed (probably in the earlier patch).
It is there for "OVS_UNUSED"
I see, it's due to the odd #ifdef use case.
Maybe it's better to include util.h here.
[...]
I see it's not needed when there is include "vswitch-idl.h" since it
indirectly includes "util.h". Removing it.
+}
+
+static const char *
+ovs_doca_log_level_to_str(uint32_t log_level)
+{
+ for (int i = 0; i < ARRAY_SIZE(levels); ++i) {
+ if (i == log_level && levels[i]) {
+ return levels[i];
+ }
+ }
+
+ OVS_NOT_REACHED();
+ return "UNKONWN";
"UNKONWN" -> "UNKNOWN"
Ack
We might need to remove the OVS_NOT_REACHED, see below.
I think it should be kept. see below.
There is no compile-time check that all levels are covered in
the levels[] array. A future SDK version could add a new level,
causing OVS_NOT_REACHED() to crash OVS. Just returning "UNKNOWN"
would be sufficient.
OK
+}
+
+static enum doca_log_level
+get_buf_log_level(const char *buf, size_t size)
+{
+ const char *p = buf;
+ int level;
+
+ for (int i = 0; i < 4; i++) {
The 4 hear makes no sense without any context, should something like
the below be added?
/* DOCA log format: [time][source][line][level] message.
* Skip the first 3 '[' tokens to reach the level field. */
#define DOCA_LOG_LEVEL_FIELD 4
Or simply a comment above the loop:
/* Skip [time], [source], [line], then read [level]. */
for (int i = 0; i < 4; i++) {
Added a comment (not this one).
+ while (size && *p && *p != '[') {
+ size--;
+ p++;
+ }
+
+ if (!size || !*p) {
+ return DOCA_LOG_LEVEL_DISABLE;
+ }
+
+ size--;
+ p++;
+ }
+
+ level = ovs_doca_parse_log_level(p);
+ if (level < 0) {
+ return DOCA_LOG_LEVEL_DISABLE;
+ }
+
+ return level;
+}
+
+static ssize_t
+ovs_doca_log_write(void *c OVS_UNUSED, const char *buf, size_t size)
+{
+ static struct vlog_rate_limit dbg_rl = VLOG_RATE_LIMIT_INIT(600, 600);
+ enum doca_log_level level = get_buf_log_level(buf, size);
+
+ switch (level) {
+ case DOCA_LOG_LEVEL_DISABLE:
+ VLOG_EMER("(Failed to parse level): %.*s", (int) size, buf);
Using VLOG_EMER() for a log parsing failure overstates the severity.
VLOG_ERR() would be more appropriate here. See also the OVS error level
definitions.
This should never happen. we could even abort here. If it happens something is
very very wrong.
This happens if for some reason the log message is corrupted, or
the level cannot be found. This is a parsing failure, not a system
emergency. According to OVS log levels, we should not crash OVS
because of this, we should just report an error.
It is a system failure because something is very wrong, but I won't
fight for it. OK.
+ break;
+ case DOCA_LOG_LEVEL_TRACE:
+ case DOCA_LOG_LEVEL_DEBUG:
+ VLOG_DBG_RL(&dbg_rl, "%.*s", (int) size, buf);
+ break;
+ case DOCA_LOG_LEVEL_INFO:
+ VLOG_INFO_RL(&rl, "%.*s", (int) size, buf);
+ break;
+ case DOCA_LOG_LEVEL_WARNING:
+ VLOG_WARN_RL(&rl, "%.*s", (int) size, buf);
+ break;
+ case DOCA_LOG_LEVEL_ERROR:
+ VLOG_ERR_RL(&rl, "%.*s", (int) size, buf);
+ break;
+ case DOCA_LOG_LEVEL_CRIT:
+ VLOG_EMER("%.*s", (int) size, buf);
+ break;
+ default:
+ OVS_NOT_REACHED();
Maybe we should remove the default: level, this way we get a compile warning
if we have a log level we do not support.
Ack
+ }
+
+ return size;
[...]
+static void
+ovs_doca_log_get(FILE *stream)
+{
+ uint32_t log_level;
+
+ log_level = doca_log_level_get_global_sdk_limit();
+ fprintf(stream, "DOCA log level is %s",
+ ovs_doca_log_level_to_str(log_level));
ovs_doca_log_level_to_str() calls OVS_NOT_REACHED() if the value
returned by doca_log_level_get_global_sdk_limit() does not match any
of the known DOCA log levels. Could a future DOCA SDK version
return a level value outside the current set, causing an abort in a
production system just because a user ran "doca/log-get"?
Yes. Specific OVS version must be linked against compatible DOCA version.
If future DOCA version changes this, we will fail compilation as we'll have
more enums.
I do not see how this would fail compilation. The levels[] array
uses designated initializers, not a switch. Adding a new enum to
the DOCA header would compile fine; the new value simply would not
have an entry in levels[], and ovs_doca_log_level_to_str() would
hit OVS_NOT_REACHED() at runtime.
I meant other place would fail (if indeed removing default:
OVS_NOT_REACHED()). Anyway, I removed the one to just return UNKNOWN
addressing this as well.
[...]
+ }
+
+ doca_log_level_set_global_sdk_limit(DOCA_LOG_LEVEL_WARNING);
+ }
+
+ unixctl_command_register("doca/log-set", "{level}. "
+ "level=CRT/ERR/WRN/INF/DBG/TRC", 0, 1,
The level names should be consistent with dpdk/log-set, which uses
full lowercase names: 'critical', 'error', 'warning', 'info',
'debug'. DOCA also adds 'trace', which has no DPDK equivalent, but
this should be fine.
dpdk names are derived from dpdk error levels. in doca they are a bit different. I don't
think there is a justification for "dpdk alignment" here. This is why I added
it in the help (missing in dpdk case btw).
You are right, DPDK has different values than OVS, and maybe those
should also be aligned. But as we are adding a new command, we
should try to be uniform with OVS as much as we can. Something
like:
emer, err, warn, info, and dbg
I guess for trace we can use 'trc', and 'crit' for critical.
Ack. no "emer" though.
+ ovs_doca_unixctl_log_set, NULL);
+ unixctl_command_register("doca/log-get", "", 0, 0,
+ unixctl_mem_stream, ovs_doca_log_get);
Should this be log-list similar to DPDK? In addition the output here should
also use full names.
There is no much meaning for a "list" here, as there is just one level. In dpdk
there are many.
ACK, I might do a patch for DPDK.
In addition we should add a test case for these two command to the
doca test framework when included in this series.
I'd prefer not to include the framework in this series.
Any reason why not to? We should apply this series with some test
harness, so either we include it in this series, or wait to apply
until the test infrastructure is ready to be included.
This is what I meant. There are still comments for the dpdk
infrstructure, so doca would be affected as well.
This is not a small series as it is, so I prefer not to make it even
larger - we might even get high as 40 commits :;
I would prefer to complete the review on it, also stabilizing the test
infrastructure and apply it later. Of course there might be fixes that
would have been better to be done on the first place, but I hope not many.
+ /* DOCA configuration happens earlier than dpif-netdev's.
+ * To avoid reorganizing them, read the relevant item directly. */
+ ovs_doca_max_megaflows_counters =
+ smap_get_uint(ovs_other_config, "flow-limit",
+ OVS_DOCA_MAX_MEGAFLOWS_COUNTERS);
When 'flow-limit' changes after initialization, the new value is
silently ignored since DOCA is initialized only once. Consider
logging a warning if the configured value differs from the one used
at startup, to indicate that a restart of ovs-vswitchd is required
for the change to take effect.
Ack, though it requires a change in vswitchd/bridge.c, otherwise we'll call
smap function each iteration that might be expensive.
I do not think it's too expensive, however, you can move this check
to the general config change callback, which is only run on config
changes and does not require bridge.c changes.
OK. Also removing the doca default OVS_DOCA_MAX_MEGAFLOWS_COUNTERS and
take ofproto's default.
[...]
+ RV_TEST(doca_flow_cfg_set_queue_depth(cfg, OVS_DOCA_QUEUE_DEPTH));
+ RV_TEST(doca_flow_cfg_set_cb_entry_process(
+ cfg, ovs_doca_offload_entry_process));
+ RV_TEST(ovs_doca_init_defs(cfg, &defs, &defs_cfg));
+
+ VLOG_INFO("DOCA Enabled - initializing...");
This message appears twice: once here inside ovs_doca_init__ and once
in ovs_doca_init before calling ovs_doca_init__. This results in this
being logged twice on each successful DOCA start.
There are potential errors below, after succefull initialization,
which might seem odd.
It follows the same approach as in dpdk.c. not twice.
For DPDK I see a single initializing message; for DOCA I see it twice.
Right. I missed it. I removed the one inside ovs_doca_init__()
[...]
+
+doca_error_t
+ovs_doca_remove_entry(struct netdev_doca_esw_ctx *esw,
+ unsigned int qid, uint32_t flags,
+ struct doca_flow_pipe_entry **entry)
+{
+ doca_error_t err;
+
+ if (!*entry) {
+ return DOCA_SUCCESS;
+ }
+
+ ovs_assert(qid < OVS_DOCA_MAX_OFFLOAD_QUEUES);
+
+ err = doca_flow_pipe_remove_entry(qid, flags, *entry);
+ if (err == DOCA_SUCCESS) {
+ esw->offload_queues[qid].n_waiting_entries++;
+ if (DOCA_FLOW_FLAGS_IS_SET(flags, DOCA_FLOW_ENTRY_FLAGS_NO_WAIT)) {
+ /* Ignore potential errors here, as even if the queue completion
+ * failed, the entry removal would still be issued. The caller
+ * requires knowing so. */
+ ovs_doca_complete_queue_esw(esw, qid);
+ }
+ *entry = NULL;
+ } else {
+ VLOG_ERR("Failed to remove entry %p qid=%d. Error: %d (%s)",
+ *entry, qid, err, doca_error_get_descr(err));
Failed to remove entry is rather vague, we should make this message more user
friendly.
And add details in a debug message maybe?
The only more detail we can have here is the ESW netdev. Added this. Other than
this, no other details
I meant more clear what failed; a pointer is not something that helps
a user in any way. For example:
"Failed to remove DOCA flow pipe entry for qid=%d. Error: %d (%s)"
OK, though trimmed "DOCA flow pipe". We see in the log it's from ovs_doca.
[...]
+void
+ovs_doca_init(const struct smap *ovs_other_config)
+{
+ static bool enabled = false;
+ int rv;
+
+ if (enabled || !ovs_other_config) {
Don't think we need the endabled logic here, the once_enable logic should take
care of this.
Same schema as in dpdk.c. If we don't have this, smap_get_bool() will be called
each iteration
Not sure if that is a real problem, but since DPDK uses the same
approach, let's keep it for now.
+ return;
+ }
+
+ if (smap_get_bool(ovs_other_config, "doca-init", false)) {
+ static struct ovsthread_once once_enable = OVSTHREAD_ONCE_INITIALIZER;
+
+ if (!ovsthread_once_start(&once_enable)) {
+ return;
+ }
+
+ VLOG_INFO("Using DOCA %s", doca_version_runtime());
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev