Re: [ovs-dev] [PATCH v3 11/11] netdev-doca: Introduce doca netdev.

Eelco Chaudron via dev Tue, 19 May 2026 04:42:51 -0700


On 18 May 2026, at 13:23, Eli Britstein wrote:


> On 11/05/2026 13:24, Eelco Chaudron wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 1 Apr 2026, at 11:13, Eli Britstein wrote:
>>
>>> Introduce a new netdev type - "doca".
>>> The code is placed in new files.
>>> - ovs-doca: initialization of doca library and utility functions that
>>>    are used currently by netdev-doca and also will be used for future
>>>    hw-offload code.
>>> - netdev-doca: implementation of the new netdev.
>>>
>>> Supported ports are mlx5 ports in switch-dev mode only that with a NIC
>>> that supports hw-steering.
>>>
>>> The netdev has the concept of ESW manager. A representor port is
>>> functional only if its ESW manager is attached to OVS. In case it is
>>> not, the representor appears as functional in ovs-vsctl show, but it is
>>> not. Upon initializing of an ESW manager port, each representor is
>>> reconfigured to be functional, and upon destruction, they are first stopped.
>>>
>>> Steering infrastructure:
>>> - RX packets of all ports are steered to a common queue. This queue is
>>>    polled using dpdk API and the packets are classified to a per-port
>>>    memory structure.
>>> - TX packets are marked with the target port as metadata and sent to a
>>>    common queue. The egress pipe matches on the metadata and forwards the
>>>    packets accordingly.
>> Hi Eli,
>>
>> Thanks for this smaller patch. It took way longer than anticipated
>> to review. See some comments below.
>>
>> //Eelco

See a few more questions and remarks below. I noticed you already
sent out a v4, so if any changes are required based on the
discussion below, you may be able to send out a v5 by the time I
start my review, as I’ll need some time to get to it anyway.

//Eelco


[...]

>>> +static int
>>> +netdev_doca_esw_init(struct netdev *netdev)
>>> +{
>>> +    struct netdev_doca *dev = netdev_doca_cast(netdev);
>>> +    struct netdev_dpdk_common *common = &dev->common;
>>> +    struct netdev_doca_esw_ctx *esw = dev->esw_ctx;
>>> +    uint16_t pid;
>>> +    int rv;
>>> +
>>> +    esw->esw_port = dev->port;
>>> +    esw->esw_netdev = netdev;
>>> +    esw->port_id = common->port_id;
>>> +    esw->n_rxq = netdev->n_rxq;
>>> +
>>> +    rv = netdev_doca_slowpath_esw_init(netdev);
>>> +    if (rv) {
>>> +        return rv;
>>> +    }
>>> +
>>> +    for (pid = 0; pid < RTE_MAX_ETHPORTS; pid++) {
>> This allocates ~2.8 MB of hugepage memory per ESW manager port (with 4
>> RX queues), as port_queues and rings are allocated for all 1024 possible
>> port IDs regardless of how many ports are actually in use.  Was there a
>> specific reason to pre-allocate for all port IDs upfront? I guess this
>> should not be a real problem for memory resources. Just curious.
>
> The actual RX is done only on the ESW port. If we want to allocate the 
> per-port on demand we will have to sync it somehow with the PMD thread that 
> does the actual RX.
>
> This way, we avoid it.

Thanks for the clarification.

[...]

>>> +        return;
>>> +    }
>>> +
>>> +    pci_addr = xstrdup(esw->pci_addr);
>> pci_addr is allocated with xstrdup and used only for logging inside
>> the if (last && esw->dev) block.  Could esw->pci_addr be used
>> directly, avoiding the allocation?
> No. we call refmap_unref() that will zero esw->pci_addr in 
> netdev_doca_esw_ctx_uninit(). Also, even if we don't, it's a kind of UAF, 
> though the free is done using ovsrcu_postpone in refmap.

Thanks, I missed this part...

>>> +    if (common->port_id != dev->esw_mgr_port_id && dev->dev_rep) {
>>> +        VLOG_DBG("%s: Closing doca dev_rep for port_id "DPDK_PORT_ID_FMT
>>> +                 ". %p", netdev_get_name(&common->up),
>>> +                 common->port_id, dev->dev_rep);
>>> +        err = doca_dev_rep_close(dev->dev_rep);
>>> +        if (err) {
>>> +            VLOG_ERR("Failed to close doca dev_rep with port id "
>>> +                     DPDK_PORT_ID_FMT". Error: %d (%s)",
>>> +                     common->port_id, err, doca_error_get_descr(err));
>>> +        }
>>> +
>>> +        dev->dev_rep = NULL;
>>> +    }
>>> +
>>> +    last = refmap_unref(netdev_doca_esw_rfm, esw);
>>> +    /* The last is the ESW. */
>>> +    if (last && esw->dev) {
>>> +        /* The esw->cmd_fd is closed inside. */
>>> +        if (dev_info.device) {
>>> +            err = rte_dev_remove(dev_info.device);
>>> +            if (err) {
>>> +                VLOG_ERR("Failed to remove device %s: %s", common->devargs,
>>> +                         rte_strerror(-err));
>>> +            }
>>> +        }
>>> +
>>> +        VLOG_DBG("Closing '%s'", pci_addr);
>>> +        err = doca_dev_close(esw->dev);
>>> +        if (err) {
>> See comment on doca errno handling. DOCA_SUCCESS vs err vs DOCA_IS_ERROR().
>> Please check/update all code related to this.
> I moved all to (err != DOCA_SUCCESS).

Thanks.

[...]

>>> +static int
>>> +dpdk_eth_dev_port_config_complete(struct netdev_doca *dev,
>> doca_eth_dev_port_config...?
> Ack
>>
>>> +                                  int n_rxq, int n_txq)
>>> +{
>>> +    struct netdev_dpdk_common *common = &dev->common;
>>> +    uint16_t conf_mtu;
>>> +    int diag;
>>> +
>>> +    free(dev->sw_tx_stats);
>>> +    dev->sw_tx_stats = xcalloc(n_txq, sizeof *dev->sw_tx_stats);
>>> +    for (int i = 0; i < n_txq; i++) {
>>> +        atomic_init(&dev->sw_tx_stats[i].n_packets, 0);
>>> +        atomic_init(&dev->sw_tx_stats[i].n_bytes, 0);
>>> +    }
>>> +
>>> +    common->up.n_rxq = n_rxq;
>>> +    common->up.n_txq = n_txq;
>>> +
>>> +    diag = rte_eth_dev_set_mtu(common->port_id, common->mtu);
>>> +    if (diag) {
>>> +        /* A device may not support rte_eth_dev_set_mtu, in this case
>>> +         * flag a warning to the user and include the devices configured
>>> +         * MTU value that will be used instead. */
>>> +        if (-ENOTSUP == diag) {
>>> +            rte_eth_dev_get_mtu(common->port_id, &conf_mtu);
>>> +            VLOG_WARN("Interface %s does not support MTU configuration, "
>>> +                      "max packet size supported is %"PRIu16".",
>>> +                      common->up.name, conf_mtu);
>>> +        } else {
>>> +            VLOG_ERR("Interface %s MTU (%d) setup error: %s",
>>> +                     common->up.name, common->mtu, rte_strerror(-diag));
>>> +        }
>>> +    }
>>> +
>>> +    return diag;
>> Should this be -diag, as positive errors should be return by the
>> netdev_reconfigure() API.
> No. It will be also the returned value of doca_eth_dev_port_config. the 
> return value of doca_eth_dev_init() is -diag.

Thanks, missed this...

>>> +
>>> +static int
>>> +netdev_doca_dev_probe(struct netdev_doca *dev, const char *devargs)
>>> +{

[...]

>>> +
>>> +    if (doca_rdma_bridge_get_dev_pd(dev->esw_ctx->dev, &pd)) {
>>> +        VLOG_ERR("Could not get pd for %s", devargs);
>> Can 'get pd for' be more user friendly, what does pd mean?
> No. 'pd' is "Protection Domain". for a developer (who will have to fix a bug 
> if occurs) i think pd would be clearer.

Yes, but this is a user error message, so would it not be nice to be
a bit more verbose?  Something like this:

  "Could not get protection domain (PD) for %s"

>>
>>> +        rv = EINVAL;
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (dev->esw_ctx->cmd_fd == -1) {
>>> +        dev->esw_ctx->cmd_fd = dup(pd->context->cmd_fd);
>>> +        if (dev->esw_ctx->cmd_fd == -1) {
>>> +            VLOG_ERR("Could not dup fd for %s. Error %s", devargs,
>>> +                     ovs_strerror(errno));
>>> +            rv = EBADF;
>>> +            goto out;
>>> +        }
>>> +    }
>>> +
>>> +    ds_put_format(&rte_devargs, "%s,cmd_fd=%d,pd_handle=%u", devargs,
>>> +                  dev->esw_ctx->cmd_fd, pd->handle);
>>> +
>>> +    VLOG_DBG("Probing '%s'", ds_cstr(&rte_devargs));
>>> +    if (rte_dev_probe(ds_cstr(&rte_devargs))) {
>>> +        VLOG_ERR("%s: rte_dev_probe failed for %s", OVS_SOURCE_LOCATOR,
>>> +                 ds_cstr(&rte_devargs));
>>> +        close(dev->esw_ctx->cmd_fd);
>>> +        dev->esw_ctx->cmd_fd = -1;
>>> +        rv = ENODEV;
>>> +        goto out;
>>> +    }
>>> +
>>> +out:
>>> +    ds_destroy(&rte_devargs);
>>> +    if (rv) {
>> Could we add a comment here why we close the dev(),
>> i.e. the relation with  refmap_ref() as it might not be
>> clear at the first glance.
>
> I don't understand your meaning. netdev_doca_dev_close() is the "opposite" of 
> netdev_doca_dev_probe().
>
> I added a simple comment to rollback, but maybe it's not what you meant.
>
> It can be used as a rollback. Would it be clearer to do the stages explicitly 
> here?
>
> While carefully reviewing, I fixed a leak of esw->cmd_fd, if it was dup 
> successfully, but failed on rte_dev_probe.
>
> Normally it is closed inside rte_dev_close(), but in case of an error need to 
> do it explicitly.


I was referring to a comment something like this:

  /* In case of an error, rollback the above refmap_ref()
   * which initialized the device. */

>>> +        netdev_doca_dev_close(dev);
>>> +    }
>>> +    return rv;
>>> +}

[...]

>>> +    /* HPF's phys_port_name is pf0/pf1. */
>>> +    if (len == 3 && !strncmp(phys_port_name, "pf", 2)) {
>>> +        /* "" to workaround a false positive checkpatch issue. */
>>> +        if (snprintf(devargs, maxlen, "%s,%s,representor=(pf%d)""vf65535", 
>>> pci,
>> What does ""vf65535" do here? If its there for the specific case the
>> "" continuation is not needed.
>
> vf65535 is for HPF port. that's the to probe it in dpdk syntax.
>
> "" is to WA a false-possitive checkpatch issue. I put a comment for that.

>>> +    /* HPF's phys_port_name is pf0/pf1. */
>>> +    if (len == 3 && !strncmp(phys_port_name, "pf", 2)) {
>>> +        /* "" to workaround a false positive checkpatch issue. */
>>> +        if (snprintf(devargs, maxlen, "%s,%s,representor=(pf%d)""vf65535", 
>>> pci,
>> What does ""vf65535" do here? If its there for the specific case the
>> "" continuation is not needed.
>
> vf65535 is for HPF port. that's the to probe it in dpdk syntax.
>
> "" is to WA a false-possitive checkpatch issue. I put a comment for that.


For this one, I would rather have the warning/error (it probably
needs fixing in checkpatch) than working around false positives.

[...]

>>> diff --git a/lib/ovs-doca.c b/lib/ovs-doca.c
>>> index eae361a21..f78d0efdd 100644
>>> --- a/lib/ovs-doca.c
>>> +++ b/lib/ovs-doca.c
>>> @@ -24,13 +24,34 @@
>>>
>> I guess the compiler.h above can be removed (probably in the earlier patch).
> It is there for "OVS_UNUSED"

I see, it's due to the odd #ifdef use case.
Maybe it's better to include util.h here.

[...]

>>
>>> +}
>>> +
>>> +static const char *
>>> +ovs_doca_log_level_to_str(uint32_t log_level)
>>> +{
>>> +    for (int i = 0; i < ARRAY_SIZE(levels); ++i) {
>>> +        if (i == log_level && levels[i]) {
>>> +            return levels[i];
>>> +        }
>>> +    }
>>> +
>>> +    OVS_NOT_REACHED();
>>> +    return "UNKONWN";
>> "UNKONWN" -> "UNKNOWN"
> Ack
>>
>> We might need to remove the OVS_NOT_REACHED, see below.
> I think it should be kept. see below.

There is no compile-time check that all levels are covered in
the levels[] array.  A future SDK version could add a new level,
causing OVS_NOT_REACHED() to crash OVS.  Just returning "UNKNOWN"
would be sufficient.

>>> +}
>>> +
>>> +static enum doca_log_level
>>> +get_buf_log_level(const char *buf, size_t size)
>>> +{
>>> +    const char *p = buf;
>>> +    int level;
>>> +
>>> +    for (int i = 0; i < 4; i++) {
>> The 4 hear makes no sense without any context, should something like
>> the below be added?
>>
>>    /* DOCA log format: [time][source][line][level] message.
>>     * Skip the first 3 '[' tokens to reach the level field. */
>>    #define DOCA_LOG_LEVEL_FIELD 4
>>
>> Or simply a comment above the loop:
>>    /* Skip [time], [source], [line], then read [level]. */
>>    for (int i = 0; i < 4; i++) {
> Added a comment (not this one).
>>
>>> +        while (size && *p && *p != '[') {
>>> +            size--;
>>> +            p++;
>>> +        }
>>> +
>>> +        if (!size || !*p) {
>>> +            return DOCA_LOG_LEVEL_DISABLE;
>>> +        }
>>> +
>>> +        size--;
>>> +        p++;
>>> +    }
>>> +
>>> +    level = ovs_doca_parse_log_level(p);
>>> +    if (level < 0) {
>>> +        return DOCA_LOG_LEVEL_DISABLE;
>>> +    }
>>> +
>>> +    return level;
>>> +}
>>> +
>>> +static ssize_t
>>> +ovs_doca_log_write(void *c OVS_UNUSED, const char *buf, size_t size)
>>> +{
>>> +    static struct vlog_rate_limit dbg_rl = VLOG_RATE_LIMIT_INIT(600, 600);
>>> +    enum doca_log_level level = get_buf_log_level(buf, size);
>>> +
>>> +    switch (level) {
>>> +        case DOCA_LOG_LEVEL_DISABLE:
>>> +            VLOG_EMER("(Failed to parse level): %.*s", (int) size, buf);
>> Using VLOG_EMER() for a log parsing failure overstates the severity.
>> VLOG_ERR() would be more appropriate here. See also the OVS error level
>> definitions.
> This should never happen. we could even abort here. If it happens something 
> is very very wrong.

This happens if for some reason the log message is corrupted, or
the level cannot be found.  This is a parsing failure, not a system
emergency.  According to OVS log levels, we should not crash OVS
because of this, we should just report an error.

>>> +            break;
>>> +        case DOCA_LOG_LEVEL_TRACE:
>>> +        case DOCA_LOG_LEVEL_DEBUG:
>>> +            VLOG_DBG_RL(&dbg_rl, "%.*s", (int) size, buf);
>>> +            break;
>>> +        case DOCA_LOG_LEVEL_INFO:
>>> +            VLOG_INFO_RL(&rl, "%.*s", (int) size, buf);
>>> +            break;
>>> +        case DOCA_LOG_LEVEL_WARNING:
>>> +            VLOG_WARN_RL(&rl, "%.*s", (int) size, buf);
>>> +            break;
>>> +        case DOCA_LOG_LEVEL_ERROR:
>>> +            VLOG_ERR_RL(&rl, "%.*s", (int) size, buf);
>>> +            break;
>>> +        case DOCA_LOG_LEVEL_CRIT:
>>> +            VLOG_EMER("%.*s", (int) size, buf);
>>> +            break;
>>> +        default:
>>> +            OVS_NOT_REACHED();

Maybe we should remove the default: level, this way we get a compile warning
if we have a log level we do not support.

>>> +    }
>>> +
>>> +    return size;

[...]

>>> +static void
>>> +ovs_doca_log_get(FILE *stream)
>>> +{
>>> +    uint32_t log_level;
>>> +
>>> +    log_level = doca_log_level_get_global_sdk_limit();
>>> +    fprintf(stream, "DOCA log level is %s",
>>> +            ovs_doca_log_level_to_str(log_level));
>> ovs_doca_log_level_to_str() calls OVS_NOT_REACHED() if the value
>> returned by doca_log_level_get_global_sdk_limit() does not match any
>> of the known DOCA log levels.  Could a future DOCA SDK version
>> return a level value outside the current set, causing an abort in a
>> production system just because a user ran "doca/log-get"?
>
> Yes. Specific OVS version must be linked against compatible DOCA version.
>
> If future DOCA version changes this, we will fail compilation as we'll have 
> more enums.

I do not see how this would fail compilation. The levels[] array
uses designated initializers, not a switch. Adding a new enum to
the DOCA header would compile fine; the new value simply would not
have an entry in levels[], and ovs_doca_log_level_to_str() would
hit OVS_NOT_REACHED() at runtime.

[...]

>>> +        }
>>> +
>>> +        doca_log_level_set_global_sdk_limit(DOCA_LOG_LEVEL_WARNING);
>>> +    }
>>> +
>>> +    unixctl_command_register("doca/log-set", "{level}. "
>>> +                             "level=CRT/ERR/WRN/INF/DBG/TRC", 0, 1,
>> The level names should be consistent with dpdk/log-set, which uses
>> full lowercase names: 'critical', 'error', 'warning', 'info',
>> 'debug'. DOCA also adds 'trace', which has no DPDK equivalent, but
>> this should be fine.
> dpdk names are derived from dpdk error levels. in doca they are a bit 
> different. I don't think there is a justification for "dpdk alignment" here. 
> This is why I added it in the help (missing in dpdk case btw).

You are right, DPDK has different values than OVS, and maybe those
should also be aligned.  But as we are adding a new command, we
should try to be uniform with OVS as much as we can.  Something
like:

  emer, err, warn, info, and dbg

I guess for trace we can use 'trc', and 'crit' for critical.

>>> +                             ovs_doca_unixctl_log_set, NULL);
>>> +    unixctl_command_register("doca/log-get", "", 0, 0,
>>> +                             unixctl_mem_stream, ovs_doca_log_get);
>> Should this be log-list similar to DPDK? In addition the output here should
>> also use full names.
> There is no much meaning for a "list" here, as there is just one level. In 
> dpdk there are many.

ACK, I might do a patch for DPDK.

>>
>> In addition we should add a test case for these two command to the
>> doca test framework when included in this series.
> I'd prefer not to include the framework in this series.

Any reason why not to?  We should apply this series with some test
harness, so either we include it in this series, or wait to apply
until the test infrastructure is ready to be included.

>>> +    /* DOCA configuration happens earlier than dpif-netdev's.
>>> +     * To avoid reorganizing them, read the relevant item directly. */
>>> +    ovs_doca_max_megaflows_counters =
>>> +        smap_get_uint(ovs_other_config, "flow-limit",
>>> +                      OVS_DOCA_MAX_MEGAFLOWS_COUNTERS);
>> When 'flow-limit' changes after initialization, the new value is
>> silently ignored since DOCA is initialized only once. Consider
>> logging a warning if the configured value differs from the one used
>> at startup, to indicate that a restart of ovs-vswitchd is required
>> for the change to take effect.
> Ack, though it requires a change in vswitchd/bridge.c, otherwise we'll call 
> smap function each iteration that might be expensive.

I do not think it's too expensive, however, you can move this check
to the general config change callback, which is only run on config
changes and does not require bridge.c changes.

[...]

>>
>>> +    RV_TEST(doca_flow_cfg_set_queue_depth(cfg, OVS_DOCA_QUEUE_DEPTH));
>>> +    RV_TEST(doca_flow_cfg_set_cb_entry_process(
>>> +                cfg, ovs_doca_offload_entry_process));
>>> +    RV_TEST(ovs_doca_init_defs(cfg, &defs, &defs_cfg));
>>> +
>>> +    VLOG_INFO("DOCA Enabled - initializing...");
>> This message appears twice: once here inside ovs_doca_init__ and once
>> in ovs_doca_init before calling ovs_doca_init__.  This results in this
>> being logged twice on each successful DOCA start.
>>
>> There are potential errors below, after succefull initialization,
>> which might seem odd.
> It follows the same approach as in dpdk.c. not twice.

For DPDK I see a single initializing message; for DOCA I see it twice.

[...]

>>> +
>>> +doca_error_t
>>> +ovs_doca_remove_entry(struct netdev_doca_esw_ctx *esw,
>>> +                      unsigned int qid, uint32_t flags,
>>> +                      struct doca_flow_pipe_entry **entry)
>>> +{
>>> +    doca_error_t err;
>>> +
>>> +    if (!*entry) {
>>> +        return DOCA_SUCCESS;
>>> +    }
>>> +
>>> +    ovs_assert(qid < OVS_DOCA_MAX_OFFLOAD_QUEUES);
>>> +
>>> +    err = doca_flow_pipe_remove_entry(qid, flags, *entry);
>>> +    if (err == DOCA_SUCCESS) {
>>> +        esw->offload_queues[qid].n_waiting_entries++;
>>> +        if (DOCA_FLOW_FLAGS_IS_SET(flags, DOCA_FLOW_ENTRY_FLAGS_NO_WAIT)) {
>>> +            /* Ignore potential errors here, as even if the queue 
>>> completion
>>> +             * failed, the entry removal would still be issued.  The caller
>>> +             * requires knowing so. */
>>> +            ovs_doca_complete_queue_esw(esw, qid);
>>> +        }
>>> +        *entry = NULL;
>>> +    } else {
>>> +        VLOG_ERR("Failed to remove entry %p qid=%d. Error: %d (%s)",
>>> +                 *entry, qid, err, doca_error_get_descr(err));
>> Failed to remove entry is rather vague, we should make this message more 
>> user friendly.
>> And add details in a debug message maybe?
> The only more detail we can have here is the ESW netdev. Added this. Other 
> than this, no other details

I meant more clear what failed; a pointer is not something that helps
a user in any way.  For example:
  "Failed to remove DOCA flow pipe entry for qid=%d. Error: %d (%s)"

[...]

>>> +void
>>> +ovs_doca_init(const struct smap *ovs_other_config)
>>> +{
>>> +    static bool enabled = false;
>>> +    int rv;
>>> +
>>> +    if (enabled || !ovs_other_config) {
>> Don't think we need the endabled logic here, the once_enable logic should 
>> take care of this.
> Same schema as in dpdk.c. If we don't have this, smap_get_bool() will be 
> called each iteration

Not sure if that is a real problem, but since DPDK uses the same
approach, let's keep it for now.

>>> +        return;
>>> +    }
>>> +
>>> +    if (smap_get_bool(ovs_other_config, "doca-init", false)) {
>>> +        static struct ovsthread_once once_enable = 
>>> OVSTHREAD_ONCE_INITIALIZER;
>>> +
>>> +        if (!ovsthread_once_start(&once_enable)) {
>>> +            return;
>>> +        }
>>> +
>>> +        VLOG_INFO("Using DOCA %s", doca_version_runtime());

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 11/11] netdev-doca: Introduce doca netdev.

Reply via email to