On 1/12/26 11:15 AM, Eli Britstein wrote:
>
> On 12/01/2026 11:23, Ilya Maximets wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 1/11/26 5:29 PM, Eli Britstein wrote:
>>> On 11/01/2026 17:54, Aaron Conole wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> Eelco Chaudron via dev <[email protected]> writes:
>>>>
>>>>> This RFC patch series introduces a major architectural
>>>>> refactoring of Open vSwitch's hardware offload
>>>>> infrastructure. It replaces the tightly coupled
>>>>> `netdev-offload` implementation with a new, modular
>>>>> `dpif-offload-provider` framework.
>>>>>
>>>>> MOTIVATION
>>>>> -------------------------------------------------------------
>>>>> The existing `netdev-offload` API tightly couples datapath
>>>>> implementations (like `dpif-netdev`) with specific offload
>>>>> technologies (DPDK's rte_flow). This design has several
>>>>> limitations:
>>>>>
>>>>> - Rigid Architecture: It creates complex dependencies,
>>>>> making the code difficult to maintain and extend.
>>>>>
>>>>> - Limited Flexibility: Supporting multiple offload backends
>>>>> simultaneously or adding new ones is cumbersome.
>>>>>
>>>>> - Inconsistent APIs: The logic for handling different
>>>>> offload types is scattered, leading to an inconsistent
>>>>> and hard-to-follow API surface.
>>>>>
>>>>> This refactoring aims to resolve these issues by creating a
>>>>> clean separation of concerns, improving modularity, and
>>>>> establishing a clear path for future hardware offload
>>>>> integrations.
>>>> Thanks for all the work you've done on this over the last 8 months.
>>> I'm joining the thanks.
>>>>> PROPOSED SOLUTION: THE `DPIF-OFFLOAD-PROVIDER` FRAMEWORK
>>>>> -------------------------------------------------------------
>>>>> This series introduces the `dpif-offload-provider`
>>>>> framework, which functions similarly to the existing
>>>>> `dpif-provider` pattern. It treats hardware offload as a
>>>>> distinct layer with multiple, dynamically selectable
>>>>> backends.
>>>>>
>>>>> Key features of the new framework include:
>>>>>
>>>>> 1. Modular Architecture: A clean separation between the
>>>>> generic datapath interface and specific offload
>>>>> provider implementations (e.g., `dpif-offload-tc`,
>>>>> `dpif-offload-dpdk`). `dpif` layers are now generic
>>>>> clients of the offload API.
>>>>>
>>>>> 2. Provider-based System: Allows multiple offload backends
>>>>> to coexist.
>>>>>
>>>>> 3. Unified and Asynchronous API: Establishes a consistent
>>>>> API across all offload providers. For userspace
>>>>> datapaths, the API is extended to support asynchronous
>>>>> flow operations with callbacks, making `dpif-netdev` a
>>>>> more efficient client.
>>>>>
>>>>> 4. Enhanced Configuration: Provides granular control over
>>>>> offload provider selection through a global and per-port
>>>>> priority system (`hw-offload-priority`), allowing
>>>>> fine-tuned policies for different hardware.
>>>>>
>>>>> 5. Improved Testing: Includes a new test framework
>>>>> specifically for validating DPDKs rte_flow offloads,
>>>>> enhancing long-term maintainability.
>>>>>
>>>>> PATCH SERIES ORGANIZATION
>>>>> -------------------------------------------------------------
>>>>> This large series is organized logically to facilitate
>>>>> review:
>>>>>
>>>>> 1. Framework Foundation: The initial patches establish the
>>>>> core `dpif-offload-provider` framework, including the
>>>>> necessary APIs for port management, flow mark
>>>>> allocation, configuration, and a dummy provider for
>>>>> testing.
>>>>>
>>>>> 2. Provider Implementation: These patches introduce the new
>>>>> `dpif-offload-tc` and `dpif-offload-dpdk` providers,
>>>>> building out their specific implementations on top of
>>>>> the new framework.
>>>>>
>>>>> 3. API Migration and Decoupling: The bulk of the series
>>>>> systematically migrates functionality from the legacy
>>>>> `netdev-offload` layer to the new providers. Key
>>>>> commits here decouple `dpif-netdev` and, crucially,
>>>>> `dpif-netlink` from their hardware offload
>>>>> entanglements.
>>>>>
>>>>> 4. Cleanup: The final patches remove the now-redundant
>>>>> global APIs and structures from `netdev-offload`,
>>>>> completing the transition.
>>>>>
>>>>> BACKWARD COMPATIBILITY
>>>>> -------------------------------------------------------------
>>>>> This refactoring maintains full API compatibility from a
>>>>> user's perspective. All existing `ovs-vsctl` and
>>>>> `ovs-appctl` commands continue to function as before. The
>>>>> changes are primarily internal architectural improvements
>>>>> designed to make OVS more robust and extensible.
>>>>>
>>>>> REQUEST FOR COMMENTS
>>>>> -------------------------------------------------------------
>>>>> This is a significant architectural change that affects
>>>>> core OVS infrastructure. We welcome feedback on:
>>>>>
>>>>> - The overall architectural approach and the
>>>>> `dpif-offload-provider` concept.
>>>>> - The API design, particularly the new asynchronous model
>>>>> for `dpif-netdev`.
>>>>> - The migration strategy and any potential backward
>>>>> compatibility concerns.
>>>>> - Performance implications of the new framework.
>>>> Just a few quick thoughts while I'm finishing up (I'm on patch 29):
>>>>
>>>> 1. There are a few groups that might make more sense to squash together
>>>> while applying. For example, 23-25 is particularly agregious example
>>>> where it introduces some temp change, and then deletes it with a
>>>> different change. While seeing the steps is nice, it is *really*
>>>> confusing - even on the review side, it's a bit jarring to see.
>>>> Typically we don't allow patches that clean up earlier patches in the
>>>> same series. It may be worth spending the time to squash some of
>>>> these. Regardless of how messy it can be.
>>>>
>>>> 2. The dpdk offload stuff - my opinion is it should be named with
>>>> rte_flow rather than dpdk. It could make things difficult in the
>>>> future if a second offload type becomes available for DPDK ports.
>>>> WDYT (I realize that's a rename that takes a bit of time to do)?
>> Hi, Aaron. I requested the rename on the v2 of the set and I still think
>> it should be 'dpdk' in all the user-facing parts and internally by extension.
>> See the arguments here:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2025-October/426681.html
>> In short, 'rte_flow' is a very user-unfriendly name as it doesn't mean much
>> to the average OVS user. And we already expose 'dpdk' to users today as a
>> name of the offload type. Changing that will be a backward-incompatible
>> change while this set aims to be a refactor without major changes to the
>> current use of the feature. Having 'dpdk' on a user-facing side and
>> 'rte_flow'
>> internally is confusing.
>>
>> If we ever need to have an implementation of two separate DPDK APIs for flow
>> offload and have to give users a choice, we can go with a suffix on 'dpdk',
>> i.e. 'dpdk-ng' or 'dpdk-new-shiny-offload', but that is, IMO, very unlikely
>> to happen. More on that below.
>>
>>> Actually, giving it another thought, there is already a second offload
>>> type in dpdk - rte_flow_async_XXX.
>> It's not a different API, async is just a mode of the same API. Having
>> a second offload API in DPDK would be strange, as rte_flow is supposed
>> to be an abstraction layer that is covering offload for all supported HW.
>> Having a second offload API within DPDK would mean a complete failure of
>> rte_flow and we'd just migrate to a new API instead while rte_flow is
>> likely getting deprecated and removed from DPDK in this situation.
>
> It was introduced first to support HW steering mode for mlx cards.
>
> New mlx5 cards (>=CX9) support only HW-steering, so it meant indeed the
> legacy rte-flow could not work there.
>
> Since then, another work has been done to do the required things under
> the hood of the legacy rte-flow, so it can work.
>
> It is a different API. It has another set of requirement, API etc. It's
> not just a "mode".
>
> The concept with this API is that there are QPs (Queue Pair) from the
> SW->HW and HW->SW and WQE/CQE (Work-Queue-Element/Completion-Queue-Element).
>
> async API form a WQE, put on the SW->HW queue, but the HW doesn't
> do anything with it unless explicitly requested (see the comment for
> struct rte_flow_op_attr).
>
> When the HW competes the operation, it puts a CQE in the other queue.
> The SW then must poll for it. The "user_data" specified in the async API
> is returned as a cookie in the poll API.
>
> Also, tables and templates must be first created.
OK. Thanks for the context! Though it's still part of the rte_flow,
as otherwise it would not have this prefix. I understand that async
and non-async functions are not really related to each other. But
if the new cards only supported the async way, then we'd likely end up
detecting this somehow and use async for these cards and non-async for
others. Though I'd consider that an API failure, as rte_flow supposed
to be a single API to abstract the hardware internals from the
application, even though it never actually succeeded on that front (the
bug #958 is a prime example).
There would still be no need for OVS users to know about what is
happening under the hood. If some card supports both, we'd just choose
the one that is better for one reason or another. As it would be
strange if the same hardware supported feature A only in async and a
feature B only in non-async at the same time and both of them were
worth supporting in different scenarios. Of course, anything is
possible, but seems unlikely.
> Regarding "not likely to happen" - I agree with that (since legacy
> rte-flow can support new mlx5 cards now, and also this API is not
> implemented for most PMDs). However, in the unlikely case it will,
> "dpdk-ng"/"new-shiny" are worse than calling it from the start "rte-flow".
I'd say the "dpdk-async" version that Eelco suggested in the branch
of this thread is a good enough name if we ever need one. I believe
we could come up with something decent if it will be a brand new
API unrelated to rte_flow in any way.
> This naming is not for the "average OVS user", but for a OVS developer
> that wants to handle offloads.
I'd hope that most people who turn on offload in OVS are not OVS
developers. Or at least that should be the goal eventually. :) But I
suppose you mean people developing software platforms on top of OVS,
which is fair to some extent. Even for those, it was always painful
to set up dpdk environment and configure OVS for good performance, and
we tried to make it easier and provide sane defaults for most things
and easier to use configs. IMO, we should try to keep it this way,
unless extra complexity is necessary.
> I'm OK either way, just FYI.
Ack.
Best regards, Ilya Maximets.
>
>>
>>> If someone wants to implement in OVS dpif-offload-X based on
>>> rte_flow_async API, it will indeed be confusing.
>> We'd likely need a separate set of high level APIs on the OVS side in order
>> to
>> support async operation. And our offload is already happening asynchronously
>> with the traffic processing, so I'm not sure if we need to use the
>> rte_flow_async
>> functions. And if we do migrate to this implementation, I'm not sure why we
>> would allow users to choose rather than choosing the best implementation
>> ourselves.
>
> The "async" in rte_flow_async is not related to OVS mode of async
> threads. See above.
>
>>
>> Best regards, Ilya Maximets.
>>
>>>> 3. During my read-ahead, I got the heebee jeebees with
>>>> [34/40] dpif_netdev: Fix nullable memcpy in queue_netdev_flow_put().
>>>> I haven't completed my review to that portion, but I suspect that if
>>>> it's a real issue, it could be first in the series or even separate
>>>> and applied separately. WDYT?
>>>>
>>>>> -------------------------------------------------------------
>>>>> Changes from v1:
>>>>> - Fixed issues reported by Aaron and my AI experiments.
>>>>> - See individual patches for specific changes.
>>>>>
>>>>> Changes from v2:
>>>>> - See individual patches for specific changes.
>>>>>
>>>>> Changes from v3:
>>>>> - In patch 7, removed leftover netdev_close(port->netdev)
>>>>> causing netdev reference count issues.
>>>>> - Changed naming for dpif_offload_impl_type enum entries.
>>>>> - Merged previous patch 36, 'dpif-netdev: Add full name
>>>>> to the dp_netdev structure', with the next patch.
>>>>>
>>>>> Eelco Chaudron (40):
>>>>> dpif-offload-provider: Add dpif-offload-provider implementation.
>>>>> dpif-offload: Add provider for tc offload.
>>>>> dpif-offload: Add provider for dpdk (rte_flow).
>>>>> dpif-offload: Allow configuration of offload provider priority.
>>>>> dpif-offload: Move hw-offload configuration to dpif-offload.
>>>>> dpif-offload: Add offload provider set_config API.
>>>>> dpif-offload: Add port registration and management APIs.
>>>>> dpif-offload-tc: Add port management framework.
>>>>> dpif-offload-dpdk: Add port management framework.
>>>>> dpif-offload: Validate mandatory port class callbacks on registration.
>>>>> dpif-offload: Allow per-port offload provider priority config.
>>>>> dpif-offload: Introduce provider debug information API.
>>>>> dpif-offload: Call flow-flush netdev-offload APIs via dpif-offload.
>>>>> dpif-offload: Call meter netdev-offload APIs via dpif-offload.
>>>>> dpif-offload: Move the flow_get_n_flows() netdev-offload API to dpif.
>>>>> dpif-offload: Move hw_post_process netdev API to dpif.
>>>>> dpif-offload: Add flow dump APIs to dpif-offload.
>>>>> dpif-offload: Move the tc flow dump netdev APIs to dpif-offload.
>>>>> dpif-netlink: Remove netlink-offload integration.
>>>>> dpif-netlink: Add API to get offloaded netdev from port_id.
>>>>> dpif-offload: Add API to find offload implementation type.
>>>>> dpif-offload: Add operate implementation to dpif-offload.
>>>>> netdev-offload: Temporarily move thread-related APIs to dpif-netdev.
>>>>> dpif-offload: Add port dump APIs to dpif-offload.
>>>>> dpif-netdev: Remove indirect DPDK netdev offload API calls.
>>>>> dpif: Add dpif_get_features() API.
>>>>> dpif-offload: Add flow operations to dpif-offload-tc.
>>>>> dpif-netlink: Remove entangled hardware offload.
>>>>> dpif-offload-tc: Remove netdev-offload dependency.
>>>>> netdev_dummy: Remove hardware offload override.
>>>>> dpif-offload: Move the netdev_any_oor() API to dpif-offload.
>>>>> netdev-offload: Remove the global netdev-offload API.
>>>>> dpif-offload: Add inline flow APIs for userspace datapaths.
>>>>> dpif_netdev: Fix nullable memcpy in queue_netdev_flow_put().
>>>>> dpif-offload: Move offload_stats_get() API to dpif-offload.
>>>>> dpif-offload-dpdk: Abstract rte_flow implementation from dpif-netdev.
>>>>> dpif-offload-dummy: Add flow add/del/get APIs.
>>>>> netdev-offload: Fold netdev-offload APIs and files into dpif-offload.
>>>>> tests: Fix NSH decap header test for real Ethernet devices.
>>>>> tests: Add a simple DPDK rte_flow test framework.
>>>>>
>>>>> Documentation/topics/testing.rst | 19 +
>>>>> include/openvswitch/json.h | 1 +
>>>>> include/openvswitch/netdev.h | 1 +
>>>>> lib/automake.mk | 17 +-
>>>>> lib/dp-packet.h | 1 +
>>>>> lib/dpctl.c | 50 +-
>>>>> lib/dpdk.c | 2 -
>>>>> lib/dpif-netdev-avx512.c | 4 +-
>>>>> lib/dpif-netdev-private-flow.h | 9 +-
>>>>> lib/dpif-netdev.c | 1244 +++---------
>>>>> lib/dpif-netlink.c | 557 +----
>>>>> ...load-dpdk.c => dpif-offload-dpdk-netdev.c} | 592 ++++--
>>>>> lib/dpif-offload-dpdk-private.h | 73 +
>>>>> lib/dpif-offload-dpdk.c | 1186 +++++++++++
>>>>> lib/dpif-offload-dummy.c | 920 +++++++++
>>>>> lib/dpif-offload-provider.h | 421 ++++
>>>>> ...-offload-tc.c => dpif-offload-tc-netdev.c} | 238 ++-
>>>>> lib/dpif-offload-tc-private.h | 76 +
>>>>> lib/dpif-offload-tc.c | 877 ++++++++
>>>>> lib/dpif-offload.c | 1790 +++++++++++++++++
>>>>> lib/dpif-offload.h | 221 ++
>>>>> lib/dpif-provider.h | 65 +-
>>>>> lib/dpif.c | 166 +-
>>>>> lib/dpif.h | 14 +-
>>>>> lib/dummy.h | 9 +
>>>>> lib/json.c | 7 +
>>>>> lib/netdev-dpdk.c | 9 +-
>>>>> lib/netdev-dpdk.h | 2 +-
>>>>> lib/netdev-dummy.c | 199 +-
>>>>> lib/netdev-linux.c | 3 +-
>>>>> lib/netdev-offload-provider.h | 148 --
>>>>> lib/netdev-offload.c | 910 ---------
>>>>> lib/netdev-offload.h | 169 --
>>>>> lib/netdev-provider.h | 10 +-
>>>>> lib/netdev.c | 71 +-
>>>>> lib/netdev.h | 22 +
>>>>> lib/tc.c | 2 +-
>>>>> ofproto/ofproto-dpif-upcall.c | 50 +-
>>>>> ofproto/ofproto-dpif.c | 90 +-
>>>>> tests/.gitignore | 3 +
>>>>> tests/automake.mk | 24 +
>>>>> tests/dpif-netdev.at | 40 +-
>>>>> tests/ofproto-dpif.at | 170 ++
>>>>> tests/ofproto-macros.at | 17 +-
>>>>> tests/sendpkt.py | 12 +-
>>>>> tests/system-dpdk-offloads-macros.at | 236 +++
>>>>> tests/system-dpdk-offloads-testsuite.at | 28 +
>>>>> tests/system-dpdk-offloads.at | 223 ++
>>>>> tests/system-dpdk.at | 35 +
>>>>> tests/system-kmod-macros.at | 5 +
>>>>> tests/system-offloads-testsuite-macros.at | 5 +
>>>>> tests/system-offloads-traffic.at | 48 +
>>>>> tests/system-traffic.at | 9 +-
>>>>> tests/system-userspace-macros.at | 5 +
>>>>> vswitchd/bridge.c | 7 +-
>>>>> vswitchd/vswitch.xml | 43 +
>>>>> 56 files changed, 7808 insertions(+), 3347 deletions(-)
>>>>> rename lib/{netdev-offload-dpdk.c => dpif-offload-dpdk-netdev.c} (83%)
>>>>> create mode 100644 lib/dpif-offload-dpdk-private.h
>>>>> create mode 100644 lib/dpif-offload-dpdk.c
>>>>> create mode 100644 lib/dpif-offload-dummy.c
>>>>> create mode 100644 lib/dpif-offload-provider.h
>>>>> rename lib/{netdev-offload-tc.c => dpif-offload-tc-netdev.c} (95%)
>>>>> create mode 100644 lib/dpif-offload-tc-private.h
>>>>> create mode 100644 lib/dpif-offload-tc.c
>>>>> create mode 100644 lib/dpif-offload.c
>>>>> create mode 100644 lib/dpif-offload.h
>>>>> delete mode 100644 lib/netdev-offload-provider.h
>>>>> delete mode 100644 lib/netdev-offload.c
>>>>> delete mode 100644 lib/netdev-offload.h
>>>>> create mode 100644 tests/system-dpdk-offloads-macros.at
>>>>> create mode 100644 tests/system-dpdk-offloads-testsuite.at
>>>>> create mode 100644 tests/system-dpdk-offloads.at
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev