Re: [ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

Eli Britstein Tue, 09 Feb 2021 07:11:52 -0800


On 2/8/2021 6:21 PM, Sriharsha Basavapatna wrote:

On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein <el...@nvidia.com> wrote:


On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:

On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <el...@nvidia.com> wrote:

On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:

On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
<sriharsha.basavapa...@broadcom.com> wrote:

On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <el...@nvidia.com> wrote:

VXLAN decap in OVS-DPDK configuration consists of two flows:
F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0

F1 is a classification flow. It has outer headers matches and it
classifies the packet as a VXLAN packet, and using tnl_pop action the
packet continues processing in F2.
F2 is a flow that has matches on tunnel metadata as well as on the inner
packet headers (as any other flow).

In order to fully offload VXLAN decap path, both F1 and F2 should be
offloaded. As there are more than one flow in HW, it is possible that
F1 is done by HW but F2 is not. Packet is received by SW, and should be
processed starting from F2 as F1 was already done by HW.
Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
are applied on uplink ports attached to OVS.

This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
for tunnel offloads.

Travis:
v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552

GitHub Actions:
v1: https://github.com/elibritstein/OVS/actions/runs/515334647

[1] https://mails.dpdk.org/archives/dev/2020-October/187314.html

Eli Britstein (13):
     netdev-offload: Add HW miss packet state recover API
     netdev-dpdk: Introduce DPDK tunnel APIs
     netdev-offload-dpdk: Implement flow dump create/destroy APIs
     netdev-dpdk: Add flow_api support for netdev vxlan vports
     netdev-offload-dpdk: Implement HW miss packet recover for vport
     dpif-netdev: Add HW miss packet state recover logic
     netdev-offload-dpdk: Change log rate limits
     netdev-offload-dpdk: Support tunnel pop action
     netdev-offload-dpdk: Refactor offload rule creation
     netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
       port
     netdev-offload-dpdk: Map netdev and ufid to offload objects
     netdev-offload-dpdk: Support vports flows offload
     netdev-dpdk-offload: Add vxlan pattern matching function

Ilya Maximets (2):
     netdev-offload: Allow offloading to netdev without ifindex.
     netdev-offload: Disallow offloading to unrelated tunneling vports.

    Documentation/howto/dpdk.rst  |   1 +
    NEWS                          |   2 +
    lib/dpif-netdev.c             |  49 +-
    lib/netdev-dpdk.c             | 135 ++++++
    lib/netdev-dpdk.h             | 104 ++++-
    lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
    lib/netdev-offload-provider.h |   5 +
    lib/netdev-offload-tc.c       |   8 +
    lib/netdev-offload.c          |  29 +-
    lib/netdev-offload.h          |   1 +
    10 files changed, 1033 insertions(+), 152 deletions(-)

--
2.28.0.546.g385c171

Hi Eli,

Thanks for posting this new patchset to support tunnel decap action offload.

I haven't looked at the entire patchset yet. But I focused on the
patches that introduce 1-to-many mapping between an OVS flow (f2) and
HW offloaded flows.

Here is a representation of the design proposed in this patchset. A
flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
the underlying uplink/physical port is P0, gets offloaded to not only
P0, but also to other physical ports P1, P2... and so on.

       P0 <----> VxLAN-vPort <----> VFRep-1

       P1
       P2
       ...
       Pn

IMO, the problems with this design are:

- Offloading a flow to an unrelated physical device that has nothing
to do with that flow (invalid device for the flow).
- Offloading to not just one, but several such invalid physical devices.
- Consuming HW resources for a flow that is never seen or intended to
be processed by those physical devices.
- Impacts flow scale on other physical devices, since it would consume
their HW resources with a large number of such invalid flows.
- The indirect list used to track these multiple mappings complicates
the offload layer implementation.
- The addition of flow_dump_create() to offload APIs, just to parse
and get a list of user datapath netdevs is confusing and not needed.

I have been exploring an alternate design to address this problem of
figuring out the right physical device for a given tunnel inner-flow.
I will send a patch, please take a look so we can continue the discussion.

I just posted this patch, please see the link below; this is currently
based on the decap offload patchset (but it can be rebased if needed,
without the decap patchset too). The patch provides changes to pass
physical port (orig_in_port) information to the offload layer in the
context of flow F2. Note that additional changes would be needed in
the decap patchset to utilize this, if we agree on this approach.

https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapa...@broadcom.com/

Thanks for that.

However, we cannot say packets on that flow *cannot* arrive from other
PFs. For example, consider ECMP. Packets can arrive from either port,
depending on external routing decisions.

Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
packets for P0 be received over P1 ? With ECMP, packets could take a
different route in the network. But the destination remains the same.

Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.

How can you have multiple uplink ports connected to the same bridge,
unless you have them in some kind of bonding/link-aggregation
configuration ? Also, for br-phy you need to provide the mac address
of the underlying physical interface to the bridge.
Take a look at step-5 in this document:
https://docs.openvswitch.org/en/latest/howto/userspace-tunneling/
If you have 2 uplink ports, which mac address will you specify ?

Another example is that each is on a different bridge, and the IP is
moved from one to the other.

Here you are changing the tunnel configuration; this has to go through
some kind of reconfiguration or flow modification, since you are
changing the IP/mac of br-phy.

Thus, if we want to support it, we should keep the 1-to-many
indirection, and not assume that packets may only arrive always from the
same port as the first packet that created this flow.

The rationale behind offloading to multiple ports as per your patches,
is not to support ECMP, but because there's no way to figure out the
uplink port for the flow in the context of a vport. I don't see ECMP
mentioned anywhere in your patchset ?

ECMP is just an example. The point is we cannot guarantee the "correct"
in_port of a vport.

The patch that I sent out provides the correct in_port for a given
flow [ because we already know the port on which it was received, it
just needs to be passed as metadata ! ].

Also, please note that the HW rule is applied on PFs only, so in your
example P1,...,Pn, "n" is expected to be quite small number.

You can't predict or control the number of physical devices that can
be configured in the system (e.g., multiple physical bridges each
connecting to a different network). So you cannot assume that "n" is a
small number. The point here is that with the current proposal, we end
up offloading a flow incorrectly to unrelated physical devices.

"n" is limited by the number of physical cards there are in the system
(and used by OVS), unlike VFs (or SFs) that can be much more, so though
I cannot predict it, the point is that the number of "unrelated" rules
is probably not high.

There are cards with multiple physical ports and there could be
multiple such physical cards in the system. And since we are talking
about vport flows (F2), it could be thousands of such flows
incorrectly offloaded on each of those other ports, on each card.

Another point regarding this, is that some of the "unrelated" rules may
not be offloaded after all, as rejected by PMDs. For example if PF->Px
cannot be supported. On the other hand, drop is probably supported on
all PMDs.

Right, we may end up with more flows, as we cannot be sure what is a
"related" and what is an "unrelated" physical device.

Thanks for agreeing that we will end up with more flows. This is
precisely what we want to avoid -  ending up with more flows offloaded
on devices that are not in the picture at all w.r.t the flows in
question.

BTW, note that Linux kernel also uses the same approach, using
flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.

I've seen that and it is also complicated (independent of how many
drivers use it), I'm sure it can be simplified too.

I agree that your proposal works for the simple use cases. As there must
be some kind of indirectness, because after all we must apply the rule
on a physical device that is not the vport itself, i think the cost of
this approach to cover more use-cases is worthy.

Can we go ahead with this approach for now and build on it for more
use-cases as needed later ?

OK. Let's start with supporting the simple use cases. I will take yourpatch as a reference and re-spin the series.


We may enhance it in the future.

Thanks.

Thanks,
-Harsha

Thanks,
-Harsha

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

Reply via email to