On Tue, Feb 18, 2020 at 3:30 PM Eli Britstein <el...@mellanox.com> wrote: > > > On 2/10/2020 11:16 PM, Hemal Shah wrote: > > Eli, > > > > There are some fundamental architecture issues (multi HW tables vs. > > single HW table, use of rte_flow JUMP action for mapping ovs sw > > tnl_pop action, etc.) that we need to discuss before we can get into > > the details of the patchset. > > I'm inlining my comments on the discussion between you and Harsha below. > > > > Hemal > > > > > > On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <el...@mellanox.com > > <mailto:el...@mellanox.com>> wrote: > > > > > > On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote: > > > Hi Eli, > > > > > > Thanks for sending this patchset. I have some questions about the > > > design, please see my comments below. > > > > > > Thanks, > > > -Harsha > > > > > > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein > > <el...@mellanox.com <mailto:el...@mellanox.com>> wrote: > > >> In the netdev datapath, packets arriving over a tunnel are > > processed by > > >> more than one flow. For example a packet that should be > > decapsulated and > > >> forwarded to another port is processed by two flows: > > >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN), > > >> actions:tnl_pop(vxlan_sys_4789). > > >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example), > > >> inner-header matches, actions: vm1. > > >> > > >> In order to offload such a multi-flow processing path, a > > multi-table HW > > >> model is used. > > > Did you explore other ways to avoid this multi-flow processing > > model ? > > > For example, may be support some additional logic to merge the two > > > datapath flow-rules for decap into a single flow-rule in the offload > > > layer ? > > Yes, we did explore this approach and our research lead us to > > conclude > > that following the OVS multi-table SW model in HW proves to be a more > > robust architecture. > > > > [Hemal] It will be good and prudent for you to share details of that > > research to help the community understand how you arrive to the > > conclusion above. Can you share it? > The HW offload scheme should follow the SW model. It is not only about > current VXLAN patch-set, but rather an infrastructure for offloading > more actions. Once HW offloading does not follow the SW model it becomes > difficult or not practical to support them. Squashing multi-table flows > into single rules might work but only for specific use cases. VXLAN with > fixed sized outer header is one of them, but not with other use cases. > There are actions, like DP_HASH, that perform calculations on the > packets, to be used in the next recirc. > For example: > recirc_id(0),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no), > packets:26, bytes:1924, used:0.105s, flags:S, > actions:hash(sym_l4(0)),recirc(0x5) > > recirc_id(0x5),dp_hash(0x315c2571/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no), > packets:0, bytes:0, used:never, actions:,3 > recirc_id(0x5),dp_hash(0x315ca562/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no), > packets:0, bytes:0, used:never, actions:,2 > .. > 4 rules > > 1000000 TCP/UDP sessions will not require more data plane flows. > In single-table, every separated 5-tuple should be matched, so 1000000 > sessions will create 1000000 data plane flows.
[Harsha] I understand your concerns and the requirement from different use cases (like dp-hash) that might need multitable flows. But for VXLAN decap offload, we are questioning the need for multitable. We are looking for design options that you considered to support VXLAN use case with single-table. Can you please provide specific details on the challenges ? > > > > Single table architecture for tunnel offloads, for example, has the > > following issues: > > 1. Statistics – flow #1 counts bytes including outer header. If the > > outer is IPv6, it can have extensions, to be counted by the > > packet. Flow > > #2 counts the bytes of the inner packet only. > > > > [Hemal] For fixed format tunnels, you can calculate count bytes of > > flow #1 from the flow #2 counts. > For fixed format tunnels, yes. However, this cannot be done for variable > size headers protocols such as IPv6, VXLAN over VLAN, Geneve and others. > > [Harsha] For variable size headers, may be we could do some approximations and the trade off is reduced code complexity. In the long term we want to explore single table option for fixed headers, meanwhile we can continue to work with the proposed two-table model. > > 2. Update rate – parsing multi-table to single one means cartesian > > product of all the flows. Any change of a single flow in one table > > means > > a lot of removals and insertions of rules in HW. > > > > [Hemal] Can you quantify "a lot" with specific examples? flow #2 can't > > be offloaded without flow #1 being offloaded. It is less likely that > > flow #1 will change as frequently as flow #2. If flow #1 is updated > > (tunnel change) and flow #2 is using the fields from tunnel headers > > that are changed in flow #1, then you have to anyways update flow #2. > > This is no different from the software case when a tunnel is changed. > This is correct for the VXLAN use case. [Harsha] Ok, so update rate is not really a concern to support single table for VXLAN use case. > > > > The single table model also breaks when offloading more complex > > actions > > such as CT, dp_hash and others. > > > > [Hemal] Can you be more specific about what it breaks? > The cartesian product of rules can scale to a very large number of > single-rules. > See above example of DP_HASH. [Harsha] Offload design that's currently being discussed in this patch-set is for VXLAN. We can certainly discuss other features like CT, dp-hash in their context at appropriate time. Like we said earlier, we are not saying there's no use case for multitable, just that we are questioning its relevance to VXLAN use case. > > > > > > >> Flow #1 matches are offloaded as is, but tnl_pop is > > >> translated to MARK and JUMP to a tnl-port table actions. > > > The response to tnl_pop action is now different between non-offload > > > and offload datapaths, since offload layer changes it into entirely > > > different > > > actions. The user would expect consistent response to tnl_pop > > action, > > > especially if you treat this flow rule as an independent rule by > > > itself (i.e, > > > assuming there's no flow #2 subsequently). > > > > > > Regarding the JUMP action: > > > IMO, offload framework should avoid anything that is too vendor > > > specific; and make use of simple/common abstractions. I'm not > > sure if > > > JUMP action is really needed to support offload of tnl_pop action. > > > Also, it may not map to the real use case for which JUMP action > > seems > > > to have been defined, which is to support flow-grouping (see > > > rte_flow.h). > > The rte_flow API does not offer an equivalent action to moving outer > > headers to some metadata buffer, like the SW tnl_pop action. The JUMP > > action is equivalent to recirculate on the vport, and the MARK > > action is > > to make sure the user will have a consistent response in SW/HW > > (again, > > sticking to the SW model), so in case of a miss the packet is > > recovered > > to continue SW processing normally. > > > > [Hemal] This seems like a gap in rte_flow API mapping of ovs sw > > tnl_pop action. Should we look into extending rte_flow API to cover > > this case rather than changing the use case for JUMP? > It is not only the rte_flow API. It would also require providing the > tunnel info that was parsed by the hardware to the software, via the > MBUF, so this may be required to change as well. Furthermore, the HW > should also support this kind of action, so it gets complicated. [Harsha] Ok. > > > > > > > >> Flow #2 is > > >> offloaded to that tnl-port table, its tunnel matches are > > translated to > > >> outer-header matches, followed by the inner-header ones. The > > actions > > >> start with an implicit DECAP action followed by the flow's actions. > > >> The DECAP is done in flow #2. If it was done in flow #1, the outer > > >> headers of the packets would have been lost, and they could not > > have > > >> been matched in flow #2. > > > The only missing match field that would allow decap on flow #1 > > itself, > > > seems to be the tunnel-id. If that is provided in flow #1, then > > we may > > > not need to match it again in flow #2. And flow #2 could just > > match on > > > the inner packet headers. This would also make handling the > > flow-miss > > > simple, since you wouldn't need the recovery logic (offload layer > > > popping the header etc). The packet would be in the right state for > > > the next stage of processing in SW, if there's a flow-miss after the > > > first stage in HW. It also means tnl_pop can be done in flow #1, > > > making the behavior consistent with non-offload processing. > > > > > > The other advantage of providing the tunnel-id in flow #1 is that, > > > then we wouldn't need to match on the entire packet including the > > > outer-headers in the second stage and only the inner header fields > > > would need to be matched in flow #2. > > Even if we could somehow get the matches from flow #2 to flow #1 > > (leading back to single flow model…) and do the DECAP in flow #1, > > then > > we would have to change the SW matches in flow #2 in case it is > > processed in SW as it won’t have the outer header/tunnel anymore. > > > > [Hemal] Flow matches are defined by some policy. How is the changing > > of match fields of flow #2 an issue? Can you elaborate? Aren't these > > matches defined by a user or an agent? > In flow #2 there are matches on the tunnel (outer headers). If they are > in flow #1 in HW and with a DECAP action, now for processing flow #2 in > SW, those matches in flow #2 should be omitted if flow #1 is processed > in HW (as they are already matched and gone with the DECAP) or not if it > is processed in SW. [Harsha] The current offload proposal introduces the group-id in match, which is outside of actual OVS (sw) flow match. Have you thought about a solution that does not require a group-id ? For the long term, we should explore a unified flow #2 approach where a group-id is not needed; any thoughts on that ? Thanks, -Harsha > > > > > > > >> Furthermore, once the HW processing path is composed of a multi > > table > > >> processing, there is a need to handle a case where the > > processing starts > > >> in HW, but is not completed in HW. For example, only flow #1 is > > >> offloaded while flow #2 doesn't exist. For that, the MARK value > > set in > > >> the tnl_pop action is received in SW, the packet is recovered > > by the > > >> mapped vport, and processing continues in SW. > > > If it is not possible to provide tunnel-id in flow #1, then the > > other > > > alternative might be to defer offloading until flow #2 and offload > > > only flow #2. You are anyway matching almost the entire > > outer-headers > > > and the inner-header in flow #2 and the decap action is also added > > > implicitly to flow #2. Why not totally avoid offloading flow #1 > > ? That > > > way, none of the miss handling code is required. > > See the first argument about following the SW model. > > > > > >> As vports are virtual and cannot have rte_flows, offloading a vport > > >> generates rte_flows for every PF in the system, as we cannot > > know from > > >> which one the packets arrive from. > > > Is it possible to track the "original in-port" through this > > two-stage > > > flow processing and generate rte_flow only on the specific PF that > > > received the > > > packet ? > > We could “track” it only if we tried to merge both flow #1 and > > flow #2. > > See the first argument about following the SW model. > > > > > > Thanks, > > > -Harsha > > >> This series adds support for IPv6, tunnel push actions, and the > > >> multi-table model for the decap processing. > > >> > > >> v1 Travis passed: > > > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&reserved=0 > > > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617734858&sdata=1AxiayIWhCmmMCugerLnyAJDD%2F7hRXetRlk9Unk8L74%3D&reserved=0> > > >> > > >> Eli Britstein (14): > > >> dpif-netdev: Add mega ufid in flow add log > > >> netdev-offload-dpdk: Remove pre-validate of patterns function > > >> netdev-offload-dpdk: Add IPv6 pattern matching > > >> netdev-offload-dpdk: Support offload of clone > > tnl_push/output actions > > >> netdev-offload-dpdk: Support tnl/push using vxlan encap > > attribute > > >> netdev-offload: Don't use zero flow mark > > >> netdev-offload-dpdk: Introduce map APIs for table id and > > miss context > > >> netdev-offload-dpdk: Implement HW miss packet recover for vport > > >> netdev-offload-dpdk: Refactor disassociate ufid function > > >> netdev-offload-dpdk: Support tunnel pop action > > >> netdev-offload-dpdk: Implement flow dump create/destroy APIs > > >> netdev-dpdk: Getter function for dpdk devargs API > > >> netdev-dpdk: Introduce get netdev by devargs function > > >> netdev-dpdk-offload: Add vxlan pattern matching function > > >> > > >> Ilya Maximets (4): > > >> netdev: Allow storing dpif type into netdev structure. > > >> netdev-offload: Use dpif type instead of class. > > >> netdev-offload: Allow offloading to netdev without ifindex. > > >> netdev-offload: Disallow offloading to unrelated tunneling > > vports. > > >> > > >> Ophir Munk (6): > > >> dpif-netdev: Make mark allocation public API > > >> dpif-netdev: Add HW miss packet state recover logic > > >> netdev-dpdk: Introduce an API to query if a dpdk port is an > > uplink > > >> port > > >> netdev-dpdk-offload: Infrastructure for multiple rte_flows > > per UFID > > >> netdev-dpdk: Add flow_api support for netdev vxlan vports > > >> netdev-offload-dpdk: Support offload for vxlan vport > > >> > > >> Oz Shlomo (1): > > >> netdev-offload: Add HW miss packet state recover API > > >> > > >> Documentation/howto/dpdk.rst | 5 +- > > >> NEWS | 6 +- > > >> lib/dpif-netdev.c | 70 +-- > > >> lib/dpif-netlink.c | 23 +- > > >> lib/dpif.c | 21 +- > > >> lib/netdev-dpdk.c | 68 +++ > > >> lib/netdev-dpdk.h | 6 + > > >> lib/netdev-offload-dpdk.c | 1275 > > +++++++++++++++++++++++++++++++++++------ > > >> lib/netdev-offload-provider.h | 6 + > > >> lib/netdev-offload-tc.c | 11 +- > > >> lib/netdev-offload.c | 115 ++-- > > >> lib/netdev-offload.h | 23 +- > > >> lib/netdev-provider.h | 3 +- > > >> lib/netdev.c | 16 + > > >> lib/netdev.h | 2 + > > >> ofproto/ofproto-dpif-upcall.c | 5 +- > > >> tests/dpif-netdev.at > > > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpif-netdev.at%2F&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617734858&sdata=FSHMbRVHgjS6Jph6FTbnmSfyo70id9nkVesahEPOk2A%3D&reserved=0> > > | 14 +- > > >> tests/ofproto-macros.at > > > > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fofproto-macros.at%2F&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=nOV4yi8QLgGOlctHcZz%2BQnF6n0rch6whF5onzZ87LbM%3D&reserved=0> > > | 3 +- > > >> 18 files changed, 1394 insertions(+), 278 deletions(-) > > >> > > >> -- > > >> 2.14.5 > > >> > > >> _______________________________________________ > > >> dev mailing list > > >> d...@openvswitch.org <mailto:d...@openvswitch.org> > > >> > > > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&reserved=0 > > > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=9h393cM42RKHNS2hL%2FeiM%2FoADEaepqwXpo6F6u3MkNQ%3D&reserved=0> > > _______________________________________________ > > dev mailing list > > d...@openvswitch.org <mailto:d...@openvswitch.org> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=9h393cM42RKHNS2hL%2FeiM%2FoADEaepqwXpo6F6u3MkNQ%3D&reserved=0> > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev