On 12/4/23 15:21, Vladislav Odintsov wrote:
> Hi!
> 
> I’m starting this thread as continuation after its first version, which was
> about OVN Interconnection usecase [0] ;)
> Tests are performed with OVN 22.09.x and latest OVS master branch with applied
> [1] and [2] patches.
> 
> OVS configuration:
> 
> 
>  ┌───────────────────────────────────────────────┐
>  │ OVS (l3gw chassis)                            │
>  │                                               │
>  │ ┌────────────────────┐   ┌──────────────────┐ │
>  │ │ br-int           patch │ br-ext           │ │
>  │ │               ◄────┬───┼──►               │ │
>  │ │ ┌──────────────────┤   │                  │ │
>  │ │ │ovn-hv2    geneve │   │                  │ │
>  │ │ │                  │   │ ┌──────────────┐ │ │
>  │ │ │remote_ip=10.0.0.2│   │ │ bond0.2      │ │ │
>  │ │ └──────────────────┤   │ └─────────┬────┘ │ │
>  │ │                    │   │           │      │ │
>  │ └────────────────────┘   └───────────┼──────┘ │
>  │                                      │        │
>  │                                      │        │
>  │ ┌────────────────────┬───────────────▼─┐      │
>  │ │  bond0 (10.0.0.1)  │  bond0.2        │      │
>  │ ├─────────┬┬─────────┼─────────────────┘      │
>  │ │   eth0  ││   eth1  │                        │
>  └─┴─────────┴┴─────────┴────────────────────────┘
> 
> Now I’m gonna run different OVN NAT functions:
> 
> 1. SNAT (network to one public IP)
> 2. 1-to-1 NAT (stateful and stateless)
> 
> In both scenarios traffic comes from one chassis (hypervisor; IP 10.0.0.2) to
> another chassis ("centralized" L3GW node; IP 10.0.0.2 from scheme above), 
> which
> then makes addresses translation and sends out packets to external network.
> 
> ------
> 
> In the 1st scenario (SNAT) egress traffic (from OVN to outside) is offloaded
> partially - decap from geneve is done in HW, while recirculation is done in 
> SW:
> 
> # ovs-appctl dpctl/dump-flows type=offloaded
> recirc_id(0),tunnel(tun_id=0x5,src=10.1.0.107,dst=10.1.0.109,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002/0x7fffffff}),flags(+key)),in_port(2),ct_label(0/0x2),eth(),eth_type(0x0800),ipv4(src=172.31.0.0/255.255.0.0,dst=0.0.0.0/128.0.0.0,proto=1,frag=no),
>  packets:1, bytes:84, used:0.291s, 
> actions:ct(commit,zone=3,nat(src=x.x.x.x)),recirc(0x9baa)

I'm actually surprised by this flow being offloaded.
I don't think ct actions with 'commit' flag can be offloaded.
Looks like a TC or a driver bug.

> 
> # ovs-appctl dpctl/dump-flows type=non-offloaded | grep 0x9baa
> recirc_id(0x9baa),tunnel(tun_id=0x5,src=10.1.0.107,dst=10.1.0.109,tp_dst=6081,geneve({len=4}),flags(+key)),in_port(2),ct_state(+est-rel-rpl+trk),ct_label(0/0x3),eth(src=0e:00:86:b6:92:6f,dst=00:00:0c:07:ac:07),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no),
>  packets:171, bytes:14364, used:0.210s, actions:ct_clear,5
> 
> 5 - is the bond vlan output port.
> 
> In tcpdump I see no geneve packets targeted to outside network, but see 
> packets
> on vlan interface (bond0.2 from example) outgoing to external network.
> 
> Ingress pipeline is not offloaded at all. I see reply packets on the vlan
> interface (bond0.2), also I see geneve tunneled packets going from this GW 
> node
> to hypervisor.
> 
> ------
> 
> In the 2nd scenario (dnat_and_snat) there is absolutely same behaviour if used
> with default stateful NAT (only geneve decap is offloaded).
> 
> egress dp flow:
> 
> # ovs-appctl dpctl/dump-flows type=offloaded
> recirc_id(0),tunnel(tun_id=0x5,src=10.1.0.107,dst=10.1.0.109,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002/0x7fffffff}),flags(+key)),in_port(2),ct_state(-new-est-rel-rpl-trk),ct_label(0/0x3),eth(src=0e:00:86:b6:92:6f,dst=00:00:0c:07:ac:07),eth_type(0x0800),ipv4(src=172.31.32.6,dst=0.0.0.0/128.0.0.0,proto=1,frag=no),
>  packets:1308, bytes:101694, used:0.860s, 
> actions:ct(zone=3,nat),recirc(0xa1fe)
> 
> ingress dp flow:
> 
> # ovs-appctl dpctl/dump-flows type=non-offloaded | grep proto=1, | grep 
> X.X.X.X | grep -v 'packets:0,'
> recirc_id(0),in_port(5),ct_state(-new-est-rel-rpl-trk),ct_label(0/0x3),eth(src=xx:xx:xx:xx:xx:xx,dst=0e:00:86:b6:92:6f),eth_type(0x0800),ipv4(src=8.0.0.0/254.0.0.0,dst=X.X.X.X,proto=1,tos=0/0x3,ttl=59,frag=no),
>  packets:1050, bytes:88200, used:0.130s, actions:ct(zone=3,nat),recirc(0xa1fa)
> 
> If I make this OVN NAT rule stateless (ovn-nbctl set nat <uuid>
> options:stateless=true), it will become totally non-offloaded.
> 
> # ovs-appctl dpctl/dump-flows type=non-offloaded | grep proto=1, | grep 
> X.X.X.X | grep -v 'packets:0,'
> recirc_id(0),in_port(5),ct_state(-new-est-rel-rpl-trk),ct_label(0/0x3),eth(src=xx:xx:xx:xx:xx:xx,dst=0e:00:86:b6:92:6f),eth_type(0x0800),ipv4(src=8.0.0.0/254.0.0.0,dst=X.X.X.X,proto=1,tos=0/0x3,ttl=59,frag=no),
>  packets:157, bytes:13188, used:0.850s, 
> actions:set(tunnel(tun_id=0x7,src=10.1.0.109,dst=10.1.0.107,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10002}),flags(csum|key))),set(eth(src=d0:fe:00:00:00:04,dst=0a:00:23:4e:32:a0)),set(ipv4(dst=172.31.32.6,ttl=57)),2
> recirc_id(0),tunnel(tun_id=0x5,src=10.1.0.107,dst=10.1.0.109,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002/0x7fffffff}),flags(+key)),in_port(2),ct_state(-new-est-rel-rpl-trk),ct_label(0/0x3),eth(src=0e:00:86:b6:92:6f,dst=00:00:0c:07:ac:07),eth_type(0x0800),ipv4(src=172.31.32.6,dst=0.0.0.0/128.0.0.0,proto=1,frag=no),
>  packets:200, bytes:15854, used:0.010s, actions:set(ipv4(src=X.X.X.X)),5
> 
> ------
> 
> So, I’ve got some questions here:
> 
> 1. Should these scenarios work with HW offload?

Hard to tell.  One thing I already mentioned is that the first flow with
'commit' should not be really offloaded.

> 2. If yes, what I probably could configure wrong, so offload is only
>    partial/not working?

Nvidia's driver is doing a lot of very hacky/incorrect things in order to
get tunnel offloading working.  One this that might confuse it is the
external linux bonding interface.  You may try replacing it with OVS bonding.
However, IIRC, the driver will also ignore the bonding processing in br-ex
and just send packets to one of the ports directly.  So, I don't know if
that's a good option.  MLX driver doesn't implement tunnel offloading
correctly as soon as the setup is anything but a very simple bridge with no
extra logic.  So, maybe try to get rid of the bonding entirely.

> 3. How does conntrack in kernel interacts with conntrack module inside
>    SmartNICs?  Is there any documentation on this?

I'm not an expert, but from what I know only established connections
are getting sent to the HW.  The NIC doesn't process new connections or
commits them, all the logic is done by the kernel conntrack invoked from
TC.  The HW Nic only matches packets agains established connections.
I'm nt sure if there is any documentaiton.  Maybe Marcelo knows?

> 
> As always, I’m ready to provide any additional information or do some extra
> checks.

Might be useful if you can run flow dumps with --more, so we can see if
these flows are in OVS kernel module or installed in TC, but not offloaded
to the hardware.

> Thanks in advance!
> 
> 
> 0: https://mail.openvswitch.org/pipermail/ovs-discuss/2023-October/052744.html
> 1: 
> https://patchwork.ozlabs.org/project/openvswitch/patch/20231201230836.3093792-1-i.maxim...@ovn.org/
> 2: 
> https://patchwork.ozlabs.org/project/openvswitch/patch/20231201210523.3085560-1-i.maxim...@ovn.org/
> 
> Regards,
> Vladislav Odintsov
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to