Re: [ovs-discuss] OVS HW offload not working

2023-05-25 Thread Robert Navarro via discuss
Hi Frode,

Thanks for the fast reply!

Replies in-line as well.

On Wed, May 24, 2023 at 11:41 PM Frode Nordahl 
wrote:

> Hello, Robert,
>
> See my response in-line below.
>
> On Thu, May 25, 2023 at 8:20 AM Robert Navarro via discuss
>  wrote:
> >
> > Hello,
> >
> > I've followed the directions here:
> >
> https://docs.nvidia.com/networking/pages/viewpage.action?pageId=119763689
> >
> > But I can't seem to get HW offload to work on my system.
> >
> > I'm using the latest OFED drivers with a ConnectX-5 SFP28 card running
> on kernel 5.15.107-2-pve
>
> Note that if you plan to use this feature with OVN you may find that
> the ConnectX-5 does not provide all the features required. Among other
> things it does not support the `dec_ttl` action which is a
> prerequisite for processing L3 routing, I'm also not sure whether it
> fully supports connection tracking offload. I'd go with CX-6 DX or
> above if this is one of your use cases.

Good to know, I'll keep that in mind as I progress with testing.


>
> > I have two hosts directly connected to each other, running a simple ping
> between hosts shows the following flows:
> >
> > root@pvet1:~# ovs-appctl dpctl/dump-flows -m type=tc
> > ufid:be1670c1-b36b-4f0f-8aba-e9415b9d0fb1,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=6e:56:fd:40:6e:22,dst=f6:36:11:c6:04:f0),eth_type(0x0800),ipv4(src=
> 0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
> packets:639, bytes:53676, used:0.710s, dp:tc, actions:tap103i1
> >
> > root@pvet2:~# ovs-appctl dpctl/dump-flows -m type=tc
> > ufid:f4d0ebd2-7ba9-4e21-9bf8-090f90bac072,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=f6:36:11:c6:04:f0,dst=6e:56:fd:40:6e:22),eth_type(0x0800),ipv4(src=
> 0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
> packets:656, bytes:55104, used:0.390s, dp:tc, actions:tap100i1
>
> The two flows listed above have an action towards what looks like a
> virtual Ethernet tap interface. That is not a supported configuration
> for hardware offload.

I'm using Proxmox as the hypervisor.

It seems like Proxmox is attaching the VM (using a virtio nic) as a tap
(tap100i1) to the OVS:

root@pvet2:~# ovs-dpctl show
system@ovs-system:
  lookups: hit:143640674 missed:1222 lost:45
  flows: 2
  masks: hit:143642152 total:2 hit/pkt:1.00
  port 0: ovs-system (internal)
  port 1: vmbr1 (internal)
  port 2: ens15f0np0
  port 3: vlan44 (internal)
  port 4: vlan66 (internal)
  port 5: ens15f0npf0vf0
  port 6: ens15f0npf0vf1
  port 7: tap100i1

Given that proxmox uses KVM for virtualization, what's the correct way to
link a KVM VM to OVS?


> The instance needs to be connected to an
>
When you say instance here, do you mean the KVM virtual machine or the
instance of OVS?


> interface wired directly to the embedded switch in the card by
> attaching a VF or SF to the instance.
>
OVS is attached to the physical nic using the PF and 2 VFs as shown in the
ovs-dpctl output above


>
> --
> Frode Nordahl
>
> > Neither of which shows offloaded
> >
> > The commands I'm using to setup the interfaces are:
> >
> > echo 2 | tee /sys/class/net/ens15f0np0/device/sriov_numvfs
> >
> > lspci -nn | grep Mellanox
> >
> > echo :03:00.2 | tee /sys/bus/pci/drivers/mlx5_core/unbind
> > echo :03:00.3 | tee /sys/bus/pci/drivers/mlx5_core/unbind
> >
> > devlink dev eswitch set pci/:03:00.0 mode switchdev
> >
> > echo :03:00.2 | tee /sys/bus/pci/drivers/mlx5_core/bind
> > echo :03:00.3 | tee /sys/bus/pci/drivers/mlx5_core/bind
> >
> > ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
> >
> > systemctl restart openvswitch-switch.service
> >
> > ovs-vsctl add-port vmbr1 ens15f0np0
> > ovs-vsctl add-port vmbr1 ens15f0npf0vf0
> > ovs-vsctl add-port vmbr1 ens15f0npf0vf1
> >
> > ethtool -K ens15f0np0 hw-tc-offload on
> > ethtool -K ens15f0npf0vf0 hw-tc-offload on
> > ethtool -K ens15f0npf0vf1 hw-tc-offload on
> >
> > ip link set dev ens15f0np0 up
> > ip link set dev ens15f0npf0vf0 up
> > ip link set dev ens15f0npf0vf1 up
> >
> > ovs-dpctl show
> >
> > Any ideas on what else to check?
> >
> > --
> > Robert Navarro
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>


-- 
Robert Navarro
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVS HW offload not working

2023-05-25 Thread Frode Nordahl via discuss
On Thu, May 25, 2023 at 9:03 AM Robert Navarro  wrote:
>
> Hi Frode,
>
> Thanks for the fast reply!
>
> Replies in-line as well.
>
> On Wed, May 24, 2023 at 11:41 PM Frode Nordahl  
> wrote:
>>
>> Hello, Robert,
>>
>> See my response in-line below.
>>
>> On Thu, May 25, 2023 at 8:20 AM Robert Navarro via discuss
>>  wrote:
>> >
>> > Hello,
>> >
>> > I've followed the directions here:
>> > https://docs.nvidia.com/networking/pages/viewpage.action?pageId=119763689
>> >
>> > But I can't seem to get HW offload to work on my system.
>> >
>> > I'm using the latest OFED drivers with a ConnectX-5 SFP28 card running on 
>> > kernel 5.15.107-2-pve
>>
>> Note that if you plan to use this feature with OVN you may find that
>> the ConnectX-5 does not provide all the features required. Among other
>> things it does not support the `dec_ttl` action which is a
>> prerequisite for processing L3 routing, I'm also not sure whether it
>> fully supports connection tracking offload. I'd go with CX-6 DX or
>> above if this is one of your use cases.
>
> Good to know, I'll keep that in mind as I progress with testing.
>
>>
>>
>> > I have two hosts directly connected to each other, running a simple ping 
>> > between hosts shows the following flows:
>> >
>> > root@pvet1:~# ovs-appctl dpctl/dump-flows -m type=tc
>> > ufid:be1670c1-b36b-4f0f-8aba-e9415b9d0fb1, 
>> > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=6e:56:fd:40:6e:22,dst=f6:36:11:c6:04:f0),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
>> >  packets:639, bytes:53676, used:0.710s, dp:tc, actions:tap103i1
>> >
>> > root@pvet2:~# ovs-appctl dpctl/dump-flows -m type=tc
>> > ufid:f4d0ebd2-7ba9-4e21-9bf8-090f90bac072, 
>> > skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=f6:36:11:c6:04:f0,dst=6e:56:fd:40:6e:22),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
>> >  packets:656, bytes:55104, used:0.390s, dp:tc, actions:tap100i1
>>
>> The two flows listed above have an action towards what looks like a
>> virtual Ethernet tap interface. That is not a supported configuration
>> for hardware offload.
>
> I'm using Proxmox as the hypervisor.
>
> It seems like Proxmox is attaching the VM (using a virtio nic) as a tap 
> (tap100i1) to the OVS:
>
> root@pvet2:~# ovs-dpctl show
> system@ovs-system:
>   lookups: hit:143640674 missed:1222 lost:45
>   flows: 2
>   masks: hit:143642152 total:2 hit/pkt:1.00
>   port 0: ovs-system (internal)
>   port 1: vmbr1 (internal)
>   port 2: ens15f0np0
>   port 3: vlan44 (internal)
>   port 4: vlan66 (internal)
>   port 5: ens15f0npf0vf0
>   port 6: ens15f0npf0vf1
>   port 7: tap100i1
>
> Given that proxmox uses KVM for virtualization, what's the correct way to 
> link a KVM VM to OVS?

I do not have detailed knowledge about proxmox, but what you would be
looking for is PCI Passthrough and SR-IOV. Once you get the instance
to use a VF instead of a virtio nic you should add its representor
port to the OVS bridge.

You can find the names of the representor ports by issuing the
`devlink port show` command.

>>
>> The instance needs to be connected to an
>
> When you say instance here, do you mean the KVM virtual machine or the 
> instance of OVS?

Instance as in the KVM virtual machine instance.

>>
>> interface wired directly to the embedded switch in the card by
>> attaching a VF or SF to the instance.
>
> OVS is attached to the physical nic using the PF and 2 VFs as shown in the 
> ovs-dpctl output above

Beware that there are multiple types of representor ports in play here
and you would be interested in plugging the ports with flavour
`virtual` into the OVS bridge.

-- 
Frode Nordahl

>>
>>
>> --
>> Frode Nordahl
>>
>> > Neither of which shows offloaded
>> >
>> > The commands I'm using to setup the interfaces are:
>> >
>> > echo 2 | tee /sys/class/net/ens15f0np0/device/sriov_numvfs
>> >
>> > lspci -nn | grep Mellanox
>> >
>> > echo :03:00.2 | tee /sys/bus/pci/drivers/mlx5_core/unbind
>> > echo :03:00.3 | tee /sys/bus/pci/drivers/mlx5_core/unbind
>> >
>> > devlink dev eswitch set pci/:03:00.0 mode switchdev
>> >
>> > echo :03:00.2 | tee /sys/bus/pci/drivers/mlx5_core/bind
>> > echo :03:00.3 | tee /sys/bus/pci/drivers/mlx5_core/bind
>> >
>> > ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>> >
>> > systemctl restart openvswitch-switch.service
>> >
>> > ovs-vsctl add-port vmbr1 ens15f0np0
>> > ovs-vsctl add-port vmbr1 ens15f0npf0vf0
>> > ovs-vsctl add-port vmbr1 ens15f0npf0vf1
>> >
>> > ethtool -K ens15f0np0 hw-tc-offload on
>> > ethtool -K ens15f0npf0vf0 hw-tc-offload on
>> > ethtool -K ens15f0npf0vf1 hw-tc-offload on
>> >
>> > ip link set dev ens15f0np0 up
>> > ip link set dev ens15f0npf0v

[ovs-discuss] BFD WITH ECMP NOT WORK!

2023-05-25 Thread wangchuanlei via discuss
Hello,
I use command create static route with bfd, by following cmd
(1) uuid='ovn-nbctl create bfd logical_port=lrp0 dst_ip=192.168.3.2 status=down`
(2) ovn-nbctl --bfd=$uuid --ecmp lr-route-add r0 240.0.0.0/8 192.168.3.2

   And the lr-port lrp0 and the vm(192.168.3.2) are in different compute node,
i found northd do not create the static route, because the bfd is down!
  Why bfd is down, because the bfd packet sent by vm is processed by local 
controller,not send to the other node, so lrp0 not get the reply!
  I think this a bug, any one to help me fix this?

Thanks 
Best regards!
wangchuanlei
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] MAC binding aging refresh mechanism

2023-05-25 Thread Ales Musil via discuss
Hi,

to improve the MAC binding aging mechanism we need a way to ensure that
rows which are still in use are preserved. This doesn't happen with current
implementation.

I propose the following solution which should solve the issue, any
questions or comments are welcome. If there isn't anything major that would
block this approach I would start to implement it so it can be available on
23.09.

For the approach itself:

Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the flow
that continues on known MAC binding):
match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(next;)  ->
match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;)

The "mac_cache_use" would translate to resubmit into separate table with
flows per MAC binding as follows:
match=(ip.src=, eth.src=, datapath=),
action=(drop;)

This should bump the statistics every time for the correct MAC binding. In
ovn-controller we could periodically dump the flows from this table. the
period would be set to MIN(mac_binding_age_threshold/2) from all local
datapaths. The dump would happen from a different thread with its own rconn
to prevent backlogging issues. The thread would receive mapped data from
I-P node that would keep track of mapping datapath -> cookies -> mac
bindings. This allows us to avoid constant lookups, but at the cost of
keeping track of all local MAC bindings. To save some computation time this
I-P could be relevant only for datapaths that actually have the threshold
set.

If the "idle_age" of the particular flow is smaller than the datapath
"mac_binding_age_threshold" it means that it is still in use. To prevent a
lot of updates, if the traffic is still relevant on multiple controllers,
we would check if the timestamp is older than the "dump period"; if not we
don't have to update it, because someone else did.

Also to "desync" the controllers there would be a random delay added to the
"dump period".

All of this would be applicable to FDB aging as well.

Does that sound reasonable?
Please let me know if you have any comments/suggestions.

Thanks,
Ales
-- 

Ales Musil

Senior Software Engineer - OVN Core

Red Hat EMEA 

amu...@redhat.comIM: amusil

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVS-DPDK ConnTrack Update Racing Condition

2023-05-25 Thread Lazuardi Nasution via discuss
Hi,

Continuing my posting on "ovs-vswitchd crashes several times a day", it
seems that I find some racing conditions on the conntrack update. Without
enabling debugging logs, I find logs like the following frequently.

2023-05-25T12:48:07.270Z|02757|conntrack(pmd-c47/id:101)|WARN|Unable to NAT
due to tuple space exhaustion - if DoS attack, use firewalling and/or zone
partitioning.
2023-05-25T12:48:09.318Z|02758|conntrack(pmd-c47/id:101)|WARN|Unable to NAT
due to tuple space exhaustion - if DoS attack, use firewalling and/or zone
partitioning.

After enabling debugging logs, I find the logs like the following before
the above logs.

2023-05-25T12:48:06.979Z|00030|conntrack_tp(pmd-c71/id:103)|DBG|Update
timeout TCP_ESTABLISHED zone=4 with policy id=0 val=86400 sec.

At that time, the conntrack table is only like the following with only a
single ESTABLISHED entry.

root@controller02:~# ovs-appctl dpctl/dump-conntrack -s
icmp,orig=(src=192.168.14.14,dst=8.8.8.8,id=6,type=8,code=0),reply=(src=8.8.8.8,dst=10.10.141.153,id=6,type=0,code=0),zone=4,timeout=29
icmp,orig=(src=192.168.14.11,dst=10.10.41.70,id=4,type=8,code=0),reply=(src=10.10.41.70,dst=10.10.141.153,id=4,type=0,code=0),zone=4,timeout=27
tcp,orig=(src=192.168.14.14,dst=10.10.41.73,sport=49852,dport=3306),reply=(src=10.10.41.73,dst=10.10.141.153,sport=3306,dport=49852),zone=4,timeout=86399,protoinfo=(state=ESTABLISHED)

It seems that OVS fails on searching conntrack entries whenever there is an
update. Is there any idea how to deal with this?

Best regards.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Help: reproduce CVE-2023-1668

2023-05-25 Thread David Morel via discuss
> I think, the issue should still be there, though I didn't check.
> Why exactly porting of the mf_set_mask_l3_prereqs() is a problem?
> do_xlate_actions() looks different in 2.5.3, but it still performs
> same mf_are_prereqs_ok() check.  Can't you just add the call in the
> body of the if as it is done on newer versions?
That's what I originally did, but I'm not sure I fully grasp the
implementation enough yet.

The old implementation does mf_mask_and_prereqs() before the if whereas
the new one does the if, then the fix adds the mf_set_mask_l3_prereqs()
and finally it does mf_mask_field_masked(). So I was second guessing
adding it inside the if as I was afraid it was already too late to do
mf_set_mask_l3_prereqs().

> On the other note, do you plan to migrate to any supported version?
> 2.5.3 is six years old at this point and not supported for more than
> two years.
I surely hope so. On our side we're following what is in XenServer, but
we also contribute, so we're wondering about contributing an update as
the task is from my understanding existing but not planned yet on their
side. I know someone internally tried a newer version and it "didn't
work". Unfortunately it was a quick test and there is no report so I
don't know which version was tested.

> FWIW, we removed [1] support for XenServer integration in 3.0,
> because nobody was using it, so it slowly decayed and likely didn't
> work anyway.
> 
> Can be brought back, if someone is willing to support it.
> 2.17 is the current LTS and it still has XenServer integration,
> I'm not sure if it's working though as I have no way to test it.
Thanks for the heads up, and pointing to versions and this commit, that
will likely be helpful if we go ahead and try to contribute that update
(I guess I'll be the most likely to do this).
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Help: reproduce CVE-2023-1668

2023-05-25 Thread Ilya Maximets via discuss
On 5/25/23 15:35, David Morel wrote:
>> I think, the issue should still be there, though I didn't check.
>> Why exactly porting of the mf_set_mask_l3_prereqs() is a problem?
>> do_xlate_actions() looks different in 2.5.3, but it still performs
>> same mf_are_prereqs_ok() check.  Can't you just add the call in the
>> body of the if as it is done on newer versions?
> That's what I originally did, but I'm not sure I fully grasp the
> implementation enough yet.
> 
> The old implementation does mf_mask_and_prereqs() before the if whereas
> the new one does the if, then the fix adds the mf_set_mask_l3_prereqs()
> and finally it does mf_mask_field_masked(). So I was second guessing
> adding it inside the if as I was afraid it was already too late to do
> mf_set_mask_l3_prereqs().

The mf_set_mask_l3_prereqs() should be executed before the
mf_set_flow_value_masked().  The only thing that necessary is that
we mask "l3 prerequisites" whenever we set the actual fields.
The order is not very important.

> 
>> On the other note, do you plan to migrate to any supported version?
>> 2.5.3 is six years old at this point and not supported for more than
>> two years.
> I surely hope so. On our side we're following what is in XenServer, but
> we also contribute, so we're wondering about contributing an update as
> the task is from my understanding existing but not planned yet on their
> side. I know someone internally tried a newer version and it "didn't
> work". Unfortunately it was a quick test and there is no report so I
> don't know which version was tested.

In the past I saw some custom patches present in XenServer packages
fixing various bugs in our xapi support, but these were never sent
upstream and didn't have commit messages descriptive enough to understand
them.  So, my guess is that some changes are needed.  Also, as OVS
evolved over time we might have broken something due to lack of
proper testing on Xen...

> 
>> FWIW, we removed [1] support for XenServer integration in 3.0,
>> because nobody was using it, so it slowly decayed and likely didn't
>> work anyway.
>>
>> Can be brought back, if someone is willing to support it.
>> 2.17 is the current LTS and it still has XenServer integration,
>> I'm not sure if it's working though as I have no way to test it.
> Thanks for the heads up, and pointing to versions and this commit, that
> will likely be helpful if we go ahead and try to contribute that update
> (I guess I'll be the most likely to do this).

Ack.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Help: reproduce CVE-2023-1668

2023-05-25 Thread David Morel via discuss
> The mf_set_mask_l3_prereqs() should be executed before the
> mf_set_flow_value_masked().  The only thing that necessary is that
> we mask "l3 prerequisites" whenever we set the actual fields.
> The order is not very important.
Ok, should be good with the way I did my first version then, thanks for
the clarifications!
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] MAC binding aging refresh mechanism

2023-05-25 Thread Ilya Maximets via discuss
On 5/25/23 14:08, Ales Musil via discuss wrote:
> Hi,
> 
> to improve the MAC binding aging mechanism we need a way to ensure that rows 
> which are still in use are preserved. This doesn't happen with current 
> implementation.
> 
> I propose the following solution which should solve the issue, any questions 
> or comments are welcome. If there isn't anything major that would block this 
> approach I would start to implement it so it can be available on 23.09.
> 
> For the approach itself:
> 
> Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the flow 
> that continues on known MAC binding):
> match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 || REGBIT_LOOKUP_NEIGHBOR_IP_RESULT 
> == 0), action=(next;)  -> match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 || 
> REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;)
> 
> The "mac_cache_use" would translate to resubmit into separate table with 
> flows per MAC binding as follows:
> match=(ip.src=, eth.src=, datapath=), 
> action=(drop;)

One concern here would be that it will likely cause a packet clone
in the datapath just to immediately drop it.  So, might have a
noticeable performance impact.

> 
> This should bump the statistics every time for the correct MAC binding. In 
> ovn-controller we could periodically dump the flows from this table. the 
> period would be set to MIN(mac_binding_age_threshold/2) from all local 
> datapaths. The dump would happen from a different thread with its own rconn 
> to prevent backlogging issues. The thread would receive mapped data from I-P 
> node that would keep track of mapping datapath -> cookies -> mac bindings. 
> This allows us to avoid constant lookups, but at the cost of keeping track of 
> all local MAC bindings. To save some computation time this I-P could be 
> relevant only for datapaths that actually have the threshold set.
> 
> If the "idle_age" of the particular flow is smaller than the datapath 
> "mac_binding_age_threshold" it means that it is still in use. To prevent a 
> lot of updates, if the traffic is still relevant on multiple controllers, we 
> would check if the timestamp is older than the "dump period"; if not we don't 
> have to update it, because someone else did.
> 
> Also to "desync" the controllers there would be a random delay added to the 
> "dump period".
> 
> All of this would be applicable to FDB aging as well.
> 
> Does that sound reasonable?
> Please let me know if you have any comments/suggestions.
> 
> Thanks,
> Ales

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVS NDP proxy / nd_options_type

2023-05-25 Thread Sesterhenn, Maximilian via discuss
Hello together,

I would like to follow up on this.

Meanwhile I was able to find the following discussion [1] in this mailing list 
some time ago which looks like is about the same topic.

Is this something that's currently not possible within OVS?
If so, are there plans to add this?

I had a look at how OVN is doing it, and it seems like OVN is sending this type 
of traffic to their controller for external processing.

Would be nice if we could answer NDP NS packets with OVS flows directly like we 
already can for ARP requests.


[1] https://www.mail-archive.com/ovs-dev@openvswitch.org/msg46880.html
[2] 
https://github.com/openvswitch/ovs/blob/8045c0f8de5192355ca438ed7eef77457c3c1625/ofproto/ofproto-dpif.c#L4695

Thanks in advance for your time!


Mit freundlichen Grüßen / Best Regards

i.A. Maximilian Sesterhenn
Cloud Engineer

EPG – Ehrhardt Partner Group

[cid:c86dca7d-74d0-4b37-8967-6bbda8a72fad]
__

EPX - Ehrhardt + Partner Xtended GmbH
Alte Römerstraße 3
56154 Boppard-Buchholz
Germany
Phone: (+49) 67 42 / 87 27 0
Fax: (+49) 67 42 / 87 27 50
E-Mail: i...@epg.com
Internet: www.epg.com
__

CEO:
Marco Ehrhardt, Markus Derksen
Commercial register Koblenz HRB 22546
Registered office: Boppard


From: discuss  on behalf of Sesterhenn, 
Maximilian via discuss 
Sent: Monday, May 22, 2023 22:16
To: ovs-discuss@openvswitch.org 
Subject: [ovs-discuss] OVS NDP proxy / nd_options_type

ovs-discuss@openvswitch.org appears similar to someone who previously sent you 
email, but may not be that person. Learn why this could be a 
risk

OUTSIDE-EPG!


Hey there,

maybe someone from this list can help me.

I'm currently trying to implement a simple NDP proxy using OVS.

For that, I defined two basic flows that should do the trick:

cookie=0x3e6,priority=1100,icmp6,icmp_type=135,icmp_code=0,in_port=patch-provnet-f,dl_src=MAC
 actions=set_field:136->icmpv6_type,set_field:0->icmpv6_code,goto_table:10

cookie=0x3e6,priority=1000,table=10,icmp6,icmp_type=136,icmp_code=0,in_port=patch-provnet-f,dl_src=MAC
 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[], 
mod_dl_src:MAC,move:NXM_NX_IPV6_SRC[]->NXM_NX_IPV6_DST[],move:NXM_NX_ND_TARGET[]->NXM_NX_IPV6_SRC[],set_field:MAC->nd_tll,set_field:2->nd_options_type,in_port

However, I'm unable to install the second flow, it works without the 
nd_options_type field.
It errors out with the message "OFPT_ERROR (xid=0xa): OFPBAC_BAD_SET_ARGUMENT".
The problem is that without that field my NA messages are incorrect.

I digged further and realized that this is something that older version of the 
OVS do not support.
I think I found the feature on the release notes of OVS 2.12.0 [1].

However, I'm on a quite recent system so that should not be a problem.

Is someone with more experience able to tell me what the reason could be?

Versioning:
Rocky Linux 9.1
Linux 5.14.0-162.23.1.el9_1.x86_64

# ovs-ofctl --version
ovs-ofctl (Open vSwitch) 3.1.2
OpenFlow versions 0x1:0x6

# ovs-vsctl --version
ovs-vsctl (Open vSwitch) 3.1.2
DB Schema 8.3.1

# modinfo openvswitch
filename:   
/lib/modules/5.14.0-162.23.1.el9_1.x86_64/kernel/net/openvswitch/openvswitch.ko.xz
alias:  net-pf-16-proto-16-family-ovs_ct_limit
alias:  net-pf-16-proto-16-family-ovs_meter
alias:  net-pf-16-proto-16-family-ovs_packet
alias:  net-pf-16-proto-16-family-ovs_flow
alias:  net-pf-16-proto-16-family-ovs_vport
alias:  net-pf-16-proto-16-family-ovs_datapath
license:GPL
description:Open vSwitch switching datapath
rhelversion:9.1
srcversion: C1E5F3D9CD0C9A09006C69E
depends:nf_conntrack,nf_nat,nf_conncount,libcrc32c,nf_defrag_ipv6
retpoline:  Y
intree: Y
name:   openvswitch
vermagic:   5.14.0-162.23.1.el9_1.x86_64 SMP preempt mod_unload modversions

[1] 
https://mail.openvswitch.org/pipermail/ovs-announce/2019-September/000255.html



Mit freundlichen Grüßen / Best Regards

i.A. Maximilian Sesterhenn
Cloud Engineer

EPG – Ehrhardt Partner Group

[cid:e3522170-0047-4685-a463-2af12eb78bb2]
__

EPX - Ehrhardt + Partner Xtended GmbH
Alte Römerstraße 3
56154 Boppard-Buchholz
Germany
Phone: (+49) 67 42 / 87 27 0
Fax: (+49) 67 42 / 87 27 50
E-Mail: i...@epg.com
Internet: www.epg.com
__

CEO:
Marco Ehrhardt, Markus Derksen
Commercial register Koblenz HRB 22546
Registered office: Boppard
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVS HW offload not working

2023-05-25 Thread Robert Navarro via discuss
>
> Once you get the instance
> to use a VF instead of a virtio nic you should add its representor
> port to the OVS bridge.


Interesting, I didn't know this.

Given that the instance has to use PCI Passthrough does that mean live
migrations are no longer possible?

I think that was one of the biggest reasons for wanting to use the virtio
nic

I'm learning a lot very quickly, thanks for the information Frode!

On Thu, May 25, 2023 at 1:34 AM Frode Nordahl 
wrote:

> On Thu, May 25, 2023 at 9:03 AM Robert Navarro  wrote:
> >
> > Hi Frode,
> >
> > Thanks for the fast reply!
> >
> > Replies in-line as well.
> >
> > On Wed, May 24, 2023 at 11:41 PM Frode Nordahl <
> frode.nord...@canonical.com> wrote:
> >>
> >> Hello, Robert,
> >>
> >> See my response in-line below.
> >>
> >> On Thu, May 25, 2023 at 8:20 AM Robert Navarro via discuss
> >>  wrote:
> >> >
> >> > Hello,
> >> >
> >> > I've followed the directions here:
> >> >
> https://docs.nvidia.com/networking/pages/viewpage.action?pageId=119763689
> >> >
> >> > But I can't seem to get HW offload to work on my system.
> >> >
> >> > I'm using the latest OFED drivers with a ConnectX-5 SFP28 card
> running on kernel 5.15.107-2-pve
> >>
> >> Note that if you plan to use this feature with OVN you may find that
> >> the ConnectX-5 does not provide all the features required. Among other
> >> things it does not support the `dec_ttl` action which is a
> >> prerequisite for processing L3 routing, I'm also not sure whether it
> >> fully supports connection tracking offload. I'd go with CX-6 DX or
> >> above if this is one of your use cases.
> >
> > Good to know, I'll keep that in mind as I progress with testing.
> >
> >>
> >>
> >> > I have two hosts directly connected to each other, running a simple
> ping between hosts shows the following flows:
> >> >
> >> > root@pvet1:~# ovs-appctl dpctl/dump-flows -m type=tc
> >> > ufid:be1670c1-b36b-4f0f-8aba-e9415b9d0fb1,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=6e:56:fd:40:6e:22,dst=f6:36:11:c6:04:f0),eth_type(0x0800),ipv4(src=
> 0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
> packets:639, bytes:53676, used:0.710s, dp:tc, actions:tap103i1
> >> >
> >> > root@pvet2:~# ovs-appctl dpctl/dump-flows -m type=tc
> >> > ufid:f4d0ebd2-7ba9-4e21-9bf8-090f90bac072,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=f6:36:11:c6:04:f0,dst=6e:56:fd:40:6e:22),eth_type(0x0800),ipv4(src=
> 0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
> packets:656, bytes:55104, used:0.390s, dp:tc, actions:tap100i1
> >>
> >> The two flows listed above have an action towards what looks like a
> >> virtual Ethernet tap interface. That is not a supported configuration
> >> for hardware offload.
> >
> > I'm using Proxmox as the hypervisor.
> >
> > It seems like Proxmox is attaching the VM (using a virtio nic) as a tap
> (tap100i1) to the OVS:
> >
> > root@pvet2:~# ovs-dpctl show
> > system@ovs-system:
> >   lookups: hit:143640674 missed:1222 lost:45
> >   flows: 2
> >   masks: hit:143642152 total:2 hit/pkt:1.00
> >   port 0: ovs-system (internal)
> >   port 1: vmbr1 (internal)
> >   port 2: ens15f0np0
> >   port 3: vlan44 (internal)
> >   port 4: vlan66 (internal)
> >   port 5: ens15f0npf0vf0
> >   port 6: ens15f0npf0vf1
> >   port 7: tap100i1
> >
> > Given that proxmox uses KVM for virtualization, what's the correct way
> to link a KVM VM to OVS?
>
> I do not have detailed knowledge about proxmox, but what you would be
> looking for is PCI Passthrough and SR-IOV. Once you get the instance
> to use a VF instead of a virtio nic you should add its representor
> port to the OVS bridge.
>
> You can find the names of the representor ports by issuing the
> `devlink port show` command.
>
> >>
> >> The instance needs to be connected to an
> >
> > When you say instance here, do you mean the KVM virtual machine or the
> instance of OVS?
>
> Instance as in the KVM virtual machine instance.
>
> >>
> >> interface wired directly to the embedded switch in the card by
> >> attaching a VF or SF to the instance.
> >
> > OVS is attached to the physical nic using the PF and 2 VFs as shown in
> the ovs-dpctl output above
>
> Beware that there are multiple types of representor ports in play here
> and you would be interested in plugging the ports with flavour
> `virtual` into the OVS bridge.
>
> --
> Frode Nordahl
>
> >>
> >>
> >> --
> >> Frode Nordahl
> >>
> >> > Neither of which shows offloaded
> >> >
> >> > The commands I'm using to setup the interfaces are:
> >> >
> >> > echo 2 | tee /sys/class/net/ens15f0np0/device/sriov_numvfs
> >> >
> >> > lspci -nn | grep Mellanox
> >> >
> >> > echo :03:00.2 | tee /sys/bus/pci/drivers/mlx5_core/unbind
> >> > echo :03:00.3 | tee /sys/bus/pci/drivers/m

Re: [ovs-discuss] Ping over dpdk bridge failed after upgrade to OVS 2.17.3

2023-05-25 Thread Alex Yeh (ayeh) via discuss
Hi Ilya,
  Thanks for you reply. We did further investigation and from the finding it 
seems related to the QEMU/libvirt version. The ping starts to work on DPDK 
bridge after we rollback the QEMU/libvirt version. Are you aware if there any 
new config needed to use the newer QEMU version?

Alex

Non-working: The ping from host to VM caused the  ovs_tx_failure_drops counter 
to go up.

[root@nfvis ~]# virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0

[root@nfvis ~]# ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.17.0
DPDK 21.11.0
[root@nfvis ~]# ovs-vsctl get Interface vnic1 statistics
{ovs_rx_qos_drops=0, ovs_tx_failure_drops=15, ovs_tx_invalid_hwol_drops=0, 
ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0, ovs_tx_retries=0, 
rx_1024_to_1522_packets=0, rx_128_to_255_packets=0, rx_1523_to_max_packets=0, 
rx_1_to_64_packets=0, rx_256_to_511_packets=0, rx_512_to_1023_packets=0, 
rx_65_to_127_packets=0, rx_bytes=0, rx_dropped=0, rx_errors=0, rx_packets=0, 
tx_bytes=0, tx_dropped=15, tx_packets=0}
[root@nfvis ~]#


Working:

[root@nfvis ~]# virsh version
Compiled against library: libvirt 6.0.0. <-- rollback to 6.0.0
Using library: libvirt 6.0.0
Using API: QEMU 6.0.0
Running hypervisor: QEMU 4.2.0

[root@nfvis ~]# ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.17.0
DPDK 21.11.0
[root@nfvis ~]# ovs-vsctl get Interface vnic1 statistics
{ovs_rx_qos_drops=0, ovs_tx_failure_drops=0, ovs_tx_invalid_hwol_drops=0, 
ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0, ovs_tx_retries=0, 
rx_1024_to_1522_packets=0, rx_128_to_255_packets=0, rx_1523_to_max_packets=0, 
rx_1_to_64_packets=1, rx_256_to_511_packets=0, rx_512_to_1023_packets=0, 
rx_65_to_127_packets=73, rx_bytes=6318, rx_dropped=0, rx_errors=0, 
rx_packets=74, tx_bytes=6318, tx_dropped=0, tx_packets=74}
[root@nfvis ~]#


On 5/24/23, 12:37 PM, "Ilya Maximets" mailto:i.maxim...@ovn.org>> wrote:


On 5/24/23 20:04, Alex Yeh (ayeh) via discuss wrote:
> Hi All,
> 
> A little more info on the qemu version of the working and not working setup.
> 
> Thanks
> Alex
> 
> Working:
> [root@nfvis ~]# /usr/libexec/qemu-kvm --version
> QEMU emulator version 4.2.0 (qemu-kvm-4.2.0-48.el8)
> Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
> 
> [root@nfvis ~]#
> Non-working:
> 
> [root@nfvis ~]# /usr/libexec/qemu-kvm --version
> QEMU emulator version 6.2.0 (qemu-kvm-6.2.0-11.el8)
> Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
> 
> Thanks
> Alex
> 
> *From: *"Alex Yeh (ayeh)" mailto:a...@cisco.com>>
> *Date: *Tuesday, May 23, 2023 at 2:58 PM
> *To: *"ovs-discuss@openvswitch.org " 
> mailto:ovs-discuss@openvswitch.org>>
> *Subject: *Ping over dpdk bridge failed after upgrade to OVS 2.17.3
> 
> Hi All,
> 
> We were running OVS 2.13.0/DPDK 19.11.1 and the VMs were able to ping over 
> the DPDK bridge. After upgrade to OVS 2.17.3 the VMs can’t ping over the DPDK 
> bridge with the same OVS config. Does anyone have seen the same issue and 
> have a way to fix the issue?


Nothing in particular comes to mind. You need to check the logs for warnings
or errors. And check datapath flows installed to see if there is something
abnormal there. Check QEMU logs for vhost user related messages.
If everything looks correct, but there is no traffic, check if you have
shared memory backend in qemu (a common issue when everything seems normal).


Best regards, Ilya Maximets.


> 
> Thanks
> Alex
> 
> OVS bridge setup, VMs are connected to vnic1 and vnic3:
> 
> Bridge dpdk-br
> datapath_type: netdev
> Port dpdk-br
> Interface dpdk-br
> type: internal
> 
> Port vnic1
> Interface vnic1
> type: dpdkvhostuserclient
> options: {vhost-server-path="/run/vhostfd/vnic1"}
> 
> Port vnic3
> Interface vnic3
> type: dpdkvhostuserclient
> options: {vhost-server-path="/run/vhostfd/vnic3"}
> 
> ovs_version: "2.17.7"
> 
> Working:
> [root@nfvis-csp-45 ~]# ovs-vswitchd --version
> ovs-vswitchd (Open vSwitch) 2.13.0
> DPDK 19.11.1
> 
> Not working:
> [root@nfvis nfvos-confd]# ovs-vswitchd --version
> ovs-vswitchd (Open vSwitch) 2.17.3
> DPDK 21.11.0





___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVS HW offload not working

2023-05-25 Thread Frode Nordahl via discuss
tor. 25. mai 2023, 22:29 skrev Robert Navarro :

> Once you get the instance
>> to use a VF instead of a virtio nic you should add its representor
>> port to the OVS bridge.
>
>
> Interesting, I didn't know this.
>
> Given that the instance has to use PCI Passthrough does that mean live
> migrations are no longer possible?
>
> I think that was one of the biggest reasons for wanting to use the virtio
> nic
>

The technique du jour to make live migration work with hardware offload
would be vDPA, where the instance indeed will make use of the virtio driver.

The underlying plumbing to make it work is a bit more involved though, so
I'd recommend getting the simpler setup described so far in this thread
work before embarking on that journey.

I'm learning a lot very quickly, thanks for the information Frode!
>

Thank you for the feedback, and hth.

--
Frode Nordahl


> On Thu, May 25, 2023 at 1:34 AM Frode Nordahl 
> wrote:
>
>> On Thu, May 25, 2023 at 9:03 AM Robert Navarro  wrote:
>> >
>> > Hi Frode,
>> >
>> > Thanks for the fast reply!
>> >
>> > Replies in-line as well.
>> >
>> > On Wed, May 24, 2023 at 11:41 PM Frode Nordahl <
>> frode.nord...@canonical.com> wrote:
>> >>
>> >> Hello, Robert,
>> >>
>> >> See my response in-line below.
>> >>
>> >> On Thu, May 25, 2023 at 8:20 AM Robert Navarro via discuss
>> >>  wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > I've followed the directions here:
>> >> >
>> https://docs.nvidia.com/networking/pages/viewpage.action?pageId=119763689
>> >> >
>> >> > But I can't seem to get HW offload to work on my system.
>> >> >
>> >> > I'm using the latest OFED drivers with a ConnectX-5 SFP28 card
>> running on kernel 5.15.107-2-pve
>> >>
>> >> Note that if you plan to use this feature with OVN you may find that
>> >> the ConnectX-5 does not provide all the features required. Among other
>> >> things it does not support the `dec_ttl` action which is a
>> >> prerequisite for processing L3 routing, I'm also not sure whether it
>> >> fully supports connection tracking offload. I'd go with CX-6 DX or
>> >> above if this is one of your use cases.
>> >
>> > Good to know, I'll keep that in mind as I progress with testing.
>> >
>> >>
>> >>
>> >> > I have two hosts directly connected to each other, running a simple
>> ping between hosts shows the following flows:
>> >> >
>> >> > root@pvet1:~# ovs-appctl dpctl/dump-flows -m type=tc
>> >> > ufid:be1670c1-b36b-4f0f-8aba-e9415b9d0fb1,
>> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=6e:56:fd:40:6e:22,dst=f6:36:11:c6:04:f0),eth_type(0x0800),ipv4(src=
>> 0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
>> packets:639, bytes:53676, used:0.710s, dp:tc, actions:tap103i1
>> >> >
>> >> > root@pvet2:~# ovs-appctl dpctl/dump-flows -m type=tc
>> >> > ufid:f4d0ebd2-7ba9-4e21-9bf8-090f90bac072,
>> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens15f0np0),packet_type(ns=0/0,id=0/0),eth(src=f6:36:11:c6:04:f0,dst=6e:56:fd:40:6e:22),eth_type(0x0800),ipv4(src=
>> 0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
>> packets:656, bytes:55104, used:0.390s, dp:tc, actions:tap100i1
>> >>
>> >> The two flows listed above have an action towards what looks like a
>> >> virtual Ethernet tap interface. That is not a supported configuration
>> >> for hardware offload.
>> >
>> > I'm using Proxmox as the hypervisor.
>> >
>> > It seems like Proxmox is attaching the VM (using a virtio nic) as a tap
>> (tap100i1) to the OVS:
>> >
>> > root@pvet2:~# ovs-dpctl show
>> > system@ovs-system:
>> >   lookups: hit:143640674 missed:1222 lost:45
>> >   flows: 2
>> >   masks: hit:143642152 total:2 hit/pkt:1.00
>> >   port 0: ovs-system (internal)
>> >   port 1: vmbr1 (internal)
>> >   port 2: ens15f0np0
>> >   port 3: vlan44 (internal)
>> >   port 4: vlan66 (internal)
>> >   port 5: ens15f0npf0vf0
>> >   port 6: ens15f0npf0vf1
>> >   port 7: tap100i1
>> >
>> > Given that proxmox uses KVM for virtualization, what's the correct way
>> to link a KVM VM to OVS?
>>
>> I do not have detailed knowledge about proxmox, but what you would be
>> looking for is PCI Passthrough and SR-IOV. Once you get the instance
>> to use a VF instead of a virtio nic you should add its representor
>> port to the OVS bridge.
>>
>> You can find the names of the representor ports by issuing the
>> `devlink port show` command.
>>
>> >>
>> >> The instance needs to be connected to an
>> >
>> > When you say instance here, do you mean the KVM virtual machine or the
>> instance of OVS?
>>
>> Instance as in the KVM virtual machine instance.
>>
>> >>
>> >> interface wired directly to the embedded switch in the card by
>> >> attaching a VF or SF to the instance.
>> >
>> > OVS is attached to the physical nic using the PF and 2 VFs as shown in
>> the ovs-dpctl output above
>>
>> Beware that th

Re: [ovs-discuss] MAC binding aging refresh mechanism

2023-05-25 Thread Han Zhou via discuss
On Thu, May 25, 2023 at 9:19 AM Ilya Maximets  wrote:
>
> On 5/25/23 14:08, Ales Musil via discuss wrote:
> > Hi,
> >
> > to improve the MAC binding aging mechanism we need a way to ensure that
rows which are still in use are preserved. This doesn't happen with current
implementation.
> >
> > I propose the following solution which should solve the issue, any
questions or comments are welcome. If there isn't anything major that would
block this approach I would start to implement it so it can be available on
23.09.
> >
> > For the approach itself:
> >
> > Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the
flow that continues on known MAC binding):
> > match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(next;)  ->
match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;)
> >
> > The "mac_cache_use" would translate to resubmit into separate table
with flows per MAC binding as follows:
> > match=(ip.src=, eth.src=, datapath=),
action=(drop;)

It is possible that some workload has heavy traffic for ingress direction
only, such as some UDP streams, but not sending anything out for a long
interval. So I am not sure if using "src" only would be sufficient.

>
> One concern here would be that it will likely cause a packet clone
> in the datapath just to immediately drop it.  So, might have a
> noticeable performance impact.
>
+1. We need to be careful to avoid any dataplane performance impact, which
doesn't sound justified for the value.

> >
> > This should bump the statistics every time for the correct MAC binding.
In ovn-controller we could periodically dump the flows from this table. the
period would be set to MIN(mac_binding_age_threshold/2) from all local
datapaths. The dump would happen from a different thread with its own rconn
to prevent backlogging issues. The thread would receive mapped data from
I-P node that would keep track of mapping datapath -> cookies -> mac
bindings. This allows us to avoid constant lookups, but at the cost of
keeping track of all local MAC bindings. To save some computation time this
I-P could be relevant only for datapaths that actually have the threshold
set.
> >
> > If the "idle_age" of the particular flow is smaller than the datapath
"mac_binding_age_threshold" it means that it is still in use. To prevent a
lot of updates, if the traffic is still relevant on multiple controllers,
we would check if the timestamp is older than the "dump period"; if not we
don't have to update it, because someone else did.

Thanks for trying to reduce the number of updates to SB DB, but I still
have some concerns for this. In theory, to prevent the records being
deleted while it is still used, at least one timestamp update is required
per threshold for each record. Even if we bundle the updates from each
node, assume that the workloads that own the IP/MAC of the mac_binding
records are distributed across 1000 nodes, and the aging threshold is 30s,
there will still be ~30 updates/s (if we can evenly distribute the updates
from different nodes). That's still a lot, which may keep the SB server and
all the ovn-controller busy just for these messages. If the aging threshold
is set to 300s (5min), it may look better: ~3updates/s, but this still
could contribute to the major part of the SB <-> ovn-controller messages,
e.g. in ovn-k8s deployment the cluster LR is distributed on all nodes so
all nodes would need to monitor all mac-binding timestamp updates related
to the cluster LR, which means all mac-binding updates from all nodes. In
reality the amount of messages may be doubled if we use the proposed
dump-and-check interval mac_binding_age_threshold/2.

So, I'd evaluate both the dataplane and control plane cost before going for
formal implementation.

Thanks,
Han

>
> >
> > Also to "desync" the controllers there would be a random delay added to
the "dump period".
> >
> > All of this would be applicable to FDB aging as well.
> >
> > Does that sound reasonable?
> > Please let me know if you have any comments/suggestions.
> >
> > Thanks,
> > Ales
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] MAC binding aging refresh mechanism

2023-05-25 Thread Ales Musil via discuss
On Fri, May 26, 2023 at 7:58 AM Han Zhou  wrote:

>
>
> On Thu, May 25, 2023 at 9:19 AM Ilya Maximets  wrote:
> >
> > On 5/25/23 14:08, Ales Musil via discuss wrote:
> > > Hi,
> > >
> > > to improve the MAC binding aging mechanism we need a way to ensure
> that rows which are still in use are preserved. This doesn't happen with
> current implementation.
> > >
> > > I propose the following solution which should solve the issue, any
> questions or comments are welcome. If there isn't anything major that would
> block this approach I would start to implement it so it can be available on
> 23.09.
> > >
> > > For the approach itself:
> > >
> > > Add "mac_cache_use" action into "lr_in_learn_neighbor" table (only the
> flow that continues on known MAC binding):
> > > match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
> REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(next;)  ->
> match=(REGBIT_LOOKUP_NEIGHBOR_RESULT == 1 ||
> REGBIT_LOOKUP_NEIGHBOR_IP_RESULT == 0), action=(mac_cache_use; next;)
> > >
> > > The "mac_cache_use" would translate to resubmit into separate table
> with flows per MAC binding as follows:
> > > match=(ip.src=, eth.src=, datapath=),
> action=(drop;)
>
> It is possible that some workload has heavy traffic for ingress direction
> only, such as some UDP streams, but not sending anything out for a long
> interval. So I am not sure if using "src" only would be sufficient.
>

Using table 67 (for the "dst" check) still has the black hole problem. I'm
not sure if there is a universal solution that would satisfy UDP traffic
that goes only one way, no matter which.
Also the negative effect of re-ARPing on UDP traffic should be a lot
smaller than connection protocols.


>
> >
> > One concern here would be that it will likely cause a packet clone
> > in the datapath just to immediately drop it.  So, might have a
> > noticeable performance impact.
> >
> +1. We need to be careful to avoid any dataplane performance impact, which
> doesn't sound justified for the value.
>

We can test this out if there is a clone or not, if so we might need to
include it as a regular table in the pipeline.


>
> > >
> > > This should bump the statistics every time for the correct MAC
> binding. In ovn-controller we could periodically dump the flows from this
> table. the period would be set to MIN(mac_binding_age_threshold/2) from all
> local datapaths. The dump would happen from a different thread with its own
> rconn to prevent backlogging issues. The thread would receive mapped data
> from I-P node that would keep track of mapping datapath -> cookies -> mac
> bindings. This allows us to avoid constant lookups, but at the cost of
> keeping track of all local MAC bindings. To save some computation time this
> I-P could be relevant only for datapaths that actually have the threshold
> set.
> > >
> > > If the "idle_age" of the particular flow is smaller than the datapath
> "mac_binding_age_threshold" it means that it is still in use. To prevent a
> lot of updates, if the traffic is still relevant on multiple controllers,
> we would check if the timestamp is older than the "dump period"; if not we
> don't have to update it, because someone else did.
>
> Thanks for trying to reduce the number of updates to SB DB, but I still
> have some concerns for this. In theory, to prevent the records being
> deleted while it is still used, at least one timestamp update is required
> per threshold for each record. Even if we bundle the updates from each
> node, assume that the workloads that own the IP/MAC of the mac_binding
> records are distributed across 1000 nodes, and the aging threshold is 30s,
> there will still be ~30 updates/s (if we can evenly distribute the updates
> from different nodes). That's still a lot, which may keep the SB server and
> all the ovn-controller busy just for these messages. If the aging threshold
> is set to 300s (5min), it may look better: ~3updates/s, but this still
> could contribute to the major part of the SB <-> ovn-controller messages,
> e.g. in ovn-k8s deployment the cluster LR is distributed on all nodes so
> all nodes would need to monitor all mac-binding timestamp updates related
> to the cluster LR, which means all mac-binding updates from all nodes. In
> reality the amount of messages may be doubled if we use the proposed
> dump-and-check interval mac_binding_age_threshold/2.
>

Arguably the 30 second threshold is way too low. You are right that the
double update might happen in some cases, that can still be mitigated by
shifting the check of timestamp older than 3/4 of the threshold for
example, that should still give us enough time to perform the update, but
it will greatly reduce the chance of doing the update twice.

One thing to keep in mind is that these updates would happen only for MAC
bindings that have the aging enabled. For some parts of the cluster that
are fixed enough, or the MAC binding is growing very slowly it might not
make sense to enable it, if so it might be with a pretty big th