Re: [ovs-dev] [PATCH net-next v3 0/7] selftests: net: Switch pmtu.sh to use the internal ovs script.

2024-06-28 Thread Jakub Kicinski
On Fri, 28 Jun 2024 14:04:09 -0400 Aaron Conole wrote:
> > I also added the OvS tests themselves, and those are not passing, yet:
> > https://netdev.bots.linux.dev/contest.html?test=openvswitch-sh
> > Could you take a look and LMK if these are likely env issues or
> > something bad in the test itself?  
> 
> I saw that.  I was looking for a place in the nipa repository where I
> could submit a small fix, because I noticed in the stdout:
> 
>   make -C tools/testing/selftests TARGETS="net/openvswitch"
>   TEST_PROGS=openvvswitch.sh TEST_GEN_PROGS="" run_tests
>   
> and I think the TEST_PROGS=openvvswitch.sh is misspelled (but it seems
> to not matter too much for the run_test target).

:o that's a weird bug, whatever is echoing back the input from the VMs
stdin to stdout is duplicating 74th character?! but as you say looks
like the VM gets the right input, it's just the echo.

> From what I understand, there are two things causing it to be flaky.
> First, the module detection is a bit flaky (and that's why it results is
> some 'skip' reports).  Additionally, the connection oriented tests
> include negative cases and those hit timeouts.  The default is to
> declare failure after 45s.  That can be seen in:
> 
>   
> https://netdev-3.bots.linux.dev/vmksft-net/results/659601/91-openvswitch-sh/stdout
>   ...
>   # timeout set to 45
>   ...
>   # TEST: nat_connect_v4
> [START]
>   # Terminated
>   # Terminated
> 
> This is showing that the timeout is too short.
> 
> I have patches ready for these issues, but I didn't know if you would
> like me to submit config 

config meaning make OvS built in? We have a number of tests using
module auto-loading, I don't think there were major issues with it.
(well, the rebuilding of modules is a bit questionable with vng,
but we do 'make mrproper' between each build to work around that).

> and settings files to go under net/openvswitch,
> or if you would prefer to see the openvswitch.sh script, and
> ovs-dpctl.py utilities move out of their net/openvswitch/ directory.  If
> the latter, I can submit patches quickly with config and settings (and a
> small change to the script itself) that addresses these.  If you'd
> prefer the former (moving around the files), I'll need to spend some
> additional time modifying pmtu and doing a larger test.  I don't have a
> strong opinion on either approach.

Since we talked about it a while back I gave in and implemented support
for combining multiple targets in a single runner. So it doesn't matter
any more (barring bugs in NIPA), up to you.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] vswitchd: Only lock pages that are faulted in.

2024-06-28 Thread Ilya Maximets
On 6/28/24 16:51, Eelco Chaudron wrote:
> 
> 
> On 14 Jun 2024, at 14:22, Ilya Maximets wrote:
> 
>> The main purpose of locking the memory is to ensure that OVS can keep
>> doing what it did before in case of increased memory pressure, e.g.,
>> during VM ingest / migration.  Fulfilling this requirement can be
>> achieved without locking all the allocated memory, but only the pages
>> already accessed in the past (faulted in).  Processing of the new
>> traffic involves new memory allocations.  Latency on these operations
>> can't be guaranteed by the locking.  The main difference would be
>> the pre-faulting of the stack memory.  However, in order to revalidate
>> or process upcalls on the same traffic, the same amount of stack is
>> likely needed, so all the necessary memory will already be faulted in.
>>
>> Switch 'mlockall' to MCL_ONFAULT to avoid consuming unnecessarily
>> large amounts of RAM on systems with high core counts.  For example,
>> in a densely populated OVN cluster this saves about 650 MB of RAM per
>> node on a system with 64 cores.  This equates to 320 GB of allocated
>> but unused RAM in a 500 node cluster.
>>
>> This also makes OVS better suited by default for small systems with
>> limited amount of memory.
>>
>> The MCL_ONFAULT flag was introduced in Linux kernel 4.4 and wasn't
>> available at the time of '--mlockall' introduction, but we can use it
>> now.  Falling back to an old way of locking in case we're running on
>> an older kernel just in case.
>>
>> Only locking the faulted in pages also makes locking compatible with
>> vhost post-copy live migration by default, because we'll no longer
>> pre-fault all the guest's memory.  Post-copy relies on userfaultfd
>> to work on shared huge pages, which is only available in 4.11+ kernels.
>> So, technically, it should not be possible for MCL_ONFAULT to fail and
>> the call without it to succeed.  But keeping the check just in case
>> for now.
>>
>> Signed-off-by: Ilya Maximets 
> 
> This change looks good. I did some performance testing and I noticed not 
> negative impacts.
> 
> Acked-by: Eelco Chaudron 
> 

Thanks, Eelco and Simon!  Applied.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] netdev-offload-dpdk: Support offload of set dscp action.

2024-06-28 Thread Ilya Maximets
On 6/28/24 23:08, Ilya Maximets wrote:
> On 6/20/24 09:35, Sunyang Wu via dev wrote:
>> Add the "set dscp action" parsing function,
>> so that the "set dscp action" can be offloaded.
>>
>> Signed-off-by: Sunyang Wu 
>> ---
>>  lib/netdev-offload-dpdk.c | 19 ++-
>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>
> 
> Hi, Sunyang Wu.  Thanks for the patch!
> See some comments inline.
> 
> Best regards, Ilya Maximets.
> 
>> diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
>> index 623005b1c..524942457 100644
>> --- a/lib/netdev-offload-dpdk.c
>> +++ b/lib/netdev-offload-dpdk.c
>> @@ -791,6 +791,17 @@ dump_flow_action(struct ds *s, struct ds *s_extra,
>>  ds_put_format(s, "port %"PRIu16" ", ntohs(set_tp->port));
>>  }
>>  ds_put_cstr(s, "/ ");
>> +} else if (actions->type == RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP ||
>> +   actions->type == RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP) {
>> +const struct rte_flow_action_set_dscp *set_dscp = actions->conf;
>> +char *dirstr = actions->type == RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP
>> +   ? "set_ipv4_dscp " : "set_ipv6_dscp ";
>> +
>> +ds_put_cstr(s, dirstr);
>> +if (set_dscp) {
>> +ds_put_format(s, "dscp_value %d ", set_dscp->dscp);
>> +}
>> +ds_put_cstr(s, "/ ");
>>  } else if (actions->type == RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN) {
>>  const struct rte_flow_action_of_push_vlan *of_push_vlan =
>>  actions->conf;
>> @@ -1835,7 +1846,10 @@ add_set_flow_action__(struct flow_actions *actions,
>>  if (is_all_zeros(mask, size)) {
>>  return 0;
>>  }
>> -if (!is_all_ones(mask, size)) {
>> +if (!is_all_ones(mask, size) &&
>> +/* set dscp need patial mask */
>> +attr != RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP &&
>> +attr != RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP) {
> 
> This doesn't seem right.  We need to check that the mask contains
> all the DSCP bits and that it doesn't have anything else inside.
> 
> For example, if we try to set ECN bits, we'll have them in a mask.
> In this case is_all_zeros() check will fail and then this check
> will allow us to proceed as well.  In the end, we'll end up setting
> DSCP bits to all zeroes, or worse, to whatever was in the first
> six bits of nw_tos.
> 

Also, this function will clear the whole mask below, but we should
only clear DSCP bits in this case.

>>  VLOG_DBG_RL(&rl, "Partial mask is not supported");
>>  return -1;
>>  }
>> @@ -1912,6 +1926,7 @@ parse_set_actions(struct flow_actions *actions,
>>  add_set_flow_action(ipv4_src, 
>> RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC);
>>  add_set_flow_action(ipv4_dst, 
>> RTE_FLOW_ACTION_TYPE_SET_IPV4_DST);
>>  add_set_flow_action(ipv4_ttl, RTE_FLOW_ACTION_TYPE_SET_TTL);
>> +add_set_flow_action(ipv4_tos, 
>> RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP);z
>>  
>>  if (mask && !is_all_zeros(mask, sizeof *mask)) {
>>  VLOG_DBG_RL(&rl, "Unsupported IPv4 set action");
>> @@ -1924,6 +1939,8 @@ parse_set_actions(struct flow_actions *actions,
>>  add_set_flow_action(ipv6_src, 
>> RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC);
>>  add_set_flow_action(ipv6_dst, 
>> RTE_FLOW_ACTION_TYPE_SET_IPV6_DST);
>>  add_set_flow_action(ipv6_hlimit, RTE_FLOW_ACTION_TYPE_SET_TTL);
>> +add_set_flow_action(ipv6_tclass,
>> +RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP);
>>  
>>  if (mask && !is_all_zeros(mask, sizeof *mask)) {
>>  VLOG_DBG_RL(&rl, "Unsupported IPv6 set action");
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v5 2/2] pinctrl: Handle arp/nd for other address families.

2024-06-28 Thread Numan Siddique
On Wed, Jun 12, 2024 at 2:50 AM Felix Huettner via dev
 wrote:
>
> Previously we could only generate ARP requests from IPv4 packets
> and NS requests from IPv6 packets. This was the case because we rely on
> information in the packet to generate the ARP/NS requests.
>
> However in case of ARP/NS requests originating from the Logical_Router
> pipeline for nexthop lookups we overwrite the affected fields
> afterwards. This overwrite is done by the userdata openflow actions.
> Because of this we actually do not rely on any information of the IPv4/6
> packets in these cases.
>
> Unfortunately we can not easily determine if we are actually later
> overwriting the affected fields. The approach now is to use the fields
> from the IP header if we have a matching IP version and default to some
> values otherwise. In case we overwrite this data afterwards we are
> generally good. If we do not overwrite this data because of some bug we
> will send out invalid ARP/NS requests. They will hopefully be dropped by
> the rest of the network.
>
> The alternative would have been to introduce new arp/nd_ns actions where
> we guarantee this overwrite. This would not suffer from the above
> limitations, but would require a coordination on upgrades between all
> ovn-controllers and northd.
>
> Signed-off-by: Felix Huettner 

Hi Felix,

Thanks for the patch series and sorry for the delay in reviews.

Looks like this patch series requires another rebase.

I tested this patch series using the ovn-fake-mutlinode - [1] .

After starting it,  I added the below static route

ovn-nbctl lr-route-add lr1 172.15.0.0/24 3001::b

And in the 'ovn-chassis-1' container I ran this command

[root@ovn-chassis-1 ~]# ip netns exec sw01p1 ping 172.15.0.50

Since we don't know the next hop,  the below logical flow is hit
(which is as expected)

  table=24(lr_in_arp_request  ), priority=200  , match=(eth.dst ==
00:00:00:00:00:00 && reg9[9] == 0 && xxreg0 == 3001::4), action=(nd_ns
{ eth.dst = 33:33:ff:00:00:04; ip6.dst = ff02::1:ff00:4; nd.target =
3001::b; output; }; output;)

And If I look into the ovn-controller logs (after enabling vconn:dbg)
I see that ovn-controller receives the icmp ping packet and it
generates IPv6 NS packet which is also as expected


2024-06-28T21:39:06.303Z|00020|vconn(ovn_pinctrl0)|DBG|unix:/var/run/openvswitch/br-int.mgmt:
received: NXT_PACKET_IN2 (OF1.5) (xid=0x0): cookie=0x77e2d8b
total_len=98 
reg0=0x3001,reg1=0xac10016e,reg3=0xb,reg4=0x3001,reg7=0xa,reg9=0x4,reg10=0x1,reg11=0x1,reg12=0x3,reg14=0x1,reg15=0x3,metadata=0x3,in_port=5
(via action) data_len=98 (unbuffered)
 
userdata=00.00.00.09.00.00.00.00.00.1c.00.18.00.80.00.00.00.00.00.00.00.01.de.10.80.00.3e.10.00.00.00.00.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.25.00.00.00
 continuation.bridge=b72b4e65-0756-4215-bfbf-c3db3652e8ee
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,37)
 continuation.odp_port=5
icmp,vlan_tci=0x,dl_src=30:51:00:00:00:03,dl_dst=00:00:00:00:00:00,nw_src=11.0.0.3,nw_dst=172.15.0.50,nw_tos=0,nw_ecn=0,nw_ttl=63,nw_frag=no,icmp_type=8,icmp_code=0
icmp_csum:cf9b
2024-06-28T21:39:06.304Z|00021|vconn(ovn_pinctrl0)|DBG|unix:/var/run/openvswitch/br-int.mgmt:
sent (Success): OFPT_PACKET_OUT (OF1.5) (xid=0x413):
in_port=CONTROLLER
actions=set_field:0x3001->reg0,set_field:0xac10016e->reg1,set_field:0xb->reg3,set_field:0x3001->reg4,set_field:0xa->reg7,set_field:0x4->reg9,set_field:0x1->reg10,set_field:0x1->reg11,set_field:0x3->reg12,set_field:0x1->reg14,set_field:0x3->reg15,set_field:0x3->metadata,move:NXM_NX_XXREG0[]->NXM_NX_ND_TARGET[],resubmit(,37)
data_len=86
icmp6,vlan_tci=0x,dl_src=30:51:00:00:00:03,dl_dst=33:33:ff:ff:ff:ff,ipv6_src=fe80::3251:ff:fe00:3,ipv6_dst=ff02::1::,ipv6_label=0x0,nw_tos=0,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=135,icmp_code=0,nd_target=:::::::,nd_sll=30:51:00:00:00:03,nd_tll=00:00:00:00:00:00
icmp6_csum:1877
--

But strangely when this packet leaves the physical interface 'eth2' on
ovn-chassis-1,  the nd_target is wrongly populated

tcpdump on the physical interface (connected to br-ex)
---
17:37:40.289881 30:51:00:00:00:03 > 33:33:ff:ff:ff:ff, ethertype IPv6
(0x86dd), length 86: (hlim 255, next-header ICMPv6 (58) payload
length: 32) fe80::3251:ff:fe00:3 > ff02::1::: [icmp6 sum ok]
ICMP6, neighbor solicitation, length 32, who has 3001:0:ac10:16e::b
  source link-address option (1), length 8 (1): 30:51:00:00:00:03

---

You can see that the nd_target is '3001:0:ac10:16e::b  and not
'3001::b', which is strange.  The inner action of "nd_ns" clearly sets
nd.target = 3001::b.

Can you please verify in your end if that is the case ?  You can
perhaps use ovn-fake-multinode if you can't  reproduce in your setup.

Thanks
Numan




> ---
> v4->v5: rebase
> v4: newly added
>
>  controller/pinctrl.c |  52 +++--
>  lib/actions.c|   4 +-
>  northd/northd.c  |   9 +-
>  tests/ovn-northd.at  |   8 +-
>  tests/ovn.at | 268 +

Re: [ovs-dev] [PATCH 1/1] netdev-dpdk: Check pending reset when adding device.

2024-06-28 Thread Ilya Maximets
On 6/13/24 11:40, David Marchand wrote:
> On Wed, Jun 12, 2024 at 4:33 PM Kevin Traynor  wrote:
>>
>> When a device reset interrupt event (RTE_ETH_EVENT_INTR_RESET)
>> is detected for a DPDK device added to OVS, a device reset is
>> performed.
>>
>> If a device reset interrupt event is detected for a device before
>> it is added to OVS, device reset is not called.
>>
>> If that device is later attempted to be added to OVS, it may fail
>> while being configured if it is still pending a reset as pending
>> reset is not checked when adding a device.
>>
>> A simple way to force a reset event from the ice driver for an
>> iavf device is to set the mac address after binding iavf dev to
>> vfio but before adding to OVS. (note: should not be set like this
>> in normal case). e.g.
>>
>> $ echo 2 > /sys/class/net/ens3f0/device/sriov_numvfs
>> $ ./devbind.py -b vfio-pci :d8:01.1
>> $ ip link set ens3f0 vf 1 mac 26:ab:e6:6f:79:4d
>> $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk \
>>   options:dpdk-devargs=:d8:01.1
>>
>> |dpdk|ERR|Port1 dev_configure = -1
>> |netdev_dpdk|WARN|Interface dpdk0 eth_dev setup error
>> Operation not permitted
>> |netdev_dpdk|ERR|Interface dpdk0(rxq:1 txq:5 lsc interrupt mode:false)
>> configure error: Operation not permitted
>> |dpif_netdev|ERR|Failed to set interface dpdk0 new configuration
>>
>> Add a check if there was any previous device reset interrupt events
>> when a device is added to OVS. If there was, perform the reset
>> before continuing with the rest of the configuration.
>>
>> netdev_dpdk_pending_reset[] already tracks device reset interrupt
>> events for all devices, so it can be reused to check if there is a
>> reset needed during configuration of newly added devices. By extending
>> it's usage, dev->reset_needed is no longer needed.
>>
> 
> Maybe add a Fixes: and I think we should backport this.
> 
>> Signed-off-by: Kevin Traynor 
>> ---
>>  lib/netdev-dpdk.c | 20 
>>  1 file changed, 12 insertions(+), 8 deletions(-)
>
> The patch looks good to me.
> Reviewed-by: David Marchand 

Thanks, Kevin and David!  I fixed the nits and applied the change.
Also backported to branch-3.3.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] netdev-offload-dpdk: Support offload of set dscp action.

2024-06-28 Thread Ilya Maximets
On 6/20/24 09:35, Sunyang Wu via dev wrote:
> Add the "set dscp action" parsing function,
> so that the "set dscp action" can be offloaded.
> 
> Signed-off-by: Sunyang Wu 
> ---
>  lib/netdev-offload-dpdk.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 

Hi, Sunyang Wu.  Thanks for the patch!
See some comments inline.

Best regards, Ilya Maximets.

> diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
> index 623005b1c..524942457 100644
> --- a/lib/netdev-offload-dpdk.c
> +++ b/lib/netdev-offload-dpdk.c
> @@ -791,6 +791,17 @@ dump_flow_action(struct ds *s, struct ds *s_extra,
>  ds_put_format(s, "port %"PRIu16" ", ntohs(set_tp->port));
>  }
>  ds_put_cstr(s, "/ ");
> +} else if (actions->type == RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP ||
> +   actions->type == RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP) {
> +const struct rte_flow_action_set_dscp *set_dscp = actions->conf;
> +char *dirstr = actions->type == RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP
> +   ? "set_ipv4_dscp " : "set_ipv6_dscp ";
> +
> +ds_put_cstr(s, dirstr);
> +if (set_dscp) {
> +ds_put_format(s, "dscp_value %d ", set_dscp->dscp);
> +}
> +ds_put_cstr(s, "/ ");
>  } else if (actions->type == RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN) {
>  const struct rte_flow_action_of_push_vlan *of_push_vlan =
>  actions->conf;
> @@ -1835,7 +1846,10 @@ add_set_flow_action__(struct flow_actions *actions,
>  if (is_all_zeros(mask, size)) {
>  return 0;
>  }
> -if (!is_all_ones(mask, size)) {
> +if (!is_all_ones(mask, size) &&
> +/* set dscp need patial mask */
> +attr != RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP &&
> +attr != RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP) {

This doesn't seem right.  We need to check that the mask contains
all the DSCP bits and that it doesn't have anything else inside.

For example, if we try to set ECN bits, we'll have them in a mask.
In this case is_all_zeros() check will fail and then this check
will allow us to proceed as well.  In the end, we'll end up setting
DSCP bits to all zeroes, or worse, to whatever was in the first
six bits of nw_tos.

>  VLOG_DBG_RL(&rl, "Partial mask is not supported");
>  return -1;
>  }
> @@ -1912,6 +1926,7 @@ parse_set_actions(struct flow_actions *actions,
>  add_set_flow_action(ipv4_src, RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC);
>  add_set_flow_action(ipv4_dst, RTE_FLOW_ACTION_TYPE_SET_IPV4_DST);
>  add_set_flow_action(ipv4_ttl, RTE_FLOW_ACTION_TYPE_SET_TTL);
> +add_set_flow_action(ipv4_tos, 
> RTE_FLOW_ACTION_TYPE_SET_IPV4_DSCP);z
>  
>  if (mask && !is_all_zeros(mask, sizeof *mask)) {
>  VLOG_DBG_RL(&rl, "Unsupported IPv4 set action");
> @@ -1924,6 +1939,8 @@ parse_set_actions(struct flow_actions *actions,
>  add_set_flow_action(ipv6_src, RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC);
>  add_set_flow_action(ipv6_dst, RTE_FLOW_ACTION_TYPE_SET_IPV6_DST);
>  add_set_flow_action(ipv6_hlimit, RTE_FLOW_ACTION_TYPE_SET_TTL);
> +add_set_flow_action(ipv6_tclass,
> +RTE_FLOW_ACTION_TYPE_SET_IPV6_DSCP);
>  
>  if (mask && !is_all_zeros(mask, sizeof *mask)) {
>  VLOG_DBG_RL(&rl, "Unsupported IPv6 set action");

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] Userspace: Software fallback for UDP encapsulated TCP segmentation.

2024-06-28 Thread Mike Pattrick
On Fri, Jun 28, 2024 at 4:50 AM David Marchand
 wrote:
>
> On Thu, Jun 27, 2024 at 4:04 PM David Marchand
>  wrote:
> > @@ -926,6 +926,17 @@ netdev_send(struct netdev *netdev, int qid,
> > struct dp_packet_batch *batch,
> >  return netdev_send_tso(netdev, qid, batch, 
> > concurrent_txq);
> >  }
> >  }
> > +} else if (!(netdev_flags & NETDEV_TX_OFFLOAD_OUTER_UDP_CKSUM)) {
> > +DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +if (!dp_packet_hwol_is_tso(packet) &&
> > +!dp_packet_hwol_is_tunnel_vxlan(packet) &&
> > +!dp_packet_hwol_is_tunnel_geneve(packet)) {
> > +continue;
>
> Erm, less buggy:

Thanks for the review David! These are some good ideas. I'll send in a
v3 with them included.


Cheers,
M

>
> +if (!dp_packet_hwol_is_tso(packet) ||
> +(!dp_packet_hwol_is_tunnel_vxlan(packet) &&
> + !dp_packet_hwol_is_tunnel_geneve(packet))) {
> +continue;
> +}
>
> > +}
> > +if (dp_packet_hwol_is_outer_udp_cksum(packet)) {
> > +return netdev_send_tso(netdev, qid, batch, 
> > concurrent_txq);
> > +}
> > +}
> >  }
> >  }
>
>
> --
> David Marchand
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v6 05/10] net: openvswitch: add psample action

2024-06-28 Thread Simon Horman
On Fri, Jun 28, 2024 at 01:05:41PM +0200, Adrian Moreno wrote:
> Add support for a new action: psample.
> 
> This action accepts a u32 group id and a variable-length cookie and uses
> the psample multicast group to make the packet available for
> observability.
> 
> The maximum length of the user-defined cookie is set to 16, same as
> tc_cookie, to discourage using cookies that will not be offloadable.
> 
> Signed-off-by: Adrian Moreno 

...

> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index efc82c318fa2..07086759556b 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -914,6 +914,31 @@ struct check_pkt_len_arg {
>  };
>  #endif
>  
> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
> +/**
> + * enum ovs_pample_attr - Attributes for %OVS_ACTION_ATTR_PSAMPLE

nit: s/ovs_pample_attr/ovs_psample_attr/

> + * action.
> + *
> + * @OVS_PSAMPLE_ATTR_GROUP: 32-bit number to identify the source of the
> + * sample.
> + * @OVS_PSAMPLE_ATTR_COOKIE: An optional variable-length binary cookie that
> + * contains user-defined metadata. The maximum length is
> + * OVS_PSAMPLE_COOKIE_MAX_SIZE bytes.
> + *
> + * Sends the packet to the psample multicast group with the specified group 
> and
> + * cookie. It is possible to combine this action with the
> + * %OVS_ACTION_ATTR_TRUNC action to limit the size of the sample.
> + */
> +enum ovs_psample_attr {
> + OVS_PSAMPLE_ATTR_GROUP = 1, /* u32 number. */
> + OVS_PSAMPLE_ATTR_COOKIE,/* Optional, user specified cookie. */
> +
> + /* private: */
> + __OVS_PSAMPLE_ATTR_MAX
> +};

...
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3] python: ovsdb-idl: Add custom transaction operations.

2024-06-28 Thread Terry Wilson
It can be useful to be able to send raw transaction operations
through the Idl's connection. For example, to clean up MAC_Binding
entries for floating IPs without having to monitor the MAC_Binding
table which can be quite large.

Signed-off-by: Terry Wilson 
---
 NEWS |  2 ++
 python/ovs/db/idl.py | 21 -
 tests/ovsdb-idl.at   | 27 
 tests/test-ovsdb.py  | 73 
 4 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index d05f2d0f8..fca983e5a 100644
--- a/NEWS
+++ b/NEWS
@@ -12,6 +12,8 @@ Post-v3.3.0
  * Link status changes are now handled via interrupt mode if the DPDK
driver supports it.  It is possible to revert to polling mode by setting
per interface 'options:dpdk-lsc-interrupt' to 'false'.
+   - Python:
+ * Add custom transaction support to the Idl via add_op().
 
 
 v3.3.0 - 16 Feb 2024
diff --git a/python/ovs/db/idl.py b/python/ovs/db/idl.py
index b6d5ed697..c1caab7b2 100644
--- a/python/ovs/db/idl.py
+++ b/python/ovs/db/idl.py
@@ -1708,6 +1708,8 @@ class Transaction(object):
 
 self._inserted_rows = {}  # Map from UUID to _InsertedRow
 
+self._operations = []
+
 def add_comment(self, comment):
 """Appends 'comment' to the comments that will be passed to the OVSDB
 server when this transaction is committed.  (The comment will be
@@ -1843,7 +1845,7 @@ class Transaction(object):
"rows": [rows]})
 
 # Add updates.
-any_updates = False
+any_updates = bool(self._operations)
 for row in self._txn_rows.values():
 if row._changes is None:
 if row._table.is_root:
@@ -1978,6 +1980,8 @@ class Transaction(object):
 operations.append({"op": "comment",
"comment": "\n".join(self._comments)})
 
+operations += self._operations
+
 # Dry run?
 if self.dry_run:
 operations.append({"op": "abort"})
@@ -1996,6 +2000,21 @@ class Transaction(object):
 self.__disassemble()
 return self._status
 
+def add_op(self, op):
+"""Add a raw OVSDB operation to the transaction
+
+This can be useful for re-using the existing Idl connection to take
+actions that are difficult or expensive to do with the Idl itself, e.g.
+bulk deleting rows from the server without downloading them into a
+local cache.
+
+All ops are applied after any other operations in the transaction
+
+:param op: An "op" for an OVSDB "transact" request (rfc 7047 Sec 5.2)
+:type op:   dict
+"""
+self._operations.append(op)
+
 def commit_block(self):
 """Attempts to commit this transaction, blocking until the commit
 either succeeds or fails.  Returns the final commit status, which may
diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
index b9dc0bdea..9070ea051 100644
--- a/tests/ovsdb-idl.at
+++ b/tests/ovsdb-idl.at
@@ -2863,6 +2863,33 @@ OVSDB_CHECK_IDL_PERS_UUID_INSERT([simple idl, persistent 
uuid insert],
   [['This UUID would duplicate a UUID already present within the table or 
deleted within the same transaction']])
 
 
+OVSDB_CHECK_IDL_PY([simple idl, python, add_op],
+  [],
+  [['insert 1, insert 2, insert 3, insert 1' \
+'add_op {"op": "delete", "table": "simple", "where": [["i", "==", 1]]}' \
+'add_op {"op": "insert", "table": "simple", "row": {"i": 2}}, delete 3' \
+'insert 2, add_op {"op": "update", "table": "simple", "row": {"i": 1}, 
"where": [["i", "==", 2]]}'
+  ]],
+  [[000: empty
+001: commit, status=success
+002: table simple: i=1 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<1>
+002: table simple: i=1 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<2>
+002: table simple: i=2 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<3>
+002: table simple: i=3 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<4>
+003: commit, status=success
+004: table simple: i=2 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<3>
+004: table simple: i=3 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<4>
+005: commit, status=success
+006: table simple: i=2 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<3>
+006: table simple: i=2 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<5>
+007: commit, status=success
+008: table simple: i=1 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<3>
+008: table simple: i=1 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<5>
+008: table simple: i=1 r=0 b=false s= u=<0> ia=[] ra=[] ba=[] sa=[] ua=[] 
uuid=<6>
+009: done
+]],[],sort)
+
+
 m4_define([OVSDB_CHECK_IDL_CHANGE_AWARE],
   [AT_SETUP([simple idl, database change aware, online conversion - $1])
AT_KEYWORDS([ovsdb server idl db_change_aware conversion $1])
diff --git a/tests/test-ovsdb.py b/tests/test-ovsdb.py
index 67a4

Re: [ovs-dev] [PATCH net-next v6 05/10] net: openvswitch: add psample action

2024-06-28 Thread Eelco Chaudron



On 28 Jun 2024, at 13:05, Adrian Moreno wrote:

> Add support for a new action: psample.
>
> This action accepts a u32 group id and a variable-length cookie and uses
> the psample multicast group to make the packet available for
> observability.
>
> The maximum length of the user-defined cookie is set to 16, same as
> tc_cookie, to discourage using cookies that will not be offloadable.
>
> Signed-off-by: Adrian Moreno 

I think this patch looks good. After some offline discussion on alignment with 
the userspace model, we decided to proceed with a psample() specific action.

With that in mind, and considering the additional changes, this patch looks 
good to me.

Acked-by: Eelco Chaudron echau...@redhat.com

Cheers,

Eelco

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v6 05/10] net: openvswitch: add psample action

2024-06-28 Thread Aaron Conole
Adrian Moreno  writes:

> Add support for a new action: psample.
>
> This action accepts a u32 group id and a variable-length cookie and uses
> the psample multicast group to make the packet available for
> observability.
>
> The maximum length of the user-defined cookie is set to 16, same as
> tc_cookie, to discourage using cookies that will not be offloadable.
>
> Signed-off-by: Adrian Moreno 
> ---

I didn't thoroughly review this, but just wanted to comment that I like
the idea of a psample action here specific to the actual action that is
being performed on the packet - psample.  Much like we do for userspace
and other actions.

>  Documentation/netlink/specs/ovs_flow.yaml | 17 
>  include/uapi/linux/openvswitch.h  | 28 ++
>  net/openvswitch/Kconfig   |  1 +
>  net/openvswitch/actions.c | 47 +++
>  net/openvswitch/flow_netlink.c| 32 ++-
>  5 files changed, 124 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/netlink/specs/ovs_flow.yaml 
> b/Documentation/netlink/specs/ovs_flow.yaml
> index 4fdfc6b5cae9..46f5d1cd8a5f 100644
> --- a/Documentation/netlink/specs/ovs_flow.yaml
> +++ b/Documentation/netlink/specs/ovs_flow.yaml
> @@ -727,6 +727,12 @@ attribute-sets:
>  name: dec-ttl
>  type: nest
>  nested-attributes: dec-ttl-attrs
> +  -
> +name: psample
> +type: nest
> +nested-attributes: psample-attrs
> +doc: |
> +  Sends a packet sample to psample for external observation.
>-
>  name: tunnel-key-attrs
>  enum-name: ovs-tunnel-key-attr
> @@ -938,6 +944,17 @@ attribute-sets:
>-
>  name: gbp
>  type: u32
> +  -
> +name: psample-attrs
> +enum-name: ovs-psample-attr
> +name-prefix: ovs-psample-attr-
> +attributes:
> +  -
> +name: group
> +type: u32
> +  -
> +name: cookie
> +type: binary
>  
>  operations:
>name-prefix: ovs-flow-cmd-
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index efc82c318fa2..07086759556b 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -914,6 +914,31 @@ struct check_pkt_len_arg {
>  };
>  #endif
>  
> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
> +/**
> + * enum ovs_pample_attr - Attributes for %OVS_ACTION_ATTR_PSAMPLE
> + * action.
> + *
> + * @OVS_PSAMPLE_ATTR_GROUP: 32-bit number to identify the source of the
> + * sample.
> + * @OVS_PSAMPLE_ATTR_COOKIE: An optional variable-length binary cookie that
> + * contains user-defined metadata. The maximum length is
> + * OVS_PSAMPLE_COOKIE_MAX_SIZE bytes.
> + *
> + * Sends the packet to the psample multicast group with the specified group 
> and
> + * cookie. It is possible to combine this action with the
> + * %OVS_ACTION_ATTR_TRUNC action to limit the size of the sample.
> + */
> +enum ovs_psample_attr {
> + OVS_PSAMPLE_ATTR_GROUP = 1, /* u32 number. */
> + OVS_PSAMPLE_ATTR_COOKIE,/* Optional, user specified cookie. */
> +
> + /* private: */
> + __OVS_PSAMPLE_ATTR_MAX
> +};
> +
> +#define OVS_PSAMPLE_ATTR_MAX (__OVS_PSAMPLE_ATTR_MAX - 1)
> +
>  /**
>   * enum ovs_action_attr - Action types.
>   *
> @@ -966,6 +991,8 @@ struct check_pkt_len_arg {
>   * of l3 tunnel flag in the tun_flags field of OVS_ACTION_ATTR_ADD_MPLS
>   * argument.
>   * @OVS_ACTION_ATTR_DROP: Explicit drop action.
> + * @OVS_ACTION_ATTR_PSAMPLE: Send a sample of the packet to external 
> observers
> + * via psample.
>   *
>   * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not 
> all
>   * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
> @@ -1004,6 +1031,7 @@ enum ovs_action_attr {
>   OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */
>   OVS_ACTION_ATTR_DEC_TTL,  /* Nested OVS_DEC_TTL_ATTR_*. */
>   OVS_ACTION_ATTR_DROP, /* u32 error code. */
> + OVS_ACTION_ATTR_PSAMPLE,  /* Nested OVS_PSAMPLE_ATTR_*. */
>  
>   __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
>  * from userspace. */
> diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
> index 29a7081858cd..2535f3f9f462 100644
> --- a/net/openvswitch/Kconfig
> +++ b/net/openvswitch/Kconfig
> @@ -10,6 +10,7 @@ config OPENVSWITCH
>  (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
>(!NF_NAT || NF_NAT) && \
>(!NETFILTER_CONNCOUNT || 
> NETFILTER_CONNCOUNT)))
> + depends on PSAMPLE || !PSAMPLE
>   select LIBCRC32C
>   select MPLS
>   select NET_MPLS_GSO
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 964225580824..a035b7e677dd 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -24,6 +24,11 @@
>  #include 

Re: [ovs-dev] [PATCH net-next v3 0/7] selftests: net: Switch pmtu.sh to use the internal ovs script.

2024-06-28 Thread Aaron Conole
Jakub Kicinski  writes:

> On Tue, 25 Jun 2024 13:22:38 -0400 Aaron Conole wrote:
>> Currently, if a user wants to run pmtu.sh and cover all the provided test
>> cases, they need to install the Open vSwitch userspace utilities.  This
>> dependency is difficult for users as well as CI environments, because the
>> userspace build and setup may require lots of support and devel packages
>> to be installed, system setup to be correct, and things like permissions
>> and selinux policies to be properly configured.
>
> Hi Aaron!
>
> I merged this yesterday (with slight alphabetical reshuffling of
> the config options). The pmtu.sh test is solid now, which is great!

:)  Thanks!  That's great to see.

> I also added the OvS tests themselves, and those are not passing, yet:
> https://netdev.bots.linux.dev/contest.html?test=openvswitch-sh
> Could you take a look and LMK if these are likely env issues or
> something bad in the test itself?

I saw that.  I was looking for a place in the nipa repository where I
could submit a small fix, because I noticed in the stdout:

  make -C tools/testing/selftests TARGETS="net/openvswitch"
  TEST_PROGS=openvvswitch.sh TEST_GEN_PROGS="" run_tests
  
and I think the TEST_PROGS=openvvswitch.sh is misspelled (but it seems
to not matter too much for the run_test target).

>From what I understand, there are two things causing it to be flaky.
First, the module detection is a bit flaky (and that's why it results is
some 'skip' reports).  Additionally, the connection oriented tests
include negative cases and those hit timeouts.  The default is to
declare failure after 45s.  That can be seen in:

  
https://netdev-3.bots.linux.dev/vmksft-net/results/659601/91-openvswitch-sh/stdout
  ...
  # timeout set to 45
  ...
  # TEST: nat_connect_v4[START]
  # Terminated
  # Terminated

This is showing that the timeout is too short.

I have patches ready for these issues, but I didn't know if you would
like me to submit config and settings files to go under net/openvswitch,
or if you would prefer to see the openvswitch.sh script, and
ovs-dpctl.py utilities move out of their net/openvswitch/ directory.  If
the latter, I can submit patches quickly with config and settings (and a
small change to the script itself) that addresses these.  If you'd
prefer the former (moving around the files), I'll need to spend some
additional time modifying pmtu and doing a larger test.  I don't have a
strong opinion on either approach.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v6 00/10] net: openvswitch: Add sample multicasting.

2024-06-28 Thread Adrián Moreno
On Fri, Jun 28, 2024 at 01:05:36PM GMT, Adrian Moreno wrote:
> ** Background **
> Currently, OVS supports several packet sampling mechanisms (sFlow,
> per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
> userspace action that needs to be handled by ovs-vswitchd's handler
> threads only to be forwarded to some third party application that
> will somehow process the sample and provide observability on the
> datapath.
>
> A particularly interesting use-case is controller-driven
> per-flow IPFIX sampling where the OpenFlow controller can add metadata
> to samples (via two 32bit integers) and this metadata is then available
> to the sample-collecting system for correlation.
>
> ** Problem **
> The fact that sampled traffic share netlink sockets and handler thread
> time with upcalls, apart from being a performance bottleneck in the
> sample extraction itself, can severely compromise the datapath,
> yielding this solution unfit for highly loaded production systems.
>
> Users are left with little options other than guessing what sampling
> rate will be OK for their traffic pattern and system load and dealing
> with the lost accuracy.
>
> Looking at available infrastructure, an obvious candidated would be
> to use psample. However, it's current state does not help with the
> use-case at stake because sampled packets do not contain user-defined
> metadata.
>
> ** Proposal **
> This series is an attempt to fix this situation by extending the
> existing psample infrastructure to carry a variable length
> user-defined cookie.
>
> The main existing user of psample is tc's act_sample. It is also
> extended to forward the action's cookie to psample.
>
> Finally, a new OVS action (OVS_SAMPLE_ATTR_EMIT_SAMPLE) is created.
> It accepts a group and an optional cookie and uses psample to
> multicast the packet and the metadata.
>
> --
> v5 -> v6:
> - Renamed emit_sample -> psample
> - Addressed unused variable and conditionally compilation of function.
>
> v4 -> v5:
> - Rebased.
> - Removed lefover enum value and wrapped some long lines in selftests.
>
> v3 -> v4:
> - Rebased.
> - Addressed Jakub's comment on private and unused nla attributes.
>
> v2 -> v3:
> - Addressed comments from Simon, Aaron and Ilya.
> - Dropped probability propagation in nested sample actions.
> - Dropped patch v2's 7/9 in favor of a userspace implementation and
> consume skb if emit_sample is the last action, same as we do with
> userspace.
> - Split ovs-dpctl.py features in independent patches.
>
> v1 -> v2:
> - Create a new action ("emit_sample") rather than reuse existing
>   "sample" one.
> - Add probability semantics to psample's sampling rate.
> - Store sampling probability in skb's cb area and use it in emit_sample.
> - Test combining "emit_sample" with "trunc"
> - Drop group_id filtering and tracepoint in psample.
>
> rfc_v2 -> v1:
> - Accommodate Ilya's comments.
> - Split OVS's attribute in two attributes and simplify internal
> handling of psample arguments.
> - Extend psample and tc with a user-defined cookie.
> - Add a tracepoint to psample to facilitate troubleshooting.
>
> rfc_v1 -> rfc_v2:
> - Use psample instead of a new OVS-only multicast group.
> - Extend psample and tc with a user-defined cookie.
>
>
> Adrian Moreno (10):
>   net: psample: add user cookie
>   net: sched: act_sample: add action cookie to sample
>   net: psample: skip packet copy if no listeners
>   net: psample: allow using rate as probability
>   net: openvswitch: add psample action
>   net: openvswitch: store sampling probability in cb.
>   selftests: openvswitch: add psample action
>   selftests: openvswitch: add userspace parsing
>   selftests: openvswitch: parse trunc action
>   selftests: openvswitch: add psample test
>
>  Documentation/netlink/specs/ovs_flow.yaml |  17 ++
>  include/net/psample.h |   5 +-
>  include/uapi/linux/openvswitch.h  |  31 +-
>  include/uapi/linux/psample.h  |  11 +-
>  net/openvswitch/Kconfig   |   1 +
>  net/openvswitch/actions.c |  65 -
>  net/openvswitch/datapath.h|   3 +
>  net/openvswitch/flow_netlink.c|  32 ++-
>  net/openvswitch/vport.c   |   1 +
>  net/psample/psample.c |  16 +-
>  net/sched/act_sample.c|  12 +
>  .../selftests/net/openvswitch/openvswitch.sh  | 115 +++-
>  .../selftests/net/openvswitch/ovs-dpctl.py| 272 +-
>  13 files changed, 565 insertions(+), 16 deletions(-)
>
> --
> 2.45.2
>

Patchwork says this patch is not applying on net-next. I'll wait for
some reviews and rebase+resubmit it later tonight or tomorrow.

Thanks.
Adrián

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC v2 5/9] ofproto-dpif-xlate: Use emit_sample for local sample.

2024-06-28 Thread Adrián Moreno
On Wed, Jun 26, 2024 at 02:11:51PM GMT, Eelco Chaudron wrote:
> On 5 Jun 2024, at 22:23, Adrian Moreno wrote:
>
> > Use the newly added emit_sample to implement OpenFlow sample() actions
> > with local sampling configuration.
>
> See some comments below.
>
> Cheers,
>
> Eelco
>
> > Signed-off-by: Adrian Moreno 
> > ---
> >  ofproto/ofproto-dpif-lsample.c |  17 
> >  ofproto/ofproto-dpif-lsample.h |   6 ++
> >  ofproto/ofproto-dpif-xlate.c   | 163 -
> >  ofproto/ofproto-dpif-xlate.h   |   3 +-
> >  ofproto/ofproto-dpif.c |   2 +-
> >  5 files changed, 144 insertions(+), 47 deletions(-)
> >
> > diff --git a/ofproto/ofproto-dpif-lsample.c b/ofproto/ofproto-dpif-lsample.c
> > index 7bdafac19..2c0b5da89 100644
> > --- a/ofproto/ofproto-dpif-lsample.c
> > +++ b/ofproto/ofproto-dpif-lsample.c
> > @@ -139,6 +139,23 @@ dpif_lsample_set_options(struct dpif_lsample *lsample,
> >  return changed;
> >  }
> >
> > +/* Returns the group_id of the exporter with the given collector_set_id, 
> > if it
> > + * exists. */
>
> nit: The below will fit on one line
>
> /* Returns the exporter group_id for a given collector_set_id, if it exists. 
> */
>

Ack.

> > +bool
> > +dpif_lsample_get_group_id(struct dpif_lsample *ps, uint32_t 
> > collector_set_id,
> > +  uint32_t *group_id)
> > +{
> > +struct lsample_exporter_node *node;
> > +bool found = false;
> > +
> > +node = dpif_lsample_find_exporter_node(ps, collector_set_id);
> > +if (node) {
> > +found = true;
> > +*group_id = node->exporter.options.group_id;
> > +}
> > +return found;
> > +}
> > +
> >  struct dpif_lsample *
> >  dpif_lsample_create(void)
> >  {
> > diff --git a/ofproto/ofproto-dpif-lsample.h b/ofproto/ofproto-dpif-lsample.h
> > index c23ea8372..f13425575 100644
> > --- a/ofproto/ofproto-dpif-lsample.h
> > +++ b/ofproto/ofproto-dpif-lsample.h
> > @@ -19,6 +19,7 @@
> >
> >  #include 
> >  #include 
> > +#include 
>
> Maybe add in alphabetical order.
>

Ack.

> >  struct dpif_lsample;
> >  struct ofproto_lsample_options;
> > @@ -31,4 +32,9 @@ bool dpif_lsample_set_options(struct dpif_lsample *,
> >const struct ofproto_lsample_options *,
> >size_t n_opts);
> >
> > +bool dpif_lsample_get_group_id(struct dpif_lsample *,
> > +   uint32_t collector_set_id,
> > +   uint32_t *group_id);
> > +
> >  #endif /* OFPROTO_DPIF_LSAMPLE_H */
> > +
>
> Accedantely added a newline?
>

Ack.

> > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> > index 7c4950895..5bd215d31 100644
> > --- a/ofproto/ofproto-dpif-xlate.c
> > +++ b/ofproto/ofproto-dpif-xlate.c
> > @@ -47,6 +47,7 @@
> >  #include "ofproto/ofproto-dpif-ipfix.h"
> >  #include "ofproto/ofproto-dpif-mirror.h"
> >  #include "ofproto/ofproto-dpif-monitor.h"
> > +#include "ofproto/ofproto-dpif-lsample.h"
> >  #include "ofproto/ofproto-dpif-sflow.h"
> >  #include "ofproto/ofproto-dpif-trace.h"
> >  #include "ofproto/ofproto-dpif-xlate-cache.h"
> > @@ -117,6 +118,7 @@ struct xbridge {
> >  struct dpif_sflow *sflow; /* SFlow handle, or null. */
> >  struct dpif_ipfix *ipfix; /* Ipfix handle, or null. */
> >  struct netflow *netflow;  /* Netflow handle, or null. */
> > +struct dpif_lsample *lsample; /* Local sample handle, or null. */
> >  struct stp *stp;  /* STP or null if disabled. */
> >  struct rstp *rstp;/* RSTP or null if disabled. */
> >
> > @@ -687,6 +689,7 @@ static void xlate_xbridge_set(struct xbridge *, struct 
> > dpif *,
> >const struct dpif_sflow *,
> >const struct dpif_ipfix *,
> >const struct netflow *,
> > +  const struct dpif_lsample *,
> >bool forward_bpdu, bool has_in_band,
> >const struct dpif_backer_support *,
> >const struct xbridge_addr *);
> > @@ -1070,6 +1073,7 @@ xlate_xbridge_set(struct xbridge *xbridge,
> >const struct dpif_sflow *sflow,
> >const struct dpif_ipfix *ipfix,
> >const struct netflow *netflow,
> > +  const struct dpif_lsample *lsample,
>
> nit: I would move above netflow, as you also do the actual init/unref before
>  netflow.

Ack.

> >bool forward_bpdu, bool has_in_band,
> >const struct dpif_backer_support *support,
> >const struct xbridge_addr *addr)
> > @@ -1099,6 +1103,11 @@ xlate_xbridge_set(struct xbridge *xbridge,
> >  xbridge->ipfix = dpif_ipfix_ref(ipfix);
> >  }
> >
> > +if (xbridge->lsample != lsample) {
> > +dpif_lsample_unref(xbridge->lsample);
> > +xbridge->lsample = dpif_lsample_re

Re: [ovs-dev] RFC OVN: fabric integration

2024-06-28 Thread Ilya Maximets
On 6/28/24 17:38, Dumitru Ceara wrote:
> On 6/28/24 15:05, Ilya Maximets wrote:
>> On 6/28/24 11:03, Ales Musil wrote:
>>> Hi Frode,
>>>
>>> looking forward to the RFC. AFAIU it means that the routes would be exposed 
>>> on
>>> LR, more specifically GW router. Would it make sense to allow this behavior 
>>> for
>>> provider networks (LS with localnet port)? In that case we could advertise
>>> chassis-local information from logical routers attached to BGP-enabled
>>> switches. E.g.: FIPs, LBs. It would cover the use case for distributed
>>> routers. To achieve that we should have BGP peers for each chassis that the 
>>> LS
>>> is local on.
>>
>> I haven't read the whole thing yet, but can we, please, stop adding routing 
>> features
>> to switches? :)  If someone wants routing, they should use a router, IMO.
>>
> 
> I'm fairly certain that there are precedents in "classic" network
> appliances: switches that can do a certain amount of routing (even run BGP).
> 
> In this case we could add a logical router but I'm not sure that
> simplifies things.
> 

"classic" network appliances are a subject for restrictions of a physical
material world.  It's just way easier and cheaper to acquire and install
a single physical box instead of N.  This is not a problem for a virtual
network.  AP+router+switch+modem combo boxes are also "classic" last mile
network appliances that we just call a "router".  It doesn't mean we should
implement one.

The distinction between OVN logical routers and switches is there for a
reason.  That is so you can look at the logical topology and understand
how it works more or less without diving into configuration of every single
part of it.  If switches do routing and routers do switching, what's the
point of differentiation?  It only brings more confusion.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] RFC OVN: fabric integration

2024-06-28 Thread Dumitru Ceara
On 6/28/24 15:05, Ilya Maximets wrote:
> On 6/28/24 11:03, Ales Musil wrote:
>> Hi Frode,
>>
>> looking forward to the RFC. AFAIU it means that the routes would be exposed 
>> on
>> LR, more specifically GW router. Would it make sense to allow this behavior 
>> for
>> provider networks (LS with localnet port)? In that case we could advertise
>> chassis-local information from logical routers attached to BGP-enabled
>> switches. E.g.: FIPs, LBs. It would cover the use case for distributed
>> routers. To achieve that we should have BGP peers for each chassis that the 
>> LS
>> is local on.
> 
> I haven't read the whole thing yet, but can we, please, stop adding routing 
> features
> to switches? :)  If someone wants routing, they should use a router, IMO.
> 

I'm fairly certain that there are precedents in "classic" network
appliances: switches that can do a certain amount of routing (even run BGP).

In this case we could add a logical router but I'm not sure that
simplifies things.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v3 0/7] selftests: net: Switch pmtu.sh to use the internal ovs script.

2024-06-28 Thread Jakub Kicinski
On Tue, 25 Jun 2024 13:22:38 -0400 Aaron Conole wrote:
> Currently, if a user wants to run pmtu.sh and cover all the provided test
> cases, they need to install the Open vSwitch userspace utilities.  This
> dependency is difficult for users as well as CI environments, because the
> userspace build and setup may require lots of support and devel packages
> to be installed, system setup to be correct, and things like permissions
> and selinux policies to be properly configured.

Hi Aaron!

I merged this yesterday (with slight alphabetical reshuffling of
the config options). The pmtu.sh test is solid now, which is great!

I also added the OvS tests themselves, and those are not passing, yet:
https://netdev.bots.linux.dev/contest.html?test=openvswitch-sh
Could you take a look and LMK if these are likely env issues or
something bad in the test itself?
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] vswitchd: Only lock pages that are faulted in.

2024-06-28 Thread Eelco Chaudron



On 14 Jun 2024, at 14:22, Ilya Maximets wrote:

> The main purpose of locking the memory is to ensure that OVS can keep
> doing what it did before in case of increased memory pressure, e.g.,
> during VM ingest / migration.  Fulfilling this requirement can be
> achieved without locking all the allocated memory, but only the pages
> already accessed in the past (faulted in).  Processing of the new
> traffic involves new memory allocations.  Latency on these operations
> can't be guaranteed by the locking.  The main difference would be
> the pre-faulting of the stack memory.  However, in order to revalidate
> or process upcalls on the same traffic, the same amount of stack is
> likely needed, so all the necessary memory will already be faulted in.
>
> Switch 'mlockall' to MCL_ONFAULT to avoid consuming unnecessarily
> large amounts of RAM on systems with high core counts.  For example,
> in a densely populated OVN cluster this saves about 650 MB of RAM per
> node on a system with 64 cores.  This equates to 320 GB of allocated
> but unused RAM in a 500 node cluster.
>
> This also makes OVS better suited by default for small systems with
> limited amount of memory.
>
> The MCL_ONFAULT flag was introduced in Linux kernel 4.4 and wasn't
> available at the time of '--mlockall' introduction, but we can use it
> now.  Falling back to an old way of locking in case we're running on
> an older kernel just in case.
>
> Only locking the faulted in pages also makes locking compatible with
> vhost post-copy live migration by default, because we'll no longer
> pre-fault all the guest's memory.  Post-copy relies on userfaultfd
> to work on shared huge pages, which is only available in 4.11+ kernels.
> So, technically, it should not be possible for MCL_ONFAULT to fail and
> the call without it to succeed.  But keeping the check just in case
> for now.
>
> Signed-off-by: Ilya Maximets 

This change looks good. I did some performance testing and I noticed not 
negative impacts.

Acked-by: Eelco Chaudron 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v5 05/10] net: openvswitch: add emit_sample action

2024-06-28 Thread Aaron Conole
Adrián Moreno  writes:

> On Thu, Jun 27, 2024 at 09:30:08AM GMT, Aaron Conole wrote:
>> Ilya Maximets  writes:
>>
>> > On 6/27/24 12:15, Adrián Moreno wrote:
>> >> On Thu, Jun 27, 2024 at 11:31:41AM GMT, Eelco Chaudron wrote:
>> >>>
>> >>>
>> >>> On 27 Jun 2024, at 11:23, Ilya Maximets wrote:
>> >>>
>>  On 6/27/24 11:14, Eelco Chaudron wrote:
>> >
>> >
>> > On 27 Jun 2024, at 10:36, Ilya Maximets wrote:
>> >
>> >> On 6/27/24 09:52, Adrián Moreno wrote:
>> >>> On Thu, Jun 27, 2024 at 09:06:46AM GMT, Eelco Chaudron wrote:
>> 
>> 
>>  On 26 Jun 2024, at 22:34, Adrián Moreno wrote:
>> 
>> > On Wed, Jun 26, 2024 at 04:28:17PM GMT, Eelco Chaudron wrote:
>> >>
>> >>
>> >> On 25 Jun 2024, at 22:51, Adrian Moreno wrote:
>> >>
>> >>> Add support for a new action: emit_sample.
>> >>>
>> >>> This action accepts a u32 group id and a variable-length
>> >>> cookie and uses
>> >>> the psample multicast group to make the packet available for
>> >>> observability.
>> >>>
>> >>> The maximum length of the user-defined cookie is set to
>> >>> 16, same as
>> >>> tc_cookie, to discourage using cookies that will not be
>> >>> offloadable.
>> >>
>> >> I’ll add the same comment as I had in the user space part, and 
>> >> that
>> >> is that I feel from an OVS perspective this action should be 
>> >> called
>> >> emit_local() instead of emit_sample() to make it Datapath
>> >> independent.
>> >> Or quoting the earlier comment:
>> >>
>> >>
>> >> “I’ll start the discussion again on the naming. The name
>> >> "emit_sample()"
>> >> does not seem appropriate. This function's primary role
>> >> is to copy the
>> >> packet and send it to a local collector, which varies
>> >> depending on the
>> >> datapath. For the kernel datapath, this collector is
>> >> psample, while for
>> >> userspace, it will likely be some kind of probe. This
>> >> action is distinct
>> >> from the sample() action by design; it is a standalone
>> >> action that can
>> >> be combined with others.
>> >>
>> >> Furthermore, the action itself does not involve taking a sample; 
>> >> it
>> >> consistently pushes the packet to the local collector. Therefore, 
>> >> I
>> >> suggest renaming "emit_sample()" to "emit_local()". This
>> >> same goes for
>> >> all the derivative ATTR naming.”
>> >>
>> >
>> > This is a blurry semantic area.
>> > IMO, "sample" is the act of extracting (potentially a piece of)
>> > someting, in this case, a packet. It is common to only
>> > take some packets
>> > as samples, so this action usually comes with some kind of
>> > "rate", but
>> > even if the rate is 1, it's still sampling in this context.
>> >
>> > OTOH, OVS kernel design tries to be super-modular and define small
>> > combinable actions, so the rate or probability generation
>> > is done with
>> > another action which is (IMHO unfortunately) named "sample".
>> >
>> > With that interpretation of the term it would actually
>> > make more sense
>> > to rename "sample" to something like "random" (of course I'm not
>> > suggestion we do it). "sample" without any nested action
>> > that actually
>> > sends the packet somewhere is not sampling, it's just
>> > doing something or
>> > not based on a probability. Where as "emit_sample" is
>> > sampling even if
>> > it's not nested inside a "sample".
>> 
>>  You're assuming we are extracting a packet for sampling,
>>  but this function
>>  can be used for various other purposes. For instance, it
>>  could handle the
>>  packet outside of the OVS pipeline through an eBPF program
>>  (so we are not
>>  taking a sample, but continue packet processing outside of the OVS
>>  pipeline). Calling it emit_sampling() in such cases could be very
>>  confusing.
>> >>
>> >> We can't change the implementation of the action once it is
>> >> part of uAPI.
>> >> We have to document where users can find these packets and we
>> >> can't just
>> >> change the destination later.
>> >
>> > I'm not suggesting we change the uAPI implementation, but we
>> > could use the
>> > emit_xxx() action with an eBPF probe on the action to perform
>> > other tasks.
>> > This is just an example.
>> 
>>  Yeah, but as Adrian said below, you could do that with any action and
>>  this doesn't change the semantics of the actio

Re: [ovs-dev] [RFC v2 9/9] tests: Test local sampling.

2024-06-28 Thread Adrián Moreno
On Wed, Jun 26, 2024 at 02:15:02PM GMT, Eelco Chaudron wrote:
> On 5 Jun 2024, at 22:23, Adrian Moreno wrote:
>
> > Test simultaneous IPFIX and local sampling including slow-path.
>
> I guess Ilya's comments on this patch summarize most of my comments. I had
> one small additional question. See below.
>
> //Eelco
>
> > Signed-off-by: Adrian Moreno 
> > ---
> >  tests/system-common-macros.at |   4 ++
> >  tests/system-traffic.at   | 105 ++
> >  2 files changed, 109 insertions(+)
> >
> > diff --git a/tests/system-common-macros.at b/tests/system-common-macros.at
> > index 2a68cd664..22b8885e4 100644
> > --- a/tests/system-common-macros.at
> > +++ b/tests/system-common-macros.at
> > @@ -378,3 +378,7 @@ m4_define([OVS_CHECK_GITHUB_ACTION],
> >  # OVS_CHECK_DROP_ACTION()
> >  m4_define([OVS_CHECK_DROP_ACTION],
> >  [AT_SKIP_IF([! grep -q "Datapath supports explicit drop action" 
> > ovs-vswitchd.log])])
> > +
> > +# OVS_CHECK_EMIT_SAMPLE()
> > +m4_define([OVS_CHECK_EMIT_SAMPLE],
> > +[AT_SKIP_IF([! grep -q "Datapath supports emit_sample" 
> > ovs-vswitchd.log])])
> > diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> > index bd7647cbe..babc56b56 100644
> > --- a/tests/system-traffic.at
> > +++ b/tests/system-traffic.at
> > @@ -8977,3 +8977,108 @@ OVS_WAIT_UNTIL([cat p2.pcap | grep -E "0x0050: 
> > * * *5002 *2000 *b85e *00
> >
> >  OVS_TRAFFIC_VSWITCHD_STOP
> >  AT_CLEANUP
> > +
> > +AT_SETUP([emit_sample])
> > +OVS_TRAFFIC_VSWITCHD_START()
> > +OVS_CHECK_EMIT_SAMPLE()
> > +
> > +ADD_NAMESPACES(at_ns0, at_ns1)
> > +NS_CHECK_EXEC([at_ns0], [sysctl -w net.ipv6.conf.all.disable_ipv6=1], [0], 
> > [ignore])
> > +NS_CHECK_EXEC([at_ns1], [sysctl -w net.ipv6.conf.all.disable_ipv6=1], [0], 
> > [ignore])
> > +
> > +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24", "e4:11:22:33:44:55")
> > +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24", "e4:11:22:33:44:66")
> > +
> > +NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e4:11:22:33:44:66])
> > +NS_CHECK_EXEC([at_ns1], [arp -s 10.1.1.1 e4:11:22:33:44:55])
> > +
> > +
> > +AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg 
> > ofproto_dpif_upcall:dbg])
> > +
> > +AT_CHECK([ovs-vsctl -- --id=@br0 get Bridge br0 \
> > +-- --id=@ipfix create IPFIX targets=\"127.0.0.1:4739\" 
> > \
> > +-- create Flow_Sample_Collector_Set id=1 bridge=@br0 
> > ipfix=@ipfix, local_sample_group=10 \
> > +-- create Flow_Sample_Collector_Set id=2 bridge=@br0 
> > ipfix=@ipfix, local_sample_group=12],
> > + [0], [ignore])
> > +
> > +AT_DATA([flows.txt], [dnl
> > +in_port=ovs-p0,ip 
> > actions=sample(probability=65535,collector_set_id=1,obs_domain_id=1431655765,obs_point_id=1717986918),output(port=ovs-p1,max_len=100)
> > +in_port=ovs-p1,ip 
> > actions=sample(probability=65535,collector_set_id=2,obs_domain_id=2290649224,obs_point_id=2576980377),output(port=ovs-p0,max_len=100)
> > +])
> > +
> > +AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
> > +
> > +OVS_DAEMONIZE([ovstest test-psample > psample.out], [psample1.pid])
> > +
> > +NS_CHECK_EXEC([at_ns0], [ping -q -c 1 10.1.1.2 | FORMAT_PING], [0], [dnl
> > +1 packets transmitted, 1 received, 0% packet loss, time 0ms
> > +])
> > +
> > +m4_define([SAMPLE1], [group_id=0xa 
> > obs_domain=0x,obs_point=0x 
> > .*icmp.*nw_src=10.1.1.1,nw_dst=10.1.1.2])
> > +m4_define([SAMPLE2], [group_id=0xc 
> > obs_domain=0x,obs_point=0x 
> > .*icmp.*nw_src=10.1.1.2,nw_dst=10.1.1.1])
> > +AT_CHECK([grep -E 'SAMPLE1' psample.out >/dev/null])
> > +AT_CHECK([grep -E 'SAMPLE2' psample.out >/dev/null])
> > +
> > +AT_CHECK([ovs-ofctl dump-ipfix-flow br0 | sed 's/tx pkts=[[0-9]]*/tx 
> > pkts=24/' | sed 's/tx errs=[[0-9]]*/tx errs=0/'], [0], [dnl
>
> Why are we making tx pkts always 24, and are we waiving any potential tx 
> errors?
> Is this something you have seen not being consistent, if so any idea why?
>

Yes. By the time we request the statistics, IPFIX cache should contain
the flows but they might not have been sent yet. When they are sent they
will fail (because IPFIX target is localhost and it returns the socket
returns an error if the listening socket does not exist) so the number
of errors might vary.

I can look deeper into making it more consistent but it looked like a
quick workaround.

> > +NXST_IPFIX_FLOW reply (xid=0x2): 2 ids
> > +  id   1: flows=1, current flows=0, sampled pkts=1, ipv4 ok=1, ipv6 ok=0, 
> > tx pkts=24
> > +  pkts errs=0, ipv4 errs=0, ipv6 errs=0, tx errs=0
> > +  id   2: flows=1, current flows=0, sampled pkts=1, ipv4 ok=1, ipv6 ok=0, 
> > tx pkts=24
> > +  pkts errs=0, ipv4 errs=0, ipv6 errs=0, tx errs=0
> > +])
> > +
> > +AT_CHECK([ovs-appctl lsample/show br0], [0], [dnl
> > +local sample statistics for bridge "br0":
> > +- Collector Set ID: 1
> > +  Local Sample Group: 10
> > +  Total number of bytes: 98
> > +  Total number of packets: 1
> > +
> > +- Collector Set ID: 2
> > +  Local S

Re: [ovs-dev] [RFC v2 8/9] tests: Add test-psample testing utility.

2024-06-28 Thread Adrián Moreno
On Wed, Jun 26, 2024 at 02:14:27PM GMT, Eelco Chaudron wrote:
> On 5 Jun 2024, at 22:23, Adrian Moreno wrote:
>
> > This simple program reads from psample and prints the packets to stdout.
>
> Some comments below.
>
> //Eelco
>
> > Signed-off-by: Adrian Moreno 
> > ---
> >  include/linux/automake.mk |   1 +
> >  include/linux/psample.h   |  68 +
> >  tests/automake.mk |   3 +-
> >  tests/test-psample.c  | 282 ++
> >  4 files changed, 353 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/psample.h
> >  create mode 100644 tests/test-psample.c
> >
> > diff --git a/include/linux/automake.mk b/include/linux/automake.mk
> > index cdae5eedc..ac306b53c 100644
> > --- a/include/linux/automake.mk
> > +++ b/include/linux/automake.mk
> > @@ -3,6 +3,7 @@ noinst_HEADERS += \
> > include/linux/netfilter/nf_conntrack_sctp.h \
> > include/linux/openvswitch.h \
> > include/linux/pkt_cls.h \
> > +   include/linux/psample.h \
> > include/linux/gen_stats.h \
> > include/linux/tc_act/tc_mpls.h \
> > include/linux/tc_act/tc_pedit.h \
> > diff --git a/include/linux/psample.h b/include/linux/psample.h
> > new file mode 100644
> > index 0..d5761b730
> > --- /dev/null
> > +++ b/include/linux/psample.h
> > @@ -0,0 +1,68 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef __LINUX_PSAMPLE_H
> > +#define __LINUX_PSAMPLE_H
> > +
> > +enum {
> > +   PSAMPLE_ATTR_IIFINDEX,
> > +   PSAMPLE_ATTR_OIFINDEX,
> > +   PSAMPLE_ATTR_ORIGSIZE,
> > +   PSAMPLE_ATTR_SAMPLE_GROUP,
> > +   PSAMPLE_ATTR_GROUP_SEQ,
> > +   PSAMPLE_ATTR_SAMPLE_RATE,
> > +   PSAMPLE_ATTR_DATA,
> > +   PSAMPLE_ATTR_GROUP_REFCOUNT,
> > +   PSAMPLE_ATTR_TUNNEL,
> > +
> > +   PSAMPLE_ATTR_PAD,
> > +   PSAMPLE_ATTR_OUT_TC,/* u16 */
> > +   PSAMPLE_ATTR_OUT_TC_OCC,/* u64, bytes */
> > +   PSAMPLE_ATTR_LATENCY,   /* u64, nanoseconds */
> > +   PSAMPLE_ATTR_TIMESTAMP, /* u64, nanoseconds */
> > +   PSAMPLE_ATTR_PROTO, /* u16 */
> > +   PSAMPLE_ATTR_USER_COOKIE,   /* binary, user provided data */
> > +   PSAMPLE_ATTR_SAMPLE_PROBABILITY,/* no argument, interpret rate in
> > +* PSAMPLE_ATTR_SAMPLE_RATE as a
> > +* probability scaled 0 - U32_MAX.
> > +*/
> > +
> > +   __PSAMPLE_ATTR_MAX
> > +};
> > +
> > +enum psample_command {
> > +   PSAMPLE_CMD_SAMPLE,
> > +   PSAMPLE_CMD_GET_GROUP,
> > +   PSAMPLE_CMD_NEW_GROUP,
> > +   PSAMPLE_CMD_DEL_GROUP,
> > +   PSAMPLE_CMD_SAMPLE_FILTER_SET,
> > +};
> > +
> > +enum psample_tunnel_key_attr {
> > +   PSAMPLE_TUNNEL_KEY_ATTR_ID, /* be64 Tunnel ID */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_IPV4_SRC,   /* be32 src IP address. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_IPV4_DST,   /* be32 dst IP address. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_TOS,/* u8 Tunnel IP ToS. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_TTL,/* u8 Tunnel IP TTL. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_DONT_FRAGMENT,  /* No argument, set DF. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_CSUM,   /* No argument. CSUM 
> > packet. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_OAM,/* No argument. OAM frame.  
> > */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_GENEVE_OPTS,/* Array of Geneve options. 
> > */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_TP_SRC, /* be16 src Transport Port. 
> > */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_TP_DST, /* be16 dst Transport Port. 
> > */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_VXLAN_OPTS, /* Nested VXLAN opts* */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
> > address. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
> > address. */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_PAD,
> > +   PSAMPLE_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
> > +   PSAMPLE_TUNNEL_KEY_ATTR_IPV4_INFO_BRIDGE,   /* No argument. 
> > IPV4_INFO_BRIDGE mode.*/
> > +   __PSAMPLE_TUNNEL_KEY_ATTR_MAX
> > +};
> > +
> > +/* Can be overridden at runtime by module option */
> > +#define PSAMPLE_ATTR_MAX (__PSAMPLE_ATTR_MAX - 1)
> > +
> > +#define PSAMPLE_NL_MCGRP_CONFIG_NAME "config"
> > +#define PSAMPLE_NL_MCGRP_SAMPLE_NAME "packets"
> > +#define PSAMPLE_GENL_NAME "psample"
> > +#define PSAMPLE_GENL_VERSION 1
> > +
> > +#endif
> > diff --git a/tests/automake.mk b/tests/automake.mk
> > index 04f48f2d8..edfc2cb33 100644
> > --- a/tests/automake.mk
> > +++ b/tests/automake.mk
> > @@ -499,7 +499,8 @@ endif
> >  if LINUX
> >  tests_ovstest_SOURCES += \
> > tests/test-netlink-conntrack.c \
> > -   tests/test-netlink-policy.c
> > +   tests/test-netlink-policy.c \
> > +   tests/test-psample.c
> >  endif
> >
> >  tests_ovstest_LDADD = lib/libopenvswitch.la
> > diff --git a/tests/test-psample.c b/tests/test-psample.c
> > new file mode 100644
> > index 0..f04d903

Re: [ovs-dev] [RFC v2 7/9] ofproto-dpif-lsample: Show stats via unixctl.

2024-06-28 Thread Adrián Moreno
On Wed, Jun 26, 2024 at 02:13:40PM GMT, Eelco Chaudron wrote:
> On 5 Jun 2024, at 22:23, Adrian Moreno wrote:
>
> > Add a command to dump statistics per exporter.
>
> Some comments below.
>
> //Eelco
>
> > Signed-off-by: Adrian Moreno 
> > ---
> >  NEWS   |   2 +
> >  ofproto/ofproto-dpif-lsample.c | 113 +
> >  ofproto/ofproto-dpif-lsample.h |   1 +
> >  ofproto/ofproto-dpif.c |   1 +
> >  4 files changed, 117 insertions(+)
> >
> > diff --git a/NEWS b/NEWS
> > index 1c05a7120..a443f9fe1 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -11,6 +11,8 @@ Post-v3.3.0
> >   allows samples to be emitted locally (instead of via IPFIX) in a
> >   datapath-specific manner via the new datapath action called 
> > "emit_sample".
> >   Linux kernel datapath is the first to support this feature.
> > + A new unixctl command 'lsample/show' shows packet and bytes statistics
> > + per local sample exporter.
> >
> >
> >  v3.3.0 - 16 Feb 2024
> > diff --git a/ofproto/ofproto-dpif-lsample.c b/ofproto/ofproto-dpif-lsample.c
> > index 0c71e354d..e31dabac4 100644
> > --- a/ofproto/ofproto-dpif-lsample.c
> > +++ b/ofproto/ofproto-dpif-lsample.c
> > @@ -21,7 +21,10 @@
> >  #include "dpif.h"
> >  #include "hash.h"
> >  #include "ofproto.h"
> > +#include "ofproto-dpif.h"
> > +#include "openvswitch/dynamic-string.h"
> >  #include "openvswitch/thread.h"
> > +#include "unixctl.h"
> >
> >  /* Dpif local sampling.
> >   *
> > @@ -218,3 +221,113 @@ dpif_lsample_unref(struct dpif_lsample *lsample)
> >  dpif_lsample_destroy(lsample);
> >  }
> >  }
> > +
> > +static int
> > +compare_exporter_list(const void *a_, const void *b_)
>
> We are not comparing a list here, so we should choose a more appropriate name,
> maybe something like;  compare_exporter_collector_set_id(), or if you want it
> shorter, maybe comp_exporter_collector_id(), use you fantasy (but not to
> much ;).
>
> > +{
> > +const struct lsample_exporter_node *a, *b;
> > +
> > +a = *(struct lsample_exporter_node **)a_;
> > +b = *(struct lsample_exporter_node **)b_;
>
> Fix space around casting.
>

Ack.

> > +
> > +if (a->exporter.options.collector_set_id >
> > +b->exporter.options.collector_set_id) {
> > +return 1;
> > +}
> > +if (a->exporter.options.collector_set_id <
> > +b->exporter.options.collector_set_id) {
> > +return -1;
> > +}
> > +return 0;
> > +}
> > +
> > +static void
> > +lsample_exporter_list(struct dpif_lsample *lsample,
> > +  struct lsample_exporter_node ***list,
> > +  size_t *num_exporters)
> > +{
> > +struct lsample_exporter_node *node;
> > +struct lsample_exporter_node **exporter_list;
>
> Reverse Christmas tree order.
>

Ack.


> > +size_t k = 0, n;
> > +
> > +n = cmap_count(&lsample->exporters);
> > +
> > +exporter_list = xcalloc(n, sizeof *exporter_list);
> > +
> > +CMAP_FOR_EACH (node, node, &lsample->exporters) {
> > +if (k >= n) {
> > +break;
> > +}
> > +exporter_list[k++] = node;
> > +}
> > +
> > +qsort(exporter_list, k, sizeof *exporter_list, compare_exporter_list);
> > +
> > +*list = exporter_list;
> > +*num_exporters = k;
> > +}
> > +
> > +static void
> > +lsample_unixctl_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
> > + const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
> > +{
> > +struct lsample_exporter_node **node_list = NULL;
> > +struct lsample_exporter_node *node;
> > +struct ds ds = DS_EMPTY_INITIALIZER;
>
> Reverse Christmas tree? Or even better move the *node definition inside the
> for loop below.
>

Ack.

> > +const struct ofproto_dpif *ofproto;
> > +size_t i, num;
> > +
> > +ofproto = ofproto_dpif_lookup_by_name(argv[1]);
> > +if (!ofproto) {
> > +unixctl_command_reply_error(conn, "no such bridge");
> > +return;
> > +}
> > +
> > +if (!ofproto->lsample) {
> > +unixctl_command_reply_error(conn, "no local sampling exporters "
> > +"configured");
>
> unixctl_command_reply_error(conn,
> "no local sampling exporters configured");
>
>
> > +return;
> > +}
> > +
> > +ds_put_format(&ds, "local sample statistics for bridge \"%s\":\n",
>
> Maybe capital L for Local?
>

Ack.

> > + argv[1]);
> > +
> > +lsample_exporter_list(ofproto->lsample, &node_list, &num);
> > +
> > +for (i = 0; i < num; i++) {
> > +uint64_t n_bytes;
> > +uint64_t n_packets;
>
> Reverse Christmas tree
>

Ack.

> > +
> > +node = node_list[i];
> > +
> > +atomic_read_relaxed(&node->exporter.n_packets, &n_packets);
> > +atomic_read_relaxed(&node->exporter.n_bytes, &n_bytes);
> > +
> > +if (i) {
> > +ds_put_cstr(&ds, "\n");
> > +}
> > +
> > +  

Re: [ovs-dev] [RFC v2 4/9] vswitchd: Add local sampling to vswitchd schema.

2024-06-28 Thread Adrián Moreno
On Thu, Jun 27, 2024 at 03:57:03PM GMT, Eelco Chaudron wrote:
>
>
> On 27 Jun 2024, at 13:37, Adrián Moreno wrote:
>
> > On Wed, Jun 26, 2024 at 02:11:00PM GMT, Eelco Chaudron wrote:
> >> On 5 Jun 2024, at 22:23, Adrian Moreno wrote:
> >>
> >>> Add as new column in the Flow_Sample_Collector_Set table named
> >>> "local_sample_group" which enables this feature.
> >>>
> >>> Signed-off-by: Adrian Moreno 
> >>> ---
> >>>  NEWS   |  4 ++
> >>>  vswitchd/bridge.c  | 78 +++---
> >>>  vswitchd/vswitch.ovsschema |  9 -
> >>>  vswitchd/vswitch.xml   | 39 +--
> >>>  4 files changed, 119 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/NEWS b/NEWS
> >>> index b92cec532..1c05a7120 100644
> >>> --- a/NEWS
> >>> +++ b/NEWS
> >>> @@ -7,6 +7,10 @@ Post-v3.3.0
> >>> - The primary development branch has been renamed from 'master' to 
> >>> 'main'.
> >>>   The OVS tree remains hosted on GitHub.
> >>>https://github.com/openvswitch/ovs.git
> >>> +   - Local sampling is introduced. It reuses the OpenFlow sample action 
> >>> and
> >>> + allows samples to be emitted locally (instead of via IPFIX) in a
> >>> + datapath-specific manner via the new datapath action called 
> >>> "emit_sample".
> >>> + Linux kernel datapath is the first to support this feature.
> >>>
> >>>
> >>>  v3.3.0 - 16 Feb 2024
> >>> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> >>> index 95a65fcdc..cd7dc6646 100644
> >>> --- a/vswitchd/bridge.c
> >>> +++ b/vswitchd/bridge.c
> >>> @@ -288,6 +288,7 @@ static void bridge_configure_mac_table(struct bridge 
> >>> *);
> >>>  static void bridge_configure_mcast_snooping(struct bridge *);
> >>>  static void bridge_configure_sflow(struct bridge *, int 
> >>> *sflow_bridge_number);
> >>>  static void bridge_configure_ipfix(struct bridge *);
> >>> +static void bridge_configure_lsample(struct bridge *);
> >>>  static void bridge_configure_spanning_tree(struct bridge *);
> >>>  static void bridge_configure_tables(struct bridge *);
> >>>  static void bridge_configure_dp_desc(struct bridge *);
> >>> @@ -989,6 +990,7 @@ bridge_reconfigure(const struct ovsrec_open_vswitch 
> >>> *ovs_cfg)
> >>>  bridge_configure_netflow(br);
> >>>  bridge_configure_sflow(br, &sflow_bridge_number);
> >>>  bridge_configure_ipfix(br);
> >>> +bridge_configure_lsample(br);
> >>>  bridge_configure_spanning_tree(br);
> >>>  bridge_configure_tables(br);
> >>>  bridge_configure_dp_desc(br);
> >>> @@ -1537,10 +1539,11 @@ ovsrec_ipfix_is_valid(const struct ovsrec_ipfix 
> >>> *ipfix)
> >>>  return ipfix && ipfix->n_targets > 0;
> >>>  }
> >>>
> >>> -/* Returns whether a Flow_Sample_Collector_Set row is valid. */
> >>> +/* Returns whether a Flow_Sample_Collector_Set row constains valid IPFIX
> >>
> >> constains -> 'contains a'
> >>
> >
> > Ack.
> >
> >>> + * configuration. */
> >>>  static bool
> >>> -ovsrec_fscs_is_valid(const struct ovsrec_flow_sample_collector_set *fscs,
> >>> - const struct bridge *br)
> >>> +ovsrec_fscs_is_valid_ipfix(const struct ovsrec_flow_sample_collector_set 
> >>> *fscs,
> >>> +   const struct bridge *br)
> >>>  {
> >>>  return ovsrec_ipfix_is_valid(fscs->ipfix) && fscs->bridge == br->cfg;
> >>>  }
> >>> @@ -1558,7 +1561,7 @@ bridge_configure_ipfix(struct bridge *br)
> >>>  const char *virtual_obs_id;
> >>>
> >>>  OVSREC_FLOW_SAMPLE_COLLECTOR_SET_FOR_EACH(fe_cfg, idl) {
> >>> -if (ovsrec_fscs_is_valid(fe_cfg, br)) {
> >>> +if (ovsrec_fscs_is_valid_ipfix(fe_cfg, br)) {
> >>>  n_fe_opts++;
> >>>  }
> >>>  }
> >>> @@ -1621,7 +1624,7 @@ bridge_configure_ipfix(struct bridge *br)
> >>>  fe_opts = xcalloc(n_fe_opts, sizeof *fe_opts);
> >>>  opts = fe_opts;
> >>>  OVSREC_FLOW_SAMPLE_COLLECTOR_SET_FOR_EACH(fe_cfg, idl) {
> >>> -if (ovsrec_fscs_is_valid(fe_cfg, br)) {
> >>> +if (ovsrec_fscs_is_valid_ipfix(fe_cfg, br)) {
> >>>  opts->collector_set_id = fe_cfg->id;
> >>>  sset_init(&opts->targets);
> >>>  sset_add_array(&opts->targets, fe_cfg->ipfix->targets,
> >>> @@ -1667,6 +1670,71 @@ bridge_configure_ipfix(struct bridge *br)
> >>>  }
> >>>  }
> >>>
> >>> +/* Returns whether a Flow_Sample_Collector_Set row contains valid local
> >>> + * sampling configuraiton. */
> >>
> >> ...row contains a valid local...
> >>++
> >>
> >> configuraiton -> configuration
> >
> > Ack.
> >
> >>
> >>> +static bool
> >>> +ovsrec_fscs_is_valid_local(const struct ovsrec_flow_sample_collector_set 
> >>> *fscs,
> >>> +   const struct bridge *br)
> >>> +{
> >>> +return fscs->local_sample_group && fscs->n_local_sample_group == 1 &&
> >>> +   fscs->bridge == br->cfg;
> >>> +}
> >>> +
> >>> +/* Set local sample configuration on 'br'. */
> >>> +sta

Re: [ovs-dev] [RFC v2 1/9] odp-util: Add support OVS_ACTION_ATTR_EMIT_SAMPLE.

2024-06-28 Thread Adrián Moreno
On Thu, Jun 27, 2024 at 03:42:38PM GMT, Eelco Chaudron wrote:
>
>
> On 27 Jun 2024, at 12:45, Adrián Moreno wrote:
>
> > On Wed, Jun 26, 2024 at 02:08:27PM GMT, Eelco Chaudron wrote:
> >> On 5 Jun 2024, at 22:23, Adrian Moreno wrote:
> >>
> >>> Add support for parsing and formatting the new action.
> >>>
> >>> Also, flag OVS_ACTION_ATTR_SAMPLE as requiring datapath assistance if it
> >>> contains a nested OVS_ACTION_ATTR_EMIT_SAMPLE. The reason is that the
> >>> sampling rate form the parent "sample" is made available to the nested
> >>> "emit_sample" by the kernel.
> >>
> >> Hi Adrian,
> >>
> >> Thanks for this series! This email kicks off my review of the series,
> >> see comments below and in the other patches.
> >>
> >> Cheers,
> >>
> >> Eelco
> >>
> >>> Signed-off-by: Adrian Moreno 
> >>> ---
> >>>  include/linux/openvswitch.h  |  25 +
> >>>  lib/dpif-netdev.c|   1 +
> >>>  lib/dpif.c   |   1 +
> >>>  lib/odp-execute.c|  25 -
> >>>  lib/odp-util.c   | 103 +++
> >>>  lib/odp-util.h   |   3 +
> >>>  ofproto/ofproto-dpif-ipfix.c |   1 +
> >>>  ofproto/ofproto-dpif-sflow.c |   1 +
> >>>  python/ovs/flow/odp.py   |   8 +++
> >>>  tests/odp.at |  16 ++
> >>>  10 files changed, 183 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
> >>> index d9fb991ef..b4e0647bd 100644
> >>> --- a/include/linux/openvswitch.h
> >>> +++ b/include/linux/openvswitch.h
> >>> @@ -992,6 +992,30 @@ struct check_pkt_len_arg {
> >>>  };
> >>>  #endif
> >>>
> >>> +#define OVS_EMIT_SAMPLE_COOKIE_MAX_SIZE 16
> >>> +/**
> >>> + * enum ovs_emit_sample_attr - Attributes for 
> >>> %OVS_ACTION_ATTR_EMIT_SAMPLE
> >>> + * action.
> >>> + *
> >>> + * @OVS_EMIT_SAMPLE_ATTR_GROUP: 32-bit number to identify the source of 
> >>> the
> >>> + * sample.
> >>> + * @OVS_EMIT_SAMPLE_ATTR_COOKIE: A variable-length binary cookie that 
> >>> contains
> >>> + * user-defined metadata. The maximum length is 16 bytes.
> >>> + *
> >>> + * Sends the packet to the psample multicast group with the specified 
> >>> group and
> >>> + * cookie. It is possible to combine this action with the
> >>> + * %OVS_ACTION_ATTR_TRUNC to limit the size of the packet being emitted.
> >>
> >> I'll start the discussion again on the naming. The name "emit_sample()"
> >> does not seem appropriate. This function's primary role is to copy the
> >> packet and send it to a local collector, which varies depending on the
> >> datapath. For the kernel datapath, this collector is psample, while for
> >> userspace, it will likely be some kind of probe. This action is distinct
> >> from the sample() action by design; it is a standalone action that can
> >> be combined with others.
> >>
> >> Furthermore, the action itself does not involve taking a sample; it
> >> consistently pushes the packet to the local collector. Therefore, I
> >> suggest renaming "emit_sample()" to "emit_local()". This same goes for
> >> all the derivative ATTR naming.
> >>
> >
> > Let's discuss this in the kernel ml.
> >
> >>> + */
> >>> +enum ovs_emit_sample_attr {
> >>> + OVS_EMIT_SAMPLE_ATTR_UNPSEC,
> >>> + OVS_EMIT_SAMPLE_ATTR_GROUP, /* u32 number. */
> >>> + OVS_EMIT_SAMPLE_ATTR_COOKIE,/* Optional, user specified cookie. */
> >>
> >> Fix comment alignment? Maybe also change the order to be alphabetical?
> >>
> >
> > What do you mean by comment alignment?
>
> Well you are using tabs to index which might result in it looking like this:
>
> + OVS_EMIT_SAMPLE_ATTR_GROUP, /* u32 number. */
> + OVS_EMIT_SAMPLE_ATTR_COOKIE,/* Optional, user specified cookie. */
>
> (Most) of the other comments use spaces after the definition.
>
> >>> + __OVS_EMIT_SAMPLE_ATTR_MAX
> >>> +};
> >>> +
> >>> +#define OVS_EMIT_SAMPLE_ATTR_MAX (__OVS_EMIT_SAMPLE_ATTR_MAX - 1)
> >>> +
> >>> +
> >>>  /**
> >>>   * enum ovs_action_attr - Action types.
> >>>   *
> >>> @@ -1087,6 +,7 @@ enum ovs_action_attr {
> >>>   OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */
> >>>   OVS_ACTION_ATTR_DEC_TTL,  /* Nested OVS_DEC_TTL_ATTR_*. */
> >>>   OVS_ACTION_ATTR_DROP, /* u32 xlate_error. */
> >>> + OVS_ACTION_ATTR_EMIT_SAMPLE,  /* Nested OVS_EMIT_SAMPLE_ATTR_*. */
> >>>
> >>>  #ifndef __KERNEL__
> >>>   OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
> >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >>> index c7f9e1490..c1171890c 100644
> >>> --- a/lib/dpif-netdev.c
> >>> +++ b/lib/dpif-netdev.c
> >>> @@ -9519,6 +9519,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
> >>> *packets_,
> >>>  case OVS_ACTION_ATTR_DROP:
> >>>  case OVS_ACTION_ATTR_ADD_MPLS:
> >>>  case OVS_ACTION_ATTR_DEC_TTL:
> >>> +case OVS_ACTION_ATTR_EMIT_SAMPLE:
> >>>  case __OVS_ACTION_ATTR_MAX:
> >>>  OVS_NOT_REACHED();
> >>>  }
> >>> diff --git a/lib/dpif.c b/lib/dpif.c
> >>> index 23e

Re: [ovs-dev] RFC OVN: fabric integration

2024-06-28 Thread Ilya Maximets
On 6/28/24 11:03, Ales Musil wrote:
> Hi Frode,
> 
> looking forward to the RFC. AFAIU it means that the routes would be exposed on
> LR, more specifically GW router. Would it make sense to allow this behavior 
> for
> provider networks (LS with localnet port)? In that case we could advertise
> chassis-local information from logical routers attached to BGP-enabled
> switches. E.g.: FIPs, LBs. It would cover the use case for distributed
> routers. To achieve that we should have BGP peers for each chassis that the LS
> is local on.

I haven't read the whole thing yet, but can we, please, stop adding routing 
features
to switches? :)  If someone wants routing, they should use a router, IMO.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 10/10] selftests: openvswitch: add psample test

2024-06-28 Thread Adrian Moreno
Add a test to verify sampling packets via psample works.

In order to do that, create a subcommand in ovs-dpctl.py to listen to
on the psample multicast group and print samples.

Signed-off-by: Adrian Moreno 
---
 .../selftests/net/openvswitch/openvswitch.sh  | 115 +-
 .../selftests/net/openvswitch/ovs-dpctl.py|  73 ++-
 2 files changed, 182 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh 
b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index 15bca0708717..2ee770281a08 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -20,7 +20,8 @@ tests="
nat_related_v4  ip4-nat-related: ICMP related 
matches work with SNAT
netlink_checks  ovsnl: validate netlink attrs 
and settings
upcall_interfaces   ovs: test the upcall interfaces
-   drop_reason drop: test drop reasons are 
emitted"
+   drop_reason drop: test drop reasons are 
emitted
+   psample psample: Sampling packets with 
psample"
 
 info() {
 [ $VERBOSE = 0 ] || echo $*
@@ -102,12 +103,21 @@ ovs_netns_spawn_daemon() {
shift
netns=$1
shift
-   info "spawning cmd: $*"
-   ip netns exec $netns $*  >> $ovs_dir/stdout  2>> $ovs_dir/stderr &
+   if [ "$netns" == "_default" ]; then
+   $*  >> $ovs_dir/stdout  2>> $ovs_dir/stderr &
+   else
+   ip netns exec $netns $*  >> $ovs_dir/stdout  2>> 
$ovs_dir/stderr &
+   fi
pid=$!
ovs_sbx "$sbx" on_exit "kill -TERM $pid 2>/dev/null"
 }
 
+ovs_spawn_daemon() {
+   sbx=$1
+   shift
+   ovs_netns_spawn_daemon $sbx "_default" $*
+}
+
 ovs_add_netns_and_veths () {
info "Adding netns attached: sbx:$1 dp:$2 {$3, $4, $5}"
ovs_sbx "$1" ip netns add "$3" || return 1
@@ -170,6 +180,19 @@ ovs_drop_reason_count()
return `echo "$perf_output" | grep "$pattern" | wc -l`
 }
 
+ovs_test_flow_fails () {
+   ERR_MSG="Flow actions may not be safe on all matching packets"
+
+   PRE_TEST=$(dmesg | grep -c "${ERR_MSG}")
+   ovs_add_flow $@ &> /dev/null $@ && return 1
+   POST_TEST=$(dmesg | grep -c "${ERR_MSG}")
+
+   if [ "$PRE_TEST" == "$POST_TEST" ]; then
+   return 1
+   fi
+   return 0
+}
+
 usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
@@ -184,6 +207,92 @@ usage() {
exit 1
 }
 
+
+# psample test
+# - use psample to observe packets
+test_psample() {
+   sbx_add "test_psample" || return $?
+
+   # Add a datapath with per-vport dispatching.
+   ovs_add_dp "test_psample" psample -V 2:1 || return 1
+
+   info "create namespaces"
+   ovs_add_netns_and_veths "test_psample" "psample" \
+   client c0 c1 172.31.110.10/24 -u || return 1
+   ovs_add_netns_and_veths "test_psample" "psample" \
+   server s0 s1 172.31.110.20/24 -u || return 1
+
+   # Check if psample actions can be configured.
+   ovs_add_flow "test_psample" psample \
+   'in_port(1),eth(),eth_type(0x0806),arp()' 'psample(group=1)'
+   if [ $? == 1 ]; then
+   info "no support for psample - skipping"
+   ovs_exit_sig
+   return $ksft_skip
+   fi
+
+   ovs_del_flows "test_psample" psample
+
+   # Test action verification.
+   OLDIFS=$IFS
+   IFS='*'
+   min_key='in_port(1),eth(),eth_type(0x0800),ipv4()'
+   for testcase in \
+   "cookie to 
large"*"psample(group=1,cookie=1615141312111009080706050403020100)" \
+   "no group with cookie"*"psample(cookie=abcd)" \
+   "no group"*"sample()";
+   do
+   set -- $testcase;
+   ovs_test_flow_fails "test_psample" psample $min_key $2
+   if [ $? == 1 ]; then
+   info "failed - $1"
+   return 1
+   fi
+   done
+   IFS=$OLDIFS
+
+   ovs_del_flows "test_psample" psample
+   # Allow ARP
+   ovs_add_flow "test_psample" psample \
+   'in_port(1),eth(),eth_type(0x0806),arp()' '2' || return 1
+   ovs_add_flow "test_psample" psample \
+   'in_port(2),eth(),eth_type(0x0806),arp()' '1' || return 1
+
+   # Sample first 14 bytes of all traffic.
+   ovs_add_flow "test_psample" psample \
+   "in_port(1),eth(),eth_type(0x0800),ipv4()" \
+"trunc(14),psample(group=1,cookie=c0ffee),2"
+
+   # Sample all traffic. In this case, use a sample() action with both
+   # psample and an upcall emulating simultaneous local sampling and
+   # sFlow / IPFIX.
+   nlpid=$(grep -E "listening on upcall packet handler" \
+$ovs_dir/s0.out | cut -d ":" -f 2 | tr -d ' ')
+
+   ovs_add_flow "tes

[ovs-dev] [PATCH net-next v6 09/10] selftests: openvswitch: parse trunc action

2024-06-28 Thread Adrian Moreno
The trunc action was supported decode-able but not parse-able. Add
support for parsing the action string.

Reviewed-by: Aaron Conole 
Signed-off-by: Adrian Moreno 
---
 .../testing/selftests/net/openvswitch/ovs-dpctl.py  | 13 +
 1 file changed, 13 insertions(+)

diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py 
b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index fa73f82639fe..558d12b0d39d 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -817,6 +817,19 @@ class ovsactions(nla):
 self["attrs"].append(["OVS_ACTION_ATTR_USERSPACE", uact])
 parsed = True
 
+elif parse_starts_block(actstr, "trunc(", False):
+parencount += 1
+actstr, val = parse_extract_field(
+actstr,
+"trunc(",
+r"([0-9]+)",
+int,
+False,
+None,
+)
+self["attrs"].append(["OVS_ACTION_ATTR_TRUNC", val])
+parsed = True
+
 actstr = actstr[strspn(actstr, ", ") :]
 while parencount > 0:
 parencount -= 1
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 08/10] selftests: openvswitch: add userspace parsing

2024-06-28 Thread Adrian Moreno
The userspace action lacks parsing support plus it contains a bug in the
name of one of its attributes.

This patch makes userspace action work.

Reviewed-by: Aaron Conole 
Signed-off-by: Adrian Moreno 
---
 .../selftests/net/openvswitch/ovs-dpctl.py| 24 +--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py 
b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index 4dc6ceca7e1e..fa73f82639fe 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -575,13 +575,27 @@ class ovsactions(nla):
 print_str += "userdata="
 for f in self.get_attr("OVS_USERSPACE_ATTR_USERDATA"):
 print_str += "%x." % f
-if self.get_attr("OVS_USERSPACE_ATTR_TUN_PORT") is not None:
+if self.get_attr("OVS_USERSPACE_ATTR_EGRESS_TUN_PORT") is not None:
 print_str += "egress_tun_port=%d" % self.get_attr(
-"OVS_USERSPACE_ATTR_TUN_PORT"
+"OVS_USERSPACE_ATTR_EGRESS_TUN_PORT"
 )
 print_str += ")"
 return print_str
 
+def parse(self, actstr):
+attrs_desc = (
+("pid", "OVS_USERSPACE_ATTR_PID", int),
+("userdata", "OVS_USERSPACE_ATTR_USERDATA",
+lambda x: list(bytearray.fromhex(x))),
+("egress_tun_port", "OVS_USERSPACE_ATTR_EGRESS_TUN_PORT", int)
+)
+
+attrs, actstr = parse_attrs(actstr, attrs_desc)
+for attr in attrs:
+self["attrs"].append(attr)
+
+return actstr
+
 def dpstr(self, more=False):
 print_str = ""
 
@@ -797,6 +811,12 @@ class ovsactions(nla):
 self["attrs"].append(["OVS_ACTION_ATTR_PSAMPLE", psampleact])
 parsed = True
 
+elif parse_starts_block(actstr, "userspace(", False):
+uact = self.userspace()
+actstr = uact.parse(actstr[len("userspace(") : ])
+self["attrs"].append(["OVS_ACTION_ATTR_USERSPACE", uact])
+parsed = True
+
 actstr = actstr[strspn(actstr, ", ") :]
 while parencount > 0:
 parencount -= 1
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 07/10] selftests: openvswitch: add psample action

2024-06-28 Thread Adrian Moreno
Add sample and psample action support to ovs-dpctl.py.

Refactor common attribute parsing logic into an external function.

Signed-off-by: Adrian Moreno 
---
 .../selftests/net/openvswitch/ovs-dpctl.py| 162 +-
 1 file changed, 161 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py 
b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index 9f8dec2f6539..4dc6ceca7e1e 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -8,6 +8,7 @@ import argparse
 import errno
 import ipaddress
 import logging
+import math
 import multiprocessing
 import re
 import struct
@@ -58,6 +59,7 @@ OVS_FLOW_CMD_DEL = 2
 OVS_FLOW_CMD_GET = 3
 OVS_FLOW_CMD_SET = 4
 
+UINT32_MAX = 0x
 
 def macstr(mac):
 outstr = ":".join(["%02X" % i for i in mac])
@@ -267,6 +269,75 @@ def parse_extract_field(
 return str_skipped, data
 
 
+def parse_attrs(actstr, attr_desc):
+"""Parses the given action string and returns a list of netlink
+attributes based on a list of attribute descriptions.
+
+Each element in the attribute description list is a tuple such as:
+(name, attr_name, parse_func)
+where:
+name: is the string representing the attribute
+attr_name: is the name of the attribute as defined in the uAPI.
+parse_func: is a callable accepting a string and returning either
+a single object (the parsed attribute value) or a tuple of
+two values (the parsed attribute value and the remaining string)
+
+Returns a list of attributes and the remaining string.
+"""
+def parse_attr(actstr, key, func):
+actstr = actstr[len(key) :]
+
+if not func:
+return None, actstr
+
+delim = actstr[0]
+actstr = actstr[1:]
+
+if delim == "=":
+pos = strcspn(actstr, ",)")
+ret = func(actstr[:pos])
+else:
+ret = func(actstr)
+
+if isinstance(ret, tuple):
+(datum, actstr) = ret
+else:
+datum = ret
+actstr = actstr[strcspn(actstr, ",)"):]
+
+if delim == "(":
+if not actstr or actstr[0] != ")":
+raise ValueError("Action contains unbalanced parentheses")
+
+actstr = actstr[1:]
+
+actstr = actstr[strspn(actstr, ", ") :]
+
+return datum, actstr
+
+attrs = []
+attr_desc = list(attr_desc)
+while actstr and actstr[0] != ")" and attr_desc:
+found = False
+for i, (key, attr, func) in enumerate(attr_desc):
+if actstr.startswith(key):
+datum, actstr = parse_attr(actstr, key, func)
+attrs.append([attr, datum])
+found = True
+del attr_desc[i]
+
+if not found:
+raise ValueError("Unknown attribute: '%s'" % actstr)
+
+actstr = actstr[strspn(actstr, ", ") :]
+
+if actstr[0] != ")":
+raise ValueError("Action string contains extra garbage or has "
+ "unbalanced parenthesis: '%s'" % actstr)
+
+return attrs, actstr[1:]
+
+
 class ovs_dp_msg(genlmsg):
 # include the OVS version
 # We need a custom header rather than just being able to rely on
@@ -285,7 +356,7 @@ class ovsactions(nla):
 ("OVS_ACTION_ATTR_SET", "none"),
 ("OVS_ACTION_ATTR_PUSH_VLAN", "none"),
 ("OVS_ACTION_ATTR_POP_VLAN", "flag"),
-("OVS_ACTION_ATTR_SAMPLE", "none"),
+("OVS_ACTION_ATTR_SAMPLE", "sample"),
 ("OVS_ACTION_ATTR_RECIRC", "uint32"),
 ("OVS_ACTION_ATTR_HASH", "none"),
 ("OVS_ACTION_ATTR_PUSH_MPLS", "none"),
@@ -304,8 +375,85 @@ class ovsactions(nla):
 ("OVS_ACTION_ATTR_ADD_MPLS", "none"),
 ("OVS_ACTION_ATTR_DEC_TTL", "none"),
 ("OVS_ACTION_ATTR_DROP", "uint32"),
+("OVS_ACTION_ATTR_PSAMPLE", "psample"),
 )
 
+class psample(nla):
+nla_flags = NLA_F_NESTED
+
+nla_map = (
+("OVS_PSAMPLE_ATTR_UNSPEC", "none"),
+("OVS_PSAMPLE_ATTR_GROUP", "uint32"),
+("OVS_PSAMPLE_ATTR_COOKIE", "array(uint8)"),
+)
+
+def dpstr(self, more=False):
+args = "group=%d" % self.get_attr("OVS_PSAMPLE_ATTR_GROUP")
+
+cookie = self.get_attr("OVS_PSAMPLE_ATTR_COOKIE")
+if cookie:
+args += ",cookie(%s)" % \
+"".join(format(x, "02x") for x in cookie)
+
+return "psample(%s)" % args
+
+def parse(self, actstr):
+desc = (
+("group", "OVS_PSAMPLE_ATTR_GROUP", int),
+("cookie", "OVS_PSAMPLE_ATTR_COOKIE",
+lambda x: list(bytearray.fromhex(x)))
+)
+
+attrs, actstr = parse_attrs(actstr, desc)
+
+for attr in attrs:
+self["attrs"].append(attr)
+
+return acts

[ovs-dev] [PATCH net-next v6 05/10] net: openvswitch: add psample action

2024-06-28 Thread Adrian Moreno
Add support for a new action: psample.

This action accepts a u32 group id and a variable-length cookie and uses
the psample multicast group to make the packet available for
observability.

The maximum length of the user-defined cookie is set to 16, same as
tc_cookie, to discourage using cookies that will not be offloadable.

Signed-off-by: Adrian Moreno 
---
 Documentation/netlink/specs/ovs_flow.yaml | 17 
 include/uapi/linux/openvswitch.h  | 28 ++
 net/openvswitch/Kconfig   |  1 +
 net/openvswitch/actions.c | 47 +++
 net/openvswitch/flow_netlink.c| 32 ++-
 5 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/Documentation/netlink/specs/ovs_flow.yaml 
b/Documentation/netlink/specs/ovs_flow.yaml
index 4fdfc6b5cae9..46f5d1cd8a5f 100644
--- a/Documentation/netlink/specs/ovs_flow.yaml
+++ b/Documentation/netlink/specs/ovs_flow.yaml
@@ -727,6 +727,12 @@ attribute-sets:
 name: dec-ttl
 type: nest
 nested-attributes: dec-ttl-attrs
+  -
+name: psample
+type: nest
+nested-attributes: psample-attrs
+doc: |
+  Sends a packet sample to psample for external observation.
   -
 name: tunnel-key-attrs
 enum-name: ovs-tunnel-key-attr
@@ -938,6 +944,17 @@ attribute-sets:
   -
 name: gbp
 type: u32
+  -
+name: psample-attrs
+enum-name: ovs-psample-attr
+name-prefix: ovs-psample-attr-
+attributes:
+  -
+name: group
+type: u32
+  -
+name: cookie
+type: binary
 
 operations:
   name-prefix: ovs-flow-cmd-
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index efc82c318fa2..07086759556b 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -914,6 +914,31 @@ struct check_pkt_len_arg {
 };
 #endif
 
+#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
+/**
+ * enum ovs_pample_attr - Attributes for %OVS_ACTION_ATTR_PSAMPLE
+ * action.
+ *
+ * @OVS_PSAMPLE_ATTR_GROUP: 32-bit number to identify the source of the
+ * sample.
+ * @OVS_PSAMPLE_ATTR_COOKIE: An optional variable-length binary cookie that
+ * contains user-defined metadata. The maximum length is
+ * OVS_PSAMPLE_COOKIE_MAX_SIZE bytes.
+ *
+ * Sends the packet to the psample multicast group with the specified group and
+ * cookie. It is possible to combine this action with the
+ * %OVS_ACTION_ATTR_TRUNC action to limit the size of the sample.
+ */
+enum ovs_psample_attr {
+   OVS_PSAMPLE_ATTR_GROUP = 1, /* u32 number. */
+   OVS_PSAMPLE_ATTR_COOKIE,/* Optional, user specified cookie. */
+
+   /* private: */
+   __OVS_PSAMPLE_ATTR_MAX
+};
+
+#define OVS_PSAMPLE_ATTR_MAX (__OVS_PSAMPLE_ATTR_MAX - 1)
+
 /**
  * enum ovs_action_attr - Action types.
  *
@@ -966,6 +991,8 @@ struct check_pkt_len_arg {
  * of l3 tunnel flag in the tun_flags field of OVS_ACTION_ATTR_ADD_MPLS
  * argument.
  * @OVS_ACTION_ATTR_DROP: Explicit drop action.
+ * @OVS_ACTION_ATTR_PSAMPLE: Send a sample of the packet to external observers
+ * via psample.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -1004,6 +1031,7 @@ enum ovs_action_attr {
OVS_ACTION_ATTR_ADD_MPLS, /* struct ovs_action_add_mpls. */
OVS_ACTION_ATTR_DEC_TTL,  /* Nested OVS_DEC_TTL_ATTR_*. */
OVS_ACTION_ATTR_DROP, /* u32 error code. */
+   OVS_ACTION_ATTR_PSAMPLE,  /* Nested OVS_PSAMPLE_ATTR_*. */
 
__OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
   * from userspace. */
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 29a7081858cd..2535f3f9f462 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -10,6 +10,7 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
+   depends on PSAMPLE || !PSAMPLE
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 964225580824..a035b7e677dd 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -24,6 +24,11 @@
 #include 
 #include 
 #include 
+
+#if IS_ENABLED(CONFIG_PSAMPLE)
+#include 
+#endif
+
 #include 
 
 #include "datapath.h"
@@ -1299,6 +1304,39 @@ static int execute_dec_ttl(struct sk_buff *skb, struct 
sw_flow_key *key)
return 0;
 }
 
+#if IS_ENABLED(CONFIG_PSAMPLE)
+static void execute_psample(struct datapath *dp, struct sk_buff *skb,
+   const struct nlattr *attr)
+{
+   struct psample_group psample_group = {};
+   struct psample_metadata md = {};
+ 

[ovs-dev] [PATCH net-next v6 03/10] net: psample: skip packet copy if no listeners

2024-06-28 Thread Adrian Moreno
If nobody is listening on the multicast group, generating the sample,
which involves copying packet data, seems completely unnecessary.

Return fast in this case.

Acked-by: Eelco Chaudron 
Reviewed-by: Ido Schimmel 
Reviewed-by: Simon Horman 
Signed-off-by: Adrian Moreno 
---
 net/psample/psample.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/psample/psample.c b/net/psample/psample.c
index b37488f426bc..1c76f3e48dcd 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -376,6 +376,10 @@ void psample_sample_packet(struct psample_group *group, 
struct sk_buff *skb,
void *data;
int ret;
 
+   if (!genl_has_listeners(&psample_nl_family, group->net,
+   PSAMPLE_NL_MCGRP_SAMPLE))
+   return;
+
meta_len = (in_ifindex ? nla_total_size(sizeof(u16)) : 0) +
   (out_ifindex ? nla_total_size(sizeof(u16)) : 0) +
   (md->out_tc_valid ? nla_total_size(sizeof(u16)) : 0) +
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 06/10] net: openvswitch: store sampling probability in cb.

2024-06-28 Thread Adrian Moreno
When a packet sample is observed, the sampling rate that was used is
important to estimate the real frequency of such event.

Store the probability of the parent sample action in the skb's cb area
and use it in psample action to pass it down to psample module.

Acked-by: Eelco Chaudron 
Reviewed-by: Ilya Maximets 
Signed-off-by: Adrian Moreno 
---
 include/uapi/linux/openvswitch.h |  3 ++-
 net/openvswitch/actions.c| 20 +---
 net/openvswitch/datapath.h   |  3 +++
 net/openvswitch/vport.c  |  1 +
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 07086759556b..8a0cf8060c37 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -649,7 +649,8 @@ enum ovs_flow_attr {
  * Actions are passed as nested attributes.
  *
  * Executes the specified actions with the given probability on a per-packet
- * basis.
+ * basis. Nested actions will be able to access the probability value of the
+ * parent @OVS_ACTION_ATTR_SAMPLE.
  */
 enum ovs_sample_attr {
OVS_SAMPLE_ATTR_UNSPEC,
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a035b7e677dd..34af6bce4085 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1048,12 +1048,15 @@ static int sample(struct datapath *dp, struct sk_buff 
*skb,
struct nlattr *sample_arg;
int rem = nla_len(attr);
const struct sample_arg *arg;
+   u32 init_probability;
bool clone_flow_key;
+   int err;
 
/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
sample_arg = nla_data(attr);
arg = nla_data(sample_arg);
actions = nla_next(sample_arg, &rem);
+   init_probability = OVS_CB(skb)->probability;
 
if ((arg->probability != U32_MAX) &&
(!arg->probability || get_random_u32() > arg->probability)) {
@@ -1062,9 +1065,16 @@ static int sample(struct datapath *dp, struct sk_buff 
*skb,
return 0;
}
 
+   OVS_CB(skb)->probability = arg->probability;
+
clone_flow_key = !arg->exec;
-   return clone_execute(dp, skb, key, 0, actions, rem, last,
-clone_flow_key);
+   err = clone_execute(dp, skb, key, 0, actions, rem, last,
+   clone_flow_key);
+
+   if (!last)
+   OVS_CB(skb)->probability = init_probability;
+
+   return err;
 }
 
 /* When 'last' is true, clone() should always consume the 'skb'.
@@ -1311,6 +1321,7 @@ static void execute_psample(struct datapath *dp, struct 
sk_buff *skb,
struct psample_group psample_group = {};
struct psample_metadata md = {};
const struct nlattr *a;
+   u32 rate;
int rem;
 
nla_for_each_attr(a, nla_data(attr), nla_len(attr), rem) {
@@ -1329,8 +1340,11 @@ static void execute_psample(struct datapath *dp, struct 
sk_buff *skb,
psample_group.net = ovs_dp_get_net(dp);
md.in_ifindex = OVS_CB(skb)->input_vport->dev->ifindex;
md.trunc_size = skb->len - OVS_CB(skb)->cutlen;
+   md.rate_as_probability = 1;
+
+   rate = OVS_CB(skb)->probability ? OVS_CB(skb)->probability : U32_MAX;
 
-   psample_sample_packet(&psample_group, skb, 0, &md);
+   psample_sample_packet(&psample_group, skb, rate, &md);
 }
 #else
 static inline void execute_psample(struct datapath *dp, struct sk_buff *skb,
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 0cd29971a907..9ca6231ea647 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -115,12 +115,15 @@ struct datapath {
  * fragmented.
  * @acts_origlen: The netlink size of the flow actions applied to this skb.
  * @cutlen: The number of bytes from the packet end to be removed.
+ * @probability: The sampling probability that was applied to this skb; 0 means
+ * no sampling has occurred; U32_MAX means 100% probability.
  */
 struct ovs_skb_cb {
struct vport*input_vport;
u16 mru;
u16 acts_origlen;
u32 cutlen;
+   u32 probability;
 };
 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
 
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 972ae01a70f7..8732f6e51ae5 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -500,6 +500,7 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff 
*skb,
OVS_CB(skb)->input_vport = vport;
OVS_CB(skb)->mru = 0;
OVS_CB(skb)->cutlen = 0;
+   OVS_CB(skb)->probability = 0;
if (unlikely(dev_net(skb->dev) != ovs_dp_get_net(vport->dp))) {
u32 mark;
 
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 04/10] net: psample: allow using rate as probability

2024-06-28 Thread Adrian Moreno
Although not explicitly documented in the psample module itself, the
definition of PSAMPLE_ATTR_SAMPLE_RATE seems inherited from act_sample.

Quoting tc-sample(8):
"RATE of 100 will lead to an average of one sampled packet out of every
100 observed."

With this semantics, the rates that we can express with an unsigned
32-bits number are very unevenly distributed and concentrated towards
"sampling few packets".
For example, we can express a probability of 2.32E-8% but we
cannot express anything between 100% and 50%.

For sampling applications that are capable of sampling a decent
amount of packets, this sampling rate semantics is not very useful.

Add a new flag to the uAPI that indicates that the sampling rate is
expressed in scaled probability, this is:
- 0 is 0% probability, no packets get sampled.
- U32_MAX is 100% probability, all packets get sampled.

Acked-by: Eelco Chaudron 
Reviewed-by: Ido Schimmel 
Signed-off-by: Adrian Moreno 
---
 include/net/psample.h|  3 ++-
 include/uapi/linux/psample.h | 10 +-
 net/psample/psample.c|  3 +++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/psample.h b/include/net/psample.h
index 2ac71260a546..c52e9ebd88dd 100644
--- a/include/net/psample.h
+++ b/include/net/psample.h
@@ -24,7 +24,8 @@ struct psample_metadata {
u8 out_tc_valid:1,
   out_tc_occ_valid:1,
   latency_valid:1,
-  unused:5;
+  rate_as_probability:1,
+  unused:4;
const u8 *user_cookie;
u32 user_cookie_len;
 };
diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
index e80637e1d97b..b765f0e81f20 100644
--- a/include/uapi/linux/psample.h
+++ b/include/uapi/linux/psample.h
@@ -8,7 +8,11 @@ enum {
PSAMPLE_ATTR_ORIGSIZE,
PSAMPLE_ATTR_SAMPLE_GROUP,
PSAMPLE_ATTR_GROUP_SEQ,
-   PSAMPLE_ATTR_SAMPLE_RATE,
+   PSAMPLE_ATTR_SAMPLE_RATE,   /* u32, ratio between observed and
+* sampled packets or scaled probability
+* if PSAMPLE_ATTR_SAMPLE_PROBABILITY
+* is set.
+*/
PSAMPLE_ATTR_DATA,
PSAMPLE_ATTR_GROUP_REFCOUNT,
PSAMPLE_ATTR_TUNNEL,
@@ -20,6 +24,10 @@ enum {
PSAMPLE_ATTR_TIMESTAMP, /* u64, nanoseconds */
PSAMPLE_ATTR_PROTO, /* u16 */
PSAMPLE_ATTR_USER_COOKIE,   /* binary, user provided data */
+   PSAMPLE_ATTR_SAMPLE_PROBABILITY,/* no argument, interpret rate in
+* PSAMPLE_ATTR_SAMPLE_RATE as a
+* probability scaled 0 - U32_MAX.
+*/
 
__PSAMPLE_ATTR_MAX
 };
diff --git a/net/psample/psample.c b/net/psample/psample.c
index 1c76f3e48dcd..f48b5b9cd409 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -497,6 +497,9 @@ void psample_sample_packet(struct psample_group *group, 
struct sk_buff *skb,
md->user_cookie))
goto error;
 
+   if (md->rate_as_probability)
+   nla_put_flag(skb, PSAMPLE_ATTR_SAMPLE_PROBABILITY);
+
genlmsg_end(nl_skb, data);
genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 01/10] net: psample: add user cookie

2024-06-28 Thread Adrian Moreno
Add a user cookie to the sample metadata so that sample emitters can
provide more contextual information to samples.

If present, send the user cookie in a new attribute:
PSAMPLE_ATTR_USER_COOKIE.

Acked-by: Eelco Chaudron 
Reviewed-by: Simon Horman 
Reviewed-by: Ido Schimmel 
Signed-off-by: Adrian Moreno 
---
 include/net/psample.h| 2 ++
 include/uapi/linux/psample.h | 1 +
 net/psample/psample.c| 9 -
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/net/psample.h b/include/net/psample.h
index 0509d2d6be67..2ac71260a546 100644
--- a/include/net/psample.h
+++ b/include/net/psample.h
@@ -25,6 +25,8 @@ struct psample_metadata {
   out_tc_occ_valid:1,
   latency_valid:1,
   unused:5;
+   const u8 *user_cookie;
+   u32 user_cookie_len;
 };
 
 struct psample_group *psample_group_get(struct net *net, u32 group_num);
diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
index e585db5bf2d2..e80637e1d97b 100644
--- a/include/uapi/linux/psample.h
+++ b/include/uapi/linux/psample.h
@@ -19,6 +19,7 @@ enum {
PSAMPLE_ATTR_LATENCY,   /* u64, nanoseconds */
PSAMPLE_ATTR_TIMESTAMP, /* u64, nanoseconds */
PSAMPLE_ATTR_PROTO, /* u16 */
+   PSAMPLE_ATTR_USER_COOKIE,   /* binary, user provided data */
 
__PSAMPLE_ATTR_MAX
 };
diff --git a/net/psample/psample.c b/net/psample/psample.c
index a5d9b8446f77..b37488f426bc 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -386,7 +386,9 @@ void psample_sample_packet(struct psample_group *group, 
struct sk_buff *skb,
   nla_total_size(sizeof(u32)) +/* group_num */
   nla_total_size(sizeof(u32)) +/* seq */
   nla_total_size_64bit(sizeof(u64)) +  /* timestamp */
-  nla_total_size(sizeof(u16)); /* protocol */
+  nla_total_size(sizeof(u16)) +/* protocol */
+  (md->user_cookie_len ?
+   nla_total_size(md->user_cookie_len) : 0); /* user cookie */
 
 #ifdef CONFIG_INET
tun_info = skb_tunnel_info(skb);
@@ -486,6 +488,11 @@ void psample_sample_packet(struct psample_group *group, 
struct sk_buff *skb,
}
 #endif
 
+   if (md->user_cookie && md->user_cookie_len &&
+   nla_put(nl_skb, PSAMPLE_ATTR_USER_COOKIE, md->user_cookie_len,
+   md->user_cookie))
+   goto error;
+
genlmsg_end(nl_skb, data);
genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 02/10] net: sched: act_sample: add action cookie to sample

2024-06-28 Thread Adrian Moreno
If the action has a user_cookie, pass it along to the sample so it can
be easily identified.

Acked-by: Eelco Chaudron 
Reviewed-by: Ido Schimmel 
Signed-off-by: Adrian Moreno 
---
 net/sched/act_sample.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index a69b53d54039..2ceb4d141b71 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -167,7 +167,9 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
 {
struct tcf_sample *s = to_sample(a);
struct psample_group *psample_group;
+   u8 cookie_data[TC_COOKIE_MAX_SIZE];
struct psample_metadata md = {};
+   struct tc_cookie *user_cookie;
int retval;
 
tcf_lastuse_update(&s->tcf_tm);
@@ -189,6 +191,16 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
if (skb_at_tc_ingress(skb) && tcf_sample_dev_ok_push(skb->dev))
skb_push(skb, skb->mac_len);
 
+   rcu_read_lock();
+   user_cookie = rcu_dereference(a->user_cookie);
+   if (user_cookie) {
+   memcpy(cookie_data, user_cookie->data,
+  user_cookie->len);
+   md.user_cookie = cookie_data;
+   md.user_cookie_len = user_cookie->len;
+   }
+   rcu_read_unlock();
+
md.trunc_size = s->truncate ? s->trunc_size : skb->len;
psample_sample_packet(psample_group, skb, s->rate, &md);
 
-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net-next v6 00/10] net: openvswitch: Add sample multicasting.

2024-06-28 Thread Adrian Moreno
** Background **
Currently, OVS supports several packet sampling mechanisms (sFlow,
per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
userspace action that needs to be handled by ovs-vswitchd's handler
threads only to be forwarded to some third party application that
will somehow process the sample and provide observability on the
datapath.

A particularly interesting use-case is controller-driven
per-flow IPFIX sampling where the OpenFlow controller can add metadata
to samples (via two 32bit integers) and this metadata is then available
to the sample-collecting system for correlation.

** Problem **
The fact that sampled traffic share netlink sockets and handler thread
time with upcalls, apart from being a performance bottleneck in the
sample extraction itself, can severely compromise the datapath,
yielding this solution unfit for highly loaded production systems.

Users are left with little options other than guessing what sampling
rate will be OK for their traffic pattern and system load and dealing
with the lost accuracy.

Looking at available infrastructure, an obvious candidated would be
to use psample. However, it's current state does not help with the
use-case at stake because sampled packets do not contain user-defined
metadata.

** Proposal **
This series is an attempt to fix this situation by extending the
existing psample infrastructure to carry a variable length
user-defined cookie.

The main existing user of psample is tc's act_sample. It is also
extended to forward the action's cookie to psample.

Finally, a new OVS action (OVS_SAMPLE_ATTR_EMIT_SAMPLE) is created.
It accepts a group and an optional cookie and uses psample to
multicast the packet and the metadata.

--
v5 -> v6:
- Renamed emit_sample -> psample
- Addressed unused variable and conditionally compilation of function.

v4 -> v5:
- Rebased.
- Removed lefover enum value and wrapped some long lines in selftests.

v3 -> v4:
- Rebased.
- Addressed Jakub's comment on private and unused nla attributes.

v2 -> v3:
- Addressed comments from Simon, Aaron and Ilya.
- Dropped probability propagation in nested sample actions.
- Dropped patch v2's 7/9 in favor of a userspace implementation and
consume skb if emit_sample is the last action, same as we do with
userspace.
- Split ovs-dpctl.py features in independent patches.

v1 -> v2:
- Create a new action ("emit_sample") rather than reuse existing
  "sample" one.
- Add probability semantics to psample's sampling rate.
- Store sampling probability in skb's cb area and use it in emit_sample.
- Test combining "emit_sample" with "trunc"
- Drop group_id filtering and tracepoint in psample.

rfc_v2 -> v1:
- Accommodate Ilya's comments.
- Split OVS's attribute in two attributes and simplify internal
handling of psample arguments.
- Extend psample and tc with a user-defined cookie.
- Add a tracepoint to psample to facilitate troubleshooting.

rfc_v1 -> rfc_v2:
- Use psample instead of a new OVS-only multicast group.
- Extend psample and tc with a user-defined cookie.


Adrian Moreno (10):
  net: psample: add user cookie
  net: sched: act_sample: add action cookie to sample
  net: psample: skip packet copy if no listeners
  net: psample: allow using rate as probability
  net: openvswitch: add psample action
  net: openvswitch: store sampling probability in cb.
  selftests: openvswitch: add psample action
  selftests: openvswitch: add userspace parsing
  selftests: openvswitch: parse trunc action
  selftests: openvswitch: add psample test

 Documentation/netlink/specs/ovs_flow.yaml |  17 ++
 include/net/psample.h |   5 +-
 include/uapi/linux/openvswitch.h  |  31 +-
 include/uapi/linux/psample.h  |  11 +-
 net/openvswitch/Kconfig   |   1 +
 net/openvswitch/actions.c |  65 -
 net/openvswitch/datapath.h|   3 +
 net/openvswitch/flow_netlink.c|  32 ++-
 net/openvswitch/vport.c   |   1 +
 net/psample/psample.c |  16 +-
 net/sched/act_sample.c|  12 +
 .../selftests/net/openvswitch/openvswitch.sh  | 115 +++-
 .../selftests/net/openvswitch/ovs-dpctl.py| 272 +-
 13 files changed, 565 insertions(+), 16 deletions(-)

-- 
2.45.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 0/2] Add support to disable VXLAN mode.

2024-06-28 Thread Vladislav Odintsov
Thank you all!

regards,
 
Vladislav Odintsov

-Original Message-
From: dev  on behalf of Dumitru Ceara 

Date: Friday, 28 June 2024 at 12:03
To: Vladislav Odintsov , "d...@openvswitch.org" 

Subject: Re: [ovs-dev] [PATCH ovn v6 0/2] Add support to disable VXLAN mode.

On 6/7/24 15:54, Vladislav Odintsov wrote:
> v6:
>   - Addressed Mark's review comments:
> 1. Removed global variable "vxlan_mode" change from "global" engine 
node.
> 2. Configuration knob "disable_vxlan_mode" was renamed to "vxlan_mode"
> v5:
>   - Addressed Ihar's review comments:
> 1. fixed errors after incorrect conflicts solving on rebase;
> 2. changed VXLAN mode naming to capitalized;
> 3. clarified VXLAN mode in ovn-architecture man page.
> v4:
>   - Addressed Dumitru's and Ihar's review comments;
>   - single patch was split into two:
> 1. function call replaced with a global variable `vxlan_mode`;
> 2. introduced `disable_vxlan_mode` configuration knob;
>   - rebased onto latest main branch.
> v3:
>   - Removed accidental ovs submodule change.
> v2:
>   - Added NEWS item.
> 
> Vladislav Odintsov (2):
>   northd: Make `vxlan_mode` a global variable.
>   northd: Add support for disabling vxlan mode.
> 

Thanks, Vladislav, Ihar and Ales!

I applied this series to main.

Best regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 2/2] tests: Skip memory error in DPDK tests.

2024-06-28 Thread Dumitru Ceara
On 6/28/24 11:16, David Marchand wrote:
> On Thu, Jun 27, 2024 at 10:38 AM Ales Musil  wrote:
>>
>> Add skip for memory error produced by DPDK on ARM with 2MB hugepages
>> which isn't harmful [0].
>>
>> [0] http://mails.dpdk.org/archives/dev/2024-June/296764.html
>>
>> Suggested-by: David Marchand 
>> Signed-off-by: Ales Musil 
> 
> I would mention ARM in the commit title (maybe it can be tweaked when
> applying), but otherwise the patch lgtm.

I changed it to:

tests: Skip memory error triggered on ARM in DPDK tests.

> Reviewed-by: David Marchand 
> 

Thanks, Ales and David!  Applied to main.

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 1/2] ci: Save some DPDK compilation time.

2024-06-28 Thread Dumitru Ceara
On 6/28/24 11:15, David Marchand wrote:
> On Thu, Jun 27, 2024 at 10:38 AM Ales Musil  wrote:
>>
>> Add more options that will shave some compilation time off DPDK by
>> skipping unused parts of the code.
>>
>> Suggested-by: David Marchand 
>> Signed-off-by: Ales Musil 
> 
> Reviewed-by: David Marchand 
> 

Thanks, Ales and David!  Applied to main.

Regards,
Dumitru


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] controller: Add random delay during fdb learning.

2024-06-28 Thread Naveen Yerramneni


> On 28 Jun 2024, at 2:45 PM, Dumitru Ceara  wrote:
> 
> !---|
>  CAUTION: External Email
> 
> |---!
> 
> On 6/28/24 11:02, Dumitru Ceara wrote:
> 
> [...]
> 
>> 
>> This needs to be "struct fdb *fdb".  I fixed it up and applied the patch
>> to main.
>> 
> 
> I also added you to the AUTHORS file, Naveen.  I'm not sure why you
> weren't there already.  I apologize for that, should be fixed now.

Thank you, Dumitru.

Thanks,
Naveen


> 
> Regards,
> Dumitru
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ovsdb: raft: Don't forward more than one command to the leader.

2024-06-28 Thread Ilya Maximets
On 6/28/24 09:20, Dumitru Ceara wrote:
> On 6/27/24 00:02, Ilya Maximets wrote:
>> Every transaction has RAFT log prerequisites.  Even if transactions
>> are not related (because RAFT doesn't actually know what data it is
>> handling).  When leader writes a new record to a RAFT storage, it is
>> getting appended to the log right away and changes current 'eid',
>> i.e., changes prerequisites.  The leader will not try to write new
>> records until the current one is committed, because until then the
>> pre-check will be failing.
>>
>> However, that is different for the follower.  Followers do not add
>> records to the RAFT log until the leader sends an append request back.
>> So, if there are multiple transactions pending on a follower, it will
>> create a command for each of them and prerequisites will be set to the
>> same values.  All these commands will be sent to the leader, but only
>> one can succeed at a time, because accepting one command immediately
>> changes prerequisites and all other commands become non-applicable.
>> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
>> failure is a transient failure, so the follower will re-process all the
>> failed transactions and send them again.  1 will succeed and N - 2 will
>> fail.  And so on, until there are no more transactions.  In the end,
>> instead of processing N transactions, the follower is performing
>> N * (N - 1) / 2 transaction processing iterations.  That is consuming
>> a huge amount of CPU resources completely unnecessarily.
>>
>> Since there is no real chance for multiple transactions from the same
>> follower to succeed, it's better to not send them in the first place.
>> This also eliminates prerequisite mismatch messages on a leader in
>> this particular case.
>>
> 
> Makes sense!
> 
>> In a test with 30 parallel shell threads executing 12K transactions
>> total with separate ovsdb-client calls through the same follower there
>> is about 60% performance improvement.  The test takes ~100 seconds to
>> complete without this change and ~40 seconds with this change applied.
>> The new time is very close to what it takes to execute the same test
>> through the cluster leader.
>>
> 
> Maybe a public link to the test (if possible) would be a nice reference
> to have in the future?

I'm not sure where to post it, so I'll just leave it here, and I can
add a link to this post into the commit message.

The script is adding 12K ports across 120 logical switches in a style
of ovn-kubernetes (sets up port security, some external IDs and adds
to an address set):

---
$ cat ../transaction-benchmark-parallel.sh
#set -x
set -o errexit

# Creating 120 logical switches
for i in $(seq 120); do
ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
done
IFS=$'\r\n' GLOBIGNORE='*' \
command eval  'uuids=($(ovn-nbctl --bare --columns _uuid list 
logical_switch | sed "/^$/d"))'

for i in $(seq 120); do
echo "uuid #${i}: ${uuids[$(($i - 1))]}"
done

# Creating one addres set
as_name="a0123456789012345678"
as_uuid=$(ovn-nbctl create address_set name=${as_name})

# Adding ports
function add_400() {
rm -rf txn_results$1.txt
for i in $(seq $1 $(($1 + 399)) ); do
echo $i

port_name="node-density-----_node-density-$i"
namespace="node-density-----"
ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
uuid_name="row____"
mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))

time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb 
"[\"OVN_Northbound\",
{
  \"uuid-name\":\"${uuid_name}\",
  \"row\":{
  \"name\":\"${port_name}\",
  \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
  \"addresses\":[\"set\",[\"${mac} ${ip}\"]],
  
\"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
  \"port_security\":[\"set\",[\"${mac} ${ip}\"]]
},
  \"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
},
{
  \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % 120))]}\"]]],
  
\"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"],
  \"op\":\"mutate\", \"table\":\"Logical_Switch\"
},
{
  \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
  \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\",
  \"op\":\"mutate\", \"table\":\"Address_Set\"
}]" >> txn_results$1.txt
done
echo "done" >> txn_results$1.txt
}

rm -f txn_results*.txt

n=0
for i in $(seq 0 400 11600); do
n=$(($n + 1))
add_400 $i &
done

while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
sleep 1
done
---

It is not th

Re: [ovs-dev] OVN technical community meeting - June 24th

2024-06-28 Thread Dumitru Ceara
On 6/25/24 19:03, Frode Nordahl wrote:
> Hello,
> 
> On Tue, Jun 25, 2024 at 6:04 PM Dumitru Ceara  wrote:
>>
>> Hi,
>>
>> Thanks to everyone who attended the OVN meeting yesterday!
>>
>> The recording is available here:
>> https://www.youtube.com/watch?v=bonLhUj2oGg
>>
>> Transcripts:
>> https://drive.google.com/file/d/1ioVXcHiuD-xr5RQTh5OSGbpmhUtIDiO_/view?usp=sharing
>>
>> Notes:
>> https://docs.google.com/document/d/1dG4GwcYOSs4uArPGtOoaP5tH4KCto-GH_C3tIXSnZZ8/
> 
> Thank you for organizing, it was a good discussion! As promised I also
> sent out an e-mail detailing some of the points we spoke about, I
> suggest we continue the discussion there [0].
> 

Awesome, thanks!  I see Ales already shared some thoughts there.

>> The main discussion topic was BGP integration in OVN and mostly about
>> where to run the BGP control plane (e.g., FRR on the side vs an in-tree
>> BGP implementation) and how to get packets to it.  I'll try to list some
>> points here:
>>
>> Frode suggested a "L2 load balancer" feature to be added to OVN to
>> "sniff" BGP control packets reaching a logical router and send them out
>> (through a VIF?) to FRR running alongside OVN on the host.
>>
>> Ilya also provided an alternative of using a new type of logical router
>> port, similar to a loopback interface and bind BGP configuration to that
>> one.
>>
>> I had an off-list discussion with Tim Rozet (in CC) from the OpenShift
>> networking team in Red Hat and he shared an idea that might be
>> interesting because it kind of combines the two above.  That is:
>>
>> - configure a TCP load balancer on the logical router we want FRR to run
>> BGP for
>> - this load balancer will "balance" (DNAT) traffic to a single backend
>> IP which can be any tenant internal IP reserved for FRR to bind to
>> - this IP could be configured on a regular OVN VIF that's further
>> plugged into the namespace/container/VM where FRR runs
>>
>> Frode, would this be a good alternative for your use case?
> 
> Using a loopback address is indeed common in many configurations,
> however as I see it there are two requirements that conflict with
> that. One being the goal of minimizing configuration overhead through
> the use of IPv6 LLA ("BGP Unnumbered ''), which inherently are point
> to point, the other being support for BGP authentication, which is at
> odds with NATing the BGP connection. An indirect blocker for this
> would also be how popular BGP implementations do not allow setting a
> next hop for routes transmitted over a IPv6 LLA.
> 
> See more detail in [0].
> 

Ack.

>> In any case, if we end up having to add a new feature to explicitly
>> handle BGP configuration (the "L2 load balancer" discussed in the
>> meeting), I think my preference would be in the end to have an explicit
>> "bgp" configuration in the northbound database.  We already do that for
>> other features: DHCP, IGMP, multicast routing.
> 
> Ack, thank you for stating your preference here!
> 
>> Finally, I'd like to propose a new technical meeting instance in
>> approximately 5 weeks.  I'm looking at:
>>
>> Monday, July 29th, 3PM UTC.
>>
>> I'll wait a few days before sending out the invite in case people raise
>> objections.
> 
> Works for me!
> 

I went ahead and sent a new invite for:

Date/Time: Monday, July 29h, 3PM UTC
Link: https://meet.google.com/zns-gqsd-jdn
Agenda:
https://docs.google.com/document/d/1dG4GwcYOSs4uArPGtOoaP5tH4KCto-GH_C3tIXSnZZ8/

Best regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] datapath-windows: Fix parsing of split buffers in OvsGetTcpHeader.

2024-06-28 Thread Wenying Dong via dev
Hi Alin, llya,

I have tried the patch, unfortunately, we hit BSOD after patching the
change. So I think it doesn't work.

Best regards,
Wenying


On Fri, Jun 7, 2024 at 8:58 AM Wenying Dong 
wrote:

> Sure, happy to help if we have a patch.
>
> Best,
> Wenying
>
> On Thu, Jun 6, 2024 at 7:19 PM Alin Serdean  wrote:
>
>> Hi Ilya,
>>
>> Thanks for the patch. Will try to get a setup and test it.
>>
>> In theory if we have a new analyze target of a newer visual studio it
>> should flag those types of issues.
>>
>> Will try to send a patch for the aforementioned target.
>>
>> Wenying can you help in testing the patch?
>>
>> Thank you,
>> Alin.
>> --
>> *From:* Ilya Maximets 
>> *Sent:* Thursday, June 6, 2024 12:52 PM
>> *To:* ovs-dev@openvswitch.org 
>> *Cc:* Alin Serdean ; Wenying Dong ;
>> Ilya Maximets 
>> *Subject:* [PATCH] datapath-windows: Fix parsing of split buffers in
>> OvsGetTcpHeader.
>>
>> NdisGetDataBuffer() is called without providing a buffer to copy packet
>> data in case it is not contiguous.  So, it fails in some scenarios
>> where the packet is handled by the general network stack before OVS
>> and headers become split in multiple buffers.
>>
>> Use existing helpers to retrieve the headers instead, they are using
>> OvsGetPacketBytes() which should be able to handle split data.
>>
>> It might be a bit slower than getting direct pointers that may be
>> provided by NdisGetDataBuffer(), but it's better to optimize commonly
>> used OvsGetPacketBytes() helper in the future instead of optimizing
>> every single caller separately.  And we're still copying the TCP
>> header anyway.
>>
>> Fixes: 9726a016d9d6 ("datapath-windows: Implement locking in conntrack
>> NAT.")
>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/323
>> Signed-off-by: Ilya Maximets 
>> ---
>>
>> WARNING: I beleive this code is correct, but I did not test it with real
>>  traffic, I only verified that it compiles.  Should not be applied
>>  unless someone tests it in an actual Windows setup.
>>
>>  datapath-windows/ovsext/Conntrack.c | 45 ++---
>>  1 file changed, 21 insertions(+), 24 deletions(-)
>>
>> diff --git a/datapath-windows/ovsext/Conntrack.c
>> b/datapath-windows/ovsext/Conntrack.c
>> index 39ba5cc10..4649805dd 100644
>> --- a/datapath-windows/ovsext/Conntrack.c
>> +++ b/datapath-windows/ovsext/Conntrack.c
>> @@ -678,46 +678,43 @@ OvsGetTcpHeader(PNET_BUFFER_LIST nbl,
>>  VOID *storage,
>>  UINT32 *tcpPayloadLen)
>>  {
>> -IPHdr *ipHdr;
>> -IPv6Hdr *ipv6Hdr;
>> -TCPHdr *tcp;
>> +IPv6Hdr ipv6HdrStorage;
>> +IPHdr ipHdrStorage;
>> +const IPv6Hdr *ipv6Hdr;
>> +const IPHdr *ipHdr;
>> +const TCPHdr *tcp;
>>  VOID *dest = storage;
>>  uint16_t ipv6ExtLength = 0;
>>
>>  if (layers->isIPv6) {
>> -ipv6Hdr = NdisGetDataBuffer(NET_BUFFER_LIST_FIRST_NB(nbl),
>> -layers->l4Offset + sizeof(TCPHdr),
>> -NULL, 1, 0);
>> +ipv6Hdr = OvsGetPacketBytes(nbl, sizeof *ipv6Hdr,
>> +layers->l3Offset, &ipv6HdrStorage);
>>  if (ipv6Hdr == NULL) {
>>  return NULL;
>>  }
>>
>> -tcp = (TCPHdr *)((PCHAR)ipv6Hdr + layers->l4Offset);
>> -ipv6Hdr = (IPv6Hdr *)((PCHAR)ipv6Hdr + layers->l3Offset);
>> -if (tcp->doff * 4 >= sizeof *tcp) {
>> -NdisMoveMemory(dest, tcp, sizeof(TCPHdr));
>> -ipv6ExtLength = layers->l4Offset - layers->l3Offset -
>> sizeof(IPv6Hdr);
>> -*tcpPayloadLen = (ntohs(ipv6Hdr->payload_len) -
>> ipv6ExtLength - TCP_HDR_LEN(tcp));
>> -return storage;
>> +tcp = OvsGetTcp(nbl, layers->l4Offset, dest);
>> +if (tcp == NULL) {
>> +return NULL;
>>  }
>> +
>> +ipv6ExtLength = layers->l4Offset - layers->l3Offset -
>> sizeof(IPv6Hdr);
>> +*tcpPayloadLen = (ntohs(ipv6Hdr->payload_len) - ipv6ExtLength -
>> TCP_HDR_LEN(tcp));
>>  } else {
>> -ipHdr = NdisGetDataBuffer(NET_BUFFER_LIST_FIRST_NB(nbl),
>> -  layers->l4Offset + sizeof(TCPHdr),
>> -  NULL, 1 /*no align*/, 0);
>> +ipHdr = OvsGetIp(nbl, layers->l3Offset, &ipHdrStorage);
>>  if (ipHdr == NULL) {
>>  return NULL;
>>  }
>>
>> -ipHdr = (IPHdr *)((PCHAR)ipHdr + layers->l3Offset);
>> -tcp = (TCPHdr *)((PCHAR)ipHdr + ipHdr->ihl * 4);
>> -
>> -if (tcp->doff * 4 >= sizeof *tcp) {
>> -NdisMoveMemory(dest, tcp, sizeof(TCPHdr));
>> -*tcpPayloadLen = TCP_DATA_LENGTH(ipHdr, tcp);
>> -return storage;
>> +tcp = OvsGetTcp(nbl, layers->l4Offset, dest);
>> +if (tcp == NULL) {
>> +return NULL;
>>  }
>> +
>> +*tcpPayloadLen = TCP_DATA_LENGTH(ipHdr, tcp);
>>  }
>> -r

Re: [ovs-dev] [PATCH ovn v2 2/2] tests: Skip memory error in DPDK tests.

2024-06-28 Thread David Marchand
On Thu, Jun 27, 2024 at 10:38 AM Ales Musil  wrote:
>
> Add skip for memory error produced by DPDK on ARM with 2MB hugepages
> which isn't harmful [0].
>
> [0] http://mails.dpdk.org/archives/dev/2024-June/296764.html
>
> Suggested-by: David Marchand 
> Signed-off-by: Ales Musil 

I would mention ARM in the commit title (maybe it can be tweaked when
applying), but otherwise the patch lgtm.
Reviewed-by: David Marchand 


-- 
David Marchand

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 1/2] ci: Save some DPDK compilation time.

2024-06-28 Thread David Marchand
On Thu, Jun 27, 2024 at 10:38 AM Ales Musil  wrote:
>
> Add more options that will shave some compilation time off DPDK by
> skipping unused parts of the code.
>
> Suggested-by: David Marchand 
> Signed-off-by: Ales Musil 

Reviewed-by: David Marchand 


-- 
David Marchand

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] controller: Add random delay during fdb learning.

2024-06-28 Thread Dumitru Ceara
On 6/28/24 11:02, Dumitru Ceara wrote:

[...]

> 
> This needs to be "struct fdb *fdb".  I fixed it up and applied the patch
> to main.
> 

I also added you to the AUTHORS file, Naveen.  I'm not sure why you
weren't there already.  I apologize for that, should be fixed now.

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 ovn] Do not reply on unicast arps for IPv4 targets.

2024-06-28 Thread Dumitru Ceara
On 6/28/24 11:02, Dumitru Ceara wrote:
[...]

> 
> And applied the patch to main.
> 

I also added Vasyl to the AUTHORS list.

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] RFC OVN: fabric integration

2024-06-28 Thread Ales Musil
On Tue, Jun 25, 2024 at 6:52 PM Frode Nordahl  wrote:

> Hello,
>
> We are increasingly seeing requests for integration between OVN
> powered CMSs/workloads and the fabric.
>
> As a side note, this is a very interesting topic to me personally, and
> I think there are opportunities in the long term for this class of
> software to potentially fill a void for more automated and SDN-like
> ways of managing the physical network, as previously closed physical
> switch hardware is increasingly opening up to programmatic extension
> and control.
>
> While very exciting, it will take a while, both in terms of evolving
> how networking teams are organized, in terms of the longevity of
> networking gear making entity wide refresh cycles very long, not to
> mention gathering agreement and momentum to build such a thing from
> the pieces we have.
>
>
> So to be pragmatic, we need to integrate with something that fabric
> network engineers are comfortable with, and already available on most
> networking hardware, be it closed or open, today.
>
> The most ubiquitous routing protocol, which has prevailed in modern
> layer 3 only data center designs [0], is BGP.
>
> Use cases:
> * Allow fabric to locate and direct traffic to reroutable resources
> such as IPv4/IPv6 prefixes, Floating IPs (FIPs) and Load Balancer
> VIPs.
>
> * Use the fabric as a load balancer, announcing the same service IP on
> multiple hosts (anycast).
>
> * Aggregate announcements from stacked CMSes (i.e. Kubernetes running
> on top of OpenStack).
>
>
> Requirements:
> * Data path must be hardware offloaded, i.e. the next hop address the
> peer resolves for announcements of OVN resources needs to be an LRP
> IP.
>
> * Minimize configuration overhead through the use of IPv6 LLAs for
> peering routing both IPv4 and IPv6 prefixes over a IPv6 BGP session
> [1] (aka. “BGP Unnumbered”).
>
> * Support ECMP out of the host, i.e. use L3 interfaces potentially
> connecting to two different ToRs, instead of bonds, avoiding the
> additional complexity of multi-chassis bonds.
>
> * Support BGP authentication [2][3], i.e. the source, destination
> address and ports in packet headers can not be changed.
>
> * Compatibility
>* Running a BGP protocol suite on the host is becoming a thing in
> its own right, and our users may have requirements of their own that
> influence their choice of implementation. We need to take this into
> account and choose integration methods that allow OVN to work with
> multiple protocol suite implementations.
>
>* While we have the power to change and fix issues in popular
> routing protocol suites, such as FRR, we need to be able to integrate
> with versions that exist on networking hardware out there today.
>
> Limitations that influence/dictate implementation choices:
> * Peering with IPv6 LLAs to meet the configuration overhead
> requirement makes the peering relationship point to point.
>
> * Popular BGP implementations, such as FRR which is used as routing
> protocol suite by many ToR open source NOSes, does not accept
> sending/receiving IPv6 LLA next hop with the route, so the BGP peer
> address will be used as next hop. (There are even mentions of 3rd
> party nexthop currently not being supported, but not sure if that is
> accurate [4]).
>
> * As mentioned above, BGP authentication requires IP headers to be
> unchanged for the BGP TCP packets going to/from the BGP speaker.
>
>
> Proposed implementation:
>
> We are in the process of preparing some RFC/PoC patches that at a high
> level will:
> * Manage a VRF in the system serving two purposes:
>* Leaking of route information from ovn-controller to the VRF
> routing table, which a routing protocol suite can redistribute subject
> to configuration.
>
>* Provide an IP endpoint that a VRF aware application, such as FRR,
> can bind to serving as a BGP speaker on behalf of a OVN LRP IP.
>
> * We will attach a OVN VIF to this VRF that has data path rules that:
>
>* Forward required traffic destined to the OVN LRP IP to the VRF.
>
>* Forward required traffic from the application bound to the VRF as
> if it originated from the OVN LRP IP.
>
>
> Hopefully we'll have something up on the list before the end of this
> week, which makes it real and easier to reason about for further
> discussion.
>
>
> Prior art:
>
> We recognize that there already exists a third party approach to this
> in the ovn-bgp-agent [5] governed by OpenStack, and our goal with this
> work is to provide a tighter integration that might cater generically
> for other CMSes and use cases.
>
>
> 0: https://datatracker.ietf.org/doc/html/rfc7938
> 1: https://datatracker.ietf.org/doc/html/rfc5549
> 2: https://datatracker.ietf.org/doc/html/rfc2385
> 3: https://datatracker.ietf.org/doc/html/rfc5925
> 4:
> https://github.com/FRRouting/frr/blob/cc3519f3e6eaa06f762e0d447202df32df66e129/bgpd/bgp_route.c#L2719
> 5: https://docs.openstack.org/ovn-bgp-agent/latest/
>
>
Hi Frode,

looking forward to the RFC. AFAIU 

Re: [ovs-dev] [PATCH ovn v6 0/2] Add support to disable VXLAN mode.

2024-06-28 Thread Dumitru Ceara
On 6/7/24 15:54, Vladislav Odintsov wrote:
> v6:
>   - Addressed Mark's review comments:
> 1. Removed global variable "vxlan_mode" change from "global" engine node.
> 2. Configuration knob "disable_vxlan_mode" was renamed to "vxlan_mode"
> v5:
>   - Addressed Ihar's review comments:
> 1. fixed errors after incorrect conflicts solving on rebase;
> 2. changed VXLAN mode naming to capitalized;
> 3. clarified VXLAN mode in ovn-architecture man page.
> v4:
>   - Addressed Dumitru's and Ihar's review comments;
>   - single patch was split into two:
> 1. function call replaced with a global variable `vxlan_mode`;
> 2. introduced `disable_vxlan_mode` configuration knob;
>   - rebased onto latest main branch.
> v3:
>   - Removed accidental ovs submodule change.
> v2:
>   - Added NEWS item.
> 
> Vladislav Odintsov (2):
>   northd: Make `vxlan_mode` a global variable.
>   northd: Add support for disabling vxlan mode.
> 

Thanks, Vladislav, Ihar and Ales!

I applied this series to main.

Best regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] utilties: Allow ovn-detrace to run on ovs-ofctl dump-flows output.

2024-06-28 Thread Dumitru Ceara
On 6/21/24 16:29, Dumitru Ceara wrote:
> On 6/3/24 10:23, Ales Musil wrote:
>> The ovs-ofctl dump-flows output is slightly different from oproto/trace
>> cookie 0xXXX vs. cookie=0xXXX. Update the regex that it also matches
>> on the equals case. This allows us to run ovn-detrace against the
>> ovs-ofctl dump-flows output.
>>
>> Also provide simple, partially hardcoded test case for ovn-detrace.
>>
>> Signed-off-by: Ales Musil 
>> ---
> 
> Looks good to me, thanks!
> 
> Acked-by: Dumitru Ceara 
> 

I went ahead and applied this to main.  Thanks, Ales!

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 ovn] Do not reply on unicast arps for IPv4 targets.

2024-06-28 Thread Dumitru Ceara
On 6/27/24 16:18, Numan Siddique wrote:
> On Fri, May 24, 2024 at 4:00 AM Vasyl Saienko  wrote:
>>
>> Reply only if target ethernet address is broadcast, if
>> address is specified explicitly do noting to let target
>> reply by itself. This technique allows to monitor target
>> aliveness with arping.
>>
>> Closes  #239
>>
>> Signed-off-by: Vasyl Saienko 
> 
> Sorry for the late reviews.
> 
> Acked-by: Numan Siddique 
> 
> Numan
> 
>> ---
>>  northd/northd.c | 11 +--
>>  northd/ovn-northd.8.xml |  7 ---
>>  tests/ovn-northd.at |  4 ++--
>>  3 files changed, 15 insertions(+), 7 deletions(-)
>>
>> diff --git a/northd/northd.c b/northd/northd.c
>> index 37f443e70..e80e1885d 100644
>> --- a/northd/northd.c
>> +++ b/northd/northd.c
>> @@ -8832,8 +8832,15 @@ build_lswitch_arp_nd_responder_known_ips(struct 
>> ovn_port *op,
>>  for (size_t i = 0; i < op->n_lsp_addrs; i++) {
>>  for (size_t j = 0; j < op->lsp_addrs[i].n_ipv4_addrs; j++) {
>>  ds_clear(match);
>> -ds_put_format(match, "arp.tpa == %s && arp.op == 1",
>> -op->lsp_addrs[i].ipv4_addrs[j].addr_s);
>> +/* NOTE(vsaienko): Do not reply on unicast ARPs, forward
>> + * them to the target to have ability to monitor target
>> + * aliveness via ARPs.
>> +*/

Thanks, Vasyl and Numan!  I changed this to:

/* Do not reply on unicast ARPs, forward them to the target
 * to have ability to monitor target liveness via unicast
 * ARP requests.
 */

And applied the patch to main.

Best regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] controller: Add random delay during fdb learning.

2024-06-28 Thread Dumitru Ceara
On 6/28/24 09:20, Ales Musil wrote:
> On Wed, Jun 26, 2024 at 8:20 AM Naveen Yerramneni <
> naveen.yerramn...@nutanix.com> wrote:
> 
>> This change reduces the probability of conflicts when
>> multiple nodes tries to add the same FDB entry to SB at the
>> same time. When conflict occurs, OVN controller does full
>> recompute which is heavy weight on the scale setup.
>>
>> Signed-off-by: Naveen Yerramneni 
>> Suggested-by: Dumitru Ceara 
>> ---
>>  controller/mac-cache.c  |  4 +++-
>>  controller/mac-cache.h  |  4 +++-
>>  controller/ovn-controller.c |  2 +-
>>  controller/pinctrl.c| 28 
>>  tests/ovn.at|  6 +++---
>>  5 files changed, 30 insertions(+), 14 deletions(-)
>>
>> diff --git a/controller/mac-cache.c b/controller/mac-cache.c
>> index d8c4e2aed..de45d7a6a 100644
>> --- a/controller/mac-cache.c
>> +++ b/controller/mac-cache.c
>> @@ -245,7 +245,8 @@ mac_binding_lookup(struct ovsdb_idl_index
>> *sbrec_mac_binding_by_lport_ip,
>>
>>  /* FDB. */
>>  struct fdb *
>> -fdb_add(struct hmap *map, struct fdb_data fdb_data) {
>> +fdb_add(struct hmap *map, struct fdb_data fdb_data, long long timestamp)
>> +{
>>  struct fdb *fdb = fdb_find(map, &fdb_data);
>>
>>  if (!fdb) {
>> @@ -255,6 +256,7 @@ fdb_add(struct hmap *map, struct fdb_data fdb_data) {
>>  }
>>
>>  fdb->data = fdb_data;
>> +fdb->timestamp = timestamp;
>>
>>  return fdb;
>>  }
>> diff --git a/controller/mac-cache.h b/controller/mac-cache.h
>> index 3c78f9440..b94e7a1cf 100644
>> --- a/controller/mac-cache.h
>> +++ b/controller/mac-cache.h
>> @@ -83,6 +83,7 @@ struct fdb {
>>  struct fdb_data data;
>>  /* Reference to the SB FDB record. */
>>  const struct sbrec_fdb *sbrec_fdb;
>> +long long timestamp;
>>  };
>>
>>  struct bp_packet_data {
>> @@ -149,7 +150,8 @@ mac_binding_lookup(struct ovsdb_idl_index
>> *sbrec_mac_binding_by_lport_ip,
>> const char *logical_port, const char *ip);
>>
>>  /* FDB. */
>> -struct fdb *fdb_add(struct hmap *map, struct fdb_data fdb_data);
>> +struct fdb *fdb_add(struct hmap *map, struct fdb_data fdb_data,
>> +long long timestamp);
>>
>>  void fdb_remove(struct hmap *map, struct fdb *fdb);
>>
>> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
>> index 6874f99a3..e99bdb3ef 100644
>> --- a/controller/ovn-controller.c
>> +++ b/controller/ovn-controller.c
>> @@ -3322,7 +3322,7 @@ fdb_add_sb(struct mac_cache_data *data, const struct
>> sbrec_fdb *sfdb)
>>  return;
>>  }
>>
>> -struct fdb *fdb = fdb_add(&data->fdbs, fdb_data);
>> +struct fdb *fdb = fdb_add(&data->fdbs, fdb_data, 0);
>>
>>  fdb->sbrec_fdb = sfdb;
>>  }
>> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
>> index f2e382a44..e3ded793b 100644
>> --- a/controller/pinctrl.c
>> +++ b/controller/pinctrl.c
>> @@ -4739,6 +4739,7 @@ pinctrl_destroy(void)
>>  /* Buffered "put_mac_binding" operation. */
>>
>>  #define MAX_MAC_BINDING_DELAY_MSEC 50
>> +#define MAX_FDB_DELAY_MSEC 50
>>  #define MAX_MAC_BINDINGS   1000
>>
>>  /* Contains "struct mac_binding"s. */
>> @@ -8961,21 +8962,30 @@ run_put_fdbs(struct ovsdb_idl_txn *ovnsb_idl_txn,
>>  return;
>>  }
>>
>> +long long now = time_msec();
>>  const struct fdb *fdb;

This needs to be "struct fdb *fdb".  I fixed it up and applied the patch
to main.

Thanks, Naveen and Ales!

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] Userspace: Software fallback for UDP encapsulated TCP segmentation.

2024-06-28 Thread David Marchand
On Thu, Jun 27, 2024 at 4:04 PM David Marchand
 wrote:
> @@ -926,6 +926,17 @@ netdev_send(struct netdev *netdev, int qid,
> struct dp_packet_batch *batch,
>  return netdev_send_tso(netdev, qid, batch, 
> concurrent_txq);
>  }
>  }
> +} else if (!(netdev_flags & NETDEV_TX_OFFLOAD_OUTER_UDP_CKSUM)) {
> +DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +if (!dp_packet_hwol_is_tso(packet) &&
> +!dp_packet_hwol_is_tunnel_vxlan(packet) &&
> +!dp_packet_hwol_is_tunnel_geneve(packet)) {
> +continue;

Erm, less buggy:

+if (!dp_packet_hwol_is_tso(packet) ||
+(!dp_packet_hwol_is_tunnel_vxlan(packet) &&
+ !dp_packet_hwol_is_tunnel_geneve(packet))) {
+continue;
+}

> +}
> +if (dp_packet_hwol_is_outer_udp_cksum(packet)) {
> +return netdev_send_tso(netdev, qid, batch, 
> concurrent_txq);
> +}
> +}
>  }
>  }


-- 
David Marchand

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 2/2] northd: Add support for disabling vxlan mode.

2024-06-28 Thread Ales Musil
On Fri, Jun 7, 2024 at 3:54 PM Vladislav Odintsov  wrote:

> Commit [1] introduced a "VXLAN mode" concept.  It brought a limitation
> for available tunnel IDs because of lack of space in VXLAN VNI.
> In VXLAN mode OVN is limited by 4095 datapaths (LRs or non-transit LSs)
> and 2047 logical ports per datapath.
>
> Prior to this patch VXLAN mode was enabled automatically if at least one
> chassis had encap of VXLAN type.  In scenarios where one want to use
> VXLAN only for HW VTEP (RAMP) switch, such limitation makes no sence.
>
> This patch adds support for explicit disabling of VXLAN mode via
> Northbound database.
>
> 1: https://github.com/ovn-org/ovn/commit/b07f1bc3d068
>
> Acked-By: Ihar Hrachyshka 
> Fixes: b07f1bc3d068 ("Add VXLAN support for non-VTEP datapath bindings")
> Signed-off-by: Vladislav Odintsov 
> ---
>  NEWS  |  4 
>  northd/en-global-config.c |  8 +++-
>  northd/northd.c   | 10 --
>  northd/northd.h   |  3 ++-
>  ovn-architecture.7.xml|  6 ++
>  ovn-nb.xml| 10 ++
>  tests/ovn-northd.at   | 29 +
>  7 files changed, 66 insertions(+), 4 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 3bdc55172..aa1669d9c 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -31,6 +31,10 @@ Post v24.03.0
>  has been renamed to "options:ic-route-denylist" in order to comply
> with
>  inclusive language guidelines. The previous name is still recognized
> to
>  aid with backwards compatibility.
> +  - Added new global config option NB_Global:options:vxlan_mode to support
> +ability to disable "VXLAN mode" to extend available tunnel IDs space
> for
> +datapaths from 4095 to 16711680.  For more details see man ovn-nb(5)
> for
> +mentioned option.
>
>  OVN v24.03.0 - 01 Mar 2024
>  --
> diff --git a/northd/en-global-config.c b/northd/en-global-config.c
> index df0f8e58c..784538a14 100644
> --- a/northd/en-global-config.c
> +++ b/northd/en-global-config.c
> @@ -117,7 +117,8 @@ en_global_config_run(struct engine_node *node , void
> *data)
>
>  char *max_tunid = xasprintf("%d",
>  get_ovn_max_dp_key_local(
> -is_vxlan_mode(sbrec_chassis_table)));
> +is_vxlan_mode(&nb->options,
> +  sbrec_chassis_table)));
>  smap_replace(options, "max_tunid", max_tunid);
>  free(max_tunid);
>
> @@ -534,6 +535,11 @@ check_nb_options_out_of_sync(const struct
> nbrec_nb_global *nb,
>  return true;
>  }
>
> +if (config_out_of_sync(&nb->options, &config_data->nb_options,
> +   "vxlan_mode", false)) {
> +return true;
> +}
> +
>  return false;
>  }
>
> diff --git a/northd/northd.c b/northd/northd.c
> index 6d118a19a..a4937b472 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -886,8 +886,13 @@ join_datapaths(const struct
> nbrec_logical_switch_table *nbrec_ls_table,
>  }
>
>  bool
> -is_vxlan_mode(const struct sbrec_chassis_table *sbrec_chassis_table)
> +is_vxlan_mode(const struct smap *nb_options,
> +  const struct sbrec_chassis_table *sbrec_chassis_table)
>  {
> +if (!smap_get_bool(nb_options, "vxlan_mode", true)) {
> +return false;
> +}
> +
>  const struct sbrec_chassis *chassis;
>  SBREC_CHASSIS_TABLE_FOR_EACH (chassis, sbrec_chassis_table) {
>  for (int i = 0; i < chassis->n_encaps; i++) {
> @@ -17605,7 +17610,8 @@ ovnnb_db_run(struct northd_input *input_data,
>  use_common_zone = smap_get_bool(input_data->nb_options,
> "use_common_zone",
>  false);
>
> -vxlan_mode = is_vxlan_mode(input_data->sbrec_chassis_table);
> +vxlan_mode = is_vxlan_mode(input_data->nb_options,
> +   input_data->sbrec_chassis_table);
>
>  build_datapaths(ovnsb_txn,
>  input_data->nbrec_logical_switch_table,
> diff --git a/northd/northd.h b/northd/northd.h
> index 987f82954..2f2fdb673 100644
> --- a/northd/northd.h
> +++ b/northd/northd.h
> @@ -790,7 +790,8 @@ lr_has_multiple_gw_ports(const struct ovn_datapath *od)
>  }
>
>  bool
> -is_vxlan_mode(const struct sbrec_chassis_table *sbrec_chassis_table);
> +is_vxlan_mode(const struct smap *nb_options,
> +  const struct sbrec_chassis_table *sbrec_chassis_table);
>
>  uint32_t get_ovn_max_dp_key_local(bool _vxlan_mode);
>
> diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> index e32d1a9f7..640944faf 100644
> --- a/ovn-architecture.7.xml
> +++ b/ovn-architecture.7.xml
> @@ -2920,4 +2920,10 @@
>  the future, gateways that do not support encapsulations with large
> amounts
>  of metadata may continue to have a reduced feature set.
>
> +  
> +VXLAN mode is recommended to be disabled if VXLAN encap
> at
> +hypervisors is needed only to support HW VTEP L2 Gateway
> f

Re: [ovs-dev] [PATCH ovn v6 1/2] northd: Make `vxlan_mode` a global variable.

2024-06-28 Thread Ales Musil
On Fri, Jun 7, 2024 at 3:54 PM Vladislav Odintsov  wrote:

> This simplifies code and subsequent commit to explicitely disable VXLAN
> mode is based on these changes.
>
> Also "VXLAN mode" term is introduced in ovn-architecture man page.
>
> Signed-off-by: Vladislav Odintsov 
> ---
>  northd/en-global-config.c |  3 +-
>  northd/northd.c   | 76 ---
>  northd/northd.h   |  5 ++-
>  ovn-architecture.7.xml| 10 +++---
>  4 files changed, 41 insertions(+), 53 deletions(-)
>
> diff --git a/northd/en-global-config.c b/northd/en-global-config.c
> index 28c78a12c..df0f8e58c 100644
> --- a/northd/en-global-config.c
> +++ b/northd/en-global-config.c
> @@ -116,7 +116,8 @@ en_global_config_run(struct engine_node *node , void
> *data)
>  }
>
>  char *max_tunid = xasprintf("%d",
> -get_ovn_max_dp_key_local(sbrec_chassis_table));
> +get_ovn_max_dp_key_local(
> +is_vxlan_mode(sbrec_chassis_table)));
>  smap_replace(options, "max_tunid", max_tunid);
>  free(max_tunid);
>
> diff --git a/northd/northd.c b/northd/northd.c
> index 9f81afccb..6d118a19a 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -90,6 +90,10 @@ static bool use_ct_inv_match = true;
>   */
>  static bool default_acl_drop;
>
> +/* If this option is 'true' northd will use limited 24-bit space for
> datapath
> + * and ports tunnel key allocation (12 bits for each instead of default
> 16). */
> +static bool vxlan_mode;
> +
>  #define MAX_OVN_TAGS 4096
>
>
> @@ -881,7 +885,7 @@ join_datapaths(const struct nbrec_logical_switch_table
> *nbrec_ls_table,
>  }
>  }
>
> -static bool
> +bool
>  is_vxlan_mode(const struct sbrec_chassis_table *sbrec_chassis_table)
>  {
>  const struct sbrec_chassis *chassis;
> @@ -896,25 +900,22 @@ is_vxlan_mode(const struct sbrec_chassis_table
> *sbrec_chassis_table)
>  }
>
>  uint32_t
> -get_ovn_max_dp_key_local(const struct sbrec_chassis_table
> *sbrec_chassis_table)
> +get_ovn_max_dp_key_local(bool _vxlan_mode)
>  {
> -if (is_vxlan_mode(sbrec_chassis_table)) {
> -/* OVN_MAX_DP_GLOBAL_NUM doesn't apply for vxlan mode. */
> +if (_vxlan_mode) {
> +/* OVN_MAX_DP_GLOBAL_NUM doesn't apply for VXLAN mode. */
>  return OVN_MAX_DP_VXLAN_KEY;
>  }
>  return OVN_MAX_DP_KEY - OVN_MAX_DP_GLOBAL_NUM;
>  }
>
>  static void
> -ovn_datapath_allocate_key(const struct sbrec_chassis_table
> *sbrec_ch_table,
> -  struct hmap *datapaths, struct hmap *dp_tnlids,
> +ovn_datapath_allocate_key(struct hmap *datapaths, struct hmap *dp_tnlids,
>struct ovn_datapath *od, uint32_t *hint)
>  {
>  if (!od->tunnel_key) {
>  od->tunnel_key = ovn_allocate_tnlid(dp_tnlids, "datapath",
> -OVN_MIN_DP_KEY_LOCAL,
> -
> get_ovn_max_dp_key_local(sbrec_ch_table),
> -hint);
> +OVN_MIN_DP_KEY_LOCAL, get_ovn_max_dp_key_local(vxlan_mode),
> hint);
>  if (!od->tunnel_key) {
>  if (od->sb) {
>  sbrec_datapath_binding_delete(od->sb);
> @@ -927,7 +928,6 @@ ovn_datapath_allocate_key(const struct
> sbrec_chassis_table *sbrec_ch_table,
>
>  static void
>  ovn_datapath_assign_requested_tnl_id(
> -const struct sbrec_chassis_table *sbrec_chassis_table,
>  struct hmap *dp_tnlids, struct ovn_datapath *od)
>  {
>  const struct smap *other_config = (od->nbs
> @@ -936,8 +936,7 @@ ovn_datapath_assign_requested_tnl_id(
>  uint32_t tunnel_key = smap_get_int(other_config, "requested-tnl-key",
> 0);
>  if (tunnel_key) {
>  const char *interconn_ts = smap_get(other_config, "interconn-ts");
> -if (!interconn_ts && is_vxlan_mode(sbrec_chassis_table) &&
> -tunnel_key >= 1 << 12) {
> +if (!interconn_ts && vxlan_mode && tunnel_key >= 1 << 12) {
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>  VLOG_WARN_RL(&rl, "Tunnel key %"PRIu32" for datapath %s is "
>   "incompatible with VXLAN", tunnel_key,
> @@ -985,7 +984,6 @@ build_datapaths(struct ovsdb_idl_txn *ovnsb_txn,
>  const struct nbrec_logical_switch_table *nbrec_ls_table,
>  const struct nbrec_logical_router_table *nbrec_lr_table,
>  const struct sbrec_datapath_binding_table *sbrec_dp_table,
> -const struct sbrec_chassis_table *sbrec_chassis_table,
>  struct ovn_datapaths *ls_datapaths,
>  struct ovn_datapaths *lr_datapaths,
>  struct ovs_list *lr_list)
> @@ -1000,12 +998,11 @@ build_datapaths(struct ovsdb_idl_txn *ovnsb_txn,
>  struct hmap dp_tnlids = HMAP_INITIALIZER(&dp_tnlids);
>  struct ovn_datapath *od;
>  LIST_FOR_EACH (od, list, &both) {
> -ovn_datapath_assign_requested_tnl_id(sbrec_chassis_table,
> &dp_tnlids,
> -  

Re: [ovs-dev] [PATCH ovn] controller: Add random delay during fdb learning.

2024-06-28 Thread Ales Musil
On Wed, Jun 26, 2024 at 8:20 AM Naveen Yerramneni <
naveen.yerramn...@nutanix.com> wrote:

> This change reduces the probability of conflicts when
> multiple nodes tries to add the same FDB entry to SB at the
> same time. When conflict occurs, OVN controller does full
> recompute which is heavy weight on the scale setup.
>
> Signed-off-by: Naveen Yerramneni 
> Suggested-by: Dumitru Ceara 
> ---
>  controller/mac-cache.c  |  4 +++-
>  controller/mac-cache.h  |  4 +++-
>  controller/ovn-controller.c |  2 +-
>  controller/pinctrl.c| 28 
>  tests/ovn.at|  6 +++---
>  5 files changed, 30 insertions(+), 14 deletions(-)
>
> diff --git a/controller/mac-cache.c b/controller/mac-cache.c
> index d8c4e2aed..de45d7a6a 100644
> --- a/controller/mac-cache.c
> +++ b/controller/mac-cache.c
> @@ -245,7 +245,8 @@ mac_binding_lookup(struct ovsdb_idl_index
> *sbrec_mac_binding_by_lport_ip,
>
>  /* FDB. */
>  struct fdb *
> -fdb_add(struct hmap *map, struct fdb_data fdb_data) {
> +fdb_add(struct hmap *map, struct fdb_data fdb_data, long long timestamp)
> +{
>  struct fdb *fdb = fdb_find(map, &fdb_data);
>
>  if (!fdb) {
> @@ -255,6 +256,7 @@ fdb_add(struct hmap *map, struct fdb_data fdb_data) {
>  }
>
>  fdb->data = fdb_data;
> +fdb->timestamp = timestamp;
>
>  return fdb;
>  }
> diff --git a/controller/mac-cache.h b/controller/mac-cache.h
> index 3c78f9440..b94e7a1cf 100644
> --- a/controller/mac-cache.h
> +++ b/controller/mac-cache.h
> @@ -83,6 +83,7 @@ struct fdb {
>  struct fdb_data data;
>  /* Reference to the SB FDB record. */
>  const struct sbrec_fdb *sbrec_fdb;
> +long long timestamp;
>  };
>
>  struct bp_packet_data {
> @@ -149,7 +150,8 @@ mac_binding_lookup(struct ovsdb_idl_index
> *sbrec_mac_binding_by_lport_ip,
> const char *logical_port, const char *ip);
>
>  /* FDB. */
> -struct fdb *fdb_add(struct hmap *map, struct fdb_data fdb_data);
> +struct fdb *fdb_add(struct hmap *map, struct fdb_data fdb_data,
> +long long timestamp);
>
>  void fdb_remove(struct hmap *map, struct fdb *fdb);
>
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index 6874f99a3..e99bdb3ef 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -3322,7 +3322,7 @@ fdb_add_sb(struct mac_cache_data *data, const struct
> sbrec_fdb *sfdb)
>  return;
>  }
>
> -struct fdb *fdb = fdb_add(&data->fdbs, fdb_data);
> +struct fdb *fdb = fdb_add(&data->fdbs, fdb_data, 0);
>
>  fdb->sbrec_fdb = sfdb;
>  }
> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
> index f2e382a44..e3ded793b 100644
> --- a/controller/pinctrl.c
> +++ b/controller/pinctrl.c
> @@ -4739,6 +4739,7 @@ pinctrl_destroy(void)
>  /* Buffered "put_mac_binding" operation. */
>
>  #define MAX_MAC_BINDING_DELAY_MSEC 50
> +#define MAX_FDB_DELAY_MSEC 50
>  #define MAX_MAC_BINDINGS   1000
>
>  /* Contains "struct mac_binding"s. */
> @@ -8961,21 +8962,30 @@ run_put_fdbs(struct ovsdb_idl_txn *ovnsb_idl_txn,
>  return;
>  }
>
> +long long now = time_msec();
>  const struct fdb *fdb;
> -HMAP_FOR_EACH (fdb, hmap_node, &put_fdbs) {
> -run_put_fdb(ovnsb_idl_txn, sbrec_fdb_by_dp_key_mac,
> -sbrec_port_binding_by_key,
> -sbrec_datapath_binding_by_key, fdb);
> +HMAP_FOR_EACH_SAFE (fdb, hmap_node, &put_fdbs) {
> +if (now >= fdb->timestamp) {
> +run_put_fdb(ovnsb_idl_txn, sbrec_fdb_by_dp_key_mac,
> +sbrec_port_binding_by_key,
> +sbrec_datapath_binding_by_key, fdb);
> +fdb_remove(&put_fdbs, fdb);
> +}
>  }
> -fdbs_clear(&put_fdbs);
>  }
>
>
>  static void
>  wait_put_fdbs(struct ovsdb_idl_txn *ovnsb_idl_txn)
> +OVS_REQUIRES(pinctrl_mutex)
>  {
> -if (ovnsb_idl_txn && !hmap_is_empty(&put_fdbs)) {
> -poll_immediate_wake();
> +if (!ovnsb_idl_txn) {
> +return;
> +}
> +
> +struct fdb *fdb;
> +HMAP_FOR_EACH (fdb, hmap_node, &put_fdbs) {
> +poll_timer_wait_until(fdb->timestamp);
>  }
>  }
>
> @@ -8995,6 +9005,8 @@ pinctrl_handle_put_fdb(const struct flow *md, const
> struct flow *headers)
>  .mac = headers->dl_src,
>  };
>
> -fdb_add(&put_fdbs, fdb_data);
> +uint32_t delay = random_range(MAX_FDB_DELAY_MSEC) + 1;
> +long long timestamp = time_msec() + delay;
> +fdb_add(&put_fdbs, fdb_data, timestamp);
>  notify_pinctrl_main();
>  }
> diff --git a/tests/ovn.at b/tests/ovn.at
> index af0eaeed3..7b199d3f5 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -34382,21 +34382,21 @@ AT_CHECK([ovn-nbctl --wait=hv sync])
>  # Learning is disabled, the table should be empty
>  send_packet 20
>  AT_CHECK([ovn-nbctl --wait=hv sync])
> -AT_CHECK([test $(ovn-sbctl list fdb | grep -c "00:00:00:00:10:20") = 0])
> +OVS_WAIT_UNTIL([test $(ovn-sbctl lis

Re: [ovs-dev] [PATCH] ovsdb: raft: Don't forward more than one command to the leader.

2024-06-28 Thread Dumitru Ceara
On 6/27/24 00:02, Ilya Maximets wrote:
> Every transaction has RAFT log prerequisites.  Even if transactions
> are not related (because RAFT doesn't actually know what data it is
> handling).  When leader writes a new record to a RAFT storage, it is
> getting appended to the log right away and changes current 'eid',
> i.e., changes prerequisites.  The leader will not try to write new
> records until the current one is committed, because until then the
> pre-check will be failing.
> 
> However, that is different for the follower.  Followers do not add
> records to the RAFT log until the leader sends an append request back.
> So, if there are multiple transactions pending on a follower, it will
> create a command for each of them and prerequisites will be set to the
> same values.  All these commands will be sent to the leader, but only
> one can succeed at a time, because accepting one command immediately
> changes prerequisites and all other commands become non-applicable.
> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
> failure is a transient failure, so the follower will re-process all the
> failed transactions and send them again.  1 will succeed and N - 2 will
> fail.  And so on, until there are no more transactions.  In the end,
> instead of processing N transactions, the follower is performing
> N * (N - 1) / 2 transaction processing iterations.  That is consuming
> a huge amount of CPU resources completely unnecessarily.
> 
> Since there is no real chance for multiple transactions from the same
> follower to succeed, it's better to not send them in the first place.
> This also eliminates prerequisite mismatch messages on a leader in
> this particular case.
> 

Makes sense!

> In a test with 30 parallel shell threads executing 12K transactions
> total with separate ovsdb-client calls through the same follower there
> is about 60% performance improvement.  The test takes ~100 seconds to
> complete without this change and ~40 seconds with this change applied.
> The new time is very close to what it takes to execute the same test
> through the cluster leader.
> 

Maybe a public link to the test (if possible) would be a nice reference
to have in the future?

> Note: prerequisite failures on a leader are still possible, but mostly
> in a case of simultaneous transactions from different followers.  It's
> a normal thing for a distributed database due to its nature.
> 
> Signed-off-by: Ilya Maximets 
> ---
>  ovsdb/raft.c| 45 -
>  ovsdb/raft.h|  2 +-
>  ovsdb/storage.c |  9 +
>  ovsdb/storage.h |  5 -
>  ovsdb/transaction.c |  6 +-
>  5 files changed, 55 insertions(+), 12 deletions(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index ac3d37ac4..116924058 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>  return &raft->snap.eid;
>  }
>  
> -const struct uuid *
> +static const struct uuid *
>  raft_current_eid(const struct raft *raft)
>  {
>  return raft_get_eid(raft, raft->log_end - 1);
>  }
>  
> +bool
> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
> +{
> +if (!uuid_equals(raft_current_eid(raft), prereq)) {
> +VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
> + "do not match current eid (" UUID_FMT ")",
> + __func__, UUID_ARGS(prereq),
> + UUID_ARGS(raft_current_eid(raft)));
> +return false;
> +}
> +
> +/* Having incomplete commands on a follower means that the leader has
> + * these commands and they will change the prerequisites once added to
> + * the leader's log.
> + *
> + * There is a chance that all these commands will actually fail and the
> + * record with current prerequisites will in fact succeed, but, since
> + * these are our own commands, the chances are low.
> + *
> + * Incomplete commands on a leader will not change the leader's current
> + * 'eid' on commit as they are already part of the leader's log. */
> +if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
> +struct raft_command *cmd;
> +
> +HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {

Nit: I'd rewrite this as:

if (raft->role == RAFT_LEADER) {
return true;
}

HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
...
}

> +/* Skip commands that are already part of the log (have non-zero
> + * index) and ones that do not carry any data (have zero 'eid'),
> + * as they can't change prerequisites.
> + *
> + * Database will not re-run triggers unless the data changes or
> + * one of the data-carrying triggers completes.  So, pre-check 
> must
> + * not fail if there are no outstanding data-carrying commands. 
> */
> +if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
> +