Re: [ovs-dev] [PATCH v1] dpif-netdev: Update meter timestamp when dp_netdev_run_meter been called.

2023-06-21 Thread Ilya Maximets
On 6/21/23 18:07, Ilya Maximets wrote:
> On 5/31/23 17:41, mit...@outlook.com wrote:
>> From: Lin Huang 
>>
>> Currently, a meter's timestamp 'now' is set by 'pmd->ctx.now' which updated
>> by pmd_thread_ctx_time_update().
>>
>> Before processing of the new packet batch:
>> - dpif_netdev_execute()
>> - dp_netdev_process_rxq_port()
>>
>> There is a problem when user want to police the rate to a low pps by meter.
>> For example, When the traffic is more than 3Mpps, policing the traffic to
>> 1M pps, the real rate will be 3Mpps not 1Mpps.
>>
>> The key point is that a meter's timestamp isn't update in real time.
>> For example, PMD thread A and B are polled at the same time (t1).
>> Thread A starts to run meter to police traffic. At the same time, thread B
>> pause at 'ovs_mutex_lock(>lock)' to wait thread A release the meter
>> lock. This elapsed a lot of time, especially we have more than 10 PMD 
>> threads.
> 
> Shouldn't the infrequent time update cause more drops instead?

Also, having 10+ threads waiting on the same lock doesn't sound like
a particularly efficient setup.  If the locking here indeed a problem,
maybe you can try the following patch in your scenario:

 
https://patchwork.ozlabs.org/project/openvswitch/patch/20230621223220.2073152-1-i.maxim...@ovn.org/

?

> 
> Best regards, Ilya Maximets.
> 
>>
>> After thread A release the meter lock, thread B start to count the time diff.
>> - long_delta_t = now - meter->used' --> long_delta_t = t1 - t1 = 0.
>> - band->bucket += (uint64_t) delta_t * band->rate --> band->bucket = 0.
>> - band_exceeded_pkt = band->bucket / 1000; --> band_exceeded_pkt = 0.
>>
>> Fix this problem by update the meter timestamp every time.
>>
>> Test-by: Zhang Yuhuang 
>> Signed-off-by: Lin Huang 
>> ---
>>  lib/dpif-netdev.c | 12 
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 70b953ae6..dbb275cf8 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -7169,12 +7169,13 @@ dpif_netdev_meter_get_features(const struct dpif * 
>> dpif OVS_UNUSED,
>>   * that exceed a band are dropped in-place. */
>>  static void
>>  dp_netdev_run_meter(struct dp_netdev *dp, struct dp_packet_batch *packets_,
>> -uint32_t meter_id, long long int now)
>> +uint32_t meter_id)
>>  {
>>  struct dp_meter *meter;
>>  struct dp_meter_band *band;
>>  struct dp_packet *packet;
>>  long long int long_delta_t; /* msec */
>> +long long int now;
>>  uint32_t delta_t; /* msec */
>>  const size_t cnt = dp_packet_batch_size(packets_);
>>  uint32_t bytes, volume;
>> @@ -7197,8 +7198,12 @@ dp_netdev_run_meter(struct dp_netdev *dp, struct 
>> dp_packet_batch *packets_,
>>  memset(exceeded_rate, 0, cnt * sizeof *exceeded_rate);
>>  
>>  ovs_mutex_lock(>lock);
>> +
>> +/* Update now */
>> +now = time_msec();
>> +
>>  /* All packets will hit the meter at the same time. */
>> -long_delta_t = now / 1000 - meter->used / 1000; /* msec */
>> +long_delta_t = now  - meter->used; /* msec */
>>  
>>  if (long_delta_t < 0) {
>>  /* This condition means that we have several threads fighting for a
>> @@ -9170,8 +9175,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
>> *packets_,
>>  }
>>  
>>  case OVS_ACTION_ATTR_METER:
>> -dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a),
>> -pmd->ctx.now);
>> +dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a));
>>  break;
>>  
>>  case OVS_ACTION_ATTR_PUSH_VLAN:
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] dpif-netdev: Lockless meters.

2023-06-21 Thread Ilya Maximets
Current implementation of meters in the userspace datapath takes
the meter lock for every packet batch.  If more than one thread
hits the flow with the same meter, they will lock each other.

Replace the critical section with atomic operations to avoid
interlocking.  Meters themselves are RCU-protected, so it's safe
to access them without holding a lock.

Implementation does the following:

 1. Tries to advance the 'used' timer of the meter with atomic
compare+exchange if it's smaller than 'now'.
 2. If the timer change succeeds, atomically update band buckets.
 3. Atomically update packet statistics for a meter.
 4. Go over buckets and try to atomically subtract the amount of
packets or bytes, recording the highest exceeded band.
 5. Atomically update band statistics and drop packets.

Bucket manipulations are implemented with atomic compare+exchange
operations with extra checks, because bucket size should never
exceed the maximum and it should never go below zero.

Packet statistics may be momentarily inconsistent, i.e., number
of packets and the number of bytes may reflect different sets
of packets.  But it should be eventually consistent.  And the
difference at any given time should be in just few packets.

For the sake of reduced code complexity PKTPS meter tries to push
packets through the band one by one, even though they all have
the same weight.  This is also more fair if more than one thread
is passing packets through the same band at the same time.
Trying to predict the number of packets that can pass may also
cause extra atomic operations reducing the performance.

This implementation shows similar performance to the previous one,
but should scale better with more threads hiting the same meter.

Signed-off-by: Ilya Maximets 
---

@Lin Huang, if you can try this change on your setup, that
would be great.

 NEWS  |   2 +
 lib/dpif-netdev.c | 250 +-
 2 files changed, 140 insertions(+), 112 deletions(-)

diff --git a/NEWS b/NEWS
index 66d5a4ea3..4a2b7dbca 100644
--- a/NEWS
+++ b/NEWS
@@ -37,6 +37,8 @@ Post-v3.1.0
- SRv6 Tunnel Protocol
  * Added support for userspace datapath (only).
- Userspace datapath:
+ * Implementation of OpenFlow meters is now lockless allowing for better
+   multi-thread scalability.
  * IP and L4 checksum offload support is now enabled by default for
interfaces that support it.  See the 'status' column in the 'interface'
table to check the status.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index abe63412e..2fa556a62 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -212,21 +212,21 @@ static void dpcls_remove(struct dpcls *, struct 
dpcls_rule *);
 struct dp_meter_band {
 uint32_t rate;
 uint32_t burst_size;
-uint64_t bucket; /* In 1/1000 packets (for PKTPS), or in bits (for KBPS) */
-uint64_t packet_count;
-uint64_t byte_count;
+atomic_uint64_t bucket;  /* In 1/1000 packets for PKTPS,
+  * or in bits for KBPS. */
+atomic_uint64_t packet_count;
+atomic_uint64_t byte_count;
 };
 
 struct dp_meter {
 struct cmap_node node;
-struct ovs_mutex lock;
 uint32_t id;
 uint16_t flags;
 uint16_t n_bands;
 uint32_t max_delta_t;
-uint64_t used;
-uint64_t packet_count;
-uint64_t byte_count;
+atomic_uint64_t used;  /* Time of a last use in milliseconds. */
+atomic_uint64_t packet_count;
+atomic_uint64_t byte_count;
 struct dp_meter_band bands[];
 };
 
@@ -7165,22 +7165,56 @@ dpif_netdev_meter_get_features(const struct dpif * dpif 
OVS_UNUSED,
 features->max_color = 0;
 }
 
+/* Tries to atomically add 'n' to 'value' in terms of saturation arithmetic,
+ * i.e., if the result will be larger than 'max_value', will store 'max_value'
+ * instead. */
+static void
+atomic_sat_add(atomic_uint64_t *value, uint64_t n, uint64_t max_value)
+{
+uint64_t current, new_value;
+
+atomic_read_relaxed(value, );
+do {
+new_value = current + n;
+new_value = MIN(new_value, max_value);
+} while (!atomic_compare_exchange_weak_relaxed(value, ,
+   new_value));
+}
+
+/* Tries to atomically subtract 'n' from 'value'.  Does not perform the
+ * operation and returns 'false' if the result will be less than 'min_value'.
+ * Otherwise, stores the result and returns 'true'. */
+static bool
+atomic_bound_sub(atomic_uint64_t *value, uint64_t n, uint64_t min_value)
+{
+uint64_t current;
+
+atomic_read_relaxed(value, );
+do {
+if (current < min_value + n) {
+return false;
+}
+} while (!atomic_compare_exchange_weak_relaxed(value, ,
+   current - n));
+return true;
+}
+
 /* Applies the meter identified by 'meter_id' to 'packets_'.  Packets
  * that exceed a band are dropped in-place. */
 static void
 

[ovs-dev] [PATCH v3 1/2] userspace: Respect tso/gso segment size.

2023-06-21 Thread Mike Pattrick
From: Flavio Leitner 

Currently OVS will calculate the segment size based on the
MTU of the egress port. That usually happens to be correct
when the ports share the same MTU, but that is not always true.

Therefore, if the segment size is provided, then use that and
make sure the over sized packets are dropped.

Signed-off-by: Flavio Leitner 
Co-authored-by: Mike Pattrick 
Signed-off-by: Mike Pattrick 
---
v3: Corrected sparse error
---
 lib/dp-packet.c|  3 ++
 lib/dp-packet.h| 26 
 lib/netdev-dpdk.c  | 12 +++-
 lib/netdev-linux.c | 76 --
 4 files changed, 94 insertions(+), 23 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 27114a9a9..4d92a2e5b 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -34,6 +34,7 @@ dp_packet_init__(struct dp_packet *b, size_t allocated, enum 
dp_packet_source so
 pkt_metadata_init(>md, 0);
 dp_packet_reset_cutlen(b);
 dp_packet_reset_offload(b);
+dp_packet_set_tso_segsz(b, 0);
 /* Initialize implementation-specific fields of dp_packet. */
 dp_packet_init_specific(b);
 /* By default assume the packet type to be Ethernet. */
@@ -199,6 +200,8 @@ dp_packet_clone_with_headroom(const struct dp_packet 
*buffer, size_t headroom)
 *dp_packet_ol_flags_ptr(new_buffer) = *dp_packet_ol_flags_ptr(buffer);
 *dp_packet_ol_flags_ptr(new_buffer) &= DP_PACKET_OL_SUPPORTED_MASK;
 
+dp_packet_set_tso_segsz(new_buffer, dp_packet_get_tso_segsz(buffer));
+
 if (dp_packet_rss_valid(buffer)) {
 dp_packet_set_rss_hash(new_buffer, dp_packet_get_rss_hash(buffer));
 }
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 70ddf8aa4..6a924f3ff 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -126,6 +126,7 @@ struct dp_packet {
 uint32_t ol_flags;  /* Offloading flags. */
 uint32_t rss_hash;  /* Packet hash. */
 uint32_t flow_mark; /* Packet flow mark. */
+uint16_t tso_segsz; /* TCP TSO segment size. */
 #endif
 enum dp_packet_source source;  /* Source of memory allocated as 'base'. */
 
@@ -166,6 +167,9 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline uint16_t dp_packet_get_tso_segsz(const struct dp_packet *);
+static inline void dp_packet_set_tso_segsz(struct dp_packet *, uint16_t);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -644,6 +648,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->mbuf.buf_len = s;
 }
 
+static inline uint16_t
+dp_packet_get_tso_segsz(const struct dp_packet *p)
+{
+return p->mbuf.tso_segsz;
+}
+
+static inline void
+dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s)
+{
+p->mbuf.tso_segsz = s;
+}
 #else /* DPDK_NETDEV */
 
 static inline void
@@ -700,6 +715,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->allocated_ = s;
 }
 
+static inline uint16_t
+dp_packet_get_tso_segsz(const struct dp_packet *p)
+{
+return p->tso_segsz;
+}
+
+static inline void
+dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s)
+{
+p->tso_segsz = s;
+}
 #endif /* DPDK_NETDEV */
 
 static inline void
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 63dac689e..1c100c48e 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2302,6 +2302,7 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, 
struct rte_mbuf *mbuf)
 
 if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
 struct tcp_header *th = dp_packet_l4(pkt);
+int hdr_len;
 
 if (!th) {
 VLOG_WARN_RL(, "%s: TCP Segmentation without L4 header"
@@ -2311,7 +2312,15 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, 
struct rte_mbuf *mbuf)
 
 mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
 mbuf->ol_flags |= RTE_MBUF_F_TX_TCP_CKSUM;
+hdr_len = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
 mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
+if (OVS_UNLIKELY((hdr_len + mbuf->tso_segsz) > dev->max_packet_len)) {
+VLOG_WARN_RL(, "%s: Oversized TSO packet. "
+ "hdr: %"PRIu32", gso: %"PRIu32", max len: %"PRIu32"",
+ dev->up.name, hdr_len, mbuf->tso_segsz,
+ dev->max_packet_len);
+return false;
+}
 
 if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) {
 mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM;
@@ -2586,7 +2595,8 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, 
struct rte_mbuf **pkts,
 int cnt = 0;
 struct rte_mbuf *pkt;
 
-/* Filter oversized packets, unless are marked for TSO. */
+/* Filter oversized packets. The TSO packets are 

[ovs-dev] [PATCH v3 2/2] userspace: Add Generic Segmentation Offloading.

2023-06-21 Thread Mike Pattrick
From: Flavio Leitner 

This provides a software implementation in the case
the egress netdev doesn't support segmentation in hardware.

The challenge here is to guarantee packet ordering in the
original batch that may be full of TSO packets. Each TSO
packet can go up to ~64kB, so with segment size of 1440
that means about 44 packets for each TSO. Each batch has
32 packets, so the total batch amounts to 1408 normal
packets.

The segmentation estimates the total number of packets
and then the total number of batches. Then allocate
enough memory and finally do the work.

Finally each batch is sent in order to the netdev.

Signed-off-by: Flavio Leitner 
Co-authored-by: Mike Pattrick 
Signed-off-by: Mike Pattrick 
---
 lib/automake.mk |   2 +
 lib/dp-packet-gso.c | 172 
 lib/dp-packet-gso.h |  24 +++
 lib/dp-packet.h |  11 +++
 lib/netdev-dpdk.c   |  46 
 lib/netdev-linux.c  |  58 ---
 lib/netdev.c| 120 ++-
 lib/packets.c   |   4 +-
 8 files changed, 314 insertions(+), 123 deletions(-)
 create mode 100644 lib/dp-packet-gso.c
 create mode 100644 lib/dp-packet-gso.h

diff --git a/lib/automake.mk b/lib/automake.mk
index e64ee76ce..49a92958d 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -118,6 +118,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpctl.h \
lib/dp-packet.h \
lib/dp-packet.c \
+   lib/dp-packet-gso.c \
+   lib/dp-packet-gso.h \
lib/dpdk.h \
lib/dpif-netdev-extract-study.c \
lib/dpif-netdev-lookup.h \
diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c
new file mode 100644
index 0..bc72e2f90
--- /dev/null
+++ b/lib/dp-packet-gso.c
@@ -0,0 +1,172 @@
+/*
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dp-packet-gso.h"
+#include "netdev-provider.h"
+#include "openvswitch/vlog.h"
+
+VLOG_DEFINE_THIS_MODULE(dp_packet_gso);
+
+COVERAGE_DEFINE(soft_seg_good);
+
+/* Retuns a new packet that is a segment of packet 'p'.
+ *
+ * The new packet is initialized with 'hdr_len' bytes from the
+ * start of packet 'p' and then appended with 'data_len' bytes
+ * from the 'data' buffer.
+ *
+ * Note: The packet headers are not updated. */
+static struct dp_packet *
+dp_packet_gso_seg_new(const struct dp_packet *p, size_t hdr_len,
+  const char *data, size_t data_len)
+{
+struct dp_packet *seg = dp_packet_new_with_headroom(hdr_len + data_len,
+dp_packet_headroom(p));
+
+/* Append the original packet headers and then the payload. */
+dp_packet_put(seg, dp_packet_data(p), hdr_len);
+dp_packet_put(seg, data, data_len);
+
+/* The new segment should have the same offsets. */
+seg->l2_5_ofs = p->l2_5_ofs;
+seg->l3_ofs = p->l3_ofs;
+seg->l4_ofs = p->l4_ofs;
+
+/* The protocol headers remain the same, so preserve hash and mark. */
+*dp_packet_rss_ptr(seg) = dp_packet_get_rss_hash(p);
+*dp_packet_flow_mark_ptr(seg) = *dp_packet_flow_mark_ptr(p);
+
+/* The segment should inherit all the offloading flags from the
+ * original packet, except for the TCP segmentation, external
+ * buffer and indirect buffer flags. */
+*dp_packet_ol_flags_ptr(seg) = *dp_packet_ol_flags_ptr(p)
+& ~(DP_PACKET_OL_TX_TCP_SEG | DP_PACKET_OL_EXTERNAL
+| DP_PACKET_OL_INDIRECT);
+
+dp_packet_hwol_reset_tcp_seg(seg);
+
+return seg;
+}
+
+/* Returns the calculated number of TCP segments in packet 'p'. */
+int
+dp_packet_gso_nr_segs(struct dp_packet *p)
+{
+uint16_t segsz = dp_packet_get_tso_segsz(p);
+const char *data_tail;
+const char *data_pos;
+int n_segs;
+
+data_pos = dp_packet_get_tcp_payload(p);
+data_tail = (char *) dp_packet_tail(p) - dp_packet_l2_pad_size(p);
+data_pos = dp_packet_get_tcp_payload(p);
+n_segs = DIV_ROUND_UP((data_tail - data_pos), segsz);
+
+return n_segs;
+
+}
+
+/* Perform software segmentation on packet 'p'.
+ *
+ * Returns all the segments added to the array of preallocated
+ * batches in 'batches' starting at batch position 'batch_pos'. */
+void
+dp_packet_gso(struct dp_packet *p, struct dp_packet_batch *batches,
+  size_t *batch_pos)
+{
+struct tcp_header *tcp_hdr;
+struct 

[ovs-dev] [PATCH v12] netdev-dpdk: Add custom rx-steering configuration.

2023-06-21 Thread Robin Jarry
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.

Use the RTE flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. Re-program the RSS redirection table to only
use the other Rx queues.

The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.

This feature must be enabled per port on specific protocols via the
rx-steering option. This option takes "rss" followed by a "+" separated
list of protocol names. It is only supported on ethernet ports. This
feature is experimental.

If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control packets. If the hardware
cannot satisfy the number of requested Rx queues, the last Rx queue will
be assigned for control plane. If only one Rx queue is available, the
rx-steering feature will be disabled. If the hardware does not support
the RTE flow matchers/actions, the rx-steering feature will be
completely disabled on the port.

It cannot be enabled when other_config:hw-offload=true as it may
conflict with the offloaded RTE flows. Similarly, if hw-offload is
enabled, custom rx-steering will be forcibly disabled on all ports.

Example use:

 ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
   set interface phy0 type=dpdk options:dpdk-devargs=:ca:00.0 -- \
   set interface phy0 options:rx-steering=rss+lacp -- \
   set interface phy1 type=dpdk options:dpdk-devargs=:ca:00.1 -- \
   set interface phy1 options:rx-steering=rss+lacp

As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.

To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.

   +--+
   | DUT  |
   |++|
   ||   br-int   || 
in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
   |||| 
in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
   || patch10patch11 ||
   |+---|---|+|
   ||   | |
   |+---|---|+|
   || patch00patch01 ||
   ||  tag:10tag:20  ||
   ||||
   ||   br-phy   || default flow, action=NORMAL
   ||||
   ||   bond0|| balance-slb, lacp=passive, lacp-time=fast
   ||phy0   phy1 ||
   |+--|-|---+|
   +---|-|+
   | |
   +---|-|+
   | port0  port1 | balance L3/L4, lacp=active, lacp-time=fast
   | lag  | mode trunk VLANs 10, 20
   |  |
   |switch|
   |  |
   |  vlan 10vlan 20  |  mode access
   |   port2  port3   |
   +-|--|-+
 |  |
   +-|--|-+
   |   tgen0  tgen1   |  Random traffic that is properly balanced
   |  |  across the bond ports in both directions.
   |  traffic generator   |
   +--+

Without rx-steering, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.

 ~# ovs-appctl lacp/show-stats bond0
  bond0 statistics 
 member: phy0:
   TX PDUs: 347246
   RX PDUs: 14865
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 168
   Link Defaulted: 0
   Carrier Status Changed: 0
 member: phy1:
   TX PDUs: 347245
   RX PDUs: 14919
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 147
   Link Defaulted: 1
   Carrier Status Changed: 0

When rx-steering is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.

This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep 

[ovs-dev] [PATCH v2 2/2] userspace: Add Generic Segmentation Offloading.

2023-06-21 Thread Mike Pattrick
From: Flavio Leitner 

This provides a software implementation in the case
the egress netdev doesn't support segmentation in hardware.

The challenge here is to guarantee packet ordering in the
original batch that may be full of TSO packets. Each TSO
packet can go up to ~64kB, so with segment size of 1440
that means about 44 packets for each TSO. Each batch has
32 packets, so the total batch amounts to 1408 normal
packets.

The segmentation estimates the total number of packets
and then the total number of batches. Then allocate
enough memory and finally do the work.

Finally each batch is sent in order to the netdev.

Signed-off-by: Flavio Leitner 
Co-authored-by: Mike Pattrick 
Signed-off-by: Mike Pattrick 
---
 lib/automake.mk |   2 +
 lib/dp-packet-gso.c | 172 
 lib/dp-packet-gso.h |  24 +++
 lib/dp-packet.h |  11 +++
 lib/netdev-dpdk.c   |  46 
 lib/netdev-linux.c  |  58 ---
 lib/netdev.c| 120 ++-
 lib/packets.c   |   4 +-
 8 files changed, 314 insertions(+), 123 deletions(-)
 create mode 100644 lib/dp-packet-gso.c
 create mode 100644 lib/dp-packet-gso.h

diff --git a/lib/automake.mk b/lib/automake.mk
index e64ee76ce..49a92958d 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -118,6 +118,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpctl.h \
lib/dp-packet.h \
lib/dp-packet.c \
+   lib/dp-packet-gso.c \
+   lib/dp-packet-gso.h \
lib/dpdk.h \
lib/dpif-netdev-extract-study.c \
lib/dpif-netdev-lookup.h \
diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c
new file mode 100644
index 0..bc72e2f90
--- /dev/null
+++ b/lib/dp-packet-gso.c
@@ -0,0 +1,172 @@
+/*
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dp-packet-gso.h"
+#include "netdev-provider.h"
+#include "openvswitch/vlog.h"
+
+VLOG_DEFINE_THIS_MODULE(dp_packet_gso);
+
+COVERAGE_DEFINE(soft_seg_good);
+
+/* Retuns a new packet that is a segment of packet 'p'.
+ *
+ * The new packet is initialized with 'hdr_len' bytes from the
+ * start of packet 'p' and then appended with 'data_len' bytes
+ * from the 'data' buffer.
+ *
+ * Note: The packet headers are not updated. */
+static struct dp_packet *
+dp_packet_gso_seg_new(const struct dp_packet *p, size_t hdr_len,
+  const char *data, size_t data_len)
+{
+struct dp_packet *seg = dp_packet_new_with_headroom(hdr_len + data_len,
+dp_packet_headroom(p));
+
+/* Append the original packet headers and then the payload. */
+dp_packet_put(seg, dp_packet_data(p), hdr_len);
+dp_packet_put(seg, data, data_len);
+
+/* The new segment should have the same offsets. */
+seg->l2_5_ofs = p->l2_5_ofs;
+seg->l3_ofs = p->l3_ofs;
+seg->l4_ofs = p->l4_ofs;
+
+/* The protocol headers remain the same, so preserve hash and mark. */
+*dp_packet_rss_ptr(seg) = dp_packet_get_rss_hash(p);
+*dp_packet_flow_mark_ptr(seg) = *dp_packet_flow_mark_ptr(p);
+
+/* The segment should inherit all the offloading flags from the
+ * original packet, except for the TCP segmentation, external
+ * buffer and indirect buffer flags. */
+*dp_packet_ol_flags_ptr(seg) = *dp_packet_ol_flags_ptr(p)
+& ~(DP_PACKET_OL_TX_TCP_SEG | DP_PACKET_OL_EXTERNAL
+| DP_PACKET_OL_INDIRECT);
+
+dp_packet_hwol_reset_tcp_seg(seg);
+
+return seg;
+}
+
+/* Returns the calculated number of TCP segments in packet 'p'. */
+int
+dp_packet_gso_nr_segs(struct dp_packet *p)
+{
+uint16_t segsz = dp_packet_get_tso_segsz(p);
+const char *data_tail;
+const char *data_pos;
+int n_segs;
+
+data_pos = dp_packet_get_tcp_payload(p);
+data_tail = (char *) dp_packet_tail(p) - dp_packet_l2_pad_size(p);
+data_pos = dp_packet_get_tcp_payload(p);
+n_segs = DIV_ROUND_UP((data_tail - data_pos), segsz);
+
+return n_segs;
+
+}
+
+/* Perform software segmentation on packet 'p'.
+ *
+ * Returns all the segments added to the array of preallocated
+ * batches in 'batches' starting at batch position 'batch_pos'. */
+void
+dp_packet_gso(struct dp_packet *p, struct dp_packet_batch *batches,
+  size_t *batch_pos)
+{
+struct tcp_header *tcp_hdr;
+struct 

[ovs-dev] [PATCH v2 1/2] userspace: Respect tso/gso segment size.

2023-06-21 Thread Mike Pattrick
From: Flavio Leitner 

Currently OVS will calculate the segment size based on the
MTU of the egress port. That usually happens to be correct
when the ports share the same MTU, but that is not always true.

Therefore, if the segment size is provided, then use that and
make sure the over sized packets are dropped.

Signed-off-by: Flavio Leitner 
Co-authored-by: Mike Pattrick 
Signed-off-by: Mike Pattrick 
---
 lib/dp-packet.c|  3 ++
 lib/dp-packet.h| 26 
 lib/netdev-dpdk.c  | 12 +++-
 lib/netdev-linux.c | 76 --
 4 files changed, 94 insertions(+), 23 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 27114a9a9..4d92a2e5b 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -34,6 +34,7 @@ dp_packet_init__(struct dp_packet *b, size_t allocated, enum 
dp_packet_source so
 pkt_metadata_init(>md, 0);
 dp_packet_reset_cutlen(b);
 dp_packet_reset_offload(b);
+dp_packet_set_tso_segsz(b, 0);
 /* Initialize implementation-specific fields of dp_packet. */
 dp_packet_init_specific(b);
 /* By default assume the packet type to be Ethernet. */
@@ -199,6 +200,8 @@ dp_packet_clone_with_headroom(const struct dp_packet 
*buffer, size_t headroom)
 *dp_packet_ol_flags_ptr(new_buffer) = *dp_packet_ol_flags_ptr(buffer);
 *dp_packet_ol_flags_ptr(new_buffer) &= DP_PACKET_OL_SUPPORTED_MASK;
 
+dp_packet_set_tso_segsz(new_buffer, dp_packet_get_tso_segsz(buffer));
+
 if (dp_packet_rss_valid(buffer)) {
 dp_packet_set_rss_hash(new_buffer, dp_packet_get_rss_hash(buffer));
 }
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 70ddf8aa4..6a924f3ff 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -126,6 +126,7 @@ struct dp_packet {
 uint32_t ol_flags;  /* Offloading flags. */
 uint32_t rss_hash;  /* Packet hash. */
 uint32_t flow_mark; /* Packet flow mark. */
+uint16_t tso_segsz; /* TCP TSO segment size. */
 #endif
 enum dp_packet_source source;  /* Source of memory allocated as 'base'. */
 
@@ -166,6 +167,9 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline uint16_t dp_packet_get_tso_segsz(const struct dp_packet *);
+static inline void dp_packet_set_tso_segsz(struct dp_packet *, uint16_t);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -644,6 +648,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->mbuf.buf_len = s;
 }
 
+static inline uint16_t
+dp_packet_get_tso_segsz(const struct dp_packet *p)
+{
+return p->mbuf.tso_segsz;
+}
+
+static inline void
+dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s)
+{
+p->mbuf.tso_segsz = s;
+}
 #else /* DPDK_NETDEV */
 
 static inline void
@@ -700,6 +715,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->allocated_ = s;
 }
 
+static inline uint16_t
+dp_packet_get_tso_segsz(const struct dp_packet *p)
+{
+return p->tso_segsz;
+}
+
+static inline void
+dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s)
+{
+p->tso_segsz = s;
+}
 #endif /* DPDK_NETDEV */
 
 static inline void
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 63dac689e..1c100c48e 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2302,6 +2302,7 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, 
struct rte_mbuf *mbuf)
 
 if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
 struct tcp_header *th = dp_packet_l4(pkt);
+int hdr_len;
 
 if (!th) {
 VLOG_WARN_RL(, "%s: TCP Segmentation without L4 header"
@@ -2311,7 +2312,15 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, 
struct rte_mbuf *mbuf)
 
 mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
 mbuf->ol_flags |= RTE_MBUF_F_TX_TCP_CKSUM;
+hdr_len = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
 mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
+if (OVS_UNLIKELY((hdr_len + mbuf->tso_segsz) > dev->max_packet_len)) {
+VLOG_WARN_RL(, "%s: Oversized TSO packet. "
+ "hdr: %"PRIu32", gso: %"PRIu32", max len: %"PRIu32"",
+ dev->up.name, hdr_len, mbuf->tso_segsz,
+ dev->max_packet_len);
+return false;
+}
 
 if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) {
 mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM;
@@ -2586,7 +2595,8 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, 
struct rte_mbuf **pkts,
 int cnt = 0;
 struct rte_mbuf *pkt;
 
-/* Filter oversized packets, unless are marked for TSO. */
+/* Filter oversized packets. The TSO packets are filtered out
+ * during the 

Re: [ovs-dev] [PATCH v1] bridge ovs-vsctl Bridge IPFIX enable_input_sampling, enable_ouput_sampling fix unexpected values

2023-06-21 Thread Sayali Naval (sanaval) via dev
A gentle reminder to take a look at this patch.

Thanks,
Sayali Naval

From: Sayali Naval (sanaval)
Sent: Wednesday, May 31, 2023 11:02 AM
To: d...@openvswitch.org 
Subject: [ovs-dev] [PATCH v1] bridge ovs-vsctl Bridge IPFIX 
enable_input_sampling, enable_ouput_sampling fix unexpected values


As per the Open vSwitch Manual 
(http://www.openvswitch.org/support/dist-docs/ovs-vsctl.8.txt) the Bridge IPFIX 
parameters can be passed as follows:


ovs-vsctl -- set Bridge br0 ipfix=@i \

  --  --id=@i  create  IPFIX targets=\"192.168.0.34:4739\" obs_do‐

  main_id=123   obs_point_id=456   cache_active_timeout=60

  cache_max_flows=13 \

  other_config:enable-input-sampling=false \

  other_config:enable-output-sampling=false


where the default values are:

enable_input_sampling: true

enable_output_sampling: true


But in the existing code 
https://github.com/openvswitch/ovs/blob/master/vswitchd/bridge.c#L1563-L1567, 
these 2 parameters take up unexpected values in some scenarios.



be_opts.enable_input_sampling = !smap_get_bool(_cfg->other_config,

  "enable-input-sampling", false);


be_opts.enable_output_sampling = !smap_get_bool(_cfg->other_config,

  "enable-output-sampling", false);


Here, the function smap_get_bool is being used with a negation.


smap_get_bool is defined as below:

(https://github.com/openvswitch/ovs/blob/master/lib/smap.c#L220-L232)


/* Gets the value associated with 'key' in 'smap' and converts it to a boolean.

 * If 'key' is not in 'smap', or its value is neither "true" nor "false",

 * returns 'def'. */

bool

smap_get_bool(const struct smap *smap, const char *key, bool def)

{

const char *value = smap_get_def(smap, key, "");

if (def) {

return strcasecmp("false", value) != 0;

} else {

return !strcasecmp("true", value);

}

}


This returns expected values for the default case (since the above code will 
negate “false” we get from smap_get bool function and return the value “true”) 
but unexpected values for the case where the sampling value is passed through 
the CLI.

For example, if we pass "true" for other_config:enable-input-sampling in the 
CLI, the above code will negate the “true” value we get from the smap_bool 
function and return the value “false”. Same would be the case for 
enable_output_sampling.


Signed-off-by: Sayali Naval 

---

[bridge-ipfix-fix 277b3baae] Fix values for enable_input_sampling & 
enable_ouput_sampling

 Date: Thu May 25 10:32:43 2023 -0700

 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c

index f5dc59ad0..b972d55d0 100644

--- a/vswitchd/bridge.c

+++ b/vswitchd/bridge.c

@@ -1560,11 +1560,11 @@ bridge_configure_ipfix(struct bridge *br)

 be_opts.enable_tunnel_sampling = smap_get_bool(_cfg->other_config,

  "enable-tunnel-sampling", true);



-be_opts.enable_input_sampling = !smap_get_bool(_cfg->other_config,

-  "enable-input-sampling", false);

+be_opts.enable_input_sampling = smap_get_bool(_cfg->other_config,

+  "enable-input-sampling", true);



-be_opts.enable_output_sampling = !smap_get_bool(_cfg->other_config,

-  "enable-output-sampling", false);

+be_opts.enable_output_sampling = smap_get_bool(_cfg->other_config,

+  "enable-output-sampling", true);



 virtual_obs_id = smap_get(_cfg->other_config, "virtual_obs_id");

 be_opts.virtual_obs_id = nullable_xstrdup(virtual_obs_id);

--



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 2/2] userspace: Add Generic Segmentation Offloading.

2023-06-21 Thread Mike Pattrick
From: Flavio Leitner 

This provides a software implementation in the case
the egress netdev doesn't support segmentation in hardware.

The challenge here is to guarantee packet ordering in the
original batch that may be full of TSO packets. Each TSO
packet can go up to ~64kB, so with segment size of 1440
that means about 44 packets for each TSO. Each batch has
32 packets, so the total batch amounts to 1408 normal
packets.

The segmentation estimates the total number of packets
and then the total number of batches. Then allocate
enough memory and finally do the work.

Finally each batch is sent in order to the netdev.

Signed-off-by: Flavio Leitner 
Co-authored-by: Mike Pattrick 
Signed-off-by: Mike Pattrick 
---
 lib/automake.mk |   2 +
 lib/dp-packet-gso.c | 172 
 lib/dp-packet-gso.h |  24 +++
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h |  15 +++-
 lib/netdev-dpdk.c   |  38 +++---
 lib/netdev-linux.c  |  58 ---
 lib/netdev.c| 121 ++-
 lib/packets.c   |   4 +-
 9 files changed, 314 insertions(+), 124 deletions(-)
 create mode 100644 lib/dp-packet-gso.c
 create mode 100644 lib/dp-packet-gso.h

diff --git a/lib/automake.mk b/lib/automake.mk
index e64ee76ce..49a92958d 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -118,6 +118,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpctl.h \
lib/dp-packet.h \
lib/dp-packet.c \
+   lib/dp-packet-gso.c \
+   lib/dp-packet-gso.h \
lib/dpdk.h \
lib/dpif-netdev-extract-study.c \
lib/dpif-netdev-lookup.h \
diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c
new file mode 100644
index 0..bc72e2f90
--- /dev/null
+++ b/lib/dp-packet-gso.c
@@ -0,0 +1,172 @@
+/*
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dp-packet-gso.h"
+#include "netdev-provider.h"
+#include "openvswitch/vlog.h"
+
+VLOG_DEFINE_THIS_MODULE(dp_packet_gso);
+
+COVERAGE_DEFINE(soft_seg_good);
+
+/* Retuns a new packet that is a segment of packet 'p'.
+ *
+ * The new packet is initialized with 'hdr_len' bytes from the
+ * start of packet 'p' and then appended with 'data_len' bytes
+ * from the 'data' buffer.
+ *
+ * Note: The packet headers are not updated. */
+static struct dp_packet *
+dp_packet_gso_seg_new(const struct dp_packet *p, size_t hdr_len,
+  const char *data, size_t data_len)
+{
+struct dp_packet *seg = dp_packet_new_with_headroom(hdr_len + data_len,
+dp_packet_headroom(p));
+
+/* Append the original packet headers and then the payload. */
+dp_packet_put(seg, dp_packet_data(p), hdr_len);
+dp_packet_put(seg, data, data_len);
+
+/* The new segment should have the same offsets. */
+seg->l2_5_ofs = p->l2_5_ofs;
+seg->l3_ofs = p->l3_ofs;
+seg->l4_ofs = p->l4_ofs;
+
+/* The protocol headers remain the same, so preserve hash and mark. */
+*dp_packet_rss_ptr(seg) = dp_packet_get_rss_hash(p);
+*dp_packet_flow_mark_ptr(seg) = *dp_packet_flow_mark_ptr(p);
+
+/* The segment should inherit all the offloading flags from the
+ * original packet, except for the TCP segmentation, external
+ * buffer and indirect buffer flags. */
+*dp_packet_ol_flags_ptr(seg) = *dp_packet_ol_flags_ptr(p)
+& ~(DP_PACKET_OL_TX_TCP_SEG | DP_PACKET_OL_EXTERNAL
+| DP_PACKET_OL_INDIRECT);
+
+dp_packet_hwol_reset_tcp_seg(seg);
+
+return seg;
+}
+
+/* Returns the calculated number of TCP segments in packet 'p'. */
+int
+dp_packet_gso_nr_segs(struct dp_packet *p)
+{
+uint16_t segsz = dp_packet_get_tso_segsz(p);
+const char *data_tail;
+const char *data_pos;
+int n_segs;
+
+data_pos = dp_packet_get_tcp_payload(p);
+data_tail = (char *) dp_packet_tail(p) - dp_packet_l2_pad_size(p);
+data_pos = dp_packet_get_tcp_payload(p);
+n_segs = DIV_ROUND_UP((data_tail - data_pos), segsz);
+
+return n_segs;
+
+}
+
+/* Perform software segmentation on packet 'p'.
+ *
+ * Returns all the segments added to the array of preallocated
+ * batches in 'batches' starting at batch position 'batch_pos'. */
+void
+dp_packet_gso(struct dp_packet *p, struct dp_packet_batch *batches,
+  size_t *batch_pos)
+{
+struct tcp_header 

[ovs-dev] [PATCH 1/2] userspace: Respect tso/gso segment size.

2023-06-21 Thread Mike Pattrick
From: Flavio Leitner 

Currently OVS will calculate the segment size based on the
MTU of the egress port. That usually happens to be correct
when the ports share the same MTU, but that is not always true.

Therefore, if the segment size is provided, then use that and
make sure the over sized packets are dropped.

Signed-off-by: Flavio Leitner 
Co-authored-by: Mike Pattrick 
Signed-off-by: Mike Pattrick 
---
 lib/dp-packet.c|  1 +
 lib/dp-packet.h| 26 +
 lib/netdev-dpdk.c  | 13 +++--
 lib/netdev-linux.c | 73 +-
 4 files changed, 91 insertions(+), 22 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index dfedd0e9b..d1e916488 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -34,6 +34,7 @@ dp_packet_init__(struct dp_packet *b, size_t allocated, enum 
dp_packet_source so
 pkt_metadata_init(>md, 0);
 dp_packet_reset_cutlen(b);
 dp_packet_reset_offload(b);
+dp_packet_set_tso_segsz(p, 0);
 /* Initialize implementation-specific fields of dp_packet. */
 dp_packet_init_specific(b);
 /* By default assume the packet type to be Ethernet. */
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 70ddf8aa4..6a924f3ff 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -126,6 +126,7 @@ struct dp_packet {
 uint32_t ol_flags;  /* Offloading flags. */
 uint32_t rss_hash;  /* Packet hash. */
 uint32_t flow_mark; /* Packet flow mark. */
+uint16_t tso_segsz; /* TCP TSO segment size. */
 #endif
 enum dp_packet_source source;  /* Source of memory allocated as 'base'. */
 
@@ -166,6 +167,9 @@ static inline void dp_packet_set_size(struct dp_packet *, 
uint32_t);
 static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
 static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
 
+static inline uint16_t dp_packet_get_tso_segsz(const struct dp_packet *);
+static inline void dp_packet_set_tso_segsz(struct dp_packet *, uint16_t);
+
 void *dp_packet_resize_l2(struct dp_packet *, int increment);
 void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
 static inline void *dp_packet_eth(const struct dp_packet *);
@@ -644,6 +648,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->mbuf.buf_len = s;
 }
 
+static inline uint16_t
+dp_packet_get_tso_segsz(const struct dp_packet *p)
+{
+return p->mbuf.tso_segsz;
+}
+
+static inline void
+dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s)
+{
+p->mbuf.tso_segsz = s;
+}
 #else /* DPDK_NETDEV */
 
 static inline void
@@ -700,6 +715,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
 b->allocated_ = s;
 }
 
+static inline uint16_t
+dp_packet_get_tso_segsz(const struct dp_packet *p)
+{
+return p->tso_segsz;
+}
+
+static inline void
+dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s)
+{
+p->tso_segsz = s;
+}
 #endif /* DPDK_NETDEV */
 
 static inline void
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 46f577441..1959b260b 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -2312,6 +2312,7 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, 
struct rte_mbuf *mbuf)
 
 if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
 struct tcp_header *th = dp_packet_l4(pkt);
+int hdr_len;
 
 if (!th) {
 VLOG_WARN_RL(, "%s: TCP Segmentation without L4 header"
@@ -2321,7 +2322,14 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, 
struct rte_mbuf *mbuf)
 
 mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
 mbuf->ol_flags |= RTE_MBUF_F_TX_TCP_CKSUM;
-mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
+hdr_len = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len;
+if (OVS_UNLIKELY((hdr_len + mbuf->tso_segsz) > dev->max_packet_len)) {
+VLOG_WARN_RL(, "%s: Oversized TSO packet. "
+ "hdr: %"PRIu32", gso: %"PRIu32", max len: %"PRIu32"",
+ dev->up.name, hdr_len, mbuf->tso_segsz,
+ dev->max_packet_len);
+return false;
+}
 
 if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) {
 mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM;
@@ -2596,7 +2604,8 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, 
struct rte_mbuf **pkts,
 int cnt = 0;
 struct rte_mbuf *pkt;
 
-/* Filter oversized packets, unless are marked for TSO. */
+/* Filter oversized packets. The TSO packets are filtered out
+ * during the offloading preparation for performance reasons. */
 for (i = 0; i < pkt_cnt; i++) {
 pkt = pkts[i];
 if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 3dba2ef1f..3877586b8 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -536,7 +536,7 @@ static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 static bool tap_supports_vnet_hdr = true;
 
 static int 

Re: [ovs-dev] [PATCH v1] dpif-netdev: Update meter timestamp when dp_netdev_run_meter been called.

2023-06-21 Thread Ilya Maximets
On 5/31/23 17:41, mit...@outlook.com wrote:
> From: Lin Huang 
> 
> Currently, a meter's timestamp 'now' is set by 'pmd->ctx.now' which updated
> by pmd_thread_ctx_time_update().
> 
> Before processing of the new packet batch:
> - dpif_netdev_execute()
> - dp_netdev_process_rxq_port()
> 
> There is a problem when user want to police the rate to a low pps by meter.
> For example, When the traffic is more than 3Mpps, policing the traffic to
> 1M pps, the real rate will be 3Mpps not 1Mpps.
> 
> The key point is that a meter's timestamp isn't update in real time.
> For example, PMD thread A and B are polled at the same time (t1).
> Thread A starts to run meter to police traffic. At the same time, thread B
> pause at 'ovs_mutex_lock(>lock)' to wait thread A release the meter
> lock. This elapsed a lot of time, especially we have more than 10 PMD threads.

Shouldn't the infrequent time update cause more drops instead?

Best regards, Ilya Maximets.

> 
> After thread A release the meter lock, thread B start to count the time diff.
> - long_delta_t = now - meter->used' --> long_delta_t = t1 - t1 = 0.
> - band->bucket += (uint64_t) delta_t * band->rate --> band->bucket = 0.
> - band_exceeded_pkt = band->bucket / 1000; --> band_exceeded_pkt = 0.
> 
> Fix this problem by update the meter timestamp every time.
> 
> Test-by: Zhang Yuhuang 
> Signed-off-by: Lin Huang 
> ---
>  lib/dpif-netdev.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 70b953ae6..dbb275cf8 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7169,12 +7169,13 @@ dpif_netdev_meter_get_features(const struct dpif * 
> dpif OVS_UNUSED,
>   * that exceed a band are dropped in-place. */
>  static void
>  dp_netdev_run_meter(struct dp_netdev *dp, struct dp_packet_batch *packets_,
> -uint32_t meter_id, long long int now)
> +uint32_t meter_id)
>  {
>  struct dp_meter *meter;
>  struct dp_meter_band *band;
>  struct dp_packet *packet;
>  long long int long_delta_t; /* msec */
> +long long int now;
>  uint32_t delta_t; /* msec */
>  const size_t cnt = dp_packet_batch_size(packets_);
>  uint32_t bytes, volume;
> @@ -7197,8 +7198,12 @@ dp_netdev_run_meter(struct dp_netdev *dp, struct 
> dp_packet_batch *packets_,
>  memset(exceeded_rate, 0, cnt * sizeof *exceeded_rate);
>  
>  ovs_mutex_lock(>lock);
> +
> +/* Update now */
> +now = time_msec();
> +
>  /* All packets will hit the meter at the same time. */
> -long_delta_t = now / 1000 - meter->used / 1000; /* msec */
> +long_delta_t = now  - meter->used; /* msec */
>  
>  if (long_delta_t < 0) {
>  /* This condition means that we have several threads fighting for a
> @@ -9170,8 +9175,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
> *packets_,
>  }
>  
>  case OVS_ACTION_ATTR_METER:
> -dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a),
> -pmd->ctx.now);
> +dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a));
>  break;
>  
>  case OVS_ACTION_ATTR_PUSH_VLAN:

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4] rhel: make the version, displayed to the user, customizable

2023-06-21 Thread Ilya Maximets
On 6/20/23 19:43, Timothy Redaelli wrote:
> Since on CentOS/RHEL the builds are based on stable branches and not on
> tags for debugging purpose it's better to have the downstream version as
> version so it's easier to know which commits are included in a build.
> 
> This commit adds --with-version-suffix as ./configure option in
> order to set an OVS version suffix that should be shown to the user via
> ovs-vsctl -V and, so, also on database, on ovs-vsctl show and the other
> utilities.
> 
> --with-version-suffix is used in Fedora/CentOS/RHEL spec file in order to have
> the version be aligned with the downstream one.
> 
> Signed-off-by: Timothy Redaelli 
> ---
> v1 -> v2: Use --with-version-suffix= and add version to other utilies
>   (as requested by Ilya).
> 
> v2 -> v3: Add versioning to python utilities and python library itself
>   (as suggested by Aaron)
> 
> v3 -> v4: Remove versioning to python library itself to avoid breaking
>   PEP440 (as requested by Ilya). Versioning is still used in
>   python utilities.

Thanks for the update.  There is at least one place in a python code
missed in this patch.  It's the python/ovs/unixctl/server.py.  It's
using ovs.version.VERSION that doesn't contain the suffix.

Another place is python/ovstest/args.py, but, I suppose, we can ignore
it for now, as it seems to be broken anyway.  It just prints the
'@VERSION@' verbatim...

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] ovn-controller: Reduce size of the SB monitor condition.

2023-06-21 Thread Han Zhou
On Wed, Jun 21, 2023 at 2:48 AM Dumitru Ceara  wrote:
>
> We don't need to explicitly add port bindings that were already bound
> locally.  We implicitly get those because we monitor the datapaths
> they're attached to.
>
> When performing an ovn-heater 500-node density-heavy scale test [0], with
> conditional monitoring enabled, the unreasonably long poll intervals on
> the Southbound database (the ones that took more than one second) are
> measured as:
>
>   -
>Count  MinMax   Median  Mean   95 percentile
>   -
> 56.0 1010.0 2664.0 1455.5 1544.9 2163.0
> 77.0 1015.0 3460.0 1896.0 1858.2 2907.8
> 69.0 1010.0 3118.0 1618.0 1688.1 2582.4
>   -
>202.0 1010.0 3460.0 1610.0 1713.3 2711.5
>
> Compared to the baseline results (also with conditional monitoring
> enabled):
>   -
>Count  MinMax   Median  Mean   95 percentile
>   -
>141.0 1003.018338.0 1721.0 2625.4 7626.0
>151.0 1019.080297.0 1808.0 3410.7 9089.0
>165.0 1007.050736.0 2354.0 3067.7 7309.8
>   -
>457.0 1003.080297.0 2131.0 3044.6 7799.6
>
> We see significant improvement on the server side.  This is explained
> by the fact that now the Southbound database server doesn't need to
> apply as many conditions as before when filtering individual monitor
> contents.
>
> Note: Sub-ports - OVN switch ports with parent_port set - have to be
> monitored unconditionally as we cannot efficiently determine their local
> datapaths without including all local OVS interfaces in the monitor.
>
> [0]
https://github.com/ovn-org/ovn-heater/blob/main/test-scenarios/ocp-500-density-heavy.yml
>
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2139194
> Reported-by: Ilya Maximets 
> Signed-off-by: Dumitru Ceara 
> ---
> V2:
> - Addressed Han's comments:
>   - changed commit log: removed assumption about sub-ports.
>   - added ovn-nb documentation note about sub-ports monitoring.
>   - added ovn-controller documentation note about sub-ports monitoring.
>   - added TODO item.
> - Added NEWS item.
> ---
>  NEWS|  3 +++
>  TODO.rst|  6 ++
>  controller/binding.c|  2 +-
>  controller/binding.h|  2 +-
>  controller/ovn-controller.8.xml |  6 ++
>  controller/ovn-controller.c | 19 ---
>  ovn-nb.xml  |  6 +-
>  7 files changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 482970eeeb..8275877f99 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -7,6 +7,9 @@ Post v23.06.0
>  DHCPv4 "hostname" (12) option.
>- Support to create/update MAC_Binding when GARP received from VTEP
(RAMP)
>  switch on l3dgw port.
> +  - To allow optimizing ovn-controller's monitor conditions for the
regular
> +VIF case, ovn-controller now unconditionally monitors all sub-ports
> +(ports with parent_port set).
>
>  OVN v23.06.0 - 01 Jun 2023
>  --
> diff --git a/TODO.rst b/TODO.rst
> index dff163e70b..b790a9fadf 100644
> --- a/TODO.rst
> +++ b/TODO.rst
> @@ -124,3 +124,9 @@ OVN To-do List
>  * Load Balancer templates
>
>* Support combining the VIP IP and port into a single template
variable.
> +
> +* ovn-controller conditional monitoring
> +
> +  * Improve sub-ports (with parent_port set) conditional monitoring;
these
> +are currently unconditionally monitored, even if ovn-monitor-all is
> +set to false.
> diff --git a/controller/binding.c b/controller/binding.c
> index 486095ca79..359ad6d660 100644
> --- a/controller/binding.c
> +++ b/controller/binding.c
> @@ -785,7 +785,7 @@ local_binding_data_destroy(struct local_binding_data
*lbinding_data)
>  }
>
>  const struct sbrec_port_binding *
> -local_binding_get_primary_pb(struct shash *local_bindings,
> +local_binding_get_primary_pb(const struct shash *local_bindings,
>   const char *pb_name)
>  {
>  struct local_binding *lbinding =
> diff --git a/controller/binding.h b/controller/binding.h
> index 0e57f02ee5..319cbd263c 100644
> --- a/controller/binding.h
> +++ b/controller/binding.h
> @@ -150,7 +150,7 @@ void local_binding_data_init(struct
local_binding_data *);
>  void local_binding_data_destroy(struct local_binding_data *);
>
>  const struct sbrec_port_binding *local_binding_get_primary_pb(
> -struct shash *local_bindings, const char *pb_name);
> 

Re: [ovs-dev] Request adding long poll interval metrics to coverage metrics

2023-06-21 Thread Aaron Conole
Eelco Chaudron  writes:

> On 20 Jun 2023, at 16:57, Ilya Maximets wrote:
>
>> On 6/20/23 16:10, Aaron Conole wrote:
>>> Adrian Moreno  writes:
>>>
 On 6/19/23 10:36, Eelco Chaudron wrote:
> On 16 Jun 2023, at 19:19, Aaron Conole wrote:
>
>> Martin Kennelly  writes:
>>
>>> Hey ovs community,
>>>
>>> I am a developer working on ovn-kubernetes and I want to
>>> programmatically consume long poll information
>>> i.e:
>>> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll
>>> interval (752ms user, 209ms system)
>>>
>>> This is currently exposed via journal logs but it's not practical
>>> to consume it there programmatically and I was
>>> hoping you could add it to coverage metrics.
>>
>> I think it could be useful.  I do want to be careful about exposing
>> these kinds of data in a way that could be misinterpreted.  Already,
>> that log in particular gets misinterpreted quite a bit, and RH gets
>> customers claiming OVS is misbehaving when they've oversubscribed the
>> system.
> +1
>

 Maybe it's a good time to start documenting coverage counters?
>>>
>>> I agree - we should have at least some kind of documentation.  Actually,
>>> it would be really nice if we could do something during
>>> COVERAGE_DEFINE() that would be like:
>>>
>>>   COVERAGE_DEFINE(ctr, "description")
>>>
>>> and then we can generate documentation from the COVERAGE_DEFINE()s as
>>> well as querying for it with an ovs-appctl command.
>
> I think this might be the only way to keep the doc in sync, however,
> who is going to document the 320+ existing ones?

That becomes a bit of a retro-fit challenge I think.  Regardless, maybe
the effort would be worth it.

>>> That might be trying to be too fancy, though.
>>
>>
>> The problem with documenting coverage counters is that we can't
>> and shouldn't claim that these are in any way a stable API.
>> They are mainly for developers.  Code can change and there might
>> be no meaningful way preserve current counters or their names.
>> They can change in each release including minor ones without
>> prior notice.
>
> +1000 as even if the direct code near the counter does not changes, it
> could still be impacted the actual counter. Which I have seen before.
>
>> Best regards, Ilya Maximets.
>>
>>>
>> Mechanically, it would be pretty simple to do something like:
>>
>> ---
>> diff --git a/lib/timeval.c b/lib/timeval.c
>> index 193c7bab17..00e5f2a74d 100644
>> --- a/lib/timeval.c
>> +++ b/lib/timeval.c
>> @@ -40,6 +40,7 @@
>>   #include "openvswitch/vlog.h"
>>
>>   VLOG_DEFINE_THIS_MODULE(timeval);
>> +COVERAGE_DEFINE(long_poll_interval);
>>
>>   #if !defined(HAVE_CLOCK_GETTIME)
>>   typedef unsigned int clockid_t;
>> @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>>   struct rusage rusage;
>>
>>   if (!getrusage_thread()) {
>> +COVERAGE_INC(long_poll_interval);
>> +
>>   VLOG_WARN("Unreasonably long %lldms poll interval"
>> " (%lldms user, %lldms system)",
>> interval,
>> ---
>>
>> This would at least expose the coverage data via the coverage framework
>> and it can be queried via ovs-appctl.  Actually, the advantage here is
>> that the coverage counter can track some details about X/sec over the
>> last 5 seconds, minute, hour, in addition to the total, so we can see
>> whether the condition is ongoing.
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>>>
>>> ___
>>> dev mailing list
>>> d...@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] Request adding long poll interval metrics to coverage metrics

2023-06-21 Thread Aaron Conole
Ilya Maximets  writes:

> On 6/20/23 16:10, Aaron Conole wrote:
>> Adrian Moreno  writes:
>> 
>>> On 6/19/23 10:36, Eelco Chaudron wrote:
 On 16 Jun 2023, at 19:19, Aaron Conole wrote:

> Martin Kennelly  writes:
>
>> Hey ovs community,
>>
>> I am a developer working on ovn-kubernetes and I want to
>> programmatically consume long poll information
>> i.e:
>> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll
>> interval (752ms user, 209ms system)
>>
>> This is currently exposed via journal logs but it's not practical
>> to consume it there programmatically and I was
>> hoping you could add it to coverage metrics.
>
> I think it could be useful.  I do want to be careful about exposing
> these kinds of data in a way that could be misinterpreted.  Already,
> that log in particular gets misinterpreted quite a bit, and RH gets
> customers claiming OVS is misbehaving when they've oversubscribed the
> system.
 +1

>>>
>>> Maybe it's a good time to start documenting coverage counters?
>> 
>> I agree - we should have at least some kind of documentation.  Actually,
>> it would be really nice if we could do something during
>> COVERAGE_DEFINE() that would be like:
>> 
>>   COVERAGE_DEFINE(ctr, "description")
>> 
>> and then we can generate documentation from the COVERAGE_DEFINE()s as
>> well as querying for it with an ovs-appctl command.
>> 
>> That might be trying to be too fancy, though.
>
>
> The problem with documenting coverage counters is that we can't
> and shouldn't claim that these are in any way a stable API.
> They are mainly for developers.  Code can change and there might
> be no meaningful way preserve current counters or their names.
> They can change in each release including minor ones without
> prior notice.

Agreed that these are developer focused counters, and not proper KPIs.
Even with that, there can be some information for others to take some
useful information back to the developers with.  For example, we can see
how often this particular event is happening to correlate with system
issues.  I guess that is why I prefer the in-code documentation to doing
some kind of "traditional" man-page for these as well.

> Best regards, Ilya Maximets.
>
>> 
> Mechanically, it would be pretty simple to do something like:
>
> ---
> diff --git a/lib/timeval.c b/lib/timeval.c
> index 193c7bab17..00e5f2a74d 100644
> --- a/lib/timeval.c
> +++ b/lib/timeval.c
> @@ -40,6 +40,7 @@
>   #include "openvswitch/vlog.h"
>
>   VLOG_DEFINE_THIS_MODULE(timeval);
> +COVERAGE_DEFINE(long_poll_interval);
>
>   #if !defined(HAVE_CLOCK_GETTIME)
>   typedef unsigned int clockid_t;
> @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>   struct rusage rusage;
>
>   if (!getrusage_thread()) {
> +COVERAGE_INC(long_poll_interval);
> +
>   VLOG_WARN("Unreasonably long %lldms poll interval"
> " (%lldms user, %lldms system)",
> interval,
> ---
>
> This would at least expose the coverage data via the coverage framework
> and it can be queried via ovs-appctl.  Actually, the advantage here is
> that the coverage counter can track some details about X/sec over the
> last 5 seconds, minute, hour, in addition to the total, so we can see
> whether the condition is ongoing.
 ___
 dev mailing list
 d...@openvswitch.org
 https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>> 
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif: Add coverage counters for dpif_operate() failures.

2023-06-21 Thread Aaron Conole
Eelco Chaudron  writes:

> On 20 Jun 2023, at 16:17, Aaron Conole wrote:
>
>> Eelco Chaudron  writes:
>>
>>> Add additional error coverage counters for dpif operation failures.
>>> This could help to quickly identify netlink problems when communicating
>>> with the OVS kernel module.
>>>
>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2070630
>>> Signed-off-by: Eelco Chaudron 
>>> ---
>>
>> One thing that made this difficult to review is the re-ordering of the
>> COVERAGE_DEFINE()s.  Maybe it would be better to have a separate patch
>> that corrects the order to alphabetical, and then a patch that adds the
>> new counters for flow put/del/get/execute.  WDYT?
>
> Based on the patch size, I decided not to do this, but if others also
> feel it needs to be split up, I can do this.

Either way:

Acked-by: Aaron Conole 

> //Eelco
>
>
>>>  lib/dpif.c |   39 ++-
>>>  1 file changed, 26 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/lib/dpif.c b/lib/dpif.c
>>> index 3305401fe..b1cbf39c4 100644
>>> --- a/lib/dpif.c
>>> +++ b/lib/dpif.c
>>> @@ -55,18 +55,22 @@
>>>  VLOG_DEFINE_THIS_MODULE(dpif);
>>>
>>>  COVERAGE_DEFINE(dpif_destroy);
>>> -COVERAGE_DEFINE(dpif_port_add);
>>> -COVERAGE_DEFINE(dpif_port_del);
>>> +COVERAGE_DEFINE(dpif_execute);
>>> +COVERAGE_DEFINE(dpif_execute_error);
>>> +COVERAGE_DEFINE(dpif_execute_with_help);
>>> +COVERAGE_DEFINE(dpif_flow_del);
>>> +COVERAGE_DEFINE(dpif_flow_del_error);
>>>  COVERAGE_DEFINE(dpif_flow_flush);
>>>  COVERAGE_DEFINE(dpif_flow_get);
>>> +COVERAGE_DEFINE(dpif_flow_get_error);
>>>  COVERAGE_DEFINE(dpif_flow_put);
>>> -COVERAGE_DEFINE(dpif_flow_del);
>>> -COVERAGE_DEFINE(dpif_execute);
>>> -COVERAGE_DEFINE(dpif_purge);
>>> -COVERAGE_DEFINE(dpif_execute_with_help);
>>> -COVERAGE_DEFINE(dpif_meter_set);
>>> -COVERAGE_DEFINE(dpif_meter_get);
>>> +COVERAGE_DEFINE(dpif_flow_put_error);
>>>  COVERAGE_DEFINE(dpif_meter_del);
>>> +COVERAGE_DEFINE(dpif_meter_get);
>>> +COVERAGE_DEFINE(dpif_meter_set);
>>> +COVERAGE_DEFINE(dpif_port_add);
>>> +COVERAGE_DEFINE(dpif_port_del);
>>> +COVERAGE_DEFINE(dpif_purge);
>>>
>>>  static const struct dpif_class *base_dpif_classes[] = {
>>>  #if defined(__linux__) || defined(_WIN32)
>>> @@ -1381,8 +1385,11 @@ dpif_operate(struct dpif *dpif, struct dpif_op 
>>> **ops, size_t n_ops,
>>>
>>>  COVERAGE_INC(dpif_flow_put);
>>>  log_flow_put_message(dpif, _module, put, error);
>>> -if (error && put->stats) {
>>> -memset(put->stats, 0, sizeof *put->stats);
>>> +if (error) {
>>> +COVERAGE_INC(dpif_flow_put_error);
>>> +if (put->stats) {
>>> +memset(put->stats, 0, sizeof *put->stats);
>>> +}
>>>  }
>>>  break;
>>>  }
>>> @@ -1392,10 +1399,10 @@ dpif_operate(struct dpif *dpif, struct dpif_op 
>>> **ops, size_t n_ops,
>>>
>>>  COVERAGE_INC(dpif_flow_get);
>>>  if (error) {
>>> +COVERAGE_INC(dpif_flow_get_error);
>>>  memset(get->flow, 0, sizeof *get->flow);
>>>  }
>>>  log_flow_get_message(dpif, _module, get, error);
>>> -
>>>  break;
>>>  }
>>>
>>> @@ -1404,8 +1411,11 @@ dpif_operate(struct dpif *dpif, struct dpif_op 
>>> **ops, size_t n_ops,
>>>
>>>  COVERAGE_INC(dpif_flow_del);
>>>  log_flow_del_message(dpif, _module, del, error);
>>> -if (error && del->stats) {
>>> -memset(del->stats, 0, sizeof *del->stats);
>>> +if (error) {
>>> +COVERAGE_INC(dpif_flow_del_error);
>>> +if (del->stats) {
>>> +memset(del->stats, 0, sizeof *del->stats);
>>> +}
>>>  }
>>>  break;
>>>  }
>>> @@ -1414,6 +1424,9 @@ dpif_operate(struct dpif *dpif, struct dpif_op **ops, 
>>> size_t n_ops,
>>>  COVERAGE_INC(dpif_execute);
>>>  log_execute_message(dpif, _module, >execute,
>>>  false, error);
>>> +if (error) {
>>> +COVERAGE_INC(dpif_execute_error);
>>> +}
>>>  break;
>>>  }
>>>  }

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron

2023-06-21 Thread Ilya Maximets
On 6/21/23 15:00, Eelco Chaudron wrote:
> 
> 
> On 21 Jun 2023, at 14:50, Simon Horman wrote:
> 
>> Eelco Chaudron was elected by the Open vSwitch committers yesterday.
>> This formalises his status as an Open vSwitch committer.
>>
>> Welcome Eelco!
> 
> Not sire if I should ack ;)
> 
> Acked-by: Eelco Chaudron 
> 
>> Signed-off-by: Simon Horman 

Applied.

And Welcome!

Best regards, Ilya Maximets.

>> ---
>>  MAINTAINERS.rst | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
>> index 85b8e6416584..2fc2517177e1 100644
>> --- a/MAINTAINERS.rst
>> +++ b/MAINTAINERS.rst
>> @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers:
>>   - aserd...@ovn.org
>> * - Ansis Atteka
>>   - ansisatt...@gmail.com
>> +   * - Eelco Chaudron
>> + - echau...@redhat.com
>> * - Ian Stokes
>>   - isto...@ovn.org
>> * - Ilya Maximets
>> -- 
>> 2.30.2
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron

2023-06-21 Thread Eelco Chaudron



On 21 Jun 2023, at 14:50, Simon Horman wrote:

> Eelco Chaudron was elected by the Open vSwitch committers yesterday.
> This formalises his status as an Open vSwitch committer.
>
> Welcome Eelco!

Not sire if I should ack ;)

Acked-by: Eelco Chaudron 

> Signed-off-by: Simon Horman 
> ---
>  MAINTAINERS.rst | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
> index 85b8e6416584..2fc2517177e1 100644
> --- a/MAINTAINERS.rst
> +++ b/MAINTAINERS.rst
> @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers:
>   - aserd...@ovn.org
> * - Ansis Atteka
>   - ansisatt...@gmail.com
> +   * - Eelco Chaudron
> + - echau...@redhat.com
> * - Ian Stokes
>   - isto...@ovn.org
> * - Ilya Maximets
> -- 
> 2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron

2023-06-21 Thread Alin Serdean
Welcome Eelco!
Acked-by: Alin Gabriel Serdean 

> 
> On 21 Jun 2023, at 14:50, Simon Horman  wrote:
> 
> Eelco Chaudron was elected by the Open vSwitch committers yesterday.
> This formalises his status as an Open vSwitch committer.
> 
> Welcome Eelco!
> 
> Signed-off-by: Simon Horman 
> ---
> MAINTAINERS.rst | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
> index 85b8e6416584..2fc2517177e1 100644
> --- a/MAINTAINERS.rst
> +++ b/MAINTAINERS.rst
> @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers:
>  - aserd...@ovn.org
>* - Ansis Atteka
>  - ansisatt...@gmail.com
> +   * - Eelco Chaudron
> + - echau...@redhat.com
>* - Ian Stokes
>  - isto...@ovn.org
>* - Ilya Maximets
> -- 
> 2.30.2
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron

2023-06-21 Thread Simon Horman
Eelco Chaudron was elected by the Open vSwitch committers yesterday.
This formalises his status as an Open vSwitch committer.

Welcome Eelco!

Signed-off-by: Simon Horman 
---
 MAINTAINERS.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst
index 85b8e6416584..2fc2517177e1 100644
--- a/MAINTAINERS.rst
+++ b/MAINTAINERS.rst
@@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers:
  - aserd...@ovn.org
* - Ansis Atteka
  - ansisatt...@gmail.com
+   * - Eelco Chaudron
+ - echau...@redhat.com
* - Ian Stokes
  - isto...@ovn.org
* - Ilya Maximets
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] ovn-controller: Reduce size of the SB monitor condition.

2023-06-21 Thread Dumitru Ceara
On 6/20/23 21:01, Han Zhou wrote:
> On Tue, Jun 20, 2023 at 12:48 AM Dumitru Ceara  wrote:
>>
>> On 6/20/23 03:49, Han Zhou wrote:
>>> On Tue, Jun 6, 2023 at 7:57 AM Dumitru Ceara  wrote:

 We don't need to explicitly add port bindings that were already bound
 locally.  We implicitly get those because we monitor the datapaths
 they're attached to.

 When performing an ovn-heater 500-node density-heavy scale test [0],
> with
 conditional monitoring enabled, the unreasonably long poll intervals on
 the Southbound database (the ones that took more than one second) are
 measured as:

   -
Count  MinMax   Median  Mean   95 percentile
   -
 56.0 1010.0 2664.0 1455.5 1544.9 2163.0
 77.0 1015.0 3460.0 1896.0 1858.2 2907.8
 69.0 1010.0 3118.0 1618.0 1688.1 2582.4
   -
202.0 1010.0 3460.0 1610.0 1713.3 2711.5

 Compared to the baseline results (also with conditional monitoring
 enabled):
   -
Count  MinMax   Median  Mean   95 percentile
   -
141.0 1003.018338.0 1721.0 2625.4 7626.0
151.0 1019.080297.0 1808.0 3410.7 9089.0
165.0 1007.050736.0 2354.0 3067.7 7309.8
   -
457.0 1003.080297.0 2131.0 3044.6 7799.6

 We see significant improvement on the server side.  This is explained
 by the fact that now the Southbound database server doesn't need to
 apply as many conditions as before when filtering individual monitor
 contents.

>>> Thanks Dumitru for the great improvement! This is very helpful for the
> high
>>> port-density environment.
>>> Just to make sure I understand the test result correctly, in [0], it
> shows
>>> 22500 pods and 500 nodes, so is it 45 pods per node?
>>>
>>
>> Yes, for density-heavy tests (load balancers are also configured) the
>> pod density is 45 per node.
>>
 Note: Sub-ports - OVN switch ports with parent_port set - have to be
 monitored unconditionally as we cannot efficiently determine their
> local
 datapaths without including all local OVS interfaces in the monitor.
 This, however, should not be a huge issue because the majority of ports
 are regular VIFs, not sub-ports.
>>>
>>> I am not sure if we can make such a conclusion. For the current ovn-k8s
> or
>>> environments similar to that, it shouldn't be a problem.
>>> However, for environments that model pods/containers as sub-ports of
> the VM
>>> VIFs, probably most of the majority of the ports would be sub-ports.
> This
>>> is what sub-ports are designed for, right?
>>
>> My impression was that this was one of the use cases for OpenStack and
>> that it's only one of the different ways of providing container
>> connectivity in a given deployment.  But I might be wrong.  I can remove
>> this sentence, it makes a lot of assumptions indeed.
>>
>>> So, I think this would be a significant change of data monitored for
> those
>>> environments. I'd suggest at least we should properly document the
>>> implication in the documents (such as ovn-monitor-all, and also the
>>> sub-port related parts). There may also be such users who prefer not
>>> monitoring all sub-ports (for efficiency of ovn-controller) sacrificing
> SB
>>> DB performance (probably they don't have very high port density so the
>>> conditional monitoring impact is not that big). I am not aware of any
> such
>>> users yet, but if they complain, we will have to provide a knob, if no
>>> better ideas.
>>>
>>
>> I agree, if really needed, we can easily add a knob.
>>
>> What do you think of the following incremental?  I can fold it in if it
>> looks good to you.
> 
> Thanks Dumitru. The below documentation looks good. In addition, I think we
> should add some notes in the ovn-nb.xml under the section  title="Containers"> of Logical_Switch_Port, which is the place where the
> "sub-port" feature is described. Could you add it as well?
> 

To avoid an "incremental on top of another incremental" I posted v2.  I
also added a NEWS item as this is a user-visible change.

Please let me know what you think.

https://patchwork.ozlabs.org/project/ovn/patch/20230621094753.150072-1-dce...@redhat.com/

Thanks,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v2] ovn-controller: Reduce size of the SB monitor condition.

2023-06-21 Thread Dumitru Ceara
We don't need to explicitly add port bindings that were already bound
locally.  We implicitly get those because we monitor the datapaths
they're attached to.

When performing an ovn-heater 500-node density-heavy scale test [0], with
conditional monitoring enabled, the unreasonably long poll intervals on
the Southbound database (the ones that took more than one second) are
measured as:

  -
   Count  MinMax   Median  Mean   95 percentile
  -
56.0 1010.0 2664.0 1455.5 1544.9 2163.0
77.0 1015.0 3460.0 1896.0 1858.2 2907.8
69.0 1010.0 3118.0 1618.0 1688.1 2582.4
  -
   202.0 1010.0 3460.0 1610.0 1713.3 2711.5

Compared to the baseline results (also with conditional monitoring
enabled):
  -
   Count  MinMax   Median  Mean   95 percentile
  -
   141.0 1003.018338.0 1721.0 2625.4 7626.0
   151.0 1019.080297.0 1808.0 3410.7 9089.0
   165.0 1007.050736.0 2354.0 3067.7 7309.8
  -
   457.0 1003.080297.0 2131.0 3044.6 7799.6

We see significant improvement on the server side.  This is explained
by the fact that now the Southbound database server doesn't need to
apply as many conditions as before when filtering individual monitor
contents.

Note: Sub-ports - OVN switch ports with parent_port set - have to be
monitored unconditionally as we cannot efficiently determine their local
datapaths without including all local OVS interfaces in the monitor.

[0] 
https://github.com/ovn-org/ovn-heater/blob/main/test-scenarios/ocp-500-density-heavy.yml

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2139194
Reported-by: Ilya Maximets 
Signed-off-by: Dumitru Ceara 
---
V2:
- Addressed Han's comments:
  - changed commit log: removed assumption about sub-ports.
  - added ovn-nb documentation note about sub-ports monitoring.
  - added ovn-controller documentation note about sub-ports monitoring.
  - added TODO item.
- Added NEWS item.
---
 NEWS|  3 +++
 TODO.rst|  6 ++
 controller/binding.c|  2 +-
 controller/binding.h|  2 +-
 controller/ovn-controller.8.xml |  6 ++
 controller/ovn-controller.c | 19 ---
 ovn-nb.xml  |  6 +-
 7 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/NEWS b/NEWS
index 482970eeeb..8275877f99 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,9 @@ Post v23.06.0
 DHCPv4 "hostname" (12) option.
   - Support to create/update MAC_Binding when GARP received from VTEP (RAMP)
 switch on l3dgw port.
+  - To allow optimizing ovn-controller's monitor conditions for the regular
+VIF case, ovn-controller now unconditionally monitors all sub-ports
+(ports with parent_port set).
 
 OVN v23.06.0 - 01 Jun 2023
 --
diff --git a/TODO.rst b/TODO.rst
index dff163e70b..b790a9fadf 100644
--- a/TODO.rst
+++ b/TODO.rst
@@ -124,3 +124,9 @@ OVN To-do List
 * Load Balancer templates
 
   * Support combining the VIP IP and port into a single template variable.
+
+* ovn-controller conditional monitoring
+
+  * Improve sub-ports (with parent_port set) conditional monitoring; these
+are currently unconditionally monitored, even if ovn-monitor-all is
+set to false.
diff --git a/controller/binding.c b/controller/binding.c
index 486095ca79..359ad6d660 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -785,7 +785,7 @@ local_binding_data_destroy(struct local_binding_data 
*lbinding_data)
 }
 
 const struct sbrec_port_binding *
-local_binding_get_primary_pb(struct shash *local_bindings,
+local_binding_get_primary_pb(const struct shash *local_bindings,
  const char *pb_name)
 {
 struct local_binding *lbinding =
diff --git a/controller/binding.h b/controller/binding.h
index 0e57f02ee5..319cbd263c 100644
--- a/controller/binding.h
+++ b/controller/binding.h
@@ -150,7 +150,7 @@ void local_binding_data_init(struct local_binding_data *);
 void local_binding_data_destroy(struct local_binding_data *);
 
 const struct sbrec_port_binding *local_binding_get_primary_pb(
-struct shash *local_bindings, const char *pb_name);
+const struct shash *local_bindings, const char *pb_name);
 ofp_port_t local_binding_get_lport_ofport(const struct shash *local_bindings,
   const char *pb_name);
 
diff --git a/controller/ovn-controller.8.xml 

[ovs-dev] [PATCH v2 6/6] pmd.at: Add per pmd max sleep unit tests.

2023-06-21 Thread Kevin Traynor
Add unit tests for new per pmd options of pmd-sleep-max.

Signed-off-by: Kevin Traynor 
---
 tests/pmd.at | 152 +++
 1 file changed, 152 insertions(+)

diff --git a/tests/pmd.at b/tests/pmd.at
index 6040d558c..31b832e3e 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -77,4 +77,18 @@ m4_define([CHECK_DP_SLEEP_MAX], [
 ])
 
+dnl CHECK_PMD_SLEEP_MAX([core_id], [numa_id], [max_sleep], [+line])
+dnl
+dnl Checks max sleep time of each pmd with core_id.
+dnl Checking starts from line number 'line' in ovs-vswithd.log .
+m4_define([CHECK_PMD_SLEEP_MAX], [
+PATTERN="PMD thread core   *[$1] NUMA  *[$2]: Max sleep request set to  
*[$3] usecs."
+line_st=$4
+if [[ -z "$line_st" ]]
+then
+line_st="+0"
+fi
+OVS_WAIT_UNTIL([tail -n $line_st ovs-vswitchd.log | grep "$PATTERN"])
+])
+
 m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id 
\)[[0-9]]*:/\1\2:/"])
 m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
@@ -1339,2 +1353,140 @@ PMD load based sleeps are enabled by default.
 OVS_VSWITCHD_STOP
 AT_CLEANUP
+
+AT_SETUP([PMD - per pmd sleep])
+OVS_VSWITCHD_START([add-port br0 p0 -- set Interface p0 type=dummy-pmd 
options:n_rxq=8 options:numa_id=1], [], [], [--dummy-numa 0,0,0,1,1,8,8])
+
+dnl Check system default
+CHECK_DP_SLEEP_MAX([0], [disabled], [])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl
+PMD max sleep request is 0 usecs by default.
+PMD load based sleeps are disabled by default.
+PMD thread core   0 NUMA  0: Max sleep request set to0 usecs.
+PMD thread core   3 NUMA  1: Max sleep request set to0 usecs.
+PMD thread core   5 NUMA  8: Max sleep request set to0 usecs.
+])
+
+dnl only a dp default
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=200])
+CHECK_DP_SLEEP_MAX([200], [enabled], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([0], [0], [200], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([3], [1], [200], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([5], [8], [200], [+$LINENUM])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl
+PMD max sleep request is 200 usecs by default.
+PMD load based sleeps are enabled by default.
+PMD thread core   0 NUMA  0: Max sleep request set to  200 usecs.
+PMD thread core   3 NUMA  1: Max sleep request set to  200 usecs.
+PMD thread core   5 NUMA  8: Max sleep request set to  200 usecs.
+])
+
+dnl only per pmd
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set Open_vSwitch . 
other_config:pmd-sleep-max=3:300,0:100,5:400])
+CHECK_DP_SLEEP_MAX([0], [disabled], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([0], [0], [100], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([3], [1], [300], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([5], [8], [400], [+$LINENUM])
+
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl
+PMD max sleep request is 0 usecs by default.
+PMD load based sleeps are disabled by default.
+PMD thread core   0 NUMA  0: Max sleep request set to  100 usecs.
+PMD thread core   3 NUMA  1: Max sleep request set to  300 usecs.
+PMD thread core   5 NUMA  8: Max sleep request set to  400 usecs.
+])
+
+dnl mix of not used default and per-pmd
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set Open_vSwitch . 
other_config:pmd-sleep-max=50,3:300,0:100,5:200])
+CHECK_DP_SLEEP_MAX([50], [enabled], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([0], [0], [100], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([3], [1], [300], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([5], [8], [200], [+$LINENUM])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl
+PMD max sleep request is 50 usecs by default.
+PMD load based sleeps are enabled by default.
+PMD thread core   0 NUMA  0: Max sleep request set to  100 usecs.
+PMD thread core   3 NUMA  1: Max sleep request set to  300 usecs.
+PMD thread core   5 NUMA  8: Max sleep request set to  200 usecs.
+])
+
+dnl remove a per-pmd entry and use default
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=50,3:300])
+CHECK_DP_SLEEP_MAX([50], [enabled], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([0], [0], [50], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([3], [1], [300], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([5], [8], [50], [+$LINENUM])
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl
+PMD max sleep request is 50 usecs by default.
+PMD load based sleeps are enabled by default.
+PMD thread core   0 NUMA  0: Max sleep request set to   50 usecs.
+PMD thread core   3 NUMA  1: Max sleep request set to  300 usecs.
+PMD thread core   5 NUMA  8: Max sleep request set to   50 usecs.
+])
+
+dnl mix and change values
+
+get_log_next_line_num
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=3:400,200])
+CHECK_DP_SLEEP_MAX([200], [enabled], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([0], [0], [200], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([3], [1], [400], [+$LINENUM])
+CHECK_PMD_SLEEP_MAX([5], [8], [200], [+$LINENUM])
+
+
+AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl
+PMD max sleep request is 200 usecs by default.
+PMD load 

[ovs-dev] [PATCH v2 5/6] dpif-netdev: Add per pmd sleep config.

2023-06-21 Thread Kevin Traynor
Extend 'pmd-sleep-max' so that individual PMD thread cores
may have a specified max sleep request value.

Any PMD thread core without a value will use the datapath default
(no sleep request) or datapath global value set by the user.

To set PMD thread cores 8 and 9 to never request a load based sleep
and all other PMD thread cores to be able to request a max sleep of
50 usecs:

$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0

To set PMD thread cores 10 and 11 to request a max sleep of 100 usecs
and all other PMD thread cores to never request a sleep:

$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=10:100,11:100

'pmd-sleep-show' can be used to dump the global and individual PMD thread
core max sleep request values.

Signed-off-by: Kevin Traynor 
---
 Documentation/topics/dpdk/pmd.rst |  23 +++
 lib/dpif-netdev-private-thread.h  |   3 +
 lib/dpif-netdev.c | 225 --
 tests/pmd.at  |  32 ++---
 4 files changed, 252 insertions(+), 31 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index 40e6b7843..eafcbc504 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -375,4 +375,27 @@ system configuration (e.g. enabling processor C-states) 
and workloads.
 rate.
 
+Max sleep request values can be set for individual PMDs using key:value pairs.
+Any PMD that has been assigned a specified value will use that. Any PMD that
+does not have a specified value will use the current global default.
+
+Specified values for individual PMDs can be added or removed at any time.
+
+For example, to set PMD thread cores 8 and 9 to never request a load based
+sleep and all others PMD cores to be able to request a max sleep of 50 usecs::
+
+$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0
+
+The max sleep request for each PMD can be checked in the logs or with::
+
+$ ovs-appctl dpif-netdev/pmd-sleep-show
+PMD max sleep request is 50 usecs by default.
+PMD load based sleeps are enabled by default.
+PMD thread core   8 NUMA  0: Max sleep request set to0 usecs.
+PMD thread core   9 NUMA  1: Max sleep request set to0 usecs.
+PMD thread core  10 NUMA  0: Max sleep request set to   50 usecs.
+PMD thread core  11 NUMA  1: Max sleep request set to   50 usecs.
+PMD thread core  12 NUMA  0: Max sleep request set to   50 usecs.
+PMD thread core  13 NUMA  1: Max sleep request set to   50 usecs.
+
 .. _ovs-vswitchd(8):
 http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
index 1ec3cd794..5c72ce5d9 100644
--- a/lib/dpif-netdev-private-thread.h
+++ b/lib/dpif-netdev-private-thread.h
@@ -181,4 +181,7 @@ struct dp_netdev_pmd_thread {
 bool isolated;
 
+/* Max sleep request. UINT64_MAX indicates dp default should be used.*/
+atomic_uint64_t max_sleep;
+
 /* Queue id used by this pmd thread to send packets on all netdevs if
  * XPS disabled for this netdev. All static_tx_qid's are unique and less
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 88d25f1da..d9ee53ff5 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -180,4 +180,9 @@ static struct odp_support dp_netdev_support = {
 #define PMD_SLEEP_INC_US 1
 
+struct pmd_sleep {
+unsigned core_id;
+uint64_t max_sleep;
+};
+
 struct dpcls {
 struct cmap_node node;  /* Within dp_netdev_pmd_thread.classifiers */
@@ -290,4 +295,6 @@ struct dp_netdev {
 /* Max load based sleep request. */
 atomic_uint64_t pmd_max_sleep;
+/* Max load based sleep request user string. */
+char *max_sleep_list;
 /* Enable the SMC cache from ovsdb config */
 atomic_bool smc_enable_db;
@@ -1013,9 +1020,15 @@ pmd_max_sleep_show(struct ds *reply, struct 
dp_netdev_pmd_thread *pmd,
 {
 if (pmd->core_id != NON_PMD_CORE_ID) {
+uint64_t pmd_max_sleep;
+
+atomic_read_relaxed(>max_sleep, _max_sleep);
 ds_put_format(reply,
   "PMD thread core %3u NUMA %2d: "
   "Max sleep request set to",
   pmd->core_id, pmd->numa_id);
-ds_put_format(reply, " %4"PRIu64" usecs.", default_max_sleep);
+ds_put_format(reply, " %4"PRIu64" usecs.",
+  pmd_max_sleep == UINT64_MAX
+  ? default_max_sleep
+  : pmd_max_sleep);
 ds_put_cstr(reply, "\n");
 }
@@ -1528,7 +1541,8 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
 atomic_read_relaxed(>pmd_max_sleep, _max_sleep);
 ds_put_format(, "PMD max sleep request is %"PRIu64" "
-  "usecs.", default_max_sleep);
+  "usecs by default.", default_max_sleep);
 ds_put_cstr(, "\n");
-

[ovs-dev] [PATCH v2 4/6] dpif-netdev: Remove pmd-sleep-max experimental tag.

2023-06-21 Thread Kevin Traynor
Signed-off-by: Kevin Traynor 
---
 Documentation/topics/dpdk/pmd.rst | 4 ++--
 NEWS  | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index bedd42194..40e6b7843 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -325,6 +325,6 @@ reassignment due to PMD Auto Load Balance. For example, 
this could be set
 (in min) such that a reassignment is triggered at most every few hours.
 
-PMD load based sleeping (Experimental)
---
+PMD load based sleeping
+---
 
 PMD threads constantly poll Rx queues which are assigned to them. In order to
diff --git a/NEWS b/NEWS
index f8850d8a0..323d05c45 100644
--- a/NEWS
+++ b/NEWS
@@ -44,4 +44,5 @@ Post-v3.1.0
  * Add 'ovs-appctl dpif-netdev/pmd-sleep-show' command to get the
max sleep request setting of PMD thread cores.
+ * Remove experimental tag from PMD load based sleeping.
 
 
-- 
2.40.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 3/6] dpif-netdev: Add pmd-sleep-show command.

2023-06-21 Thread Kevin Traynor
Max requested sleep time and status for a PMD thread
is logged at start up or when changed, but it can be
convenient to have a command to dump this information
explicitly.

It is envisaged that this will be expanded when future
additions are added.

Signed-off-by: Kevin Traynor 
---
 Documentation/topics/dpdk/pmd.rst |  4 
 NEWS  |  2 ++
 lib/dpif-netdev.c | 40 +++
 tests/pmd.at  | 29 ++
 4 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index b261e9254..bedd42194 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -349,4 +349,8 @@ time not processing packets will be determined by the sleep 
and processor
 wake-up times and should be tested with each system configuration.
 
+The max sleep time that a PMD thread may request can be shown with::
+
+$ ovs-appctl dpif-netdev/pmd-sleep-show
+
 Sleep time statistics for 10 secs can be seen with::
 
diff --git a/NEWS b/NEWS
index f445377ad..f8850d8a0 100644
--- a/NEWS
+++ b/NEWS
@@ -42,4 +42,6 @@ Post-v3.1.0
table to check the status.
  * Rename 'pmd-maxsleep' other_config to 'pmd-sleep-max'.
+ * Add 'ovs-appctl dpif-netdev/pmd-sleep-show' command to get the
+   max sleep request setting of PMD thread cores.
 
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 5b53717e0..88d25f1da 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -702,4 +702,5 @@ enum pmd_info_type {
 PMD_INFO_SHOW_RXQ,/* Show poll lists of pmd threads. */
 PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
+PMD_INFO_MAX_SLEEP_SHOW,   /* Show max sleep performance details. */
 };
 
@@ -1007,4 +1008,18 @@ sorted_poll_thread_list(struct dp_netdev *dp,
 }
 
+static void
+pmd_max_sleep_show(struct ds *reply, struct dp_netdev_pmd_thread *pmd,
+   uint64_t default_max_sleep)
+{
+if (pmd->core_id != NON_PMD_CORE_ID) {
+ds_put_format(reply,
+  "PMD thread core %3u NUMA %2d: "
+  "Max sleep request set to",
+  pmd->core_id, pmd->numa_id);
+ds_put_format(reply, " %4"PRIu64" usecs.", default_max_sleep);
+ds_put_cstr(reply, "\n");
+}
+}
+
 static void
 dpif_netdev_subtable_lookup_get(struct unixctl_conn *conn, int argc OVS_UNUSED,
@@ -1442,5 +1457,6 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
 unsigned long long max_secs = (PMD_INTERVAL_LEN * PMD_INTERVAL_MAX)
   / INTERVAL_USEC_TO_SEC;
-bool first_show_rxq = true;
+bool first_pmd = true;
+uint64_t default_max_sleep = 0;
 
 ovs_mutex_lock(_netdev_mutex);
@@ -1490,5 +1506,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
 }
 if (type == PMD_INFO_SHOW_RXQ) {
-if (first_show_rxq) {
+if (first_pmd) {
 if (!secs || secs > max_secs) {
 secs = max_secs;
@@ -1499,5 +1515,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
 ds_put_format(, "Displaying last %u seconds "
   "pmd usage %%\n", secs);
-first_show_rxq = false;
+first_pmd = false;
 }
 pmd_info_show_rxq(, pmd, secs);
@@ -1508,4 +1524,16 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int 
argc, const char *argv[],
 } else if (type == PMD_INFO_PERF_SHOW) {
 pmd_info_show_perf(, pmd, (struct pmd_perf_params *)aux);
+} else if (type == PMD_INFO_MAX_SLEEP_SHOW) {
+if (first_pmd) {
+atomic_read_relaxed(>pmd_max_sleep, _max_sleep);
+ds_put_format(, "PMD max sleep request is %"PRIu64" "
+  "usecs.", default_max_sleep);
+ds_put_cstr(, "\n");
+ds_put_format(, "PMD load based sleeps are %s.",
+  default_max_sleep ? "enabled" : "disabled");
+ds_put_cstr(, "\n");
+first_pmd = false;
+}
+pmd_max_sleep_show(, pmd, default_max_sleep);
 }
 }
@@ -1608,5 +1636,6 @@ dpif_netdev_init(void)
 static enum pmd_info_type show_aux = PMD_INFO_SHOW_STATS,
   clear_aux = PMD_INFO_CLEAR_STATS,
-  poll_aux = PMD_INFO_SHOW_RXQ;
+  poll_aux = PMD_INFO_SHOW_RXQ,
+  sleep_aux = PMD_INFO_MAX_SLEEP_SHOW;
 
 unixctl_command_register("dpif-netdev/pmd-stats-show", "[-pmd core] [dp]",
@@ -1620,4 +1649,7 @@ dpif_netdev_init(void)
  0, 5, dpif_netdev_pmd_info,
  (void *)_aux);
+

[ovs-dev] [PATCH v2 1/6] dpif-netdev: Rename pmd-maxsleep config option.

2023-06-21 Thread Kevin Traynor
other_config:pmd-maxsleep is a config option to allow
PMD thread cores to sleep under low or no load conditions.

Rename it to 'pmd-sleep-max' to allow a more structured
name and so that additional options or command can follow
the 'pmd-sleep-xyz' pattern.

Signed-off-by: Kevin Traynor 
---
 Documentation/topics/dpdk/pmd.rst |  2 +-
 NEWS  |  1 +
 lib/dpif-netdev.c |  2 +-
 tests/pmd.at  | 12 ++--
 vswitchd/vswitch.xml  |  2 +-
 5 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index e70986d16..b261e9254 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -335,5 +335,5 @@ This can be enabled by setting the max requested sleep time 
(in microseconds)
 for a PMD thread::
 
-$ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=50
+$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50
 
 With a non-zero max value a PMD may request to sleep by an incrementing amount
diff --git a/NEWS b/NEWS
index 66d5a4ea3..f445377ad 100644
--- a/NEWS
+++ b/NEWS
@@ -41,4 +41,5 @@ Post-v3.1.0
interfaces that support it.  See the 'status' column in the 'interface'
table to check the status.
+ * Rename 'pmd-maxsleep' other_config to 'pmd-sleep-max'.
 
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index abe63412e..5b53717e0 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4983,5 +4983,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
smap *other_config)
 set_pmd_auto_lb(dp, autolb_state, log_autolb);
 
-pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0);
+pmd_max_sleep = smap_get_ullong(other_config, "pmd-sleep-max", 0);
 pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
 atomic_read_relaxed(>pmd_max_sleep, _pmd_max_sleep);
diff --git a/tests/pmd.at b/tests/pmd.at
index 48f3d432d..374ad7217 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -1266,5 +1266,5 @@ OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load 
based sleeps are disabled
 dnl Check low value max sleep
 get_log_next_line_num
-AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"])
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 1 usecs."])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
@@ -1272,5 +1272,5 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep 
"PMD load based sleeps
 dnl Check high value max sleep
 get_log_next_line_num
-AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"])
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 1 usecs."])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
@@ -1278,5 +1278,5 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep 
"PMD load based sleeps
 dnl Check setting max sleep to zero
 get_log_next_line_num
-AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"])
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="0"])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 0 usecs."])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are disabled."])
@@ -1284,5 +1284,5 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep 
"PMD load based sleeps
 dnl Check above high value max sleep
 get_log_next_line_num
-AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"])
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="10001"])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 1 usecs."])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
@@ -1290,10 +1290,10 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | 
grep "PMD load based sleeps
 dnl Check rounding
 get_log_next_line_num
-AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"])
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="490"])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 490 usecs."])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
 dnl Check rounding
 get_log_next_line_num
-AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="499"])
+AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="499"])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 499 usecs."])
 OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
diff --git 

[ovs-dev] [PATCH v2 2/6] pmd.at: Add macro for checking pmd sleep max time and state.

2023-06-21 Thread Kevin Traynor
This is just cosmetic. There is no change to the tests.

Signed-off-by: Kevin Traynor 
---
 tests/pmd.at | 39 ---
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/tests/pmd.at b/tests/pmd.at
index 374ad7217..4dd775bd3 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -61,4 +61,20 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
 ])
 
+dnl CHECK_DP_SLEEP_MAX([max_sleep], [enabled], [+line])
+dnl
+dnl Checks correct pmd load based sleep is set for the datapath.
+dnl Checking starts from line number 'line' in ovs-vswithd.log .
+m4_define([CHECK_DP_SLEEP_MAX], [
+SLEEP_TIME="PMD max sleep request is $1 usecs."
+SLEEP_STATE="PMD load based sleeps are $2."
+line_st=$3
+if [[ -z "$line_st" ]]
+then
+line_st="+0"
+fi
+OVS_WAIT_UNTIL([tail -n $line_st ovs-vswitchd.log | grep "$SLEEP_TIME"])
+OVS_WAIT_UNTIL([tail -n $line_st ovs-vswitchd.log | grep "$SLEEP_STATE"])
+])
+
 m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id 
\)[[0-9]]*:/\1\2:/"])
 m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
@@ -1256,46 +1272,39 @@ OVS_VSWITCHD_STOP
 AT_CLEANUP
 
-dnl Check default state
 AT_SETUP([PMD - pmd sleep])
 OVS_VSWITCHD_START
 
 dnl Check default
-OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 
usecs."])
-OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are 
disabled."])
+CHECK_DP_SLEEP_MAX([0], [disabled], [])
 
 dnl Check low value max sleep
 get_log_next_line_num
 AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 1 usecs."])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
+CHECK_DP_SLEEP_MAX([1], [enabled], [+$LINENUM])
 
 dnl Check high value max sleep
 get_log_next_line_num
 AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 1 usecs."])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
+CHECK_DP_SLEEP_MAX([1], [enabled], [+$LINENUM])
 
 dnl Check setting max sleep to zero
 get_log_next_line_num
 AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="0"])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 0 usecs."])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are disabled."])
+CHECK_DP_SLEEP_MAX([0], [disabled], [+$LINENUM])
 
 dnl Check above high value max sleep
 get_log_next_line_num
 AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="10001"])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 1 usecs."])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
+CHECK_DP_SLEEP_MAX([1], [enabled], [+$LINENUM])
 
 dnl Check rounding
 get_log_next_line_num
 AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="490"])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 490 usecs."])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
+CHECK_DP_SLEEP_MAX([490], [enabled], [+$LINENUM])
+
 dnl Check rounding
 get_log_next_line_num
 AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="499"])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep 
request is 499 usecs."])
-OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based 
sleeps are enabled."])
+CHECK_DP_SLEEP_MAX([499], [enabled], [+$LINENUM])
 
 OVS_VSWITCHD_STOP
-- 
2.40.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 0/6] PMD load based sleep updates and per-pmd config.

2023-06-21 Thread Kevin Traynor
Patches 1-4 are about polishing the existing functionality
and preparing for new functionality later in the series.

Patch 1 renames 'pmd-maxsleep' to 'pmd-sleep-max'. I think
this is allowed as it is still experimental, but we can keep
'pmd-maxsleep' as well, if people think it's a problem.

Patches 5-6 adds a new per pmd config option functionality and tests.
These can be considered separate from patches 1-4 but as there
are dependencies, it made sense to add all the patches to this
patch set.

v2
- 2/6 fixed UT macro
- 3/6 rebased NEWS confict
- 3/6 moved pmd_max_sleep_show() location
- 6/6 removed incorrect check

GHA: https://github.com/kevintraynor/ovs/actions/runs/5325578666

Kevin Traynor (6):
  dpif-netdev: Rename pmd-maxsleep config option.
  pmd.at: Add macro for checking pmd sleep max time and state.
  dpif-netdev: Add pmd-sleep-show command.
  dpif-netdev: Remove pmd-sleep-max experimental tag.
  dpif-netdev: Add per pmd sleep config.
  pmd.at: Add per pmd max sleep unit tests.

 Documentation/topics/dpdk/pmd.rst |  33 +++-
 NEWS  |   4 +
 lib/dpif-netdev-private-thread.h  |   3 +
 lib/dpif-netdev.c | 259 --
 tests/pmd.at  | 232 +++---
 vswitchd/vswitch.xml  |   2 +-
 6 files changed, 492 insertions(+), 41 deletions(-)

-- 
2.40.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/6] dpif-netdev: Add pmd-sleep-show command.

2023-06-21 Thread Kevin Traynor

On 20/06/2023 18:02, Kevin Traynor wrote:

On 14/06/2023 16:49, David Marchand wrote:

On Wed, Jun 14, 2023 at 3:37 PM Kevin Traynor  wrote:


Max requested sleep time and status for a PMD thread
is logged at start up or when changed, but it can be
convenient to have a command to dump this information
explicitly.

It is envisaged that this will be expanded when future
additions are added.

Signed-off-by: Kevin Traynor 
---
   Documentation/topics/dpdk/pmd.rst |  4 
   NEWS  |  2 ++
   lib/dpif-netdev.c | 40 +++
   tests/pmd.at  | 29 ++
   4 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst 
b/Documentation/topics/dpdk/pmd.rst
index b261e9254..bedd42194 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -349,4 +349,8 @@ time not processing packets will be determined by the sleep 
and processor
   wake-up times and should be tested with each system configuration.

+The max sleep time that a PMD thread may request can be shown with::
+
+$ ovs-appctl dpif-netdev/pmd-sleep-show
+
   Sleep time statistics for 10 secs can be seen with::

diff --git a/NEWS b/NEWS
index 1ccc6f948..32b9e8167 100644
--- a/NEWS
+++ b/NEWS
@@ -39,4 +39,6 @@ Post-v3.1.0
  - Userspace datapath:
* Rename 'pmd-maxsleep' other_config to 'pmd-sleep-max'.
+ * Add 'ovs-appctl dpif-netdev/pmd-sleep-show' command to get the
+   max sleep request setting of PMD thread cores.


diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 94c8ca943..dadf17b70 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -702,4 +702,5 @@ enum pmd_info_type {
   PMD_INFO_SHOW_RXQ,/* Show poll lists of pmd threads. */
   PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
+PMD_INFO_MAX_SLEEP_SHOW,   /* Show max sleep performance details. */
   };

@@ -884,4 +885,18 @@ sorted_poll_list(struct dp_netdev_pmd_thread *pmd, struct 
rxq_poll **list,
   }

+static void
+pmd_max_sleep_show(struct ds *reply, struct dp_netdev_pmd_thread *pmd,
+   uint64_t default_max_sleep)
+{
+if (pmd->core_id != NON_PMD_CORE_ID) {
+ds_put_format(reply,
+  "PMD thread core %3u NUMA %2d: "
+  "Max sleep request set to",
+  pmd->core_id, pmd->numa_id);
+ds_put_format(reply, " %4"PRIu64" usecs.", default_max_sleep);
+ds_put_cstr(reply, "\n");
+}
+}
+


Nit: I would move this new function later, and let pmd_info_show_rxq()
and sorted_poll_list() close to each other.



Will add to v2. I placed it after the other dpif_netdev_pmd_info()
related functions.




   static void
   pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd,
@@ -1442,5 +1457,6 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
   unsigned long long max_secs = (PMD_INTERVAL_LEN * PMD_INTERVAL_MAX)
 / INTERVAL_USEC_TO_SEC;
-bool first_show_rxq = true;
+bool first_pmd = true;
+uint64_t default_max_sleep = 0;


You can move this new variable in the only block that cares about it.



Yes, I can do that if I remove the initialization. It will work now, but
not being able to initialize feels like it makes the code a bit more
brittle as that nuance could be missed if a default initialization was
ever was needed in future (I don't have something in mind).

What do you think? I can go with your preference.



Discussed with David offline. Working in practice, but it's not fully 
safe to rely that default_max_sleep will keep the read value after the 
loop if we move the declaration. So will stick with current declaration.






   ovs_mutex_lock(_netdev_mutex);
@@ -1490,5 +1506,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
   }
   if (type == PMD_INFO_SHOW_RXQ) {
-if (first_show_rxq) {
+if (first_pmd) {
   if (!secs || secs > max_secs) {
   secs = max_secs;
@@ -1499,5 +1515,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, 
const char *argv[],
   ds_put_format(, "Displaying last %u seconds "
 "pmd usage %%\n", secs);
-first_show_rxq = false;
+first_pmd = false;
   }
   pmd_info_show_rxq(, pmd, secs);
@@ -1508,4 +1524,16 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int 
argc, const char *argv[],
   } else if (type == PMD_INFO_PERF_SHOW) {
   pmd_info_show_perf(, pmd, (struct pmd_perf_params *)aux);
+} else if (type == PMD_INFO_MAX_SLEEP_SHOW) {
+if (first_pmd) {
+atomic_read_relaxed(>pmd_max_sleep, _max_sleep);
+ds_put_format(, "PMD max sleep request is %"PRIu64" "
+

Re: [ovs-dev] [PATCH ovn 0/2] Update stale TODO items.

2023-06-21 Thread Dumitru Ceara
On 6/20/23 21:24, Han Zhou wrote:
> On Tue, Jun 20, 2023 at 2:19 AM Dumitru Ceara  wrote:
>>
>> Patch 1/2 removes the ones that are not applicable anymore.
>> Patch 2/2 addresses a TODO related to SB.Port_Group.name backwards
> compatibility.
>>
>> Dumitru Ceara (2):
>>   TODO.rst: Remove no longer applicable items.
>>   expr.c: Remove backwards compatibility lookup in parse_port_group().
>>
>>  TODO.rst   |  8 
>>  lib/expr.c | 12 
>>  2 files changed, 20 deletions(-)
>>
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> LGTM for both patches. Thanks Dumitru!
> 
> Acked-by: Han Zhou 
> 

Thanks!  I applied these to the main branch.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] Request adding long poll interval metrics to coverage metrics

2023-06-21 Thread Eelco Chaudron



On 20 Jun 2023, at 16:57, Ilya Maximets wrote:

> On 6/20/23 16:10, Aaron Conole wrote:
>> Adrian Moreno  writes:
>>
>>> On 6/19/23 10:36, Eelco Chaudron wrote:
 On 16 Jun 2023, at 19:19, Aaron Conole wrote:

> Martin Kennelly  writes:
>
>> Hey ovs community,
>>
>> I am a developer working on ovn-kubernetes and I want to
>> programmatically consume long poll information
>> i.e:
>> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll
>> interval (752ms user, 209ms system)
>>
>> This is currently exposed via journal logs but it's not practical
>> to consume it there programmatically and I was
>> hoping you could add it to coverage metrics.
>
> I think it could be useful.  I do want to be careful about exposing
> these kinds of data in a way that could be misinterpreted.  Already,
> that log in particular gets misinterpreted quite a bit, and RH gets
> customers claiming OVS is misbehaving when they've oversubscribed the
> system.
 +1

>>>
>>> Maybe it's a good time to start documenting coverage counters?
>>
>> I agree - we should have at least some kind of documentation.  Actually,
>> it would be really nice if we could do something during
>> COVERAGE_DEFINE() that would be like:
>>
>>   COVERAGE_DEFINE(ctr, "description")
>>
>> and then we can generate documentation from the COVERAGE_DEFINE()s as
>> well as querying for it with an ovs-appctl command.

I think this might be the only way to keep the doc in sync, however, who is 
going to document the 320+ existing ones?

>> That might be trying to be too fancy, though.
>
>
> The problem with documenting coverage counters is that we can't
> and shouldn't claim that these are in any way a stable API.
> They are mainly for developers.  Code can change and there might
> be no meaningful way preserve current counters or their names.
> They can change in each release including minor ones without
> prior notice.

+1000 as even if the direct code near the counter does not changes, it could 
still be impacted the actual counter. Which I have seen before.

> Best regards, Ilya Maximets.
>
>>
> Mechanically, it would be pretty simple to do something like:
>
> ---
> diff --git a/lib/timeval.c b/lib/timeval.c
> index 193c7bab17..00e5f2a74d 100644
> --- a/lib/timeval.c
> +++ b/lib/timeval.c
> @@ -40,6 +40,7 @@
>   #include "openvswitch/vlog.h"
>
>   VLOG_DEFINE_THIS_MODULE(timeval);
> +COVERAGE_DEFINE(long_poll_interval);
>
>   #if !defined(HAVE_CLOCK_GETTIME)
>   typedef unsigned int clockid_t;
> @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>   struct rusage rusage;
>
>   if (!getrusage_thread()) {
> +COVERAGE_INC(long_poll_interval);
> +
>   VLOG_WARN("Unreasonably long %lldms poll interval"
> " (%lldms user, %lldms system)",
> interval,
> ---
>
> This would at least expose the coverage data via the coverage framework
> and it can be queried via ovs-appctl.  Actually, the advantage here is
> that the coverage counter can track some details about X/sec over the
> last 5 seconds, minute, hour, in addition to the total, so we can see
> whether the condition is ongoing.
 ___
 dev mailing list
 d...@openvswitch.org
 https://mail.openvswitch.org/mailman/listinfo/ovs-dev

>>
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev