Re: [ovs-dev] [PATCH v1] dpif-netdev: Update meter timestamp when dp_netdev_run_meter been called.
On 6/21/23 18:07, Ilya Maximets wrote: > On 5/31/23 17:41, mit...@outlook.com wrote: >> From: Lin Huang >> >> Currently, a meter's timestamp 'now' is set by 'pmd->ctx.now' which updated >> by pmd_thread_ctx_time_update(). >> >> Before processing of the new packet batch: >> - dpif_netdev_execute() >> - dp_netdev_process_rxq_port() >> >> There is a problem when user want to police the rate to a low pps by meter. >> For example, When the traffic is more than 3Mpps, policing the traffic to >> 1M pps, the real rate will be 3Mpps not 1Mpps. >> >> The key point is that a meter's timestamp isn't update in real time. >> For example, PMD thread A and B are polled at the same time (t1). >> Thread A starts to run meter to police traffic. At the same time, thread B >> pause at 'ovs_mutex_lock(>lock)' to wait thread A release the meter >> lock. This elapsed a lot of time, especially we have more than 10 PMD >> threads. > > Shouldn't the infrequent time update cause more drops instead? Also, having 10+ threads waiting on the same lock doesn't sound like a particularly efficient setup. If the locking here indeed a problem, maybe you can try the following patch in your scenario: https://patchwork.ozlabs.org/project/openvswitch/patch/20230621223220.2073152-1-i.maxim...@ovn.org/ ? > > Best regards, Ilya Maximets. > >> >> After thread A release the meter lock, thread B start to count the time diff. >> - long_delta_t = now - meter->used' --> long_delta_t = t1 - t1 = 0. >> - band->bucket += (uint64_t) delta_t * band->rate --> band->bucket = 0. >> - band_exceeded_pkt = band->bucket / 1000; --> band_exceeded_pkt = 0. >> >> Fix this problem by update the meter timestamp every time. >> >> Test-by: Zhang Yuhuang >> Signed-off-by: Lin Huang >> --- >> lib/dpif-netdev.c | 12 >> 1 file changed, 8 insertions(+), 4 deletions(-) >> >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >> index 70b953ae6..dbb275cf8 100644 >> --- a/lib/dpif-netdev.c >> +++ b/lib/dpif-netdev.c >> @@ -7169,12 +7169,13 @@ dpif_netdev_meter_get_features(const struct dpif * >> dpif OVS_UNUSED, >> * that exceed a band are dropped in-place. */ >> static void >> dp_netdev_run_meter(struct dp_netdev *dp, struct dp_packet_batch *packets_, >> -uint32_t meter_id, long long int now) >> +uint32_t meter_id) >> { >> struct dp_meter *meter; >> struct dp_meter_band *band; >> struct dp_packet *packet; >> long long int long_delta_t; /* msec */ >> +long long int now; >> uint32_t delta_t; /* msec */ >> const size_t cnt = dp_packet_batch_size(packets_); >> uint32_t bytes, volume; >> @@ -7197,8 +7198,12 @@ dp_netdev_run_meter(struct dp_netdev *dp, struct >> dp_packet_batch *packets_, >> memset(exceeded_rate, 0, cnt * sizeof *exceeded_rate); >> >> ovs_mutex_lock(>lock); >> + >> +/* Update now */ >> +now = time_msec(); >> + >> /* All packets will hit the meter at the same time. */ >> -long_delta_t = now / 1000 - meter->used / 1000; /* msec */ >> +long_delta_t = now - meter->used; /* msec */ >> >> if (long_delta_t < 0) { >> /* This condition means that we have several threads fighting for a >> @@ -9170,8 +9175,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch >> *packets_, >> } >> >> case OVS_ACTION_ATTR_METER: >> -dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a), >> -pmd->ctx.now); >> +dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a)); >> break; >> >> case OVS_ACTION_ATTR_PUSH_VLAN: > ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] dpif-netdev: Lockless meters.
Current implementation of meters in the userspace datapath takes the meter lock for every packet batch. If more than one thread hits the flow with the same meter, they will lock each other. Replace the critical section with atomic operations to avoid interlocking. Meters themselves are RCU-protected, so it's safe to access them without holding a lock. Implementation does the following: 1. Tries to advance the 'used' timer of the meter with atomic compare+exchange if it's smaller than 'now'. 2. If the timer change succeeds, atomically update band buckets. 3. Atomically update packet statistics for a meter. 4. Go over buckets and try to atomically subtract the amount of packets or bytes, recording the highest exceeded band. 5. Atomically update band statistics and drop packets. Bucket manipulations are implemented with atomic compare+exchange operations with extra checks, because bucket size should never exceed the maximum and it should never go below zero. Packet statistics may be momentarily inconsistent, i.e., number of packets and the number of bytes may reflect different sets of packets. But it should be eventually consistent. And the difference at any given time should be in just few packets. For the sake of reduced code complexity PKTPS meter tries to push packets through the band one by one, even though they all have the same weight. This is also more fair if more than one thread is passing packets through the same band at the same time. Trying to predict the number of packets that can pass may also cause extra atomic operations reducing the performance. This implementation shows similar performance to the previous one, but should scale better with more threads hiting the same meter. Signed-off-by: Ilya Maximets --- @Lin Huang, if you can try this change on your setup, that would be great. NEWS | 2 + lib/dpif-netdev.c | 250 +- 2 files changed, 140 insertions(+), 112 deletions(-) diff --git a/NEWS b/NEWS index 66d5a4ea3..4a2b7dbca 100644 --- a/NEWS +++ b/NEWS @@ -37,6 +37,8 @@ Post-v3.1.0 - SRv6 Tunnel Protocol * Added support for userspace datapath (only). - Userspace datapath: + * Implementation of OpenFlow meters is now lockless allowing for better + multi-thread scalability. * IP and L4 checksum offload support is now enabled by default for interfaces that support it. See the 'status' column in the 'interface' table to check the status. diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index abe63412e..2fa556a62 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -212,21 +212,21 @@ static void dpcls_remove(struct dpcls *, struct dpcls_rule *); struct dp_meter_band { uint32_t rate; uint32_t burst_size; -uint64_t bucket; /* In 1/1000 packets (for PKTPS), or in bits (for KBPS) */ -uint64_t packet_count; -uint64_t byte_count; +atomic_uint64_t bucket; /* In 1/1000 packets for PKTPS, + * or in bits for KBPS. */ +atomic_uint64_t packet_count; +atomic_uint64_t byte_count; }; struct dp_meter { struct cmap_node node; -struct ovs_mutex lock; uint32_t id; uint16_t flags; uint16_t n_bands; uint32_t max_delta_t; -uint64_t used; -uint64_t packet_count; -uint64_t byte_count; +atomic_uint64_t used; /* Time of a last use in milliseconds. */ +atomic_uint64_t packet_count; +atomic_uint64_t byte_count; struct dp_meter_band bands[]; }; @@ -7165,22 +7165,56 @@ dpif_netdev_meter_get_features(const struct dpif * dpif OVS_UNUSED, features->max_color = 0; } +/* Tries to atomically add 'n' to 'value' in terms of saturation arithmetic, + * i.e., if the result will be larger than 'max_value', will store 'max_value' + * instead. */ +static void +atomic_sat_add(atomic_uint64_t *value, uint64_t n, uint64_t max_value) +{ +uint64_t current, new_value; + +atomic_read_relaxed(value, ); +do { +new_value = current + n; +new_value = MIN(new_value, max_value); +} while (!atomic_compare_exchange_weak_relaxed(value, , + new_value)); +} + +/* Tries to atomically subtract 'n' from 'value'. Does not perform the + * operation and returns 'false' if the result will be less than 'min_value'. + * Otherwise, stores the result and returns 'true'. */ +static bool +atomic_bound_sub(atomic_uint64_t *value, uint64_t n, uint64_t min_value) +{ +uint64_t current; + +atomic_read_relaxed(value, ); +do { +if (current < min_value + n) { +return false; +} +} while (!atomic_compare_exchange_weak_relaxed(value, , + current - n)); +return true; +} + /* Applies the meter identified by 'meter_id' to 'packets_'. Packets * that exceed a band are dropped in-place. */ static void
[ovs-dev] [PATCH v3 1/2] userspace: Respect tso/gso segment size.
From: Flavio Leitner Currently OVS will calculate the segment size based on the MTU of the egress port. That usually happens to be correct when the ports share the same MTU, but that is not always true. Therefore, if the segment size is provided, then use that and make sure the over sized packets are dropped. Signed-off-by: Flavio Leitner Co-authored-by: Mike Pattrick Signed-off-by: Mike Pattrick --- v3: Corrected sparse error --- lib/dp-packet.c| 3 ++ lib/dp-packet.h| 26 lib/netdev-dpdk.c | 12 +++- lib/netdev-linux.c | 76 -- 4 files changed, 94 insertions(+), 23 deletions(-) diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 27114a9a9..4d92a2e5b 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -34,6 +34,7 @@ dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source so pkt_metadata_init(>md, 0); dp_packet_reset_cutlen(b); dp_packet_reset_offload(b); +dp_packet_set_tso_segsz(b, 0); /* Initialize implementation-specific fields of dp_packet. */ dp_packet_init_specific(b); /* By default assume the packet type to be Ethernet. */ @@ -199,6 +200,8 @@ dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom) *dp_packet_ol_flags_ptr(new_buffer) = *dp_packet_ol_flags_ptr(buffer); *dp_packet_ol_flags_ptr(new_buffer) &= DP_PACKET_OL_SUPPORTED_MASK; +dp_packet_set_tso_segsz(new_buffer, dp_packet_get_tso_segsz(buffer)); + if (dp_packet_rss_valid(buffer)) { dp_packet_set_rss_hash(new_buffer, dp_packet_get_rss_hash(buffer)); } diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 70ddf8aa4..6a924f3ff 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -126,6 +126,7 @@ struct dp_packet { uint32_t ol_flags; /* Offloading flags. */ uint32_t rss_hash; /* Packet hash. */ uint32_t flow_mark; /* Packet flow mark. */ +uint16_t tso_segsz; /* TCP TSO segment size. */ #endif enum dp_packet_source source; /* Source of memory allocated as 'base'. */ @@ -166,6 +167,9 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t); static inline uint16_t dp_packet_get_allocated(const struct dp_packet *); static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t); +static inline uint16_t dp_packet_get_tso_segsz(const struct dp_packet *); +static inline void dp_packet_set_tso_segsz(struct dp_packet *, uint16_t); + void *dp_packet_resize_l2(struct dp_packet *, int increment); void *dp_packet_resize_l2_5(struct dp_packet *, int increment); static inline void *dp_packet_eth(const struct dp_packet *); @@ -644,6 +648,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->mbuf.buf_len = s; } +static inline uint16_t +dp_packet_get_tso_segsz(const struct dp_packet *p) +{ +return p->mbuf.tso_segsz; +} + +static inline void +dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s) +{ +p->mbuf.tso_segsz = s; +} #else /* DPDK_NETDEV */ static inline void @@ -700,6 +715,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->allocated_ = s; } +static inline uint16_t +dp_packet_get_tso_segsz(const struct dp_packet *p) +{ +return p->tso_segsz; +} + +static inline void +dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s) +{ +p->tso_segsz = s; +} #endif /* DPDK_NETDEV */ static inline void diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 63dac689e..1c100c48e 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -2302,6 +2302,7 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) { struct tcp_header *th = dp_packet_l4(pkt); +int hdr_len; if (!th) { VLOG_WARN_RL(, "%s: TCP Segmentation without L4 header" @@ -2311,7 +2312,15 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4; mbuf->ol_flags |= RTE_MBUF_F_TX_TCP_CKSUM; +hdr_len = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len; mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len; +if (OVS_UNLIKELY((hdr_len + mbuf->tso_segsz) > dev->max_packet_len)) { +VLOG_WARN_RL(, "%s: Oversized TSO packet. " + "hdr: %"PRIu32", gso: %"PRIu32", max len: %"PRIu32"", + dev->up.name, hdr_len, mbuf->tso_segsz, + dev->max_packet_len); +return false; +} if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM; @@ -2586,7 +2595,8 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts, int cnt = 0; struct rte_mbuf *pkt; -/* Filter oversized packets, unless are marked for TSO. */ +/* Filter oversized packets. The TSO packets are
[ovs-dev] [PATCH v3 2/2] userspace: Add Generic Segmentation Offloading.
From: Flavio Leitner This provides a software implementation in the case the egress netdev doesn't support segmentation in hardware. The challenge here is to guarantee packet ordering in the original batch that may be full of TSO packets. Each TSO packet can go up to ~64kB, so with segment size of 1440 that means about 44 packets for each TSO. Each batch has 32 packets, so the total batch amounts to 1408 normal packets. The segmentation estimates the total number of packets and then the total number of batches. Then allocate enough memory and finally do the work. Finally each batch is sent in order to the netdev. Signed-off-by: Flavio Leitner Co-authored-by: Mike Pattrick Signed-off-by: Mike Pattrick --- lib/automake.mk | 2 + lib/dp-packet-gso.c | 172 lib/dp-packet-gso.h | 24 +++ lib/dp-packet.h | 11 +++ lib/netdev-dpdk.c | 46 lib/netdev-linux.c | 58 --- lib/netdev.c| 120 ++- lib/packets.c | 4 +- 8 files changed, 314 insertions(+), 123 deletions(-) create mode 100644 lib/dp-packet-gso.c create mode 100644 lib/dp-packet-gso.h diff --git a/lib/automake.mk b/lib/automake.mk index e64ee76ce..49a92958d 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -118,6 +118,8 @@ lib_libopenvswitch_la_SOURCES = \ lib/dpctl.h \ lib/dp-packet.h \ lib/dp-packet.c \ + lib/dp-packet-gso.c \ + lib/dp-packet-gso.h \ lib/dpdk.h \ lib/dpif-netdev-extract-study.c \ lib/dpif-netdev-lookup.h \ diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c new file mode 100644 index 0..bc72e2f90 --- /dev/null +++ b/lib/dp-packet-gso.c @@ -0,0 +1,172 @@ +/* + * Copyright (c) 2021 Red Hat, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "coverage.h" +#include "dp-packet.h" +#include "dp-packet-gso.h" +#include "netdev-provider.h" +#include "openvswitch/vlog.h" + +VLOG_DEFINE_THIS_MODULE(dp_packet_gso); + +COVERAGE_DEFINE(soft_seg_good); + +/* Retuns a new packet that is a segment of packet 'p'. + * + * The new packet is initialized with 'hdr_len' bytes from the + * start of packet 'p' and then appended with 'data_len' bytes + * from the 'data' buffer. + * + * Note: The packet headers are not updated. */ +static struct dp_packet * +dp_packet_gso_seg_new(const struct dp_packet *p, size_t hdr_len, + const char *data, size_t data_len) +{ +struct dp_packet *seg = dp_packet_new_with_headroom(hdr_len + data_len, +dp_packet_headroom(p)); + +/* Append the original packet headers and then the payload. */ +dp_packet_put(seg, dp_packet_data(p), hdr_len); +dp_packet_put(seg, data, data_len); + +/* The new segment should have the same offsets. */ +seg->l2_5_ofs = p->l2_5_ofs; +seg->l3_ofs = p->l3_ofs; +seg->l4_ofs = p->l4_ofs; + +/* The protocol headers remain the same, so preserve hash and mark. */ +*dp_packet_rss_ptr(seg) = dp_packet_get_rss_hash(p); +*dp_packet_flow_mark_ptr(seg) = *dp_packet_flow_mark_ptr(p); + +/* The segment should inherit all the offloading flags from the + * original packet, except for the TCP segmentation, external + * buffer and indirect buffer flags. */ +*dp_packet_ol_flags_ptr(seg) = *dp_packet_ol_flags_ptr(p) +& ~(DP_PACKET_OL_TX_TCP_SEG | DP_PACKET_OL_EXTERNAL +| DP_PACKET_OL_INDIRECT); + +dp_packet_hwol_reset_tcp_seg(seg); + +return seg; +} + +/* Returns the calculated number of TCP segments in packet 'p'. */ +int +dp_packet_gso_nr_segs(struct dp_packet *p) +{ +uint16_t segsz = dp_packet_get_tso_segsz(p); +const char *data_tail; +const char *data_pos; +int n_segs; + +data_pos = dp_packet_get_tcp_payload(p); +data_tail = (char *) dp_packet_tail(p) - dp_packet_l2_pad_size(p); +data_pos = dp_packet_get_tcp_payload(p); +n_segs = DIV_ROUND_UP((data_tail - data_pos), segsz); + +return n_segs; + +} + +/* Perform software segmentation on packet 'p'. + * + * Returns all the segments added to the array of preallocated + * batches in 'batches' starting at batch position 'batch_pos'. */ +void +dp_packet_gso(struct dp_packet *p, struct dp_packet_batch *batches, + size_t *batch_pos) +{ +struct tcp_header *tcp_hdr; +struct
[ovs-dev] [PATCH v12] netdev-dpdk: Add custom rx-steering configuration.
Some control protocols are used to maintain link status between forwarding engines (e.g. LACP). When the system is not sized properly, the PMD threads may not be able to process all incoming traffic from the configured Rx queues. When a signaling packet of such protocols is dropped, it can cause link flapping, worsening the situation. Use the RTE flow API to redirect these protocols into a dedicated Rx queue. The assumption is made that the ratio between control protocol traffic and user data traffic is very low and thus this dedicated Rx queue will never get full. Re-program the RSS redirection table to only use the other Rx queues. The additional Rx queue will be assigned a PMD core like any other Rx queue. Polling that extra queue may introduce increased latency and a slight performance penalty at the benefit of preventing link flapping. This feature must be enabled per port on specific protocols via the rx-steering option. This option takes "rss" followed by a "+" separated list of protocol names. It is only supported on ethernet ports. This feature is experimental. If the user has already configured multiple Rx queues on the port, an additional one will be allocated for control packets. If the hardware cannot satisfy the number of requested Rx queues, the last Rx queue will be assigned for control plane. If only one Rx queue is available, the rx-steering feature will be disabled. If the hardware does not support the RTE flow matchers/actions, the rx-steering feature will be completely disabled on the port. It cannot be enabled when other_config:hw-offload=true as it may conflict with the offloaded RTE flows. Similarly, if hw-offload is enabled, custom rx-steering will be forcibly disabled on all ports. Example use: ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \ set interface phy0 type=dpdk options:dpdk-devargs=:ca:00.0 -- \ set interface phy0 options:rx-steering=rss+lacp -- \ set interface phy1 type=dpdk options:dpdk-devargs=:ca:00.1 -- \ set interface phy1 options:rx-steering=rss+lacp As a starting point, only one protocol is supported: LACP. Other protocols can be added in the future. NIC compatibility should be checked. To validate that this works as intended, I used a traffic generator to generate random traffic slightly above the machine capacity at line rate on a two ports bond interface. OVS is configured to receive traffic on two VLANs and pop/push them in a br-int bridge based on tags set on patch ports. +--+ | DUT | |++| || br-int || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11 |||| in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10 || patch10patch11 || |+---|---|+| || | | |+---|---|+| || patch00patch01 || || tag:10tag:20 || |||| || br-phy || default flow, action=NORMAL |||| || bond0|| balance-slb, lacp=passive, lacp-time=fast ||phy0 phy1 || |+--|-|---+| +---|-|+ | | +---|-|+ | port0 port1 | balance L3/L4, lacp=active, lacp-time=fast | lag | mode trunk VLANs 10, 20 | | |switch| | | | vlan 10vlan 20 | mode access | port2 port3 | +-|--|-+ | | +-|--|-+ | tgen0 tgen1 | Random traffic that is properly balanced | | across the bond ports in both directions. | traffic generator | +--+ Without rx-steering, the bond0 links are randomly switching to "defaulted" when one of the LACP packets sent by the switch is dropped because the RX queues are full and the PMD threads did not process them fast enough. When that happens, all traffic must go through a single link which causes above line rate traffic to be dropped. ~# ovs-appctl lacp/show-stats bond0 bond0 statistics member: phy0: TX PDUs: 347246 RX PDUs: 14865 RX Bad PDUs: 0 RX Marker Request PDUs: 0 Link Expired: 168 Link Defaulted: 0 Carrier Status Changed: 0 member: phy1: TX PDUs: 347245 RX PDUs: 14919 RX Bad PDUs: 0 RX Marker Request PDUs: 0 Link Expired: 147 Link Defaulted: 1 Carrier Status Changed: 0 When rx-steering is enabled, no LACP packet is dropped and the bond links remain enabled at all times, maximizing the throughput. Neither the "Link Expired" nor the "Link Defaulted" counters are incremented anymore. This feature may be considered as "QoS". However, it does not work by limiting the rate of traffic explicitly. It only guarantees that some protocols have a lower chance of being dropped because the PMD cores cannot keep
[ovs-dev] [PATCH v2 2/2] userspace: Add Generic Segmentation Offloading.
From: Flavio Leitner This provides a software implementation in the case the egress netdev doesn't support segmentation in hardware. The challenge here is to guarantee packet ordering in the original batch that may be full of TSO packets. Each TSO packet can go up to ~64kB, so with segment size of 1440 that means about 44 packets for each TSO. Each batch has 32 packets, so the total batch amounts to 1408 normal packets. The segmentation estimates the total number of packets and then the total number of batches. Then allocate enough memory and finally do the work. Finally each batch is sent in order to the netdev. Signed-off-by: Flavio Leitner Co-authored-by: Mike Pattrick Signed-off-by: Mike Pattrick --- lib/automake.mk | 2 + lib/dp-packet-gso.c | 172 lib/dp-packet-gso.h | 24 +++ lib/dp-packet.h | 11 +++ lib/netdev-dpdk.c | 46 lib/netdev-linux.c | 58 --- lib/netdev.c| 120 ++- lib/packets.c | 4 +- 8 files changed, 314 insertions(+), 123 deletions(-) create mode 100644 lib/dp-packet-gso.c create mode 100644 lib/dp-packet-gso.h diff --git a/lib/automake.mk b/lib/automake.mk index e64ee76ce..49a92958d 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -118,6 +118,8 @@ lib_libopenvswitch_la_SOURCES = \ lib/dpctl.h \ lib/dp-packet.h \ lib/dp-packet.c \ + lib/dp-packet-gso.c \ + lib/dp-packet-gso.h \ lib/dpdk.h \ lib/dpif-netdev-extract-study.c \ lib/dpif-netdev-lookup.h \ diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c new file mode 100644 index 0..bc72e2f90 --- /dev/null +++ b/lib/dp-packet-gso.c @@ -0,0 +1,172 @@ +/* + * Copyright (c) 2021 Red Hat, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "coverage.h" +#include "dp-packet.h" +#include "dp-packet-gso.h" +#include "netdev-provider.h" +#include "openvswitch/vlog.h" + +VLOG_DEFINE_THIS_MODULE(dp_packet_gso); + +COVERAGE_DEFINE(soft_seg_good); + +/* Retuns a new packet that is a segment of packet 'p'. + * + * The new packet is initialized with 'hdr_len' bytes from the + * start of packet 'p' and then appended with 'data_len' bytes + * from the 'data' buffer. + * + * Note: The packet headers are not updated. */ +static struct dp_packet * +dp_packet_gso_seg_new(const struct dp_packet *p, size_t hdr_len, + const char *data, size_t data_len) +{ +struct dp_packet *seg = dp_packet_new_with_headroom(hdr_len + data_len, +dp_packet_headroom(p)); + +/* Append the original packet headers and then the payload. */ +dp_packet_put(seg, dp_packet_data(p), hdr_len); +dp_packet_put(seg, data, data_len); + +/* The new segment should have the same offsets. */ +seg->l2_5_ofs = p->l2_5_ofs; +seg->l3_ofs = p->l3_ofs; +seg->l4_ofs = p->l4_ofs; + +/* The protocol headers remain the same, so preserve hash and mark. */ +*dp_packet_rss_ptr(seg) = dp_packet_get_rss_hash(p); +*dp_packet_flow_mark_ptr(seg) = *dp_packet_flow_mark_ptr(p); + +/* The segment should inherit all the offloading flags from the + * original packet, except for the TCP segmentation, external + * buffer and indirect buffer flags. */ +*dp_packet_ol_flags_ptr(seg) = *dp_packet_ol_flags_ptr(p) +& ~(DP_PACKET_OL_TX_TCP_SEG | DP_PACKET_OL_EXTERNAL +| DP_PACKET_OL_INDIRECT); + +dp_packet_hwol_reset_tcp_seg(seg); + +return seg; +} + +/* Returns the calculated number of TCP segments in packet 'p'. */ +int +dp_packet_gso_nr_segs(struct dp_packet *p) +{ +uint16_t segsz = dp_packet_get_tso_segsz(p); +const char *data_tail; +const char *data_pos; +int n_segs; + +data_pos = dp_packet_get_tcp_payload(p); +data_tail = (char *) dp_packet_tail(p) - dp_packet_l2_pad_size(p); +data_pos = dp_packet_get_tcp_payload(p); +n_segs = DIV_ROUND_UP((data_tail - data_pos), segsz); + +return n_segs; + +} + +/* Perform software segmentation on packet 'p'. + * + * Returns all the segments added to the array of preallocated + * batches in 'batches' starting at batch position 'batch_pos'. */ +void +dp_packet_gso(struct dp_packet *p, struct dp_packet_batch *batches, + size_t *batch_pos) +{ +struct tcp_header *tcp_hdr; +struct
[ovs-dev] [PATCH v2 1/2] userspace: Respect tso/gso segment size.
From: Flavio Leitner Currently OVS will calculate the segment size based on the MTU of the egress port. That usually happens to be correct when the ports share the same MTU, but that is not always true. Therefore, if the segment size is provided, then use that and make sure the over sized packets are dropped. Signed-off-by: Flavio Leitner Co-authored-by: Mike Pattrick Signed-off-by: Mike Pattrick --- lib/dp-packet.c| 3 ++ lib/dp-packet.h| 26 lib/netdev-dpdk.c | 12 +++- lib/netdev-linux.c | 76 -- 4 files changed, 94 insertions(+), 23 deletions(-) diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 27114a9a9..4d92a2e5b 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -34,6 +34,7 @@ dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source so pkt_metadata_init(>md, 0); dp_packet_reset_cutlen(b); dp_packet_reset_offload(b); +dp_packet_set_tso_segsz(b, 0); /* Initialize implementation-specific fields of dp_packet. */ dp_packet_init_specific(b); /* By default assume the packet type to be Ethernet. */ @@ -199,6 +200,8 @@ dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom) *dp_packet_ol_flags_ptr(new_buffer) = *dp_packet_ol_flags_ptr(buffer); *dp_packet_ol_flags_ptr(new_buffer) &= DP_PACKET_OL_SUPPORTED_MASK; +dp_packet_set_tso_segsz(new_buffer, dp_packet_get_tso_segsz(buffer)); + if (dp_packet_rss_valid(buffer)) { dp_packet_set_rss_hash(new_buffer, dp_packet_get_rss_hash(buffer)); } diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 70ddf8aa4..6a924f3ff 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -126,6 +126,7 @@ struct dp_packet { uint32_t ol_flags; /* Offloading flags. */ uint32_t rss_hash; /* Packet hash. */ uint32_t flow_mark; /* Packet flow mark. */ +uint16_t tso_segsz; /* TCP TSO segment size. */ #endif enum dp_packet_source source; /* Source of memory allocated as 'base'. */ @@ -166,6 +167,9 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t); static inline uint16_t dp_packet_get_allocated(const struct dp_packet *); static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t); +static inline uint16_t dp_packet_get_tso_segsz(const struct dp_packet *); +static inline void dp_packet_set_tso_segsz(struct dp_packet *, uint16_t); + void *dp_packet_resize_l2(struct dp_packet *, int increment); void *dp_packet_resize_l2_5(struct dp_packet *, int increment); static inline void *dp_packet_eth(const struct dp_packet *); @@ -644,6 +648,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->mbuf.buf_len = s; } +static inline uint16_t +dp_packet_get_tso_segsz(const struct dp_packet *p) +{ +return p->mbuf.tso_segsz; +} + +static inline void +dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s) +{ +p->mbuf.tso_segsz = s; +} #else /* DPDK_NETDEV */ static inline void @@ -700,6 +715,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->allocated_ = s; } +static inline uint16_t +dp_packet_get_tso_segsz(const struct dp_packet *p) +{ +return p->tso_segsz; +} + +static inline void +dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s) +{ +p->tso_segsz = s; +} #endif /* DPDK_NETDEV */ static inline void diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 63dac689e..1c100c48e 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -2302,6 +2302,7 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) { struct tcp_header *th = dp_packet_l4(pkt); +int hdr_len; if (!th) { VLOG_WARN_RL(, "%s: TCP Segmentation without L4 header" @@ -2311,7 +2312,15 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4; mbuf->ol_flags |= RTE_MBUF_F_TX_TCP_CKSUM; +hdr_len = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len; mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len; +if (OVS_UNLIKELY((hdr_len + mbuf->tso_segsz) > dev->max_packet_len)) { +VLOG_WARN_RL(, "%s: Oversized TSO packet. " + "hdr: %"PRIu32", gso: %"PRIu32", max len: %"PRIu32"", + dev->up.name, hdr_len, mbuf->tso_segsz, + dev->max_packet_len); +return false; +} if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM; @@ -2586,7 +2595,8 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts, int cnt = 0; struct rte_mbuf *pkt; -/* Filter oversized packets, unless are marked for TSO. */ +/* Filter oversized packets. The TSO packets are filtered out + * during the
Re: [ovs-dev] [PATCH v1] bridge ovs-vsctl Bridge IPFIX enable_input_sampling, enable_ouput_sampling fix unexpected values
A gentle reminder to take a look at this patch. Thanks, Sayali Naval From: Sayali Naval (sanaval) Sent: Wednesday, May 31, 2023 11:02 AM To: d...@openvswitch.org Subject: [ovs-dev] [PATCH v1] bridge ovs-vsctl Bridge IPFIX enable_input_sampling, enable_ouput_sampling fix unexpected values As per the Open vSwitch Manual (http://www.openvswitch.org/support/dist-docs/ovs-vsctl.8.txt) the Bridge IPFIX parameters can be passed as follows: ovs-vsctl -- set Bridge br0 ipfix=@i \ -- --id=@i create IPFIX targets=\"192.168.0.34:4739\" obs_do‐ main_id=123 obs_point_id=456 cache_active_timeout=60 cache_max_flows=13 \ other_config:enable-input-sampling=false \ other_config:enable-output-sampling=false where the default values are: enable_input_sampling: true enable_output_sampling: true But in the existing code https://github.com/openvswitch/ovs/blob/master/vswitchd/bridge.c#L1563-L1567, these 2 parameters take up unexpected values in some scenarios. be_opts.enable_input_sampling = !smap_get_bool(_cfg->other_config, "enable-input-sampling", false); be_opts.enable_output_sampling = !smap_get_bool(_cfg->other_config, "enable-output-sampling", false); Here, the function smap_get_bool is being used with a negation. smap_get_bool is defined as below: (https://github.com/openvswitch/ovs/blob/master/lib/smap.c#L220-L232) /* Gets the value associated with 'key' in 'smap' and converts it to a boolean. * If 'key' is not in 'smap', or its value is neither "true" nor "false", * returns 'def'. */ bool smap_get_bool(const struct smap *smap, const char *key, bool def) { const char *value = smap_get_def(smap, key, ""); if (def) { return strcasecmp("false", value) != 0; } else { return !strcasecmp("true", value); } } This returns expected values for the default case (since the above code will negate “false” we get from smap_get bool function and return the value “true”) but unexpected values for the case where the sampling value is passed through the CLI. For example, if we pass "true" for other_config:enable-input-sampling in the CLI, the above code will negate the “true” value we get from the smap_bool function and return the value “false”. Same would be the case for enable_output_sampling. Signed-off-by: Sayali Naval --- [bridge-ipfix-fix 277b3baae] Fix values for enable_input_sampling & enable_ouput_sampling Date: Thu May 25 10:32:43 2023 -0700 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c index f5dc59ad0..b972d55d0 100644 --- a/vswitchd/bridge.c +++ b/vswitchd/bridge.c @@ -1560,11 +1560,11 @@ bridge_configure_ipfix(struct bridge *br) be_opts.enable_tunnel_sampling = smap_get_bool(_cfg->other_config, "enable-tunnel-sampling", true); -be_opts.enable_input_sampling = !smap_get_bool(_cfg->other_config, - "enable-input-sampling", false); +be_opts.enable_input_sampling = smap_get_bool(_cfg->other_config, + "enable-input-sampling", true); -be_opts.enable_output_sampling = !smap_get_bool(_cfg->other_config, - "enable-output-sampling", false); +be_opts.enable_output_sampling = smap_get_bool(_cfg->other_config, + "enable-output-sampling", true); virtual_obs_id = smap_get(_cfg->other_config, "virtual_obs_id"); be_opts.virtual_obs_id = nullable_xstrdup(virtual_obs_id); -- ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH 2/2] userspace: Add Generic Segmentation Offloading.
From: Flavio Leitner This provides a software implementation in the case the egress netdev doesn't support segmentation in hardware. The challenge here is to guarantee packet ordering in the original batch that may be full of TSO packets. Each TSO packet can go up to ~64kB, so with segment size of 1440 that means about 44 packets for each TSO. Each batch has 32 packets, so the total batch amounts to 1408 normal packets. The segmentation estimates the total number of packets and then the total number of batches. Then allocate enough memory and finally do the work. Finally each batch is sent in order to the netdev. Signed-off-by: Flavio Leitner Co-authored-by: Mike Pattrick Signed-off-by: Mike Pattrick --- lib/automake.mk | 2 + lib/dp-packet-gso.c | 172 lib/dp-packet-gso.h | 24 +++ lib/dp-packet.c | 4 +- lib/dp-packet.h | 15 +++- lib/netdev-dpdk.c | 38 +++--- lib/netdev-linux.c | 58 --- lib/netdev.c| 121 ++- lib/packets.c | 4 +- 9 files changed, 314 insertions(+), 124 deletions(-) create mode 100644 lib/dp-packet-gso.c create mode 100644 lib/dp-packet-gso.h diff --git a/lib/automake.mk b/lib/automake.mk index e64ee76ce..49a92958d 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -118,6 +118,8 @@ lib_libopenvswitch_la_SOURCES = \ lib/dpctl.h \ lib/dp-packet.h \ lib/dp-packet.c \ + lib/dp-packet-gso.c \ + lib/dp-packet-gso.h \ lib/dpdk.h \ lib/dpif-netdev-extract-study.c \ lib/dpif-netdev-lookup.h \ diff --git a/lib/dp-packet-gso.c b/lib/dp-packet-gso.c new file mode 100644 index 0..bc72e2f90 --- /dev/null +++ b/lib/dp-packet-gso.c @@ -0,0 +1,172 @@ +/* + * Copyright (c) 2021 Red Hat, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "coverage.h" +#include "dp-packet.h" +#include "dp-packet-gso.h" +#include "netdev-provider.h" +#include "openvswitch/vlog.h" + +VLOG_DEFINE_THIS_MODULE(dp_packet_gso); + +COVERAGE_DEFINE(soft_seg_good); + +/* Retuns a new packet that is a segment of packet 'p'. + * + * The new packet is initialized with 'hdr_len' bytes from the + * start of packet 'p' and then appended with 'data_len' bytes + * from the 'data' buffer. + * + * Note: The packet headers are not updated. */ +static struct dp_packet * +dp_packet_gso_seg_new(const struct dp_packet *p, size_t hdr_len, + const char *data, size_t data_len) +{ +struct dp_packet *seg = dp_packet_new_with_headroom(hdr_len + data_len, +dp_packet_headroom(p)); + +/* Append the original packet headers and then the payload. */ +dp_packet_put(seg, dp_packet_data(p), hdr_len); +dp_packet_put(seg, data, data_len); + +/* The new segment should have the same offsets. */ +seg->l2_5_ofs = p->l2_5_ofs; +seg->l3_ofs = p->l3_ofs; +seg->l4_ofs = p->l4_ofs; + +/* The protocol headers remain the same, so preserve hash and mark. */ +*dp_packet_rss_ptr(seg) = dp_packet_get_rss_hash(p); +*dp_packet_flow_mark_ptr(seg) = *dp_packet_flow_mark_ptr(p); + +/* The segment should inherit all the offloading flags from the + * original packet, except for the TCP segmentation, external + * buffer and indirect buffer flags. */ +*dp_packet_ol_flags_ptr(seg) = *dp_packet_ol_flags_ptr(p) +& ~(DP_PACKET_OL_TX_TCP_SEG | DP_PACKET_OL_EXTERNAL +| DP_PACKET_OL_INDIRECT); + +dp_packet_hwol_reset_tcp_seg(seg); + +return seg; +} + +/* Returns the calculated number of TCP segments in packet 'p'. */ +int +dp_packet_gso_nr_segs(struct dp_packet *p) +{ +uint16_t segsz = dp_packet_get_tso_segsz(p); +const char *data_tail; +const char *data_pos; +int n_segs; + +data_pos = dp_packet_get_tcp_payload(p); +data_tail = (char *) dp_packet_tail(p) - dp_packet_l2_pad_size(p); +data_pos = dp_packet_get_tcp_payload(p); +n_segs = DIV_ROUND_UP((data_tail - data_pos), segsz); + +return n_segs; + +} + +/* Perform software segmentation on packet 'p'. + * + * Returns all the segments added to the array of preallocated + * batches in 'batches' starting at batch position 'batch_pos'. */ +void +dp_packet_gso(struct dp_packet *p, struct dp_packet_batch *batches, + size_t *batch_pos) +{ +struct tcp_header
[ovs-dev] [PATCH 1/2] userspace: Respect tso/gso segment size.
From: Flavio Leitner Currently OVS will calculate the segment size based on the MTU of the egress port. That usually happens to be correct when the ports share the same MTU, but that is not always true. Therefore, if the segment size is provided, then use that and make sure the over sized packets are dropped. Signed-off-by: Flavio Leitner Co-authored-by: Mike Pattrick Signed-off-by: Mike Pattrick --- lib/dp-packet.c| 1 + lib/dp-packet.h| 26 + lib/netdev-dpdk.c | 13 +++-- lib/netdev-linux.c | 73 +- 4 files changed, 91 insertions(+), 22 deletions(-) diff --git a/lib/dp-packet.c b/lib/dp-packet.c index dfedd0e9b..d1e916488 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -34,6 +34,7 @@ dp_packet_init__(struct dp_packet *b, size_t allocated, enum dp_packet_source so pkt_metadata_init(>md, 0); dp_packet_reset_cutlen(b); dp_packet_reset_offload(b); +dp_packet_set_tso_segsz(p, 0); /* Initialize implementation-specific fields of dp_packet. */ dp_packet_init_specific(b); /* By default assume the packet type to be Ethernet. */ diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 70ddf8aa4..6a924f3ff 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -126,6 +126,7 @@ struct dp_packet { uint32_t ol_flags; /* Offloading flags. */ uint32_t rss_hash; /* Packet hash. */ uint32_t flow_mark; /* Packet flow mark. */ +uint16_t tso_segsz; /* TCP TSO segment size. */ #endif enum dp_packet_source source; /* Source of memory allocated as 'base'. */ @@ -166,6 +167,9 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t); static inline uint16_t dp_packet_get_allocated(const struct dp_packet *); static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t); +static inline uint16_t dp_packet_get_tso_segsz(const struct dp_packet *); +static inline void dp_packet_set_tso_segsz(struct dp_packet *, uint16_t); + void *dp_packet_resize_l2(struct dp_packet *, int increment); void *dp_packet_resize_l2_5(struct dp_packet *, int increment); static inline void *dp_packet_eth(const struct dp_packet *); @@ -644,6 +648,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->mbuf.buf_len = s; } +static inline uint16_t +dp_packet_get_tso_segsz(const struct dp_packet *p) +{ +return p->mbuf.tso_segsz; +} + +static inline void +dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s) +{ +p->mbuf.tso_segsz = s; +} #else /* DPDK_NETDEV */ static inline void @@ -700,6 +715,17 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->allocated_ = s; } +static inline uint16_t +dp_packet_get_tso_segsz(const struct dp_packet *p) +{ +return p->tso_segsz; +} + +static inline void +dp_packet_set_tso_segsz(struct dp_packet *p, uint16_t s) +{ +p->tso_segsz = s; +} #endif /* DPDK_NETDEV */ static inline void diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 46f577441..1959b260b 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -2312,6 +2312,7 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) { struct tcp_header *th = dp_packet_l4(pkt); +int hdr_len; if (!th) { VLOG_WARN_RL(, "%s: TCP Segmentation without L4 header" @@ -2321,7 +2322,14 @@ netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf) mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4; mbuf->ol_flags |= RTE_MBUF_F_TX_TCP_CKSUM; -mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len; +hdr_len = mbuf->l2_len + mbuf->l3_len + mbuf->l4_len; +if (OVS_UNLIKELY((hdr_len + mbuf->tso_segsz) > dev->max_packet_len)) { +VLOG_WARN_RL(, "%s: Oversized TSO packet. " + "hdr: %"PRIu32", gso: %"PRIu32", max len: %"PRIu32"", + dev->up.name, hdr_len, mbuf->tso_segsz, + dev->max_packet_len); +return false; +} if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { mbuf->ol_flags |= RTE_MBUF_F_TX_IP_CKSUM; @@ -2596,7 +2604,8 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts, int cnt = 0; struct rte_mbuf *pkt; -/* Filter oversized packets, unless are marked for TSO. */ +/* Filter oversized packets. The TSO packets are filtered out + * during the offloading preparation for performance reasons. */ for (i = 0; i < pkt_cnt; i++) { pkt = pkts[i]; if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len) diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index 3dba2ef1f..3877586b8 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -536,7 +536,7 @@ static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); static bool tap_supports_vnet_hdr = true; static int
Re: [ovs-dev] [PATCH v1] dpif-netdev: Update meter timestamp when dp_netdev_run_meter been called.
On 5/31/23 17:41, mit...@outlook.com wrote: > From: Lin Huang > > Currently, a meter's timestamp 'now' is set by 'pmd->ctx.now' which updated > by pmd_thread_ctx_time_update(). > > Before processing of the new packet batch: > - dpif_netdev_execute() > - dp_netdev_process_rxq_port() > > There is a problem when user want to police the rate to a low pps by meter. > For example, When the traffic is more than 3Mpps, policing the traffic to > 1M pps, the real rate will be 3Mpps not 1Mpps. > > The key point is that a meter's timestamp isn't update in real time. > For example, PMD thread A and B are polled at the same time (t1). > Thread A starts to run meter to police traffic. At the same time, thread B > pause at 'ovs_mutex_lock(>lock)' to wait thread A release the meter > lock. This elapsed a lot of time, especially we have more than 10 PMD threads. Shouldn't the infrequent time update cause more drops instead? Best regards, Ilya Maximets. > > After thread A release the meter lock, thread B start to count the time diff. > - long_delta_t = now - meter->used' --> long_delta_t = t1 - t1 = 0. > - band->bucket += (uint64_t) delta_t * band->rate --> band->bucket = 0. > - band_exceeded_pkt = band->bucket / 1000; --> band_exceeded_pkt = 0. > > Fix this problem by update the meter timestamp every time. > > Test-by: Zhang Yuhuang > Signed-off-by: Lin Huang > --- > lib/dpif-netdev.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 70b953ae6..dbb275cf8 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -7169,12 +7169,13 @@ dpif_netdev_meter_get_features(const struct dpif * > dpif OVS_UNUSED, > * that exceed a band are dropped in-place. */ > static void > dp_netdev_run_meter(struct dp_netdev *dp, struct dp_packet_batch *packets_, > -uint32_t meter_id, long long int now) > +uint32_t meter_id) > { > struct dp_meter *meter; > struct dp_meter_band *band; > struct dp_packet *packet; > long long int long_delta_t; /* msec */ > +long long int now; > uint32_t delta_t; /* msec */ > const size_t cnt = dp_packet_batch_size(packets_); > uint32_t bytes, volume; > @@ -7197,8 +7198,12 @@ dp_netdev_run_meter(struct dp_netdev *dp, struct > dp_packet_batch *packets_, > memset(exceeded_rate, 0, cnt * sizeof *exceeded_rate); > > ovs_mutex_lock(>lock); > + > +/* Update now */ > +now = time_msec(); > + > /* All packets will hit the meter at the same time. */ > -long_delta_t = now / 1000 - meter->used / 1000; /* msec */ > +long_delta_t = now - meter->used; /* msec */ > > if (long_delta_t < 0) { > /* This condition means that we have several threads fighting for a > @@ -9170,8 +9175,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch > *packets_, > } > > case OVS_ACTION_ATTR_METER: > -dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a), > -pmd->ctx.now); > +dp_netdev_run_meter(pmd->dp, packets_, nl_attr_get_u32(a)); > break; > > case OVS_ACTION_ATTR_PUSH_VLAN: ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v4] rhel: make the version, displayed to the user, customizable
On 6/20/23 19:43, Timothy Redaelli wrote: > Since on CentOS/RHEL the builds are based on stable branches and not on > tags for debugging purpose it's better to have the downstream version as > version so it's easier to know which commits are included in a build. > > This commit adds --with-version-suffix as ./configure option in > order to set an OVS version suffix that should be shown to the user via > ovs-vsctl -V and, so, also on database, on ovs-vsctl show and the other > utilities. > > --with-version-suffix is used in Fedora/CentOS/RHEL spec file in order to have > the version be aligned with the downstream one. > > Signed-off-by: Timothy Redaelli > --- > v1 -> v2: Use --with-version-suffix= and add version to other utilies > (as requested by Ilya). > > v2 -> v3: Add versioning to python utilities and python library itself > (as suggested by Aaron) > > v3 -> v4: Remove versioning to python library itself to avoid breaking > PEP440 (as requested by Ilya). Versioning is still used in > python utilities. Thanks for the update. There is at least one place in a python code missed in this patch. It's the python/ovs/unixctl/server.py. It's using ovs.version.VERSION that doesn't contain the suffix. Another place is python/ovstest/args.py, but, I suppose, we can ignore it for now, as it seems to be broken anyway. It just prints the '@VERSION@' verbatim... Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH ovn v2] ovn-controller: Reduce size of the SB monitor condition.
On Wed, Jun 21, 2023 at 2:48 AM Dumitru Ceara wrote: > > We don't need to explicitly add port bindings that were already bound > locally. We implicitly get those because we monitor the datapaths > they're attached to. > > When performing an ovn-heater 500-node density-heavy scale test [0], with > conditional monitoring enabled, the unreasonably long poll intervals on > the Southbound database (the ones that took more than one second) are > measured as: > > - >Count MinMax Median Mean 95 percentile > - > 56.0 1010.0 2664.0 1455.5 1544.9 2163.0 > 77.0 1015.0 3460.0 1896.0 1858.2 2907.8 > 69.0 1010.0 3118.0 1618.0 1688.1 2582.4 > - >202.0 1010.0 3460.0 1610.0 1713.3 2711.5 > > Compared to the baseline results (also with conditional monitoring > enabled): > - >Count MinMax Median Mean 95 percentile > - >141.0 1003.018338.0 1721.0 2625.4 7626.0 >151.0 1019.080297.0 1808.0 3410.7 9089.0 >165.0 1007.050736.0 2354.0 3067.7 7309.8 > - >457.0 1003.080297.0 2131.0 3044.6 7799.6 > > We see significant improvement on the server side. This is explained > by the fact that now the Southbound database server doesn't need to > apply as many conditions as before when filtering individual monitor > contents. > > Note: Sub-ports - OVN switch ports with parent_port set - have to be > monitored unconditionally as we cannot efficiently determine their local > datapaths without including all local OVS interfaces in the monitor. > > [0] https://github.com/ovn-org/ovn-heater/blob/main/test-scenarios/ocp-500-density-heavy.yml > > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2139194 > Reported-by: Ilya Maximets > Signed-off-by: Dumitru Ceara > --- > V2: > - Addressed Han's comments: > - changed commit log: removed assumption about sub-ports. > - added ovn-nb documentation note about sub-ports monitoring. > - added ovn-controller documentation note about sub-ports monitoring. > - added TODO item. > - Added NEWS item. > --- > NEWS| 3 +++ > TODO.rst| 6 ++ > controller/binding.c| 2 +- > controller/binding.h| 2 +- > controller/ovn-controller.8.xml | 6 ++ > controller/ovn-controller.c | 19 --- > ovn-nb.xml | 6 +- > 7 files changed, 38 insertions(+), 6 deletions(-) > > diff --git a/NEWS b/NEWS > index 482970eeeb..8275877f99 100644 > --- a/NEWS > +++ b/NEWS > @@ -7,6 +7,9 @@ Post v23.06.0 > DHCPv4 "hostname" (12) option. >- Support to create/update MAC_Binding when GARP received from VTEP (RAMP) > switch on l3dgw port. > + - To allow optimizing ovn-controller's monitor conditions for the regular > +VIF case, ovn-controller now unconditionally monitors all sub-ports > +(ports with parent_port set). > > OVN v23.06.0 - 01 Jun 2023 > -- > diff --git a/TODO.rst b/TODO.rst > index dff163e70b..b790a9fadf 100644 > --- a/TODO.rst > +++ b/TODO.rst > @@ -124,3 +124,9 @@ OVN To-do List > * Load Balancer templates > >* Support combining the VIP IP and port into a single template variable. > + > +* ovn-controller conditional monitoring > + > + * Improve sub-ports (with parent_port set) conditional monitoring; these > +are currently unconditionally monitored, even if ovn-monitor-all is > +set to false. > diff --git a/controller/binding.c b/controller/binding.c > index 486095ca79..359ad6d660 100644 > --- a/controller/binding.c > +++ b/controller/binding.c > @@ -785,7 +785,7 @@ local_binding_data_destroy(struct local_binding_data *lbinding_data) > } > > const struct sbrec_port_binding * > -local_binding_get_primary_pb(struct shash *local_bindings, > +local_binding_get_primary_pb(const struct shash *local_bindings, > const char *pb_name) > { > struct local_binding *lbinding = > diff --git a/controller/binding.h b/controller/binding.h > index 0e57f02ee5..319cbd263c 100644 > --- a/controller/binding.h > +++ b/controller/binding.h > @@ -150,7 +150,7 @@ void local_binding_data_init(struct local_binding_data *); > void local_binding_data_destroy(struct local_binding_data *); > > const struct sbrec_port_binding *local_binding_get_primary_pb( > -struct shash *local_bindings, const char *pb_name); >
Re: [ovs-dev] Request adding long poll interval metrics to coverage metrics
Eelco Chaudron writes: > On 20 Jun 2023, at 16:57, Ilya Maximets wrote: > >> On 6/20/23 16:10, Aaron Conole wrote: >>> Adrian Moreno writes: >>> On 6/19/23 10:36, Eelco Chaudron wrote: > On 16 Jun 2023, at 19:19, Aaron Conole wrote: > >> Martin Kennelly writes: >> >>> Hey ovs community, >>> >>> I am a developer working on ovn-kubernetes and I want to >>> programmatically consume long poll information >>> i.e: >>> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll >>> interval (752ms user, 209ms system) >>> >>> This is currently exposed via journal logs but it's not practical >>> to consume it there programmatically and I was >>> hoping you could add it to coverage metrics. >> >> I think it could be useful. I do want to be careful about exposing >> these kinds of data in a way that could be misinterpreted. Already, >> that log in particular gets misinterpreted quite a bit, and RH gets >> customers claiming OVS is misbehaving when they've oversubscribed the >> system. > +1 > Maybe it's a good time to start documenting coverage counters? >>> >>> I agree - we should have at least some kind of documentation. Actually, >>> it would be really nice if we could do something during >>> COVERAGE_DEFINE() that would be like: >>> >>> COVERAGE_DEFINE(ctr, "description") >>> >>> and then we can generate documentation from the COVERAGE_DEFINE()s as >>> well as querying for it with an ovs-appctl command. > > I think this might be the only way to keep the doc in sync, however, > who is going to document the 320+ existing ones? That becomes a bit of a retro-fit challenge I think. Regardless, maybe the effort would be worth it. >>> That might be trying to be too fancy, though. >> >> >> The problem with documenting coverage counters is that we can't >> and shouldn't claim that these are in any way a stable API. >> They are mainly for developers. Code can change and there might >> be no meaningful way preserve current counters or their names. >> They can change in each release including minor ones without >> prior notice. > > +1000 as even if the direct code near the counter does not changes, it > could still be impacted the actual counter. Which I have seen before. > >> Best regards, Ilya Maximets. >> >>> >> Mechanically, it would be pretty simple to do something like: >> >> --- >> diff --git a/lib/timeval.c b/lib/timeval.c >> index 193c7bab17..00e5f2a74d 100644 >> --- a/lib/timeval.c >> +++ b/lib/timeval.c >> @@ -40,6 +40,7 @@ >> #include "openvswitch/vlog.h" >> >> VLOG_DEFINE_THIS_MODULE(timeval); >> +COVERAGE_DEFINE(long_poll_interval); >> >> #if !defined(HAVE_CLOCK_GETTIME) >> typedef unsigned int clockid_t; >> @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup) >> struct rusage rusage; >> >> if (!getrusage_thread()) { >> +COVERAGE_INC(long_poll_interval); >> + >> VLOG_WARN("Unreasonably long %lldms poll interval" >> " (%lldms user, %lldms system)", >> interval, >> --- >> >> This would at least expose the coverage data via the coverage framework >> and it can be queried via ovs-appctl. Actually, the advantage here is >> that the coverage counter can track some details about X/sec over the >> last 5 seconds, minute, hour, in addition to the total, so we can see >> whether the condition is ongoing. > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>> >>> ___ >>> dev mailing list >>> d...@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >> >> ___ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] Request adding long poll interval metrics to coverage metrics
Ilya Maximets writes: > On 6/20/23 16:10, Aaron Conole wrote: >> Adrian Moreno writes: >> >>> On 6/19/23 10:36, Eelco Chaudron wrote: On 16 Jun 2023, at 19:19, Aaron Conole wrote: > Martin Kennelly writes: > >> Hey ovs community, >> >> I am a developer working on ovn-kubernetes and I want to >> programmatically consume long poll information >> i.e: >> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll >> interval (752ms user, 209ms system) >> >> This is currently exposed via journal logs but it's not practical >> to consume it there programmatically and I was >> hoping you could add it to coverage metrics. > > I think it could be useful. I do want to be careful about exposing > these kinds of data in a way that could be misinterpreted. Already, > that log in particular gets misinterpreted quite a bit, and RH gets > customers claiming OVS is misbehaving when they've oversubscribed the > system. +1 >>> >>> Maybe it's a good time to start documenting coverage counters? >> >> I agree - we should have at least some kind of documentation. Actually, >> it would be really nice if we could do something during >> COVERAGE_DEFINE() that would be like: >> >> COVERAGE_DEFINE(ctr, "description") >> >> and then we can generate documentation from the COVERAGE_DEFINE()s as >> well as querying for it with an ovs-appctl command. >> >> That might be trying to be too fancy, though. > > > The problem with documenting coverage counters is that we can't > and shouldn't claim that these are in any way a stable API. > They are mainly for developers. Code can change and there might > be no meaningful way preserve current counters or their names. > They can change in each release including minor ones without > prior notice. Agreed that these are developer focused counters, and not proper KPIs. Even with that, there can be some information for others to take some useful information back to the developers with. For example, we can see how often this particular event is happening to correlate with system issues. I guess that is why I prefer the in-code documentation to doing some kind of "traditional" man-page for these as well. > Best regards, Ilya Maximets. > >> > Mechanically, it would be pretty simple to do something like: > > --- > diff --git a/lib/timeval.c b/lib/timeval.c > index 193c7bab17..00e5f2a74d 100644 > --- a/lib/timeval.c > +++ b/lib/timeval.c > @@ -40,6 +40,7 @@ > #include "openvswitch/vlog.h" > > VLOG_DEFINE_THIS_MODULE(timeval); > +COVERAGE_DEFINE(long_poll_interval); > > #if !defined(HAVE_CLOCK_GETTIME) > typedef unsigned int clockid_t; > @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup) > struct rusage rusage; > > if (!getrusage_thread()) { > +COVERAGE_INC(long_poll_interval); > + > VLOG_WARN("Unreasonably long %lldms poll interval" > " (%lldms user, %lldms system)", > interval, > --- > > This would at least expose the coverage data via the coverage framework > and it can be queried via ovs-appctl. Actually, the advantage here is > that the coverage counter can track some details about X/sec over the > last 5 seconds, minute, hour, in addition to the total, so we can see > whether the condition is ongoing. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >> ___ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] dpif: Add coverage counters for dpif_operate() failures.
Eelco Chaudron writes: > On 20 Jun 2023, at 16:17, Aaron Conole wrote: > >> Eelco Chaudron writes: >> >>> Add additional error coverage counters for dpif operation failures. >>> This could help to quickly identify netlink problems when communicating >>> with the OVS kernel module. >>> >>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2070630 >>> Signed-off-by: Eelco Chaudron >>> --- >> >> One thing that made this difficult to review is the re-ordering of the >> COVERAGE_DEFINE()s. Maybe it would be better to have a separate patch >> that corrects the order to alphabetical, and then a patch that adds the >> new counters for flow put/del/get/execute. WDYT? > > Based on the patch size, I decided not to do this, but if others also > feel it needs to be split up, I can do this. Either way: Acked-by: Aaron Conole > //Eelco > > >>> lib/dpif.c | 39 ++- >>> 1 file changed, 26 insertions(+), 13 deletions(-) >>> >>> diff --git a/lib/dpif.c b/lib/dpif.c >>> index 3305401fe..b1cbf39c4 100644 >>> --- a/lib/dpif.c >>> +++ b/lib/dpif.c >>> @@ -55,18 +55,22 @@ >>> VLOG_DEFINE_THIS_MODULE(dpif); >>> >>> COVERAGE_DEFINE(dpif_destroy); >>> -COVERAGE_DEFINE(dpif_port_add); >>> -COVERAGE_DEFINE(dpif_port_del); >>> +COVERAGE_DEFINE(dpif_execute); >>> +COVERAGE_DEFINE(dpif_execute_error); >>> +COVERAGE_DEFINE(dpif_execute_with_help); >>> +COVERAGE_DEFINE(dpif_flow_del); >>> +COVERAGE_DEFINE(dpif_flow_del_error); >>> COVERAGE_DEFINE(dpif_flow_flush); >>> COVERAGE_DEFINE(dpif_flow_get); >>> +COVERAGE_DEFINE(dpif_flow_get_error); >>> COVERAGE_DEFINE(dpif_flow_put); >>> -COVERAGE_DEFINE(dpif_flow_del); >>> -COVERAGE_DEFINE(dpif_execute); >>> -COVERAGE_DEFINE(dpif_purge); >>> -COVERAGE_DEFINE(dpif_execute_with_help); >>> -COVERAGE_DEFINE(dpif_meter_set); >>> -COVERAGE_DEFINE(dpif_meter_get); >>> +COVERAGE_DEFINE(dpif_flow_put_error); >>> COVERAGE_DEFINE(dpif_meter_del); >>> +COVERAGE_DEFINE(dpif_meter_get); >>> +COVERAGE_DEFINE(dpif_meter_set); >>> +COVERAGE_DEFINE(dpif_port_add); >>> +COVERAGE_DEFINE(dpif_port_del); >>> +COVERAGE_DEFINE(dpif_purge); >>> >>> static const struct dpif_class *base_dpif_classes[] = { >>> #if defined(__linux__) || defined(_WIN32) >>> @@ -1381,8 +1385,11 @@ dpif_operate(struct dpif *dpif, struct dpif_op >>> **ops, size_t n_ops, >>> >>> COVERAGE_INC(dpif_flow_put); >>> log_flow_put_message(dpif, _module, put, error); >>> -if (error && put->stats) { >>> -memset(put->stats, 0, sizeof *put->stats); >>> +if (error) { >>> +COVERAGE_INC(dpif_flow_put_error); >>> +if (put->stats) { >>> +memset(put->stats, 0, sizeof *put->stats); >>> +} >>> } >>> break; >>> } >>> @@ -1392,10 +1399,10 @@ dpif_operate(struct dpif *dpif, struct dpif_op >>> **ops, size_t n_ops, >>> >>> COVERAGE_INC(dpif_flow_get); >>> if (error) { >>> +COVERAGE_INC(dpif_flow_get_error); >>> memset(get->flow, 0, sizeof *get->flow); >>> } >>> log_flow_get_message(dpif, _module, get, error); >>> - >>> break; >>> } >>> >>> @@ -1404,8 +1411,11 @@ dpif_operate(struct dpif *dpif, struct dpif_op >>> **ops, size_t n_ops, >>> >>> COVERAGE_INC(dpif_flow_del); >>> log_flow_del_message(dpif, _module, del, error); >>> -if (error && del->stats) { >>> -memset(del->stats, 0, sizeof *del->stats); >>> +if (error) { >>> +COVERAGE_INC(dpif_flow_del_error); >>> +if (del->stats) { >>> +memset(del->stats, 0, sizeof *del->stats); >>> +} >>> } >>> break; >>> } >>> @@ -1414,6 +1424,9 @@ dpif_operate(struct dpif *dpif, struct dpif_op **ops, >>> size_t n_ops, >>> COVERAGE_INC(dpif_execute); >>> log_execute_message(dpif, _module, >execute, >>> false, error); >>> +if (error) { >>> +COVERAGE_INC(dpif_execute_error); >>> +} >>> break; >>> } >>> } ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron
On 6/21/23 15:00, Eelco Chaudron wrote: > > > On 21 Jun 2023, at 14:50, Simon Horman wrote: > >> Eelco Chaudron was elected by the Open vSwitch committers yesterday. >> This formalises his status as an Open vSwitch committer. >> >> Welcome Eelco! > > Not sire if I should ack ;) > > Acked-by: Eelco Chaudron > >> Signed-off-by: Simon Horman Applied. And Welcome! Best regards, Ilya Maximets. >> --- >> MAINTAINERS.rst | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst >> index 85b8e6416584..2fc2517177e1 100644 >> --- a/MAINTAINERS.rst >> +++ b/MAINTAINERS.rst >> @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers: >> - aserd...@ovn.org >> * - Ansis Atteka >> - ansisatt...@gmail.com >> + * - Eelco Chaudron >> + - echau...@redhat.com >> * - Ian Stokes >> - isto...@ovn.org >> * - Ilya Maximets >> -- >> 2.30.2 > ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron
On 21 Jun 2023, at 14:50, Simon Horman wrote: > Eelco Chaudron was elected by the Open vSwitch committers yesterday. > This formalises his status as an Open vSwitch committer. > > Welcome Eelco! Not sire if I should ack ;) Acked-by: Eelco Chaudron > Signed-off-by: Simon Horman > --- > MAINTAINERS.rst | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst > index 85b8e6416584..2fc2517177e1 100644 > --- a/MAINTAINERS.rst > +++ b/MAINTAINERS.rst > @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers: > - aserd...@ovn.org > * - Ansis Atteka > - ansisatt...@gmail.com > + * - Eelco Chaudron > + - echau...@redhat.com > * - Ian Stokes > - isto...@ovn.org > * - Ilya Maximets > -- > 2.30.2 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron
Welcome Eelco! Acked-by: Alin Gabriel Serdean > > On 21 Jun 2023, at 14:50, Simon Horman wrote: > > Eelco Chaudron was elected by the Open vSwitch committers yesterday. > This formalises his status as an Open vSwitch committer. > > Welcome Eelco! > > Signed-off-by: Simon Horman > --- > MAINTAINERS.rst | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst > index 85b8e6416584..2fc2517177e1 100644 > --- a/MAINTAINERS.rst > +++ b/MAINTAINERS.rst > @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers: > - aserd...@ovn.org >* - Ansis Atteka > - ansisatt...@gmail.com > + * - Eelco Chaudron > + - echau...@redhat.com >* - Ian Stokes > - isto...@ovn.org >* - Ilya Maximets > -- > 2.30.2 > > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] MAINTAINERS: Add Eelco Chaudron
Eelco Chaudron was elected by the Open vSwitch committers yesterday. This formalises his status as an Open vSwitch committer. Welcome Eelco! Signed-off-by: Simon Horman --- MAINTAINERS.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS.rst b/MAINTAINERS.rst index 85b8e6416584..2fc2517177e1 100644 --- a/MAINTAINERS.rst +++ b/MAINTAINERS.rst @@ -45,6 +45,8 @@ This is the current list of active Open vSwitch committers: - aserd...@ovn.org * - Ansis Atteka - ansisatt...@gmail.com + * - Eelco Chaudron + - echau...@redhat.com * - Ian Stokes - isto...@ovn.org * - Ilya Maximets -- 2.30.2 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH ovn] ovn-controller: Reduce size of the SB monitor condition.
On 6/20/23 21:01, Han Zhou wrote: > On Tue, Jun 20, 2023 at 12:48 AM Dumitru Ceara wrote: >> >> On 6/20/23 03:49, Han Zhou wrote: >>> On Tue, Jun 6, 2023 at 7:57 AM Dumitru Ceara wrote: We don't need to explicitly add port bindings that were already bound locally. We implicitly get those because we monitor the datapaths they're attached to. When performing an ovn-heater 500-node density-heavy scale test [0], > with conditional monitoring enabled, the unreasonably long poll intervals on the Southbound database (the ones that took more than one second) are measured as: - Count MinMax Median Mean 95 percentile - 56.0 1010.0 2664.0 1455.5 1544.9 2163.0 77.0 1015.0 3460.0 1896.0 1858.2 2907.8 69.0 1010.0 3118.0 1618.0 1688.1 2582.4 - 202.0 1010.0 3460.0 1610.0 1713.3 2711.5 Compared to the baseline results (also with conditional monitoring enabled): - Count MinMax Median Mean 95 percentile - 141.0 1003.018338.0 1721.0 2625.4 7626.0 151.0 1019.080297.0 1808.0 3410.7 9089.0 165.0 1007.050736.0 2354.0 3067.7 7309.8 - 457.0 1003.080297.0 2131.0 3044.6 7799.6 We see significant improvement on the server side. This is explained by the fact that now the Southbound database server doesn't need to apply as many conditions as before when filtering individual monitor contents. >>> Thanks Dumitru for the great improvement! This is very helpful for the > high >>> port-density environment. >>> Just to make sure I understand the test result correctly, in [0], it > shows >>> 22500 pods and 500 nodes, so is it 45 pods per node? >>> >> >> Yes, for density-heavy tests (load balancers are also configured) the >> pod density is 45 per node. >> Note: Sub-ports - OVN switch ports with parent_port set - have to be monitored unconditionally as we cannot efficiently determine their > local datapaths without including all local OVS interfaces in the monitor. This, however, should not be a huge issue because the majority of ports are regular VIFs, not sub-ports. >>> >>> I am not sure if we can make such a conclusion. For the current ovn-k8s > or >>> environments similar to that, it shouldn't be a problem. >>> However, for environments that model pods/containers as sub-ports of > the VM >>> VIFs, probably most of the majority of the ports would be sub-ports. > This >>> is what sub-ports are designed for, right? >> >> My impression was that this was one of the use cases for OpenStack and >> that it's only one of the different ways of providing container >> connectivity in a given deployment. But I might be wrong. I can remove >> this sentence, it makes a lot of assumptions indeed. >> >>> So, I think this would be a significant change of data monitored for > those >>> environments. I'd suggest at least we should properly document the >>> implication in the documents (such as ovn-monitor-all, and also the >>> sub-port related parts). There may also be such users who prefer not >>> monitoring all sub-ports (for efficiency of ovn-controller) sacrificing > SB >>> DB performance (probably they don't have very high port density so the >>> conditional monitoring impact is not that big). I am not aware of any > such >>> users yet, but if they complain, we will have to provide a knob, if no >>> better ideas. >>> >> >> I agree, if really needed, we can easily add a knob. >> >> What do you think of the following incremental? I can fold it in if it >> looks good to you. > > Thanks Dumitru. The below documentation looks good. In addition, I think we > should add some notes in the ovn-nb.xml under the section title="Containers"> of Logical_Switch_Port, which is the place where the > "sub-port" feature is described. Could you add it as well? > To avoid an "incremental on top of another incremental" I posted v2. I also added a NEWS item as this is a user-visible change. Please let me know what you think. https://patchwork.ozlabs.org/project/ovn/patch/20230621094753.150072-1-dce...@redhat.com/ Thanks, Dumitru ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH ovn v2] ovn-controller: Reduce size of the SB monitor condition.
We don't need to explicitly add port bindings that were already bound locally. We implicitly get those because we monitor the datapaths they're attached to. When performing an ovn-heater 500-node density-heavy scale test [0], with conditional monitoring enabled, the unreasonably long poll intervals on the Southbound database (the ones that took more than one second) are measured as: - Count MinMax Median Mean 95 percentile - 56.0 1010.0 2664.0 1455.5 1544.9 2163.0 77.0 1015.0 3460.0 1896.0 1858.2 2907.8 69.0 1010.0 3118.0 1618.0 1688.1 2582.4 - 202.0 1010.0 3460.0 1610.0 1713.3 2711.5 Compared to the baseline results (also with conditional monitoring enabled): - Count MinMax Median Mean 95 percentile - 141.0 1003.018338.0 1721.0 2625.4 7626.0 151.0 1019.080297.0 1808.0 3410.7 9089.0 165.0 1007.050736.0 2354.0 3067.7 7309.8 - 457.0 1003.080297.0 2131.0 3044.6 7799.6 We see significant improvement on the server side. This is explained by the fact that now the Southbound database server doesn't need to apply as many conditions as before when filtering individual monitor contents. Note: Sub-ports - OVN switch ports with parent_port set - have to be monitored unconditionally as we cannot efficiently determine their local datapaths without including all local OVS interfaces in the monitor. [0] https://github.com/ovn-org/ovn-heater/blob/main/test-scenarios/ocp-500-density-heavy.yml Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2139194 Reported-by: Ilya Maximets Signed-off-by: Dumitru Ceara --- V2: - Addressed Han's comments: - changed commit log: removed assumption about sub-ports. - added ovn-nb documentation note about sub-ports monitoring. - added ovn-controller documentation note about sub-ports monitoring. - added TODO item. - Added NEWS item. --- NEWS| 3 +++ TODO.rst| 6 ++ controller/binding.c| 2 +- controller/binding.h| 2 +- controller/ovn-controller.8.xml | 6 ++ controller/ovn-controller.c | 19 --- ovn-nb.xml | 6 +- 7 files changed, 38 insertions(+), 6 deletions(-) diff --git a/NEWS b/NEWS index 482970eeeb..8275877f99 100644 --- a/NEWS +++ b/NEWS @@ -7,6 +7,9 @@ Post v23.06.0 DHCPv4 "hostname" (12) option. - Support to create/update MAC_Binding when GARP received from VTEP (RAMP) switch on l3dgw port. + - To allow optimizing ovn-controller's monitor conditions for the regular +VIF case, ovn-controller now unconditionally monitors all sub-ports +(ports with parent_port set). OVN v23.06.0 - 01 Jun 2023 -- diff --git a/TODO.rst b/TODO.rst index dff163e70b..b790a9fadf 100644 --- a/TODO.rst +++ b/TODO.rst @@ -124,3 +124,9 @@ OVN To-do List * Load Balancer templates * Support combining the VIP IP and port into a single template variable. + +* ovn-controller conditional monitoring + + * Improve sub-ports (with parent_port set) conditional monitoring; these +are currently unconditionally monitored, even if ovn-monitor-all is +set to false. diff --git a/controller/binding.c b/controller/binding.c index 486095ca79..359ad6d660 100644 --- a/controller/binding.c +++ b/controller/binding.c @@ -785,7 +785,7 @@ local_binding_data_destroy(struct local_binding_data *lbinding_data) } const struct sbrec_port_binding * -local_binding_get_primary_pb(struct shash *local_bindings, +local_binding_get_primary_pb(const struct shash *local_bindings, const char *pb_name) { struct local_binding *lbinding = diff --git a/controller/binding.h b/controller/binding.h index 0e57f02ee5..319cbd263c 100644 --- a/controller/binding.h +++ b/controller/binding.h @@ -150,7 +150,7 @@ void local_binding_data_init(struct local_binding_data *); void local_binding_data_destroy(struct local_binding_data *); const struct sbrec_port_binding *local_binding_get_primary_pb( -struct shash *local_bindings, const char *pb_name); +const struct shash *local_bindings, const char *pb_name); ofp_port_t local_binding_get_lport_ofport(const struct shash *local_bindings, const char *pb_name); diff --git a/controller/ovn-controller.8.xml
[ovs-dev] [PATCH v2 6/6] pmd.at: Add per pmd max sleep unit tests.
Add unit tests for new per pmd options of pmd-sleep-max. Signed-off-by: Kevin Traynor --- tests/pmd.at | 152 +++ 1 file changed, 152 insertions(+) diff --git a/tests/pmd.at b/tests/pmd.at index 6040d558c..31b832e3e 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -77,4 +77,18 @@ m4_define([CHECK_DP_SLEEP_MAX], [ ]) +dnl CHECK_PMD_SLEEP_MAX([core_id], [numa_id], [max_sleep], [+line]) +dnl +dnl Checks max sleep time of each pmd with core_id. +dnl Checking starts from line number 'line' in ovs-vswithd.log . +m4_define([CHECK_PMD_SLEEP_MAX], [ +PATTERN="PMD thread core *[$1] NUMA *[$2]: Max sleep request set to *[$3] usecs." +line_st=$4 +if [[ -z "$line_st" ]] +then +line_st="+0" +fi +OVS_WAIT_UNTIL([tail -n $line_st ovs-vswitchd.log | grep "$PATTERN"]) +]) + m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id \)[[0-9]]*:/\1\2:/"]) m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"]) @@ -1339,2 +1353,140 @@ PMD load based sleeps are enabled by default. OVS_VSWITCHD_STOP AT_CLEANUP + +AT_SETUP([PMD - per pmd sleep]) +OVS_VSWITCHD_START([add-port br0 p0 -- set Interface p0 type=dummy-pmd options:n_rxq=8 options:numa_id=1], [], [], [--dummy-numa 0,0,0,1,1,8,8]) + +dnl Check system default +CHECK_DP_SLEEP_MAX([0], [disabled], []) + +AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl +PMD max sleep request is 0 usecs by default. +PMD load based sleeps are disabled by default. +PMD thread core 0 NUMA 0: Max sleep request set to0 usecs. +PMD thread core 3 NUMA 1: Max sleep request set to0 usecs. +PMD thread core 5 NUMA 8: Max sleep request set to0 usecs. +]) + +dnl only a dp default +get_log_next_line_num +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=200]) +CHECK_DP_SLEEP_MAX([200], [enabled], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([0], [0], [200], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([3], [1], [200], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([5], [8], [200], [+$LINENUM]) + +AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl +PMD max sleep request is 200 usecs by default. +PMD load based sleeps are enabled by default. +PMD thread core 0 NUMA 0: Max sleep request set to 200 usecs. +PMD thread core 3 NUMA 1: Max sleep request set to 200 usecs. +PMD thread core 5 NUMA 8: Max sleep request set to 200 usecs. +]) + +dnl only per pmd +get_log_next_line_num +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=3:300,0:100,5:400]) +CHECK_DP_SLEEP_MAX([0], [disabled], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([0], [0], [100], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([3], [1], [300], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([5], [8], [400], [+$LINENUM]) + + +AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl +PMD max sleep request is 0 usecs by default. +PMD load based sleeps are disabled by default. +PMD thread core 0 NUMA 0: Max sleep request set to 100 usecs. +PMD thread core 3 NUMA 1: Max sleep request set to 300 usecs. +PMD thread core 5 NUMA 8: Max sleep request set to 400 usecs. +]) + +dnl mix of not used default and per-pmd +get_log_next_line_num +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=50,3:300,0:100,5:200]) +CHECK_DP_SLEEP_MAX([50], [enabled], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([0], [0], [100], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([3], [1], [300], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([5], [8], [200], [+$LINENUM]) + +AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl +PMD max sleep request is 50 usecs by default. +PMD load based sleeps are enabled by default. +PMD thread core 0 NUMA 0: Max sleep request set to 100 usecs. +PMD thread core 3 NUMA 1: Max sleep request set to 300 usecs. +PMD thread core 5 NUMA 8: Max sleep request set to 200 usecs. +]) + +dnl remove a per-pmd entry and use default +get_log_next_line_num +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=50,3:300]) +CHECK_DP_SLEEP_MAX([50], [enabled], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([0], [0], [50], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([3], [1], [300], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([5], [8], [50], [+$LINENUM]) + +AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl +PMD max sleep request is 50 usecs by default. +PMD load based sleeps are enabled by default. +PMD thread core 0 NUMA 0: Max sleep request set to 50 usecs. +PMD thread core 3 NUMA 1: Max sleep request set to 300 usecs. +PMD thread core 5 NUMA 8: Max sleep request set to 50 usecs. +]) + +dnl mix and change values + +get_log_next_line_num +AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-sleep-max=3:400,200]) +CHECK_DP_SLEEP_MAX([200], [enabled], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([0], [0], [200], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([3], [1], [400], [+$LINENUM]) +CHECK_PMD_SLEEP_MAX([5], [8], [200], [+$LINENUM]) + + +AT_CHECK([ovs-appctl dpif-netdev/pmd-sleep-show], [0], [dnl +PMD max sleep request is 200 usecs by default. +PMD load
[ovs-dev] [PATCH v2 5/6] dpif-netdev: Add per pmd sleep config.
Extend 'pmd-sleep-max' so that individual PMD thread cores may have a specified max sleep request value. Any PMD thread core without a value will use the datapath default (no sleep request) or datapath global value set by the user. To set PMD thread cores 8 and 9 to never request a load based sleep and all other PMD thread cores to be able to request a max sleep of 50 usecs: $ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0 To set PMD thread cores 10 and 11 to request a max sleep of 100 usecs and all other PMD thread cores to never request a sleep: $ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=10:100,11:100 'pmd-sleep-show' can be used to dump the global and individual PMD thread core max sleep request values. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 23 +++ lib/dpif-netdev-private-thread.h | 3 + lib/dpif-netdev.c | 225 -- tests/pmd.at | 32 ++--- 4 files changed, 252 insertions(+), 31 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index 40e6b7843..eafcbc504 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -375,4 +375,27 @@ system configuration (e.g. enabling processor C-states) and workloads. rate. +Max sleep request values can be set for individual PMDs using key:value pairs. +Any PMD that has been assigned a specified value will use that. Any PMD that +does not have a specified value will use the current global default. + +Specified values for individual PMDs can be added or removed at any time. + +For example, to set PMD thread cores 8 and 9 to never request a load based +sleep and all others PMD cores to be able to request a max sleep of 50 usecs:: + +$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50,8:0,9:0 + +The max sleep request for each PMD can be checked in the logs or with:: + +$ ovs-appctl dpif-netdev/pmd-sleep-show +PMD max sleep request is 50 usecs by default. +PMD load based sleeps are enabled by default. +PMD thread core 8 NUMA 0: Max sleep request set to0 usecs. +PMD thread core 9 NUMA 1: Max sleep request set to0 usecs. +PMD thread core 10 NUMA 0: Max sleep request set to 50 usecs. +PMD thread core 11 NUMA 1: Max sleep request set to 50 usecs. +PMD thread core 12 NUMA 0: Max sleep request set to 50 usecs. +PMD thread core 13 NUMA 1: Max sleep request set to 50 usecs. + .. _ovs-vswitchd(8): http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h index 1ec3cd794..5c72ce5d9 100644 --- a/lib/dpif-netdev-private-thread.h +++ b/lib/dpif-netdev-private-thread.h @@ -181,4 +181,7 @@ struct dp_netdev_pmd_thread { bool isolated; +/* Max sleep request. UINT64_MAX indicates dp default should be used.*/ +atomic_uint64_t max_sleep; + /* Queue id used by this pmd thread to send packets on all netdevs if * XPS disabled for this netdev. All static_tx_qid's are unique and less diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 88d25f1da..d9ee53ff5 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -180,4 +180,9 @@ static struct odp_support dp_netdev_support = { #define PMD_SLEEP_INC_US 1 +struct pmd_sleep { +unsigned core_id; +uint64_t max_sleep; +}; + struct dpcls { struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */ @@ -290,4 +295,6 @@ struct dp_netdev { /* Max load based sleep request. */ atomic_uint64_t pmd_max_sleep; +/* Max load based sleep request user string. */ +char *max_sleep_list; /* Enable the SMC cache from ovsdb config */ atomic_bool smc_enable_db; @@ -1013,9 +1020,15 @@ pmd_max_sleep_show(struct ds *reply, struct dp_netdev_pmd_thread *pmd, { if (pmd->core_id != NON_PMD_CORE_ID) { +uint64_t pmd_max_sleep; + +atomic_read_relaxed(>max_sleep, _max_sleep); ds_put_format(reply, "PMD thread core %3u NUMA %2d: " "Max sleep request set to", pmd->core_id, pmd->numa_id); -ds_put_format(reply, " %4"PRIu64" usecs.", default_max_sleep); +ds_put_format(reply, " %4"PRIu64" usecs.", + pmd_max_sleep == UINT64_MAX + ? default_max_sleep + : pmd_max_sleep); ds_put_cstr(reply, "\n"); } @@ -1528,7 +1541,8 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], atomic_read_relaxed(>pmd_max_sleep, _max_sleep); ds_put_format(, "PMD max sleep request is %"PRIu64" " - "usecs.", default_max_sleep); + "usecs by default.", default_max_sleep); ds_put_cstr(, "\n"); -
[ovs-dev] [PATCH v2 4/6] dpif-netdev: Remove pmd-sleep-max experimental tag.
Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 4 ++-- NEWS | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index bedd42194..40e6b7843 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -325,6 +325,6 @@ reassignment due to PMD Auto Load Balance. For example, this could be set (in min) such that a reassignment is triggered at most every few hours. -PMD load based sleeping (Experimental) --- +PMD load based sleeping +--- PMD threads constantly poll Rx queues which are assigned to them. In order to diff --git a/NEWS b/NEWS index f8850d8a0..323d05c45 100644 --- a/NEWS +++ b/NEWS @@ -44,4 +44,5 @@ Post-v3.1.0 * Add 'ovs-appctl dpif-netdev/pmd-sleep-show' command to get the max sleep request setting of PMD thread cores. + * Remove experimental tag from PMD load based sleeping. -- 2.40.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2 3/6] dpif-netdev: Add pmd-sleep-show command.
Max requested sleep time and status for a PMD thread is logged at start up or when changed, but it can be convenient to have a command to dump this information explicitly. It is envisaged that this will be expanded when future additions are added. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 4 NEWS | 2 ++ lib/dpif-netdev.c | 40 +++ tests/pmd.at | 29 ++ 4 files changed, 71 insertions(+), 4 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index b261e9254..bedd42194 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -349,4 +349,8 @@ time not processing packets will be determined by the sleep and processor wake-up times and should be tested with each system configuration. +The max sleep time that a PMD thread may request can be shown with:: + +$ ovs-appctl dpif-netdev/pmd-sleep-show + Sleep time statistics for 10 secs can be seen with:: diff --git a/NEWS b/NEWS index f445377ad..f8850d8a0 100644 --- a/NEWS +++ b/NEWS @@ -42,4 +42,6 @@ Post-v3.1.0 table to check the status. * Rename 'pmd-maxsleep' other_config to 'pmd-sleep-max'. + * Add 'ovs-appctl dpif-netdev/pmd-sleep-show' command to get the + max sleep request setting of PMD thread cores. diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 5b53717e0..88d25f1da 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -702,4 +702,5 @@ enum pmd_info_type { PMD_INFO_SHOW_RXQ,/* Show poll lists of pmd threads. */ PMD_INFO_PERF_SHOW, /* Show pmd performance details. */ +PMD_INFO_MAX_SLEEP_SHOW, /* Show max sleep performance details. */ }; @@ -1007,4 +1008,18 @@ sorted_poll_thread_list(struct dp_netdev *dp, } +static void +pmd_max_sleep_show(struct ds *reply, struct dp_netdev_pmd_thread *pmd, + uint64_t default_max_sleep) +{ +if (pmd->core_id != NON_PMD_CORE_ID) { +ds_put_format(reply, + "PMD thread core %3u NUMA %2d: " + "Max sleep request set to", + pmd->core_id, pmd->numa_id); +ds_put_format(reply, " %4"PRIu64" usecs.", default_max_sleep); +ds_put_cstr(reply, "\n"); +} +} + static void dpif_netdev_subtable_lookup_get(struct unixctl_conn *conn, int argc OVS_UNUSED, @@ -1442,5 +1457,6 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], unsigned long long max_secs = (PMD_INTERVAL_LEN * PMD_INTERVAL_MAX) / INTERVAL_USEC_TO_SEC; -bool first_show_rxq = true; +bool first_pmd = true; +uint64_t default_max_sleep = 0; ovs_mutex_lock(_netdev_mutex); @@ -1490,5 +1506,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], } if (type == PMD_INFO_SHOW_RXQ) { -if (first_show_rxq) { +if (first_pmd) { if (!secs || secs > max_secs) { secs = max_secs; @@ -1499,5 +1515,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], ds_put_format(, "Displaying last %u seconds " "pmd usage %%\n", secs); -first_show_rxq = false; +first_pmd = false; } pmd_info_show_rxq(, pmd, secs); @@ -1508,4 +1524,16 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], } else if (type == PMD_INFO_PERF_SHOW) { pmd_info_show_perf(, pmd, (struct pmd_perf_params *)aux); +} else if (type == PMD_INFO_MAX_SLEEP_SHOW) { +if (first_pmd) { +atomic_read_relaxed(>pmd_max_sleep, _max_sleep); +ds_put_format(, "PMD max sleep request is %"PRIu64" " + "usecs.", default_max_sleep); +ds_put_cstr(, "\n"); +ds_put_format(, "PMD load based sleeps are %s.", + default_max_sleep ? "enabled" : "disabled"); +ds_put_cstr(, "\n"); +first_pmd = false; +} +pmd_max_sleep_show(, pmd, default_max_sleep); } } @@ -1608,5 +1636,6 @@ dpif_netdev_init(void) static enum pmd_info_type show_aux = PMD_INFO_SHOW_STATS, clear_aux = PMD_INFO_CLEAR_STATS, - poll_aux = PMD_INFO_SHOW_RXQ; + poll_aux = PMD_INFO_SHOW_RXQ, + sleep_aux = PMD_INFO_MAX_SLEEP_SHOW; unixctl_command_register("dpif-netdev/pmd-stats-show", "[-pmd core] [dp]", @@ -1620,4 +1649,7 @@ dpif_netdev_init(void) 0, 5, dpif_netdev_pmd_info, (void *)_aux); +
[ovs-dev] [PATCH v2 1/6] dpif-netdev: Rename pmd-maxsleep config option.
other_config:pmd-maxsleep is a config option to allow PMD thread cores to sleep under low or no load conditions. Rename it to 'pmd-sleep-max' to allow a more structured name and so that additional options or command can follow the 'pmd-sleep-xyz' pattern. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 2 +- NEWS | 1 + lib/dpif-netdev.c | 2 +- tests/pmd.at | 12 ++-- vswitchd/vswitch.xml | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index e70986d16..b261e9254 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -335,5 +335,5 @@ This can be enabled by setting the max requested sleep time (in microseconds) for a PMD thread:: -$ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=50 +$ ovs-vsctl set open_vswitch . other_config:pmd-sleep-max=50 With a non-zero max value a PMD may request to sleep by an incrementing amount diff --git a/NEWS b/NEWS index 66d5a4ea3..f445377ad 100644 --- a/NEWS +++ b/NEWS @@ -41,4 +41,5 @@ Post-v3.1.0 interfaces that support it. See the 'status' column in the 'interface' table to check the status. + * Rename 'pmd-maxsleep' other_config to 'pmd-sleep-max'. diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index abe63412e..5b53717e0 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -4983,5 +4983,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) set_pmd_auto_lb(dp, autolb_state, log_autolb); -pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0); +pmd_max_sleep = smap_get_ullong(other_config, "pmd-sleep-max", 0); pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep); atomic_read_relaxed(>pmd_max_sleep, _pmd_max_sleep); diff --git a/tests/pmd.at b/tests/pmd.at index 48f3d432d..374ad7217 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -1266,5 +1266,5 @@ OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled dnl Check low value max sleep get_log_next_line_num -AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"]) +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 1 usecs."]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) @@ -1272,5 +1272,5 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps dnl Check high value max sleep get_log_next_line_num -AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"]) +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 1 usecs."]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) @@ -1278,5 +1278,5 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps dnl Check setting max sleep to zero get_log_next_line_num -AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"]) +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="0"]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."]) @@ -1284,5 +1284,5 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps dnl Check above high value max sleep get_log_next_line_num -AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"]) +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="10001"]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 1 usecs."]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) @@ -1290,10 +1290,10 @@ OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps dnl Check rounding get_log_next_line_num -AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"]) +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="490"]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) dnl Check rounding get_log_next_line_num -AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="499"]) +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="499"]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 499 usecs."]) OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) diff --git
[ovs-dev] [PATCH v2 2/6] pmd.at: Add macro for checking pmd sleep max time and state.
This is just cosmetic. There is no change to the tests. Signed-off-by: Kevin Traynor --- tests/pmd.at | 39 --- 1 file changed, 24 insertions(+), 15 deletions(-) diff --git a/tests/pmd.at b/tests/pmd.at index 374ad7217..4dd775bd3 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -61,4 +61,20 @@ m4_define([CHECK_PMD_THREADS_CREATED], [ ]) +dnl CHECK_DP_SLEEP_MAX([max_sleep], [enabled], [+line]) +dnl +dnl Checks correct pmd load based sleep is set for the datapath. +dnl Checking starts from line number 'line' in ovs-vswithd.log . +m4_define([CHECK_DP_SLEEP_MAX], [ +SLEEP_TIME="PMD max sleep request is $1 usecs." +SLEEP_STATE="PMD load based sleeps are $2." +line_st=$3 +if [[ -z "$line_st" ]] +then +line_st="+0" +fi +OVS_WAIT_UNTIL([tail -n $line_st ovs-vswitchd.log | grep "$SLEEP_TIME"]) +OVS_WAIT_UNTIL([tail -n $line_st ovs-vswitchd.log | grep "$SLEEP_STATE"]) +]) + m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id \)[[0-9]]*:/\1\2:/"]) m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"]) @@ -1256,46 +1272,39 @@ OVS_VSWITCHD_STOP AT_CLEANUP -dnl Check default state AT_SETUP([PMD - pmd sleep]) OVS_VSWITCHD_START dnl Check default -OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."]) -OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled."]) +CHECK_DP_SLEEP_MAX([0], [disabled], []) dnl Check low value max sleep get_log_next_line_num AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 1 usecs."]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) +CHECK_DP_SLEEP_MAX([1], [enabled], [+$LINENUM]) dnl Check high value max sleep get_log_next_line_num AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="1"]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 1 usecs."]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) +CHECK_DP_SLEEP_MAX([1], [enabled], [+$LINENUM]) dnl Check setting max sleep to zero get_log_next_line_num AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="0"]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."]) +CHECK_DP_SLEEP_MAX([0], [disabled], [+$LINENUM]) dnl Check above high value max sleep get_log_next_line_num AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="10001"]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 1 usecs."]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) +CHECK_DP_SLEEP_MAX([1], [enabled], [+$LINENUM]) dnl Check rounding get_log_next_line_num AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="490"]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) +CHECK_DP_SLEEP_MAX([490], [enabled], [+$LINENUM]) + dnl Check rounding get_log_next_line_num AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-sleep-max="499"]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 499 usecs."]) -OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) +CHECK_DP_SLEEP_MAX([499], [enabled], [+$LINENUM]) OVS_VSWITCHD_STOP -- 2.40.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2 0/6] PMD load based sleep updates and per-pmd config.
Patches 1-4 are about polishing the existing functionality and preparing for new functionality later in the series. Patch 1 renames 'pmd-maxsleep' to 'pmd-sleep-max'. I think this is allowed as it is still experimental, but we can keep 'pmd-maxsleep' as well, if people think it's a problem. Patches 5-6 adds a new per pmd config option functionality and tests. These can be considered separate from patches 1-4 but as there are dependencies, it made sense to add all the patches to this patch set. v2 - 2/6 fixed UT macro - 3/6 rebased NEWS confict - 3/6 moved pmd_max_sleep_show() location - 6/6 removed incorrect check GHA: https://github.com/kevintraynor/ovs/actions/runs/5325578666 Kevin Traynor (6): dpif-netdev: Rename pmd-maxsleep config option. pmd.at: Add macro for checking pmd sleep max time and state. dpif-netdev: Add pmd-sleep-show command. dpif-netdev: Remove pmd-sleep-max experimental tag. dpif-netdev: Add per pmd sleep config. pmd.at: Add per pmd max sleep unit tests. Documentation/topics/dpdk/pmd.rst | 33 +++- NEWS | 4 + lib/dpif-netdev-private-thread.h | 3 + lib/dpif-netdev.c | 259 -- tests/pmd.at | 232 +++--- vswitchd/vswitch.xml | 2 +- 6 files changed, 492 insertions(+), 41 deletions(-) -- 2.40.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH 3/6] dpif-netdev: Add pmd-sleep-show command.
On 20/06/2023 18:02, Kevin Traynor wrote: On 14/06/2023 16:49, David Marchand wrote: On Wed, Jun 14, 2023 at 3:37 PM Kevin Traynor wrote: Max requested sleep time and status for a PMD thread is logged at start up or when changed, but it can be convenient to have a command to dump this information explicitly. It is envisaged that this will be expanded when future additions are added. Signed-off-by: Kevin Traynor --- Documentation/topics/dpdk/pmd.rst | 4 NEWS | 2 ++ lib/dpif-netdev.c | 40 +++ tests/pmd.at | 29 ++ 4 files changed, 71 insertions(+), 4 deletions(-) diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index b261e9254..bedd42194 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -349,4 +349,8 @@ time not processing packets will be determined by the sleep and processor wake-up times and should be tested with each system configuration. +The max sleep time that a PMD thread may request can be shown with:: + +$ ovs-appctl dpif-netdev/pmd-sleep-show + Sleep time statistics for 10 secs can be seen with:: diff --git a/NEWS b/NEWS index 1ccc6f948..32b9e8167 100644 --- a/NEWS +++ b/NEWS @@ -39,4 +39,6 @@ Post-v3.1.0 - Userspace datapath: * Rename 'pmd-maxsleep' other_config to 'pmd-sleep-max'. + * Add 'ovs-appctl dpif-netdev/pmd-sleep-show' command to get the + max sleep request setting of PMD thread cores. diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 94c8ca943..dadf17b70 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -702,4 +702,5 @@ enum pmd_info_type { PMD_INFO_SHOW_RXQ,/* Show poll lists of pmd threads. */ PMD_INFO_PERF_SHOW, /* Show pmd performance details. */ +PMD_INFO_MAX_SLEEP_SHOW, /* Show max sleep performance details. */ }; @@ -884,4 +885,18 @@ sorted_poll_list(struct dp_netdev_pmd_thread *pmd, struct rxq_poll **list, } +static void +pmd_max_sleep_show(struct ds *reply, struct dp_netdev_pmd_thread *pmd, + uint64_t default_max_sleep) +{ +if (pmd->core_id != NON_PMD_CORE_ID) { +ds_put_format(reply, + "PMD thread core %3u NUMA %2d: " + "Max sleep request set to", + pmd->core_id, pmd->numa_id); +ds_put_format(reply, " %4"PRIu64" usecs.", default_max_sleep); +ds_put_cstr(reply, "\n"); +} +} + Nit: I would move this new function later, and let pmd_info_show_rxq() and sorted_poll_list() close to each other. Will add to v2. I placed it after the other dpif_netdev_pmd_info() related functions. static void pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd, @@ -1442,5 +1457,6 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], unsigned long long max_secs = (PMD_INTERVAL_LEN * PMD_INTERVAL_MAX) / INTERVAL_USEC_TO_SEC; -bool first_show_rxq = true; +bool first_pmd = true; +uint64_t default_max_sleep = 0; You can move this new variable in the only block that cares about it. Yes, I can do that if I remove the initialization. It will work now, but not being able to initialize feels like it makes the code a bit more brittle as that nuance could be missed if a default initialization was ever was needed in future (I don't have something in mind). What do you think? I can go with your preference. Discussed with David offline. Working in practice, but it's not fully safe to rely that default_max_sleep will keep the read value after the loop if we move the declaration. So will stick with current declaration. ovs_mutex_lock(_netdev_mutex); @@ -1490,5 +1506,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], } if (type == PMD_INFO_SHOW_RXQ) { -if (first_show_rxq) { +if (first_pmd) { if (!secs || secs > max_secs) { secs = max_secs; @@ -1499,5 +1515,5 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], ds_put_format(, "Displaying last %u seconds " "pmd usage %%\n", secs); -first_show_rxq = false; +first_pmd = false; } pmd_info_show_rxq(, pmd, secs); @@ -1508,4 +1524,16 @@ dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char *argv[], } else if (type == PMD_INFO_PERF_SHOW) { pmd_info_show_perf(, pmd, (struct pmd_perf_params *)aux); +} else if (type == PMD_INFO_MAX_SLEEP_SHOW) { +if (first_pmd) { +atomic_read_relaxed(>pmd_max_sleep, _max_sleep); +ds_put_format(, "PMD max sleep request is %"PRIu64" " +
Re: [ovs-dev] [PATCH ovn 0/2] Update stale TODO items.
On 6/20/23 21:24, Han Zhou wrote: > On Tue, Jun 20, 2023 at 2:19 AM Dumitru Ceara wrote: >> >> Patch 1/2 removes the ones that are not applicable anymore. >> Patch 2/2 addresses a TODO related to SB.Port_Group.name backwards > compatibility. >> >> Dumitru Ceara (2): >> TODO.rst: Remove no longer applicable items. >> expr.c: Remove backwards compatibility lookup in parse_port_group(). >> >> TODO.rst | 8 >> lib/expr.c | 12 >> 2 files changed, 20 deletions(-) >> >> ___ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > LGTM for both patches. Thanks Dumitru! > > Acked-by: Han Zhou > Thanks! I applied these to the main branch. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] Request adding long poll interval metrics to coverage metrics
On 20 Jun 2023, at 16:57, Ilya Maximets wrote: > On 6/20/23 16:10, Aaron Conole wrote: >> Adrian Moreno writes: >> >>> On 6/19/23 10:36, Eelco Chaudron wrote: On 16 Jun 2023, at 19:19, Aaron Conole wrote: > Martin Kennelly writes: > >> Hey ovs community, >> >> I am a developer working on ovn-kubernetes and I want to >> programmatically consume long poll information >> i.e: >> ovs|00211|timeval(handler25)|WARN|Unreasonably long 52388ms poll >> interval (752ms user, 209ms system) >> >> This is currently exposed via journal logs but it's not practical >> to consume it there programmatically and I was >> hoping you could add it to coverage metrics. > > I think it could be useful. I do want to be careful about exposing > these kinds of data in a way that could be misinterpreted. Already, > that log in particular gets misinterpreted quite a bit, and RH gets > customers claiming OVS is misbehaving when they've oversubscribed the > system. +1 >>> >>> Maybe it's a good time to start documenting coverage counters? >> >> I agree - we should have at least some kind of documentation. Actually, >> it would be really nice if we could do something during >> COVERAGE_DEFINE() that would be like: >> >> COVERAGE_DEFINE(ctr, "description") >> >> and then we can generate documentation from the COVERAGE_DEFINE()s as >> well as querying for it with an ovs-appctl command. I think this might be the only way to keep the doc in sync, however, who is going to document the 320+ existing ones? >> That might be trying to be too fancy, though. > > > The problem with documenting coverage counters is that we can't > and shouldn't claim that these are in any way a stable API. > They are mainly for developers. Code can change and there might > be no meaningful way preserve current counters or their names. > They can change in each release including minor ones without > prior notice. +1000 as even if the direct code near the counter does not changes, it could still be impacted the actual counter. Which I have seen before. > Best regards, Ilya Maximets. > >> > Mechanically, it would be pretty simple to do something like: > > --- > diff --git a/lib/timeval.c b/lib/timeval.c > index 193c7bab17..00e5f2a74d 100644 > --- a/lib/timeval.c > +++ b/lib/timeval.c > @@ -40,6 +40,7 @@ > #include "openvswitch/vlog.h" > > VLOG_DEFINE_THIS_MODULE(timeval); > +COVERAGE_DEFINE(long_poll_interval); > > #if !defined(HAVE_CLOCK_GETTIME) > typedef unsigned int clockid_t; > @@ -645,6 +646,8 @@ log_poll_interval(long long int last_wakeup) > struct rusage rusage; > > if (!getrusage_thread()) { > +COVERAGE_INC(long_poll_interval); > + > VLOG_WARN("Unreasonably long %lldms poll interval" > " (%lldms user, %lldms system)", > interval, > --- > > This would at least expose the coverage data via the coverage framework > and it can be queried via ovs-appctl. Actually, the advantage here is > that the coverage counter can track some details about X/sec over the > last 5 seconds, minute, hour, in addition to the total, so we can see > whether the condition is ongoing. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >> ___ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev