On 10/07/2018 13:40, Ilya Maximets wrote: > On 10.07.2018 15:30, Eelco Chaudron wrote: >> >> >> On 10 Jul 2018, at 13:06, Tiago Lam wrote: >> >>> From: Mark Kavanagh <mark.b.kavan...@intel.com> >>> >>> Currently, jumbo frame support for OvS-DPDK is implemented by >>> increasing the size of mbufs within a mempool, such that each mbuf >>> within the pool is large enough to contain an entire jumbo frame of >>> a user-defined size. Typically, for each user-defined MTU, >>> 'requested_mtu', a new mempool is created, containing mbufs of size >>> ~requested_mtu. >>> >>> With the multi-segment approach, a port uses a single mempool, >>> (containing standard/default-sized mbufs of ~2k bytes), irrespective >>> of the user-requested MTU value. To accommodate jumbo frames, mbufs >>> are chained together, where each mbuf in the chain stores a portion of >>> the jumbo frame. Each mbuf in the chain is termed a segment, hence the >>> name. >>> >>> == Enabling multi-segment mbufs == >>> Multi-segment and single-segment mbufs are mutually exclusive, and the >>> user must decide on which approach to adopt on init. The introduction >>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This >>> is a global boolean value, which determines how jumbo frames are >>> represented across all DPDK ports. In the absence of a user-supplied >>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment >>> mbufs must be explicitly enabled / single-segment mbufs remain the >>> default. >>> >>> Setting the field is identical to setting existing DPDK-specific OVSDB >>> fields: >>> >>> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true >>> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 >>> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 >>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true >>> >>> Co-authored-by: Tiago Lam <tiago....@intel.com> >>> >>> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >>> Signed-off-by: Tiago Lam <tiago....@intel.com> >>> Acked-by: Eelco Chaudron <echau...@redhat.com> >>> --- >>> Documentation/topics/dpdk/jumbo-frames.rst | 51 +++++++++++++++++ >>> NEWS | 2 + >>> lib/dpdk.c | 8 +++ >>> lib/netdev-dpdk.c | 92 >>> +++++++++++++++++++++++++----- >>> lib/netdev-dpdk.h | 2 + >>> vswitchd/vswitch.xml | 23 ++++++++ >>> 6 files changed, 164 insertions(+), 14 deletions(-) >>> >>> diff --git a/Documentation/topics/dpdk/jumbo-frames.rst >>> b/Documentation/topics/dpdk/jumbo-frames.rst >>> index 00360b4..d1fb111 100644 >>> --- a/Documentation/topics/dpdk/jumbo-frames.rst >>> +++ b/Documentation/topics/dpdk/jumbo-frames.rst >>> @@ -71,3 +71,54 @@ Jumbo frame support has been validated against 9728B >>> frames, which is the >>> largest frame size supported by Fortville NIC using the DPDK i40e driver, >>> but >>> larger frames and other DPDK NIC drivers may be supported. These cases are >>> common for use cases involving East-West traffic only. >>> + >>> +------------------- >>> +Multi-segment mbufs >>> +------------------- >>> + >>> +Instead of increasing the size of mbufs within a mempool, such that each >>> mbuf >>> +within the pool is large enough to contain an entire jumbo frame of a >>> +user-defined size, mbufs can be chained together instead. In this approach >>> each >>> +mbuf in the chain stores a portion of the jumbo frame, by default ~2K >>> bytes, >>> +irrespective of the user-requested MTU value. Since each mbuf in the chain >>> is >>> +termed a segment, this approach is named "multi-segment mbufs". >>> + >>> +This approach may bring more flexibility in use cases where the maximum >>> packet >>> +length may be hard to guess. For example, in cases where packets originate >>> from >>> +sources marked for oflload (such as TSO), each packet may be larger than >>> the >>> +MTU, and as such, when forwarding it to a DPDK port a single mbuf may not >>> be >>> +enough to hold all of the packet's data. >>> + >>> +Multi-segment and single-segment mbufs are mutually exclusive, and the user >>> +must decide on which approach to adopt on initialisation. If multi-segment >>> +mbufs is to be enabled, it can be done so with the following command:: >>> + >>> + $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true >>> + >>> +Single-segment mbufs still remain the default when using OvS-DPDK, and the >>> +above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if >>> +multi-segment mbufs are to be used. >>> + >>> +~~~~~~~~~~~~~~~~~ >>> +Performance notes >>> +~~~~~~~~~~~~~~~~~ >>> + >>> +When using multi-segment mbufs some PMDs may not support vectorized Tx >>> +functions, due to its non-contiguous nature. As a result this can hit >>> +performance for smaller packet sizes. For example, on a setup sending 64B >>> +packets at line rate, a decrease of ~20% has been observed. The performance >>> +impact stops being noticeable for larger packet sizes, although the exact >>> size >>> +will between PMDs, and depending on the architecture one's using. >>> + >>> +Tests performed with the i40e PMD driver only showed this limitation for >>> 64B >>> +packets, and the same rate was observed when comparing multi-segment mbufs >>> and >>> +single-segment mbuf for 128B packets. In other words, the 20% drop in >>> +performance was not observed for packets >= 128B during this test case. >>> + >>> +Because of this, multi-segment mbufs is not advised to be used with smaller >>> +packet sizes, such as 64B. >>> + >>> +Also, note that using multi-segment mbufs won't improve memory usage. For a >>> +packet of 9000B, for example, which would be stored on a single mbuf when >>> using >>> +the single-segment approach, 5 mbufs (9000/2048) of 2048B would be needed >>> to >>> +store the same data using the multi-segment mbufs approach. >>> diff --git a/NEWS b/NEWS >>> index 92e9b92..c7facac 100644 >>> --- a/NEWS >>> +++ b/NEWS >>> @@ -38,6 +38,7 @@ Post-v2.9.0 >>> * Allow init to fail and record DPDK status/version in OVS database. >>> * Add experimental flow hardware offload support >>> * Support both shared and per port mempools for DPDK devices. >>> + * Add support for multi-segment mbufs. >>> - Userspace datapath: >>> * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single >>> PMD >>> * Detailed PMD performance metrics available with new command >>> @@ -115,6 +116,7 @@ v2.9.0 - 19 Feb 2018 >>> * New appctl command 'dpif-netdev/pmd-rxq-rebalance' to rebalance rxq >>> to >>> pmd assignments. >>> * Add rxq utilization of pmd to appctl 'dpif-netdev/pmd-rxq-show'. >>> + * Add support for vHost dequeue zero copy (experimental) >> >> Guess this was added on error? > > I guess, Ian accidentally removed this line by commit > c3c722d2c7ee ("Documentation: document ovs-dpdk flow offload"). > > But, sure, this change should not be in this patch. >
Thanks for tracking it, Ilya (and for finding it, Eelco). I was wrapping my head around it myself as this was still present in v3. >> >>> - Userspace datapath: >>> * Output packet batching support. >>> - vswitchd: >>> diff --git a/lib/dpdk.c b/lib/dpdk.c >>> index 0ee3e19..ac89fd8 100644 >>> --- a/lib/dpdk.c >>> +++ b/lib/dpdk.c >>> @@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config) >>> >>> /* Finally, register the dpdk classes */ >>> netdev_dpdk_register(); >>> + >>> + bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config, >>> + "dpdk-multi-seg-mbufs", false); >>> + if (multi_seg_mbufs_enable) { >>> + VLOG_INFO("DPDK multi-segment mbufs enabled\n"); >>> + netdev_dpdk_multi_segment_mbufs_enable(); >>> + } >>> + >>> return true; >>> } >>> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>> index 81dd1ed..6fc95d6 100644 >>> --- a/lib/netdev-dpdk.c >>> +++ b/lib/netdev-dpdk.c >>> @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; >>> >>> VLOG_DEFINE_THIS_MODULE(netdev_dpdk); >>> static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); >>> +static bool dpdk_multi_segment_mbufs = false; >>> >>> #define DPDK_PORT_WATCHDOG_INTERVAL 5 >>> >>> @@ -521,6 +522,18 @@ is_dpdk_class(const struct netdev_class *class) >>> || class->destruct == netdev_dpdk_vhost_destruct; >>> } >>> >>> +bool >>> +netdev_dpdk_is_multi_segment_mbufs_enabled(void) >>> +{ >>> + return dpdk_multi_segment_mbufs == true; >>> +} >>> + >>> +void >>> +netdev_dpdk_multi_segment_mbufs_enable(void) >>> +{ >>> + dpdk_multi_segment_mbufs = true; >>> +} >>> + >>> /* DPDK NIC drivers allocate RX buffers at a particular granularity, >>> typically >>> * aligned at 1k or less. If a declared mbuf size is not a multiple of this >>> * value, insufficient buffers are allocated to accomodate the packet in >>> its >>> @@ -628,14 +641,19 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) >>> } >>> } >>> >>> -/* Calculating the required number of mbufs differs depending on the >>> - * mempool model being used. Check if per port memory is in use before >>> - * calculating. >>> - */ >>> +/* Calculating the required number of mbufs differs depending on the >>> mempool >>> + * model (per port vs shared mempools) being used. >>> + * In case multi-segment mbufs are being used, the number of mbufs is also >>> + * increased when using the per port model, to account for the multiple >>> mbufs >>> + * needed to hold each packet's data. This doesn't happen for the shared >>> port >>> + * model since the number of mbufs assigned is should already be sufficient >>> + * (MAX_NB_MBUF) */ >>> static uint32_t >>> -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp) >>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size, >>> + bool per_port_mp) >>> { >>> uint32_t n_mbufs; >>> + uint16_t max_frame_len = 0; >>> >>> if (!per_port_mp) { >>> /* Shared memory are being used. >>> @@ -662,6 +680,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, >>> bool per_port_mp) >>> + dev->requested_n_txq * dev->requested_txq_size >>> + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * >>> NETDEV_MAX_BURST >>> + MIN_NB_MBUF; >>> + >>> + /* If multi-segment mbufs are used, we also increase the number of >>> + * mbufs used. This is done by calculating how many mbufs are >>> needed to >>> + * hold the data on a single packet of MTU size. For example, for a >>> + * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to >>> hold >>> + * the data - 4 more than with single-mbufs (as mbufs' size is >>> extended >>> + * to hold all data) */ >>> + max_frame_len = MTU_TO_MAX_FRAME_LEN(mtu); >>> + if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) { >>> + uint16_t nb_segs = max_frame_len / mbuf_size; >>> + if (max_frame_len % mbuf_size) { >>> + nb_segs += 1; >>> + } >>> + >>> + n_mbufs *= nb_segs; >>> + } >>> } >>> >>> return n_mbufs; >>> @@ -688,7 +722,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, >>> uint32_t mbuf_size, >>> dmp->mtu = mtu; >>> dmp->refcount = 1; >>> >>> - n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp); >>> + n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp); >>> >>> do { >>> /* Full DPDK memory pool name must be unique and cannot be >>> @@ -843,16 +877,22 @@ dpdk_mp_put(struct dpdk_mp *dmp) >>> ovs_mutex_unlock(&dpdk_mp_mutex); >>> } >>> >>> -/* Depending on the memory model being used this function tries to >>> - * identify and reuse an existing mempool or tries to allocate a new >>> - * mempool on requested_socket_id with mbuf size corresponding to the >>> - * requested_mtu. On success, a new configuration will be applied. >>> - * On error, device will be left unchanged. */ >>> +/* Depending on the memory model (per port vs shared mempools) being used >>> this >>> + * function tries to identify and reuse an existing mempool or tries to >>> + * allocate a new mempool on requested_socket_id, with mbuf size set to the >>> + * following, depending if 'dpdk_multi_segment_mbufs' is enabled or not: >>> + * - if 'true', then the mempool contains standard-sized mbufs that are >>> chained >>> + * together to accommodate packets of size 'requested_mtu'. >>> + * - if 'false', then the members of the allocated mempool are >>> + * non-standard-sized mbufs. Each mbuf in the mempool is large enough to >>> + * fully accomdate packets of size 'requested_mtu'. >>> + * On success, a new configuration will be applied. On error, device will >>> be >>> + * left unchanged. */ >>> static int >>> netdev_dpdk_mempool_configure(struct netdev_dpdk *dev) >>> OVS_REQUIRES(dev->mutex) >>> { >>> - uint32_t buf_size = dpdk_buf_size(dev->requested_mtu); >>> + uint16_t buf_size = 0; >>> struct dpdk_mp *dmp; >>> int ret = 0; >>> bool per_port_mp = dpdk_per_port_memory(); >>> @@ -865,6 +905,16 @@ netdev_dpdk_mempool_configure(struct netdev_dpdk *dev) >>> return ret; >>> } >>> >>> + dpdk_mp_sweep(); >> >> Is this sweep needed, as its part of dpdk_mp_get() >> A blatant leftover from the rebase. Thanks, Eelco. I'll wait for Ian's reply in patch 01/15 and take this into account for v5. Tiago. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev