On 08.08.2016 18:50, Mark Kavanagh wrote:
> Add support for Jumbo Frames to DPDK-enabled port types,
> using single-segment-mbufs.
> 
> Using this approach, the amount of memory allocated to each mbuf
> to store frame data is increased to a value greater than 1518B
> (typical Ethernet maximum frame length). The increased space
> available in the mbuf means that an entire Jumbo Frame of a specific
> size can be carried in a single mbuf, as opposed to partitioning
> it across multiple mbuf segments.
> 
> The amount of space allocated to each mbuf to hold frame data is
> defined dynamically by the user with ovs-vsctl, via the 'mtu_request'
> parameter.
> 
> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com>
> [diproiet...@vmware.com rebased]
> Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com>
> ---
> 
> v4:
>     - restore error reporting in *_reconfigure functions (for
>       non-mtu-configuration based errors)
>     - remove 'goto' in the event of dpdk_mp_configure failure
>     - remove superfluous error variables
> 
>  v3:
>     - replace netdev_dpdk.last_mtu with local variable
>     - add comment for dpdk_mp_configure
> 
>  v2:
>      - rebase to HEAD of master
>      - fall back to previous 'good' MTU if reconfigure fails
>      - introduce new field 'last_mtu' in struct netdev_dpdk to facilitate
>        fall-back
>      - rename 'mtu_request' to 'requested_mtu' in struct netdev_dpdk
>      - remove rebasing artifact in INSTALL.DPDK-Advanced.md
>      - remove superflous variable in dpdk_mp_configure
>      - fix minor coding style infraction
> 
> 
>  INSTALL.DPDK-ADVANCED.md |  58 ++++++++++++++++-
>  INSTALL.DPDK.md          |   1 -
>  NEWS                     |   1 +
>  lib/netdev-dpdk.c        | 162 
> ++++++++++++++++++++++++++++++++++++++++-------
>  4 files changed, 194 insertions(+), 28 deletions(-)
> 
> diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md
> index 0ab43d4..5e758ce 100755
> --- a/INSTALL.DPDK-ADVANCED.md
> +++ b/INSTALL.DPDK-ADVANCED.md
> @@ -1,5 +1,5 @@
>  OVS DPDK ADVANCED INSTALL GUIDE
> -=================================
> +===============================
>  
>  ## Contents
>  
> @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE
>  7. [QOS](#qos)
>  8. [Rate Limiting](#rl)
>  9. [Flow Control](#fc)
> -10. [Vsperf](#vsperf)
> +10. [Jumbo Frames](#jumbo)
> +11. [Vsperf](#vsperf)
>  
>  ## <a name="overview"></a> 1. Overview
>  
> @@ -862,7 +863,58 @@ respective parameter. To disable the flow control at tx 
> side,
>  
>  `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false`
>  
> -## <a name="vsperf"></a> 10. Vsperf
> +## <a name="jumbo"></a> 10. Jumbo Frames
> +
> +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To
> +enable Jumbo Frames support for a DPDK port, change the Interface's 
> `mtu_request`
> +attribute to a sufficiently large value.
> +
> +e.g. Add a DPDK Phy port with MTU of 9000:
> +
> +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set 
> Interface dpdk0 mtu_request=9000`
> +
> +e.g. Change the MTU of an existing port to 6200:
> +
> +`ovs-vsctl set Interface dpdk0 mtu_request=6200`
> +
> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
> +increased, such that a full Jumbo Frame of a specific size may be 
> accommodated
> +within a single mbuf segment.
> +
> +Jumbo frame support has been validated against 9728B frames (largest frame 
> size
> +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames
> +(particularly in use cases involving East-West traffic only), and other DPDK 
> NIC
> +drivers may be supported.
> +
> +### 9.1 vHost Ports and Jumbo Frames
> +
> +Some additional configuration is needed to take advantage of jumbo frames 
> with
> +vhost ports:
> +
> +    1. `mergeable buffers` must be enabled for vHost ports, as demonstrated 
> in
> +        the QEMU command line snippet below:
> +
> +        ```
> +        '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
> +        '-device 
> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
> +        ```
> +
> +    2. Where virtio devices are bound to the Linux kernel driver in a guest
> +       environment (i.e. interfaces are not bound to an in-guest DPDK 
> driver),
> +       the MTU of those logical network interfaces must also be increased to 
> a
> +       sufficiently large value. This avoids segmentation of Jumbo Frames
> +       received in the guest. Note that 'MTU' refers to the length of the IP
> +       packet only, and not that of the entire frame.
> +
> +       To calculate the exact MTU of a standard IPv4 frame, subtract the L2
> +       header and CRC lengths (i.e. 18B) from the max supported frame size.
> +       So, to set the MTU for a 9018B Jumbo Frame:
> +
> +       ```
> +       ifconfig eth1 mtu 9000
> +       ```
> +
> +## <a name="vsperf"></a> 11. Vsperf
>  
>  Vsperf project goal is to develop vSwitch test framework that can be used to
>  validate the suitability of different vSwitch implementations in a Telco 
> deployment
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 253d022..a810ac8 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough].
>  
>  ## <a name="ovslimits"></a> 6. Limitations
>  
> -  - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future 
> OVS release.
>    - Currently DPDK ports does not use HW offload functionality.
>    - Network Interface Firmware requirements:
>      Each release of DPDK is validated against a specific firmware version for
> diff --git a/NEWS b/NEWS
> index ce10982..53c816b 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -69,6 +69,7 @@ Post-v2.5.0
>       * Basic connection tracking for the userspace datapath (no ALG,
>         fragmentation or NAT support yet)
>       * Support for DPDK 16.07
> +     * Jumbo frame support
>     - Increase number of registers to 16.
>     - ovs-benchmark: This utility has been removed due to lack of use and
>       bitrot.
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 60db568..98ce3d0 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -81,6 +81,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 
> 20);
>                                      + sizeof(struct dp_packet)    \
>                                      + RTE_PKTMBUF_HEADROOM)
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
> +#define NETDEV_DPDK_MAX_PKT_LEN     9728
>  
>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> @@ -337,6 +338,7 @@ struct netdev_dpdk {
>  
>      struct dpdk_mp *dpdk_mp;
>      int mtu;
> +    int requested_mtu;

I think this should be placed near to other 'requested_*' variables somewhere
under the "The following properties cannot be changed when a device is 
running...".

>      int socket_id;
>      int buf_size;
>      struct netdev_stats stats;
> @@ -477,10 +479,19 @@ dpdk_mp_get(int socket_id, int mtu) 
> OVS_REQUIRES(dpdk_mutex)
>      dmp->mtu = mtu;
>      dmp->refcount = 1;
>      mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet);
> -    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) -
> -                              sizeof (struct rte_mbuf);
> +    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet)
> +                              - sizeof (struct rte_mbuf);
> +    /* XXX: this is a really rough method of provisioning memory.
> +     * It's impossible to determine what the exact memory requirements are 
> when
> +     * the number of ports and rxqs that utilize a particular mempool can 
> change
> +     * dynamically at runtime. For the moment, use this rough heurisitic.
> +     */
> +    if (mtu >= ETHER_MTU) {
> +        mp_size = MAX_NB_MBUF;
> +    } else {
> +        mp_size = MIN_NB_MBUF;
> +    }
>  
> -    mp_size = MAX_NB_MBUF;
>      do {
>          if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>                       dmp->mtu, dmp->socket_id, mp_size) < 0) {
> @@ -520,6 +531,33 @@ dpdk_mp_put(struct dpdk_mp *dmp) OVS_REQUIRES(dpdk_mutex)
>      }
>  }
>  
> +/* Retrieve and configure a dpdk_mp that can accommodate dev's MTU.
> + * On success, decrement the refcount to dev's previous dpdk_mp (if 
> applicable)
> + * and return 0.
> + * If insufficient memory is available to create dev's dpdk_mp, return 
> ENOMEM.
> + */
> +static int
> +dpdk_mp_configure(struct netdev_dpdk *dev)
> +    OVS_REQUIRES(dpdk_mutex)
> +    OVS_REQUIRES(dev->mutex)
> +{
> +    uint32_t buf_size = dpdk_buf_size(dev->mtu);
> +    struct dpdk_mp *mp;
> +
> +    mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
> +    if (!mp) {
> +        VLOG_ERR("Insufficient memory to create memory pool for netdev %s\n",
> +                 dev->up.name);

Socket id would be nice in this error report.

> +        return ENOMEM;
> +    } else {
> +        dpdk_mp_put(dev->dpdk_mp);
> +        dev->dpdk_mp = mp;
> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +    }
> +
> +    return 0;
> +}
> +
>  static void
>  check_link_status(struct netdev_dpdk *dev)
>  {
> @@ -571,7 +609,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int 
> n_rxq, int n_txq)
>  {
>      int diag = 0;
>      int i;
> +    struct rte_eth_conf conf = port_conf;
>  
> +    if (dev->mtu > ETHER_MTU) {
> +        conf.rxmode.jumbo_frame = 1;
> +        conf.rxmode.max_rx_pkt_len = dev->max_packet_len;
> +    } else {
> +        conf.rxmode.jumbo_frame = 0;
> +        conf.rxmode.max_rx_pkt_len = 0;
> +    }
>      /* A device may report more queues than it makes available (this has
>       * been observed for Intel xl710, which reserves some of them for
>       * SRIOV):  rte_eth_*_queue_setup will fail if a queue is not
> @@ -582,8 +628,10 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int 
> n_rxq, int n_txq)
>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>          }
>  
> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>          if (diag) {
> +            VLOG_WARN("Interface %s eth_dev setup error %s\n",
> +                      dev->up.name, rte_strerror(-diag));
>              break;
>          }
>  
> @@ -736,7 +784,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
> port_no,
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      int sid;
>      int err = 0;
> -    uint32_t buf_size;
>  
>      ovs_mutex_init(&dev->mutex);
>      ovs_mutex_lock(&dev->mutex);
> @@ -757,15 +804,13 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
> port_no,
>      dev->port_id = port_no;
>      dev->type = type;
>      dev->flags = 0;
> -    dev->mtu = ETHER_MTU;
> +    dev->requested_mtu = dev->mtu = ETHER_MTU;
>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      ovsrcu_index_init(&dev->vid, -1);
>      dev->vhost_reconfigured = false;
>  
> -    buf_size = dpdk_buf_size(dev->mtu);
> -    dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
> -    if (!dev->dpdk_mp) {
> -        err = ENOMEM;
> +    err = dpdk_mp_configure(dev);
> +    if (err) {
>          goto unlock;
>      }
>  
> @@ -991,6 +1036,7 @@ netdev_dpdk_get_config(const struct netdev *netdev, 
> struct smap *args)
>      smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq);
>      smap_add_format(args, "requested_tx_queues", "%d", dev->requested_n_txq);
>      smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq);
> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>      ovs_mutex_unlock(&dev->mutex);
>  
>      return 0;
> @@ -1366,6 +1412,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
>      unsigned int total_pkts = cnt;
>      unsigned int qos_pkts = cnt;
> +    unsigned int mtu_dropped = 0;
>      int retries = 0;
>  
>      qid = dev->tx_q[qid % netdev->n_txq].map;
> @@ -1387,25 +1434,41 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int 
> qid,
>      do {
>          int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>          unsigned int tx_pkts;
> +        unsigned int try_tx_pkts = cnt;
>  
> +        for (unsigned int i = 0; i < cnt; i++) {
> +            if (cur_pkts[i]->pkt_len > dev->max_packet_len) {
> +                try_tx_pkts = i;
> +                break;
> +            }
> +        }
> +        if (!try_tx_pkts) {
> +            cur_pkts++;
> +            mtu_dropped++;
> +            cnt--;
> +            continue;
> +        }
>          tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
> -                                          vhost_qid, cur_pkts, cnt);
> +                                          vhost_qid, cur_pkts, try_tx_pkts);
>          if (OVS_LIKELY(tx_pkts)) {
>              /* Packets have been sent.*/
>              cnt -= tx_pkts;
>              /* Prepare for possible retry.*/
>              cur_pkts = &cur_pkts[tx_pkts];
> +            if (tx_pkts != try_tx_pkts) {
> +                retries++;
> +            }
>          } else {
>              /* No packets sent - do not retry.*/
>              break;
>          }
> -    } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
> +    } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM));
>  
>      rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>  
>      rte_spinlock_lock(&dev->stats_lock);
> -    cnt += qos_pkts;
> -    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt);
> +    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,
> +                                         cnt + mtu_dropped + qos_pkts);
>      rte_spinlock_unlock(&dev->stats_lock);
>  
>  out:
> @@ -1639,6 +1702,26 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int 
> *mtup)
>  }
>  
>  static int
> +netdev_dpdk_set_mtu(struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +    if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN) {

Should we check for 'mtu < ETHER_MIN_MTU' here? I think it'll be nice
to check it here and pass only sane values to device. Otherwise
configuration will fail and device will be deleted.

> +        VLOG_WARN("Unsupported MTU (%d)\n", mtu);

I prefer to mention name of the device in this message.

> +        return EINVAL;
> +    }
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->requested_mtu != mtu) {
> +        dev->requested_mtu = mtu;
> +        netdev_request_reconfigure(netdev);
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +
> +    return 0;
> +}
> +
> +static int
>  netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
>  
>  static int
> @@ -2803,7 +2886,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>      ovs_mutex_lock(&dev->mutex);
>  
>      if (netdev->n_txq == dev->requested_n_txq
> -        && netdev->n_rxq == dev->requested_n_rxq) {
> +        && netdev->n_rxq == dev->requested_n_rxq
> +        && dev->mtu == dev->requested_mtu) {
>          /* Reconfiguration is unnecessary */
>  
>          goto out;
> @@ -2811,6 +2895,18 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>  
>      rte_eth_dev_stop(dev->port_id);
>  
> +    if (dev->mtu != dev->requested_mtu) {
> +        int prev_mtu = dev->mtu;
> +
> +        dev->mtu = dev->requested_mtu;
> +        err = dpdk_mp_configure(dev);
> +        if (err) {
> +            /* Revert to previous configuration; don't flag this as an error 
> */
> +            dev->mtu = prev_mtu;
> +            err = 0;
> +        }

How about to make this like in 'netdev_dpdk_vhost_user_reconfigure()' :

        if (dpdk_mp_configure(dev)) {
            /* Failed. Revert to previous configuration */
            dev->mtu = prev_mtu;
        }
?

> +    }
> +
>      netdev->n_txq = dev->requested_n_txq;
>      netdev->n_rxq = dev->requested_n_rxq;
>  
> @@ -2818,6 +2914,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>      err = dpdk_eth_dev_init(dev);
>      netdev_dpdk_alloc_txq(dev, netdev->n_txq);
>  
> +    netdev_change_seq_changed(netdev);
> +
>  out:
>  
>      ovs_mutex_unlock(&dev->mutex);
> @@ -2830,7 +2928,6 @@ static int
>  netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    int err = 0;
>  
>      ovs_mutex_lock(&dpdk_mutex);
>      ovs_mutex_lock(&dev->mutex);
> @@ -2845,14 +2942,19 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev 
> *netdev)
>  
>      netdev_dpdk_remap_txqs(dev);
>  
> -    if (dev->requested_socket_id != dev->socket_id) {
> +    if (dev->requested_socket_id != dev->socket_id
> +        || dev->requested_mtu != dev->mtu) {
> +        int prev_mtu = dev->mtu;
> +
>          dev->socket_id = dev->requested_socket_id;
> -        /* Change mempool to new NUMA Node */
> -        dpdk_mp_put(dev->dpdk_mp);
> -        dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> -        if (!dev->dpdk_mp) {
> -            err = ENOMEM;
> +        dev->mtu = dev->requested_mtu;
> +
> +        /* Change mempool to new NUMA Node and to new MTU.
> +         * In the event of error, restore previous MTU. */
> +        if (dpdk_mp_configure(dev)) {

'dev->socket_id' should be reverted too in this case.

Maybe it'll be better to add 2 arguments ('mtu' and 'socket_id') to
function 'dpdk_mp_configure()' and update real 'mtu' and 'socket_id'
only on success? Like this:

        if (!dpdk_mp_configure(dev, dev->requested_mtu,
                                    dev->requested_socket_id)) {
            dev->socket_id = dev->requested_socket_id;
            dev->mtu = dev->requested_mtu;
        }

Another option is to rename 'dpdk_mp_configure' to
'netdev_dpdk_reconfigure_mempool' and use 'requested_{mtu, socket_id}'
inside and update real ones on success. Like this (not tested):

/* Tries to allocate new mempool on requested_socket_id with
 * mbuf size corresponding to requested_mtu.
 * On success new configuration will be applied.
 * On error, device will be left unchanged. */
static int
netdev_dpdk_reconfigure_mempool(struct netdev_dpdk *dev)
    OVS_REQUIRES(dpdk_mutex)
    OVS_REQUIRES(dev->mutex)
{
    uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
    struct dpdk_mp *mp;

    mp = dpdk_mp_get(dev->requested_socket_id, FRAME_LEN_TO_MTU(buf_size));
    if (!mp) {
        VLOG_ERR("Insufficient memory to create memory pool for netdev %s"
                 " on socket %d\n", dev->up.name, dev->requested_socket_id);
        return ENOMEM;
    } else {
        dpdk_mp_put(dev->dpdk_mp);
        dev->dpdk_mp = mp;
        dev->mtu = dev->requested_mtu;
        dev->socket_id = dev->requested_socket_id;
        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
    }

    return 0;
}

netdev_dpdk_vhost_user_reconfigure:

    if (dev->requested_socket_id != dev->socket_id
        || dev->requested_mtu != dev->mtu) {
        netdev_dpdk_reconfigure_mempool(dev);
    }

Same in other places.

I prefer option with 'netdev_dpdk_reconfigure_mempool()'.

What do you think?

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to