Hi Daniele. Thanks for posting this. I have almost same patch in my local branch.
I didn't test this with physical DPDK NICs yet, but I have few high level comments: 1. Do you thought about renaming of 'mtu_request' inside netdev-dpdk to 'requested_mtu'? I think, this would be more clear and consistent with other configurable parameters (n_rxq, n_txq, ...). 2. I'd prefer not to fail reconfiguration if there is no enough memory for new mempool. I think, it'll be common situation when we are requesting more memory than we have. Failure leads to destruction of the port and inability to reconnect to vhost-user port after re-creation if vhost is in server mode. We can just keep old mempool and inform user via VLOG_ERR. 3. Minor issues inline. What do you think? Best regards, Ilya Maximets. On 30.07.2016 04:22, Daniele Di Proietto wrote: > From: Mark Kavanagh <mark.b.kavan...@intel.com> > > Add support for Jumbo Frames to DPDK-enabled port types, > using single-segment-mbufs. > > Using this approach, the amount of memory allocated to each mbuf > to store frame data is increased to a value greater than 1518B > (typical Ethernet maximum frame length). The increased space > available in the mbuf means that an entire Jumbo Frame of a specific > size can be carried in a single mbuf, as opposed to partitioning > it across multiple mbuf segments. > > The amount of space allocated to each mbuf to hold frame data is > defined dynamically by the user with ovs-vsctl, via the 'mtu_request' > parameter. > > Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> > [diproiet...@vmware.com rebased] > Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com> > --- > INSTALL.DPDK-ADVANCED.md | 59 +++++++++++++++++- > INSTALL.DPDK.md | 1 - > NEWS | 1 + > lib/netdev-dpdk.c | 151 > +++++++++++++++++++++++++++++++++++++++-------- > 4 files changed, 185 insertions(+), 27 deletions(-) > > diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md > index 191e69e..5cd64bf 100755 > --- a/INSTALL.DPDK-ADVANCED.md > +++ b/INSTALL.DPDK-ADVANCED.md > @@ -1,5 +1,5 @@ > OVS DPDK ADVANCED INSTALL GUIDE > -================================= > +=============================== > > ## Contents > > @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE > 7. [QOS](#qos) > 8. [Rate Limiting](#rl) > 9. [Flow Control](#fc) > -10. [Vsperf](#vsperf) > +10. [Jumbo Frames](#jumbo) > +11. [Vsperf](#vsperf) > > ## <a name="overview"></a> 1. Overview > > @@ -862,7 +863,59 @@ respective parameter. To disable the flow control at tx > side, > > `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false` > > -## <a name="vsperf"></a> 10. Vsperf > +## <a name="jumbo"></a> 10. Jumbo Frames > + > +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To > +enable Jumbo Frames support for a DPDK port, change the Interface's > `mtu_request` > +attribute to a sufficiently large value. > + > +e.g. Add a DPDK Phy port with MTU of 9000: > + > +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set > Interface dpdk0 mtu_request=9000` > + > +e.g. Change the MTU of an existing port to 6200: > + > +`ovs-vsctl set Interface dpdk0 mtu_request=6200` > + > +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are > +increased, such that a full Jumbo Frame of a specific size may be > accommodated > +within a single mbuf segment. > + > +Jumbo frame support has been validated against 9728B frames (largest frame > size > +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames > +(particularly in use cases involving East-West traffic only), and other DPDK > NIC > +drivers may be supported. > + > +### 9.1 vHost Ports and Jumbo Frames > + > +Some additional configuration is needed to take advantage of jumbo frames > with > +vhost ports: > + > + 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated > in > + the QEMU command line snippet below: > + > + ``` > + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' > + '-device > virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' > + ``` > + > + 2. Where virtio devices are bound to the Linux kernel driver in a guest > + environment (i.e. interfaces are not bound to an in-guest DPDK > driver), > + the MTU of those logical network interfaces must also be increased to > a > + sufficiently large value. This avoids segmentation of Jumbo Frames > + received in the guest. Note that 'MTU' refers to the length of the IP > + packet only, and not that of the entire frame. > + > + To calculate the exact MTU of a standard IPv4 frame, subtract the L2 > + header and CRC lengths (i.e. 18B) from the max supported frame size. > + So, to set the MTU for a 9018B Jumbo Frame: > + > + ``` > + ifconfig eth1 mtu 9000 > + ``` > +>>>>>>> 5ec921d... netdev-dpdk: add support for Jumbo Frames Looks like rebasing artefact. > + > +## <a name="vsperf"></a> 11. Vsperf > > Vsperf project goal is to develop vSwitch test framework that can be used to > validate the suitability of different vSwitch implementations in a Telco > deployment > diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md > index 7609aa7..25c79de 100644 > --- a/INSTALL.DPDK.md > +++ b/INSTALL.DPDK.md > @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough]. > > ## <a name="ovslimits"></a> 6. Limitations > > - - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future > OVS release. > - Currently DPDK ports does not use HW offload functionality. > - Network Interface Firmware requirements: > Each release of DPDK is validated against a specific firmware version for > diff --git a/NEWS b/NEWS > index 0ff5616..c004e5f 100644 > --- a/NEWS > +++ b/NEWS > @@ -68,6 +68,7 @@ Post-v2.5.0 > is enabled in DPDK. > * Basic connection tracking for the userspace datapath (no ALG, > fragmentation or NAT support yet) > + * Jumbo frame support > - Increase number of registers to 16. > - ovs-benchmark: This utility has been removed due to lack of use and > bitrot. > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c > index 0b6e410..68639ae 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -82,6 +82,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, > 20); > + sizeof(struct dp_packet) \ > + RTE_PKTMBUF_HEADROOM) > #define NETDEV_DPDK_MBUF_ALIGN 1024 > +#define NETDEV_DPDK_MAX_PKT_LEN 9728 > > /* Max and min number of packets in the mempool. OVS tries to allocate a > * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have > @@ -336,6 +337,7 @@ struct netdev_dpdk { > struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex); > > struct dpdk_mp *dpdk_mp; > + int mtu_request; > int mtu; > int socket_id; > int buf_size; > @@ -474,10 +476,19 @@ dpdk_mp_get(int socket_id, int mtu) > OVS_REQUIRES(dpdk_mutex) > dmp->mtu = mtu; > dmp->refcount = 1; > mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet); > - mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) - > - sizeof (struct rte_mbuf); > + mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) > + - sizeof (struct rte_mbuf); > + /* XXX: this is a really rough method of provisioning memory. > + * It's impossible to determine what the exact memory requirements are > when > + * the number of ports and rxqs that utilize a particular mempool can > change > + * dynamically at runtime. For the moment, use this rough heurisitic. > + */ > + if (mtu >= ETHER_MTU) { > + mp_size = MAX_NB_MBUF; > + } else { > + mp_size = MIN_NB_MBUF; > + } > > - mp_size = MAX_NB_MBUF; > do { > if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u", > dmp->mtu, dmp->socket_id, mp_size) < 0) { > @@ -522,6 +533,32 @@ dpdk_mp_put(struct dpdk_mp *dmp) > #endif > } > > +static int > +dpdk_mp_configure(struct netdev_dpdk *dev) > + OVS_REQUIRES(dpdk_mutex) > + OVS_REQUIRES(dev->mutex) > +{ > + uint32_t buf_size = dpdk_buf_size(dev->mtu); > + struct dpdk_mp *mp_old, *mp; > + > + mp_old = dev->dpdk_mp; Do we need this variable? We can just put old mempool before assigning the new one. > + > + mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size)); > + if (!mp) { > + VLOG_ERR("Insufficient memory to create memory pool for netdev %s\n", > + dev->up.name); +1 space character. > + return ENOMEM; > + } > + > + dev->dpdk_mp = mp; > + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); > + > + dpdk_mp_put(mp_old); > + > + return 0; > +} > + > + > static void > check_link_status(struct netdev_dpdk *dev) > { > @@ -573,7 +610,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int > n_rxq, int n_txq) > { > int diag = 0; > int i; > + struct rte_eth_conf conf = port_conf; > > + if (dev->mtu > ETHER_MTU) { > + conf.rxmode.jumbo_frame = 1; > + conf.rxmode.max_rx_pkt_len = dev->max_packet_len; > + } else { > + conf.rxmode.jumbo_frame = 0; > + conf.rxmode.max_rx_pkt_len = 0; I know, it was implemented this way in the original patch, but I'm not sure that all DPDK drivers will handle zero value of 'max_rx_pkt_len' in a right way. > + } > /* A device may report more queues than it makes available (this has > * been observed for Intel xl710, which reserves some of them for > * SRIOV): rte_eth_*_queue_setup will fail if a queue is not > @@ -584,8 +629,10 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int > n_rxq, int n_txq) > VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq); > } > > - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf); > + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf); > if (diag) { > + VLOG_WARN("Interface %s eth_dev setup error %s\n", > + dev->up.name, rte_strerror(-diag)); > break; > } > > @@ -738,7 +785,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int > port_no, > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > int sid; > int err = 0; > - uint32_t buf_size; > > ovs_mutex_init(&dev->mutex); > ovs_mutex_lock(&dev->mutex); > @@ -759,13 +805,11 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int > port_no, > dev->port_id = port_no; > dev->type = type; > dev->flags = 0; > - dev->mtu = ETHER_MTU; > + dev->mtu_request = dev->mtu = ETHER_MTU; > dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); > > - buf_size = dpdk_buf_size(dev->mtu); > - dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size)); > - if (!dev->dpdk_mp) { > - err = ENOMEM; > + err = dpdk_mp_configure(dev); > + if (err) { > goto unlock; > } > > @@ -986,6 +1030,7 @@ netdev_dpdk_get_config(const struct netdev *netdev, > struct smap *args) > smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq); > smap_add_format(args, "requested_tx_queues", "%d", dev->requested_n_txq); > smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq); > + smap_add_format(args, "mtu", "%d", dev->mtu); > ovs_mutex_unlock(&dev->mutex); > > return 0; > @@ -1362,6 +1407,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid, > struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; > unsigned int total_pkts = cnt; > unsigned int qos_pkts = cnt; > + unsigned int mtu_dropped = 0; > int retries = 0; > > qid = dev->tx_q[qid % netdev->n_txq].map; > @@ -1383,25 +1429,41 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int > qid, > do { > int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; > unsigned int tx_pkts; > + unsigned int try_tx_pkts = cnt; > > + for (unsigned int i = 0; i < cnt; i++) { > + if (cur_pkts[i]->pkt_len > dev->max_packet_len) { > + try_tx_pkts = i; > + break; > + } > + } > + if (!try_tx_pkts) { > + cur_pkts++; > + mtu_dropped++; > + cnt--; > + continue; > + } > tx_pkts = rte_vhost_enqueue_burst(virtio_dev, vhost_qid, > - cur_pkts, cnt); > + cur_pkts, try_tx_pkts); > if (OVS_LIKELY(tx_pkts)) { > /* Packets have been sent.*/ > cnt -= tx_pkts; > /* Prepare for possible retry.*/ > cur_pkts = &cur_pkts[tx_pkts]; > + if (tx_pkts != try_tx_pkts) { > + retries++; > + } > } else { > /* No packets sent - do not retry.*/ > break; > } > - } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM)); > + } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM)); > > rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); > > rte_spinlock_lock(&dev->stats_lock); > - cnt += qos_pkts; > - netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt); > + netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, > + cnt + mtu_dropped + qos_pkts); > rte_spinlock_unlock(&dev->stats_lock); > > out: > @@ -1635,6 +1697,26 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int > *mtup) > } > > static int > +netdev_dpdk_set_mtu(struct netdev *netdev, int mtu) > +{ > + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > + > + if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN) { > + VLOG_WARN("Unsupported MTU (%d)\n", mtu); > + return EINVAL; > + } > + > + ovs_mutex_lock(&dev->mutex); > + if (dev->mtu_request != mtu) { > + dev->mtu_request = mtu; > + netdev_request_reconfigure(netdev); > + } > + ovs_mutex_unlock(&dev->mutex); > + > + return 0; > +} > + > +static int > netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier); > > static int > @@ -2787,7 +2869,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev) > ovs_mutex_lock(&dev->mutex); > > if (netdev->n_txq == dev->requested_n_txq > - && netdev->n_rxq == dev->requested_n_rxq) { > + && netdev->n_rxq == dev->requested_n_rxq > + && dev->mtu == dev->mtu_request) { > /* Reconfiguration is unnecessary */ > > goto out; > @@ -2795,6 +2878,14 @@ netdev_dpdk_reconfigure(struct netdev *netdev) > > rte_eth_dev_stop(dev->port_id); > > + if (dev->mtu != dev->mtu_request) { > + dev->mtu = dev->mtu_request; > + err = dpdk_mp_configure(dev); > + if (err) { > + goto out; > + } > + } > + > netdev->n_txq = dev->requested_n_txq; > netdev->n_rxq = dev->requested_n_rxq; > > @@ -2802,6 +2893,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev) > err = dpdk_eth_dev_init(dev); > netdev_dpdk_alloc_txq(dev, netdev->n_txq); > > + netdev_change_seq_changed(netdev); > + > out: > > ovs_mutex_unlock(&dev->mutex); > @@ -2830,20 +2923,23 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev > *netdev) > > netdev_dpdk_remap_txqs(dev); > > - if (dev->requested_socket_id != dev->socket_id) { > + if (dev->requested_socket_id != dev->socket_id > + || dev->mtu_request != dev->mtu) { > dev->socket_id = dev->requested_socket_id; > - /* Change mempool to new NUMA Node */ > - dpdk_mp_put(dev->dpdk_mp); > - dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu); > - if (!dev->dpdk_mp) { > - err = ENOMEM; > + dev->mtu = dev->mtu_request; > + /* Change mempool to new NUMA Node and to new MTU. */ > + err = dpdk_mp_configure(dev); > + if (err) { > + goto out; > } > + netdev_change_seq_changed(netdev); > } > > if (virtio_dev) { > virtio_dev->flags |= VIRTIO_DEV_RUNNING; > } > > +out: > ovs_mutex_unlock(&dev->mutex); > ovs_mutex_unlock(&dpdk_mutex); > > @@ -2854,6 +2950,7 @@ static int > netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev) > { > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); > + int err = 0; > > ovs_mutex_lock(&dpdk_mutex); > ovs_mutex_lock(&dev->mutex); > @@ -2861,10 +2958,18 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev > *netdev) > netdev->n_txq = dev->requested_n_txq; > netdev->n_rxq = 1; > > + if (dev->mtu_request != dev->mtu) { > + /* Change mempool to new MTU. */ > + err = dpdk_mp_configure(dev); > + if (!err) { > + netdev_change_seq_changed(netdev); > + } > + } > + > ovs_mutex_unlock(&dev->mutex); > ovs_mutex_unlock(&dpdk_mutex); > > - return 0; > + return err; > } > > #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, \ > @@ -2898,7 +3003,7 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev > *netdev) > netdev_dpdk_set_etheraddr, \ > netdev_dpdk_get_etheraddr, \ > netdev_dpdk_get_mtu, \ > - NULL, /* set_mtu */ \ > + netdev_dpdk_set_mtu, \ > netdev_dpdk_get_ifindex, \ > GET_CARRIER, \ > netdev_dpdk_get_carrier_resets, \ > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev