On 14.12.2017 19:26, Kevin Traynor wrote: > On 12/14/2017 03:30 PM, Ilya Maximets wrote: >> Hello Ciara, >> Thanks for the patches. >> >> I did not review the code. But I have few general concerns about number >> of tx descriptors in HW NICs inline. >> >> Best regards, Ilya Maximets. >> >> On 08.12.2017 18:26, Ciara Loftus wrote: >>> Enabled per port like so: >>> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true >>> >>> The feature is disabled by default and can only be enabled/disabled when >>> a vHost port is down. >>> >>> When packets from a vHost device with zero copy enabled are destined for >>> a 'dpdk' port, the number of tx descriptors on that 'dpdk' port must be >>> set to a smaller value. 128 is recommended. This can be achieved like >>> so: >>> >>> ovs-vsctl set Interface dpdkport options:n_txq_desc=128 >>> >>> Signed-off-by: Ciara Loftus <ciara.lof...@intel.com> >>> --- >>> v5: >>> * Rebase >>> * Update docs with warning of potential packet loss >>> * Update docs with info on NIC & Virtio descriptor values and working >>> combinations >>> >>> Documentation/howto/dpdk.rst | 33 ++++++++++++ >>> Documentation/topics/dpdk/vhost-user.rst | 61 ++++++++++++++++++++++ >>> NEWS | 2 + >>> lib/netdev-dpdk.c | 89 >>> +++++++++++++++++++++++++++++++- >>> vswitchd/vswitch.xml | 11 ++++ >>> 5 files changed, 195 insertions(+), 1 deletion(-) >>> >>> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst >>> index d123819..3e1b8f8 100644 >>> --- a/Documentation/howto/dpdk.rst >>> +++ b/Documentation/howto/dpdk.rst >>> @@ -709,3 +709,36 @@ devices to bridge ``br0``. Once complete, follow the >>> below steps: >>> Check traffic on multiple queues:: >>> >>> $ cat /proc/interrupts | grep virtio >>> + >>> +PHY-VM-PHY (vHost Dequeue Zero Copy) >>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> + >>> +vHost dequeue zero copy functionality can be validated using the >>> +PHY-VM-PHY configuration. To begin, follow the steps described in >>> +:ref:`dpdk-phy-phy` to create and initialize the database, start >>> +ovs-vswitchd and add ``dpdk``-type and ``dpdkvhostuser``-type devices >>> +and flows to bridge ``br0``. Once complete, follow the below steps: >>> + >>> +1. Enable dequeue zero copy on the vHost devices. >>> + >>> + $ ovs-vsctl set Interface dpdkvhostuser0 options:dq-zero-copy=true >>> + $ ovs-vsctl set Interface dpdkvhostuser1 options:dq-zero-copy=true >>> + >>> +The following log should be observed for each device: >>> + >>> + netdev_dpdk|INFO|Zero copy enabled for vHost socket <name> >>> + >>> +2. Reduce the number of txq descriptors on the phy ports. >>> + >>> + $ ovs-vsctl set Interface phy0 options:n_txq_desc=128 >>> + $ ovs-vsctl set Interface phy1 options:n_txq_desc=128 >>> + >>> +3. Proceed with the test by launching the VM and configuring guest >>> +forwarding, be it via the vHost loopback method or kernel forwarding >>> +method, and sending traffic. The following log should be oberved for >>> +each device as it becomes active during VM boot: >>> + >>> + VHOST_CONFIG: dequeue zero copy is enabled >>> + >>> +It is essential that step 1 is performed before booting the VM, otherwise >>> +the feature will not be enabled. >>> diff --git a/Documentation/topics/dpdk/vhost-user.rst >>> b/Documentation/topics/dpdk/vhost-user.rst >>> index 8b1c671..8587370 100644 >>> --- a/Documentation/topics/dpdk/vhost-user.rst >>> +++ b/Documentation/topics/dpdk/vhost-user.rst >>> @@ -441,3 +441,64 @@ Sample XML >>> </domain> >>> >>> .. _QEMU documentation: >>> http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD >>> + >>> +vhost-user Dequeue Zero Copy >>> +------------------------------------- >>> + >>> +Normally when dequeuing a packet from a vHost User device, a memcpy >>> operation >>> +must be used to copy that packet from guest address space to host address >>> +space. This memcpy can be removed by enabling dequeue zero-copy like so: >>> + >>> + $ ovs-vsctl set Interface dpdkvhostuserclient0 >>> options:dq-zero-copy=true >>> + >>> +With this feature enabled, a reference (pointer) to the packet is passed to >>> +the host, instead of a copy of the packet. Removing this memcpy can give a >>> +performance improvement for some use cases, for example switching large >>> packets >>> +between different VMs. However additional packet loss may be observed. >>> + >>> +Note that the feature is disabled by default and must be explicitly enabled >>> +by using the command above. >>> + >>> +The feature cannot be enabled when the device is active (ie. VM booted). If >>> +you wish to enable the feature after the VM has booted, you must shutdown >>> +the VM and bring it back up. >>> + >>> +The same logic applies for disabling the feature - it must be disabled when >>> +the device is inactive, for example before VM boot. To disable the feature: >>> + >>> + $ ovs-vsctl set Interface dpdkvhostuserclient0 >>> options:dq-zero-copy=false >>> + >>> +The feature is available to both dpdkvhostuser and dpdkvhostuserclient port >>> +types. >>> + >>> +A limitation exists whereby if packets from a vHost port with >>> dq-zero-copy=true >>> +are destined for a 'dpdk' type port, the number of tx descriptors >>> (n_txq_desc) >>> +for that port must be reduced to a smaller number, 128 being the >>> recommended >>> +value. This can be achieved by issuing the following command: >>> + >>> + $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128 >>> + >>> +More information on the n_txq_desc option can be found in the "DPDK >>> Physical >>> +Port Queue Sizes" section of the `intro/install/dpdk.rst` guide. >>> + >>> +The reason for this limitation is due to how the zero copy functionality is >>> +implemented. The vHost device's 'tx used vring', a virtio structure used >>> for >>> +tracking used ie. sent descriptors, will only be updated when the NIC frees >>> +the corresponding mbuf. If we don't free the mbufs frequently enough, that >>> +vring will be starved and packets will no longer be processed. One way to >>> +ensure we don't encounter this scenario, is to configure n_txq_desc to a >>> small >>> +enough number such that the 'mbuf free threshold' for the NIC will be hit >>> more >>> +often and thus free mbufs more frequently. The value of 128 is suggested, >>> but >>> +values of 64 and 256 have been tested and verified to work too, with >>> differing >>> +performance characteristics. A value of 512 can be used too, if the virtio >>> +queue size in the guest is increased to 1024 (available to configure in >>> QEMU >>> +versions v2.10 and greater). This value can be set like so: >>> + >>> + $ qemu-system-x86_64 ... -chardev >>> socket,id=char1,path=<sockpath>,server >>> + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce >>> + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1, >>> + tx_queue_size=1024 >> >> What if we have 2, 4, 10 HW NICs and a few different destinations for >> traffic from VM? >> What if we have 2, 4, 10 HW NICs added to balanced OVS bonding and just a >> few different packet flows? >> What if we have HW NICs with small number of TX queues (like VFs) and XPS is >> working? >> We may actually stuck with non-working virtual interface in all above cases. >> >> IMHO, this feature should be marked as experimental until we don't have >> proper solution >> for that stuck mbufs in HW rings. >> >> Thoughts? >> > > How about using rte_eth_tx_done_cleanup() ?
This API currently implemented only by igb driver. This will not be much useful. > I guess a user would not > want to combine this feature with sw tx batching though. > > On a related note, I would like to ask Ciara/Jan the use case when this > feature would likely be enabled. The RFC2544 test is very opaque but it > seems there is at least some increased packet loss, so would it be just > when the user knows it is vm-vm and large packets? (In which case you > could argue, packets would not get stuck in HW NIC anyway) > > Kevin. > >>> + >>> +Further information can be found in the >>> +`DPDK documentation >>> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__ >>> diff --git a/NEWS b/NEWS >>> index d45904e..50630f8 100644 >>> --- a/NEWS >>> +++ b/NEWS >>> @@ -18,6 +18,8 @@ Post-v2.8.0 >>> - DPDK: >>> * Add support for DPDK v17.11 >>> * Add support for vHost IOMMU >>> + * Optional dequeue zero copy feature for vHost ports can be enabled >>> per >>> + port via the boolean 'dq-zero-copy' option. >>> >>> v2.8.0 - 31 Aug 2017 >>> -------------------- >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>> index 879019e..03cc6cb 100644 >>> --- a/lib/netdev-dpdk.c >>> +++ b/lib/netdev-dpdk.c >>> @@ -361,6 +361,9 @@ struct netdev_dpdk { >>> /* True if vHost device is 'up' and has been reconfigured at least >>> once */ >>> bool vhost_reconfigured; >>> /* 3 pad bytes here. */ >>> + >>> + /* True if dq-zero-copy feature has successfully been enabled */ >>> + bool dq_zc_enabled; >>> ); >>> >>> PADDED_MEMBERS(CACHE_LINE_SIZE, >>> @@ -871,6 +874,7 @@ common_construct(struct netdev *netdev, dpdk_port_t >>> port_no, >>> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>> ovsrcu_index_init(&dev->vid, -1); >>> dev->vhost_reconfigured = false; >>> + dev->dq_zc_enabled = false; >>> dev->attached = false; >>> >>> ovsrcu_init(&dev->qos_conf, NULL); >>> @@ -1383,6 +1387,29 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, >>> const struct smap *args, >>> return 0; >>> } >>> >>> +static void >>> +dpdk_vhost_set_config_helper(struct netdev_dpdk *dev, >>> + const struct smap *args) >>> +{ >>> + bool needs_reconfigure = false; >>> + bool zc_requested = smap_get_bool(args, "dq-zero-copy", false); >>> + >>> + if (zc_requested && >>> + !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) >>> { >>> + dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY; >>> + needs_reconfigure = true; >>> + } else if (!zc_requested && >>> + (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) { >>> + dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY; >>> + needs_reconfigure = true; >>> + } >>> + >>> + /* Only try to change ZC mode when device is down */ >>> + if (needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1)) { >>> + netdev_request_reconfigure(&dev->up); >>> + } >>> +} >>> + >>> static int >>> netdev_dpdk_vhost_client_set_config(struct netdev *netdev, >>> const struct smap *args, >>> @@ -1399,6 +1426,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev >>> *netdev, >>> netdev_request_reconfigure(netdev); >>> } >>> } >>> + >>> + dpdk_vhost_set_config_helper(dev, args); >>> + >>> + ovs_mutex_unlock(&dev->mutex); >>> + >>> + return 0; >>> +} >>> + >>> +static int >>> +netdev_dpdk_vhost_set_config(struct netdev *netdev, >>> + const struct smap *args, >>> + char **errp OVS_UNUSED) >>> +{ >>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>> + >>> + ovs_mutex_lock(&dev->mutex); >>> + dpdk_vhost_set_config_helper(dev, args); >>> ovs_mutex_unlock(&dev->mutex); >>> >>> return 0; >>> @@ -2723,6 +2767,46 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev) >>> } >>> } >>> >>> +static void >>> +vhost_change_zero_copy_mode(struct netdev_dpdk *dev, bool client_mode, >>> + bool enable) >>> +{ >>> + int err = rte_vhost_driver_unregister(dev->vhost_id); >>> + >>> + if (err) { >>> + VLOG_ERR("Error unregistering vHost socket %s; can't change zero >>> copy " >>> + "mode", dev->vhost_id); >>> + } else { >>> + err = dpdk_setup_vhost_device(dev, client_mode); >>> + if (err) { >>> + VLOG_ERR("Error changing zero copy mode for vHost socket %s", >>> + dev->vhost_id); >>> + } else if (enable) { >>> + dev->dq_zc_enabled = true; >>> + VLOG_INFO("Zero copy enabled for vHost socket %s", >>> dev->vhost_id); >>> + } else { >>> + dev->dq_zc_enabled = false; >>> + VLOG_INFO("Zero copy disabled for vHost socket %s", >>> dev->vhost_id); >>> + } >>> + } >>> +} >>> + >>> +static void >>> +vhost_check_zero_copy_status(struct netdev_dpdk *dev) >>> +{ >>> + bool mode = dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT; >>> + >>> + if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY) >>> + && !dev->dq_zc_enabled) { >>> + /* ZC disabled but requested to be enabled, enable it. */ >>> + vhost_change_zero_copy_mode(dev, mode, true); >>> + } else if (!(dev->vhost_driver_flags & >>> + RTE_VHOST_USER_DEQUEUE_ZERO_COPY) && dev->dq_zc_enabled) { >>> + /* ZC enabled but requested to be disabled, disable it. */ >>> + vhost_change_zero_copy_mode(dev, mode, false); >>> + } >>> +} >>> + >>> /* >>> * Remove a virtio-net device from the specific vhost port. Use >>> dev->remove >>> * flag to stop any more packets from being sent or received to/from a VM >>> and >>> @@ -2768,6 +2852,7 @@ destroy_device(int vid) >>> */ >>> ovsrcu_quiesce_start(); >>> VLOG_INFO("vHost Device '%s' has been removed", ifname); >>> + netdev_request_reconfigure(&dev->up); >>> } else { >>> VLOG_INFO("vHost Device '%s' not found", ifname); >>> } >>> @@ -3259,6 +3344,8 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev) >>> /* Carrier status may need updating. */ >>> netdev_change_seq_changed(&dev->up); >>> } >>> + } else { >>> + vhost_check_zero_copy_status(dev); >>> } >>> >>> return 0; >>> @@ -3421,7 +3508,7 @@ static const struct netdev_class dpdk_vhost_class = >>> NULL, >>> netdev_dpdk_vhost_construct, >>> netdev_dpdk_vhost_destruct, >>> - NULL, >>> + netdev_dpdk_vhost_set_config, >>> NULL, >>> netdev_dpdk_vhost_send, >>> netdev_dpdk_vhost_get_carrier, >>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml >>> index 4c317d0..e8409b4 100644 >>> --- a/vswitchd/vswitch.xml >>> +++ b/vswitchd/vswitch.xml >>> @@ -2669,6 +2669,17 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 >>> type=patch options:peer=p1 \ >>> </p> >>> </column> >>> >>> + <column name="options" key="dq-zero-copy" >>> + type='{"type": "boolean"}'> >>> + <p> >>> + The value specifies whether or not to enable dequeue zero copy on >>> + the given interface. >>> + The port must be in an inactive state in order to enable or >>> disable >>> + this feature. >>> + Only supported by dpdkvhostuserclient and dpdkvhostuser >>> interfaces. >>> + </p> >>> + </column> >>> + >>> <column name="options" key="n_rxq_desc" >>> type='{"type": "integer", "minInteger": 1, "maxInteger": >>> 4096}'> >>> <p> >>> >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev