[dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost with DMA Engines

Fu, Patrick Fri, 17 Apr 2020 00:27:13 -0700

Background
====================================
DPDK vhost library implements a user-space VirtIO net backend allowing host 
applications to directly communicate with VirtIO front-end in VMs and 
containers. However, every vhost enqueue/dequeue operation requires to copy 
packet buffers between guest and host memory. The overhead of copying large 
bulk of data makes the vhost backend become the I/O bottleneck. DMA engines, 
including un-core DMA accelerator, like Crystal Beach DMA (CBDMA) and Data 
Streaming Accelerator (DSA), and discrete card general purpose DMA, are 
extremely efficient in data movement within system memory. Therefore, we 
propose a set of asynchronous DMA data movement API in vhost library for DMA 
acceleration. With offloading packet copies in vhost data-path from the CPU to 
the DMA engine, which can not only accelerate data transfers, but also save 
precious CPU core resources.


New API Overview
====================================
The proposed APIs in the vhost library support various DMA engines to 
accelerate data transfers in the data-path. For the higher performance, DMA 
engines work in an asynchronous manner, where DMA data transfers and CPU 
computations are executed in parallel. The proposed API consists of control 
path API and data path API. The control path API includes Registration API and 
DMA operation callback, and the data path API includes asynchronous API. To 
remove the dependency of vendor specific DMA engines, the DMA operation 
callback provides generic DMA data transfer abstractions. To support 
asynchronous DMA data movement, the new async API provides asynchronous ring 
operation semantic in data-path. To enable/disable DMA acceleration for 
virtqueues, users need to use registration API is to register/unregister DMA 
callback implementations to the vhost library and bind DMA channels to 
virtqueues. The DMA channels used by virtqueues are provided by DPDK 
applications, which is backed by  virtual or physical DMA devices.
The proposed APIs are consisted of 3 sub-sets:
1. DMA Registration APIs
2. DMA Operation Callbacks
3. Async Data APIs

DMA Registration APIs
==================================== 
DMA acceleration is per queue basis. DPDK applications need to explicitly 
decide whether a virtqueue needs DMA acceleration and which DMA channel to use. 
In addition, a DMA channel is dedicated to a virtqueue and a DMA channel cannot 
be bound to multiple virtqueues at the same time. To enable DMA acceleration 
for a virtqueue, DPDK applications need to implement DMA operation callbacks 
for a specific DMA type (e.g. CBDMA) first, then register the callbacks to the 
vhost library and bind a DMA channel to a virtqueue, and finally use the new 
async API to perform data-path operations on the virtqueue.
The definitions of registration API are shown below:
int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
                                        struct rte_vdma_device_ops *ops);

int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);

The "rte_vhost_async_channel_register" is to register implemented DMA operation 
callbacks to the vhost library and bind a DMA channel to a virtqueue. DPDK 
applications must implement the corresponding DMA operation callbacks for 
various DMA engines. To enable DMA acceleration for a virtqueue, DPDK 
applications need to explicitly call "rte_vhost_async_channel_register" for the 
virtqueue.  The "ops" points to the implementation of callbacks. 
The "rte_vhost_async_channel_unregister" unregisters DMA operation callbacks 
and unbind the DMA channel from the virtqueue. If a virtqueue does not bind to 
a DMA channel, it will use SW data-path without DMA acceleration.

DMA Operation Callbacks
==================================== 
The definitions of DMA operation callback are shown below:
struct iovec {  /** this is kernel uapi structure */
        void *iov_base; /** buffer address */
        size_t iov_len; /** buffer length */
};

struct iov_iter {       
        size_t iov_offset;
        size_t count;           /** total bytes of a packet */
        struct iovec *iov;      /** array of data buffers */
        unsigned long nr_segs;  /** number of iovec structures */
        uintptr_t usr_data;     /** app specific memory handler*/
};

struct dma_trans_desc {
        struct iov_iter *src; /** source memory iov_iter*/
        struct iov_iter *dst; /** destination memory iov_iter*/
};

struct dma_trans_status {
        uintptr_t src_usr_data; /** trans completed memory handler*/
        uintptr_t dst_usr_data; /** trans completed memory handler*/
};

struct rte_vhost_async_channel_ops {
        /** Instruct a DMA channel to perform copies for a batch of packets */
        int (*transfer_data)( struct dma_trans_desc *descs,
                                 uint16_t count);

                /** check copy-completed packets from a DMA channel */
        int (*check_completed_copies)( struct dma_trans_status *usr_data,
                                        uint16_t max_packets);
};

The first callback "transfer_data" is to submit a batch of packet copies to a 
DMA channel. As a packet's source or destination buffer can be a vector of 
buffers or a single data stream, we use "struct dma_trans_desc" to construct 
the source and destination buffer of packet.  Copying a packet is to move data 
from source iov_iter structure to destination iov_iter structure. The "count" 
is the number of packets to do copy. 
The second callback "check_completed_copies" queries the completion status of 
the DMA. An "usr_data" member variable is embedded in "iov_iter" structure, 
which serves as a unique identifier of the memory region described by 
"iov_iter". As the source/destination buffer can be scatter-gather, the DMA 
channel may perform its copies out-of-order. When all copies of an iov_iter are 
completed by the DMA channel, the "check_completed_copies" should return the 
associated "usr_data" by "dma_trans_status" structure. 

Async Data APIs
==================================== 
The definitions of new enqueue API are shown below:
uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id, struct 
rte_mbuf **pkts, uint16_t count);

uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id, struct 
rte_mbuf **pkts, uint16_t count);

The "rte_vhost_submit_enqueue_burst" is to enqueue a batch of packets to a 
virtqueue with giving ownership of enqueue packets to the vhost library. DPDK 
applications cannot reuse the enqueued packets until they get back the 
ownership. For a virtqueue enabled DMA acceleration by the 
"rte_vhost_async_channel_register", the "rte_vhost_submit_enqueue_burst" will 
use the bound DMA channel to perform packet copies; moreover, the function is 
non-blocking, which just submits packet copies to the DMA channel but without 
waiting for completion. For a virtqueue without enabling DMA acceleration, the 
"rte_vhost_submit_enqueue_burst" will use SW data-path, where the CPU performs 
packet copies. It worth noticing that DPDK applications cannot directly reuse 
enqueued packet buffers by "rte_vhost_submit_enqueue_burst", even if it uses SW 
data-path.

The "rte_vhost_poll_enqueue_completed" returns ownership for the packets whose 
copies are all completed currently, either by the DMA channel or the CPU. It is 
a non-blocking function, which will not wait for DMA copies completion. After 
getting back the ownership of packets enqueued by 
"rte_vhost_submit_enqueue_burst", DPDK applications can further process the 
packet buffers, e.g. free pktmbufs.

Sample Work Flow
==================================== 
Some DMA engines, like CBDMA, need to use physical addresses and do not support 
I/O page fault. In addition, some guests may want to avoid memory swapping out. 
For these cases, we can pin guest memory by setting a new flag 
"RTE_VHOST_USER_DMA_COPY" in rte_vhost_driver_register(). Here is an example of 
how to use CBDMA to accelerate vhost enqueue operation:
Step1: Implement DMA operation callbacks for CBDMA via IOAT PMD
Step2: call rte_vhost_driver_register with flag "RTE_VHOST_USER_DMA_COPY" (pin 
guest memory)
Step3: call rte_vhost_async_channel_register to register DMA channel
Step4: call rte_vhost_submit_enqueue_burst to enqueue packets
Step5: call rte_vhost_poll_enqueue_completed get back the ownership of the 
packets whose copies are completed
Step6: call rte_pktmbuf_free to free packet mbuf

Signed-off-by: Patrick Fu <[email protected]>
Signed-off-by: Jiayu Hu <[email protected]>

[dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost with DMA Engines

Reply via email to