Background
====================================
DPDK vhost library implements a user-space VirtIO net backend allowing host
applications to directly communicate with VirtIO front-end in VMs and
containers. However, every vhost enqueue/dequeue operation requires to copy
packet buffers between guest and host memory. The overhead of copying large
bulk of data makes the vhost backend become the I/O bottleneck. DMA engines,
including un-core DMA accelerator, like Crystal Beach DMA (CBDMA) and Data
Streaming Accelerator (DSA), and discrete card general purpose DMA, are
extremely efficient in data movement within system memory. Therefore, we
propose a set of asynchronous DMA data movement API in vhost library for DMA
acceleration. With offloading packet copies in vhost data-path from the CPU to
the DMA engine, which can not only accelerate data transfers, but also save
precious CPU core resources.
New API Overview
====================================
The proposed APIs in the vhost library support various DMA engines to
accelerate data transfers in the data-path. For the higher performance, DMA
engines work in an asynchronous manner, where DMA data transfers and CPU
computations are executed in parallel. The proposed API consists of control
path API and data path API. The control path API includes Registration API and
DMA operation callback, and the data path API includes asynchronous API. To
remove the dependency of vendor specific DMA engines, the DMA operation
callback provides generic DMA data transfer abstractions. To support
asynchronous DMA data movement, the new async API provides asynchronous ring
operation semantic in data-path. To enable/disable DMA acceleration for
virtqueues, users need to use registration API is to register/unregister DMA
callback implementations to the vhost library and bind DMA channels to
virtqueues. The DMA channels used by virtqueues are provided by DPDK
applications, which is backed by virtual or physical DMA devices.
The proposed APIs are consisted of 3 sub-sets:
1. DMA Registration APIs
2. DMA Operation Callbacks
3. Async Data APIs
DMA Registration APIs
====================================
DMA acceleration is per queue basis. DPDK applications need to explicitly
decide whether a virtqueue needs DMA acceleration and which DMA channel to use.
In addition, a DMA channel is dedicated to a virtqueue and a DMA channel cannot
be bound to multiple virtqueues at the same time. To enable DMA acceleration
for a virtqueue, DPDK applications need to implement DMA operation callbacks
for a specific DMA type (e.g. CBDMA) first, then register the callbacks to the
vhost library and bind a DMA channel to a virtqueue, and finally use the new
async API to perform data-path operations on the virtqueue.
The definitions of registration API are shown below:
int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
struct rte_vdma_device_ops *ops);
int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);
The "rte_vhost_async_channel_register" is to register implemented DMA operation
callbacks to the vhost library and bind a DMA channel to a virtqueue. DPDK
applications must implement the corresponding DMA operation callbacks for
various DMA engines. To enable DMA acceleration for a virtqueue, DPDK
applications need to explicitly call "rte_vhost_async_channel_register" for the
virtqueue. The "ops" points to the implementation of callbacks.
The "rte_vhost_async_channel_unregister" unregisters DMA operation callbacks
and unbind the DMA channel from the virtqueue. If a virtqueue does not bind to
a DMA channel, it will use SW data-path without DMA acceleration.
DMA Operation Callbacks
====================================
The definitions of DMA operation callback are shown below:
struct iovec { /** this is kernel uapi structure */
void *iov_base; /** buffer address */
size_t iov_len; /** buffer length */
};
struct iov_iter {
size_t iov_offset;
size_t count; /** total bytes of a packet */
struct iovec *iov; /** array of data buffers */
unsigned long nr_segs; /** number of iovec structures */
uintptr_t usr_data; /** app specific memory handler*/
};
struct dma_trans_desc {
struct iov_iter *src; /** source memory iov_iter*/
struct iov_iter *dst; /** destination memory iov_iter*/
};
struct dma_trans_status {
uintptr_t src_usr_data; /** trans completed memory handler*/
uintptr_t dst_usr_data; /** trans completed memory handler*/
};
struct rte_vhost_async_channel_ops {
/** Instruct a DMA channel to perform copies for a batch of packets */
int (*transfer_data)( struct dma_trans_desc *descs,
uint16_t count);
/** check copy-completed packets from a DMA channel */
int (*check_completed_copies)( struct dma_trans_status *usr_data,
uint16_t max_packets);
};
The first callback "transfer_data" is to submit a batch of packet copies to a
DMA channel. As a packet's source or destination buffer can be a vector of
buffers or a single data stream, we use "struct dma_trans_desc" to construct
the source and destination buffer of packet. Copying a packet is to move data
from source iov_iter structure to destination iov_iter structure. The "count"
is the number of packets to do copy.
The second callback "check_completed_copies" queries the completion status of
the DMA. An "usr_data" member variable is embedded in "iov_iter" structure,
which serves as a unique identifier of the memory region described by
"iov_iter". As the source/destination buffer can be scatter-gather, the DMA
channel may perform its copies out-of-order. When all copies of an iov_iter are
completed by the DMA channel, the "check_completed_copies" should return the
associated "usr_data" by "dma_trans_status" structure.
Async Data APIs
====================================
The definitions of new enqueue API are shown below:
uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id, struct
rte_mbuf **pkts, uint16_t count);
uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id, struct
rte_mbuf **pkts, uint16_t count);
The "rte_vhost_submit_enqueue_burst" is to enqueue a batch of packets to a
virtqueue with giving ownership of enqueue packets to the vhost library. DPDK
applications cannot reuse the enqueued packets until they get back the
ownership. For a virtqueue enabled DMA acceleration by the
"rte_vhost_async_channel_register", the "rte_vhost_submit_enqueue_burst" will
use the bound DMA channel to perform packet copies; moreover, the function is
non-blocking, which just submits packet copies to the DMA channel but without
waiting for completion. For a virtqueue without enabling DMA acceleration, the
"rte_vhost_submit_enqueue_burst" will use SW data-path, where the CPU performs
packet copies. It worth noticing that DPDK applications cannot directly reuse
enqueued packet buffers by "rte_vhost_submit_enqueue_burst", even if it uses SW
data-path.
The "rte_vhost_poll_enqueue_completed" returns ownership for the packets whose
copies are all completed currently, either by the DMA channel or the CPU. It is
a non-blocking function, which will not wait for DMA copies completion. After
getting back the ownership of packets enqueued by
"rte_vhost_submit_enqueue_burst", DPDK applications can further process the
packet buffers, e.g. free pktmbufs.
Sample Work Flow
====================================
Some DMA engines, like CBDMA, need to use physical addresses and do not support
I/O page fault. In addition, some guests may want to avoid memory swapping out.
For these cases, we can pin guest memory by setting a new flag
"RTE_VHOST_USER_DMA_COPY" in rte_vhost_driver_register(). Here is an example of
how to use CBDMA to accelerate vhost enqueue operation:
Step1: Implement DMA operation callbacks for CBDMA via IOAT PMD
Step2: call rte_vhost_driver_register with flag "RTE_VHOST_USER_DMA_COPY" (pin
guest memory)
Step3: call rte_vhost_async_channel_register to register DMA channel
Step4: call rte_vhost_submit_enqueue_burst to enqueue packets
Step5: call rte_vhost_poll_enqueue_completed get back the ownership of the
packets whose copies are completed
Step6: call rte_pktmbuf_free to free packet mbuf
Signed-off-by: Patrick Fu <[email protected]>
Signed-off-by: Jiayu Hu <[email protected]>