On 2017年01月28日 05:33, John Fastabend wrote:
This adds ndo ops for upper layer objects to request direct DMA from the network interface into memory "slots". The slots must be DMA'able memory given by a page/offset/size vector in a packet_ring_buffer structure. The PF_PACKET socket interface can use these ndo_ops to do zerocopy RX from the network device into memory mapped userspace memory. For this to work drivers encode the correct descriptor blocks and headers so that existing PF_PACKET applications work without any modification. This only supports the V2 header formats for now. And works by mapping a ring of the network device to these slots. Originally I used V2 header formats but this does complicate the driver a bit. V3 header formats added bulk polling via socket calls and timers used in the polling interface to return every n milliseconds. Currently, I don't see any way to support this in hardware because we can't know if the hardware is in the middle of a DMA operation or not on a slot. So when a timer fires I don't know how to advance the descriptor ring leaving empty descriptors similar to how the software ring works. The easiest (best?) route is to simply not support this. It might be worth creating a new v4 header that is simple for drivers to support direct DMA ops with. I can imagine using the xdp_buff structure as a header for example. Thoughts? The ndo operations and new socket option PACKET_RX_DIRECT work by giving a queue_index to run the direct dma operations over. Once setsockopt returns successfully the indicated queue is mapped directly to the requesting application and can not be used for other purposes. Also any kernel layers such as tc will be bypassed and need to be implemented in the hardware via some other mechanism such as tc offload or other offload interfaces. Users steer traffic to the selected queue using flow director, tc offload infrastructure or via macvlan offload. The new socket option added to PF_PACKET is called PACKET_RX_DIRECT. It takes a single unsigned int value specifying the queue index, setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT, &queue_index, sizeof(queue_index)); Implementing busy_poll support will allow userspace to kick the drivers receive routine if needed. This work is TBD. To test this I hacked a hardcoded test into the tool psock_tpacket in the selftests kernel directory here: ./tools/testing/selftests/net/psock_tpacket.c Running this tool opens a socket and listens for packets over the PACKET_RX_DIRECT enabled socket. Obviously it needs to be reworked to enable all the older tests and not hardcode my interface before it actually gets released. In general this is a rough patch to explore the interface and put something concrete up for debate. The patch does not handle all the error cases correctly and needs to be cleaned up. Known Limitations (TBD): (1) Users are required to match the number of rx ring slots with ethtool to the number requested by the setsockopt PF_PACKET layout. In the future we could possibly do this automatically. (2) Users need to configure Flow director or setup_tc to steer traffic to the correct queues. I don't believe this needs to be changed it seems to be a good mechanism for driving directed dma. (3) Not supporting timestamps or priv space yet, pushing a v4 packet header would resolve this nicely. (5) Only RX supported so far. TX already supports direct DMA interface but uses skbs which is really not needed. In the TX_RING case we can optimize this path as well. To support TX case we can do a similar "slots" mechanism and kick operation. The kick could be a busy_poll like operation but on the TX side. The flow would be user space loads up n number of slots with packets, kicks tx busy poll bit, the driver sends packets, and finally when xmit is complete clears header bits to give slots back. When we have qdisc bypass set today we already bypass the entire stack so no paticular reason to use skb's in this case. Using xdp_buff as a v4 packet header would also allow us to consolidate driver code. To be done: (1) More testing and performance analysis (2) Busy polling sockets (3) Implement v4 xdp_buff headers for analysis
I like this idea and we should generalize the API that make rx zerocopy not specific to packet socket. Then we can make this use for e.g macvtap (pass-through mode). But instead of the headers, ndo_ops should support refill from non-fixed memory location from userspace (per packet or packets) to satisfy the requirement of virtqueues.
Thanks
(4) performance testing :/ hopefully it looks good. Signed-off-by: John Fastabend<john.r.fastab...@intel.com>
[...]