Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
2018-04-11 20:43 GMT+02:00 Alexei Starovoitov : > On 4/11/18 5:17 AM, Björn Töpel wrote: >> >> >> In the current RFC you are required to create both an Rx and Tx >> queue to bind the socket, which is just weird for your "Rx on one >> device, Tx to another" scenario. I'll fix that in the next RFC. > > I would defer on adding new features until the key functionality > lands. imo it's in good shape and I would submit it without RFC tag > as soon as net-next reopens. Yes, makes sense. We're doing some ptr_ring-like vs head/tail measurements, and depending on the result we'll send out a proper patch when net-next is open again. What tree should we target -- bpf-next or net-next? Thanks! Björn
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
On 4/11/18 5:17 AM, Björn Töpel wrote: In the current RFC you are required to create both an Rx and Tx queue to bind the socket, which is just weird for your "Rx on one device, Tx to another" scenario. I'll fix that in the next RFC. I would defer on adding new features until the key functionality lands. imo it's in good shape and I would submit it without RFC tag as soon as net-next reopens.
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
2018-04-10 16:14 GMT+02:00 William Tu : > On Mon, Apr 9, 2018 at 11:47 PM, Björn Töpel wrote: [...] >>> >> >> So you've setup two identical UMEMs? Then you can just forward the >> incoming Rx descriptor to the other netdev's Tx queue. Note, that you >> only need to copy the descriptor, not the actual frame data. >> > > Thanks! > I will give it a try, I guess you're saying I can do below: > > int sfd1; // for device1 > int sfd2; // for device2 > ... > // create 2 umem > umem1 = calloc(1, sizeof(*umem)); > umem2 = calloc(1, sizeof(*umem)); > > // allocate 1 shared buffer, 1 xdp_umem_reg > posix_memalign(&bufs, ...) > mr.addr = (__u64)bufs; // shared for umem1,2 > ... > > // umem reg the same mr > setsockopt(sfd1, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) > setsockopt(sfd2, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) > > // setup fill, completion, mmap for sfd1 and sfd2 > ... > > Since both device can put frame data in 'bufs', I only need to copy > the descs between 2 umem1 and umem2. Am I understand correct? > Yup, spot on! umem1 and umem2 have the same layout/index "address space", so you can just forward the descriptors and never touch the data. In the current RFC you are required to create both an Rx and Tx queue to bind the socket, which is just weird for your "Rx on one device, Tx to another" scenario. I'll fix that in the next RFC. Björn > Regards, > William
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
On Mon, Apr 9, 2018 at 11:47 PM, Björn Töpel wrote: > 2018-04-09 23:51 GMT+02:00 William Tu : >> On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel wrote: >>> From: Björn Töpel >>> >>> This RFC introduces a new address family called AF_XDP that is >>> optimized for high performance packet processing and, in upcoming >>> patch sets, zero-copy semantics. In this v2 version, we have removed >>> all zero-copy related code in order to make it smaller, simpler and >>> hopefully more review friendly. This RFC only supports copy-mode for >>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX >>> using the XDP_DRV path. Zero-copy support requires XDP and driver >>> changes that Jesper Dangaard Brouer is working on. Some of his work is >>> already on the mailing list for review. We will publish our zero-copy >>> support for RX and TX on top of his patch sets at a later point in >>> time. >>> >>> An AF_XDP socket (XSK) is created with the normal socket() >>> syscall. Associated with each XSK are two queues: the RX queue and the >>> TX queue. A socket can receive packets on the RX queue and it can send >>> packets on the TX queue. These queues are registered and sized with >>> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is >>> mandatory to have at least one of these queues for each socket. In >>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from >>> packet buffers. An RX or TX descriptor points to a data buffer in a >>> memory area called a UMEM. RX and TX can share the same UMEM so that a >>> packet does not have to be copied between RX and TX. Moreover, if a >>> packet needs to be kept for a while due to a possible retransmit, the >>> descriptor that points to that packet can be changed to point to >>> another and reused right away. This again avoids copying data. >>> >>> This new dedicated packet buffer area is called a UMEM. It consists of >>> a number of equally size frames and each frame has a unique frame >>> id. A descriptor in one of the queues references a frame by >>> referencing its frame id. The user space allocates memory for this >>> UMEM using whatever means it feels is most appropriate (malloc, mmap, >>> huge pages, etc). This memory area is then registered with the kernel >>> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: >>> the FILL queue and the COMPLETION queue. The fill queue is used by the >>> application to send down frame ids for the kernel to fill in with RX >>> packet data. References to these frames will then appear in the RX >>> queue of the XSK once they have been received. The completion queue, >>> on the other hand, contains frame ids that the kernel has transmitted >>> completely and can now be used again by user space, for either TX or >>> RX. Thus, the frame ids appearing in the completion queue are ids that >>> were previously transmitted using the TX queue. In summary, the RX and >>> FILL queues are used for the RX path and the TX and COMPLETION queues >>> are used for the TX path. >>> >> Can we register a UMEM to multiple device's queue? >> > > No, one UMEM, one netdev queue in this RFC. That being said, there's > nothing stopping a user from creating an additional UMEM, say UMEM', > pointing to the same memory as UMEM, but bound to another > netdev/queue. Note that the user space application has to make sure > that the buffer handling is sane (user/kernel frame ownership). > > We used to allow to share UMEM between unrelated sockets, but after > the introduction of the UMEM queues (fill/completion) that's no the > case any more. For the zero-copy scenario, having to manage multiple > DMA mappings per UMEM was a bit of a mess, so we went for the simpler > (current) solution with one UMEM per netdev/queue. > >> So far the l2fwd sample code is sending/receiving from the same >> queue. I'm thinking about forwarding packets from one device to another. >> Now I'm copying packets from one device's RX desc to another device's TX >> completion queue. But this introduces one extra copy. >> > > So you've setup two identical UMEMs? Then you can just forward the > incoming Rx descriptor to the other netdev's Tx queue. Note, that you > only need to copy the descriptor, not the actual frame data. > Thanks! I will give it a try, I guess you're saying I can do below: int sfd1; // for device1 int sfd2; // for device2 ... // create 2 umem umem1 = calloc(1, sizeof(*umem)); umem2 = calloc(1, sizeof(*umem)); // allocate 1 shared buffer, 1 xdp_umem_reg posix_memalign(&bufs, ...) mr.addr = (__u64)bufs; // shared for umem1,2 ... // umem reg the same mr setsockopt(sfd1, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) setsockopt(sfd2, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) // setup fill, completion, mmap for sfd1 and sfd2 ... Since both device can put frame data in 'bufs', I only need to copy the descs between 2 umem1 and umem2. Am I understand correct? Regards, William
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
2018-04-09 23:51 GMT+02:00 William Tu : > On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel wrote: >> From: Björn Töpel >> >> This RFC introduces a new address family called AF_XDP that is >> optimized for high performance packet processing and, in upcoming >> patch sets, zero-copy semantics. In this v2 version, we have removed >> all zero-copy related code in order to make it smaller, simpler and >> hopefully more review friendly. This RFC only supports copy-mode for >> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX >> using the XDP_DRV path. Zero-copy support requires XDP and driver >> changes that Jesper Dangaard Brouer is working on. Some of his work is >> already on the mailing list for review. We will publish our zero-copy >> support for RX and TX on top of his patch sets at a later point in >> time. >> >> An AF_XDP socket (XSK) is created with the normal socket() >> syscall. Associated with each XSK are two queues: the RX queue and the >> TX queue. A socket can receive packets on the RX queue and it can send >> packets on the TX queue. These queues are registered and sized with >> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is >> mandatory to have at least one of these queues for each socket. In >> contrast to AF_PACKET V2/V3 these descriptor queues are separated from >> packet buffers. An RX or TX descriptor points to a data buffer in a >> memory area called a UMEM. RX and TX can share the same UMEM so that a >> packet does not have to be copied between RX and TX. Moreover, if a >> packet needs to be kept for a while due to a possible retransmit, the >> descriptor that points to that packet can be changed to point to >> another and reused right away. This again avoids copying data. >> >> This new dedicated packet buffer area is called a UMEM. It consists of >> a number of equally size frames and each frame has a unique frame >> id. A descriptor in one of the queues references a frame by >> referencing its frame id. The user space allocates memory for this >> UMEM using whatever means it feels is most appropriate (malloc, mmap, >> huge pages, etc). This memory area is then registered with the kernel >> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: >> the FILL queue and the COMPLETION queue. The fill queue is used by the >> application to send down frame ids for the kernel to fill in with RX >> packet data. References to these frames will then appear in the RX >> queue of the XSK once they have been received. The completion queue, >> on the other hand, contains frame ids that the kernel has transmitted >> completely and can now be used again by user space, for either TX or >> RX. Thus, the frame ids appearing in the completion queue are ids that >> were previously transmitted using the TX queue. In summary, the RX and >> FILL queues are used for the RX path and the TX and COMPLETION queues >> are used for the TX path. >> > Can we register a UMEM to multiple device's queue? > No, one UMEM, one netdev queue in this RFC. That being said, there's nothing stopping a user from creating an additional UMEM, say UMEM', pointing to the same memory as UMEM, but bound to another netdev/queue. Note that the user space application has to make sure that the buffer handling is sane (user/kernel frame ownership). We used to allow to share UMEM between unrelated sockets, but after the introduction of the UMEM queues (fill/completion) that's no the case any more. For the zero-copy scenario, having to manage multiple DMA mappings per UMEM was a bit of a mess, so we went for the simpler (current) solution with one UMEM per netdev/queue. > So far the l2fwd sample code is sending/receiving from the same > queue. I'm thinking about forwarding packets from one device to another. > Now I'm copying packets from one device's RX desc to another device's TX > completion queue. But this introduces one extra copy. > So you've setup two identical UMEMs? Then you can just forward the incoming Rx descriptor to the other netdev's Tx queue. Note, that you only need to copy the descriptor, not the actual frame data. > One way I can do is to call bpf_redirect helper function, but sometimes > I still need to process the packet in userspace. > > I like this work! > Thanks a lot. Happy to hear that, and thanks a bunch for trying it out. Keep that feedback coming! Björn > William
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel wrote: > From: Björn Töpel > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this v2 version, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This RFC only supports copy-mode for > the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX > using the XDP_DRV path. Zero-copy support requires XDP and driver > changes that Jesper Dangaard Brouer is working on. Some of his work is > already on the mailing list for review. We will publish our zero-copy > support for RX and TX on top of his patch sets at a later point in > time. > > An AF_XDP socket (XSK) is created with the normal socket() > syscall. Associated with each XSK are two queues: the RX queue and the > TX queue. A socket can receive packets on the RX queue and it can send > packets on the TX queue. These queues are registered and sized with > the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is > mandatory to have at least one of these queues for each socket. In > contrast to AF_PACKET V2/V3 these descriptor queues are separated from > packet buffers. An RX or TX descriptor points to a data buffer in a > memory area called a UMEM. RX and TX can share the same UMEM so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, the > descriptor that points to that packet can be changed to point to > another and reused right away. This again avoids copying data. > > This new dedicated packet buffer area is called a UMEM. It consists of > a number of equally size frames and each frame has a unique frame > id. A descriptor in one of the queues references a frame by > referencing its frame id. The user space allocates memory for this > UMEM using whatever means it feels is most appropriate (malloc, mmap, > huge pages, etc). This memory area is then registered with the kernel > using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: > the FILL queue and the COMPLETION queue. The fill queue is used by the > application to send down frame ids for the kernel to fill in with RX > packet data. References to these frames will then appear in the RX > queue of the XSK once they have been received. The completion queue, > on the other hand, contains frame ids that the kernel has transmitted > completely and can now be used again by user space, for either TX or > RX. Thus, the frame ids appearing in the completion queue are ids that > were previously transmitted using the TX queue. In summary, the RX and > FILL queues are used for the RX path and the TX and COMPLETION queues > are used for the TX path. > Can we register a UMEM to multiple device's queue? So far the l2fwd sample code is sending/receiving from the same queue. I'm thinking about forwarding packets from one device to another. Now I'm copying packets from one device's RX desc to another device's TX completion queue. But this introduces one extra copy. One way I can do is to call bpf_redirect helper function, but sometimes I still need to process the packet in userspace. I like this work! Thanks a lot. William
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
On Thu, 29 Mar 2018 08:16:23 +0200 Björn Töpel wrote: > 2018-03-28 23:18 GMT+02:00 Eric Leblond : > > Hello, > > > > On Tue, 2018-03-27 at 18:59 +0200, Björn Töpel wrote: > >> From: Björn Töpel > >> > >> > > optimized for high performance packet processing and, in upcoming > >> patch sets, zero-copy semantics. In this v2 version, we have removed > >> all zero-copy related code in order to make it smaller, simpler and > >> hopefully more review friendly. This RFC only supports copy-mode for > >> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for > >> RX > >> > > > > ... > >> > >> How is then packets distributed between these two XSK? We have > >> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in > >> full). The user-space application can place an XSK at an arbitrary > >> place in this map. The XDP program can then redirect a packet to a > >> specific index in this map and at this point XDP validates that the > >> XSK in that map was indeed bound to that device and queue number. If > >> not, the packet is dropped. If the map is empty at that index, the > >> packet is also dropped. This also means that it is currently > >> mandatory > >> to have an XDP program loaded (and one XSK in the XSKMAP) to be able > >> to get any traffic to user space through the XSK. > > > > If I get it correctly, this feature will have to be used to bound > > multiple sockets to a single queue and the eBPF filter will be > > responsible of the load balancing. Am I correct ? > > > > Exactly! The XDP program executing for a certain Rx queue will > distribute the packets to the socket(s) in the xskmap. It is important to understand that we (want/need to) maintain a Single Producer Single Consumer (SPSC) scenario here, for performance reasons. This _is_ maintained in this patchset AFAIK. But as the API user, you have to understand that the responsibility of aligning this is yours! If you don't the frames are dropped (silently). The BPF programmer MUST select the correct XSKMAP index, such that the ctx->rx_queue_index match the queue_id registered in the xdp_sock (and bounded ifindex also match). Bjørn, Magnus and I have discussed other API options. E.g. where the XSKMAP index _is_ the rx_queue_index, and BPF programmer is not allowed select another index. We settled on the API in the patchset, where BPF programmer get more freedom, and can select an invalid index, that cause packets to be dropped. An advantage of this API is that we allow one RX-queue, to multiplex into many xdk_sock's (all bound to this same RX-queue and ifindex). This still maintain a Single Producer, as the RX-queue just have a Single Producer relationship's with each xdp_sock. I imagine, that Suricata/Eric, want to capture all the RX-queues on the net_device. For this to happen, he need to create a xdp_sock per RX-queue, and have a side-bpf-map that assist in the XSKMAP lookup, or simply populate the XSKMAP to correspond to the rx_queue_index. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
2018-03-28 23:18 GMT+02:00 Eric Leblond : > Hello, > > On Tue, 2018-03-27 at 18:59 +0200, Björn Töpel wrote: >> From: Björn Töpel >> >> > optimized for high performance packet processing and, in upcoming >> patch sets, zero-copy semantics. In this v2 version, we have removed >> all zero-copy related code in order to make it smaller, simpler and >> hopefully more review friendly. This RFC only supports copy-mode for >> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for >> RX >> > > ... >> >> How is then packets distributed between these two XSK? We have >> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in >> full). The user-space application can place an XSK at an arbitrary >> place in this map. The XDP program can then redirect a packet to a >> specific index in this map and at this point XDP validates that the >> XSK in that map was indeed bound to that device and queue number. If >> not, the packet is dropped. If the map is empty at that index, the >> packet is also dropped. This also means that it is currently >> mandatory >> to have an XDP program loaded (and one XSK in the XSKMAP) to be able >> to get any traffic to user space through the XSK. > > If I get it correctly, this feature will have to be used to bound > multiple sockets to a single queue and the eBPF filter will be > responsible of the load balancing. Am I correct ? > Exactly! The XDP program executing for a certain Rx queue will distribute the packets to the socket(s) in the xskmap. >> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If >> the >> driver does not have support for XDP, or XDP_SKB is explicitly chosen > ... > > Thanks a lot for this work, I'm gonna try to implement this in > Suricata. > Thanks for trying it out! All input is very much appreciated (clunkiness of API, crashes...)! Björn > Best regards, > -- > Eric Leblond
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
Hello, On Tue, 2018-03-27 at 18:59 +0200, Björn Töpel wrote: > From: Björn Töpel > > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this v2 version, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This RFC only supports copy-mode for > the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for > RX > ... > > How is then packets distributed between these two XSK? We have > introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in > full). The user-space application can place an XSK at an arbitrary > place in this map. The XDP program can then redirect a packet to a > specific index in this map and at this point XDP validates that the > XSK in that map was indeed bound to that device and queue number. If > not, the packet is dropped. If the map is empty at that index, the > packet is also dropped. This also means that it is currently > mandatory > to have an XDP program loaded (and one XSK in the XSKMAP) to be able > to get any traffic to user space through the XSK. If I get it correctly, this feature will have to be used to bound multiple sockets to a single queue and the eBPF filter will be responsible of the load balancing. Am I correct ? > AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If > the > driver does not have support for XDP, or XDP_SKB is explicitly chosen ... Thanks a lot for this work, I'm gonna try to implement this in Suricata. Best regards, -- Eric Leblond
[RFC PATCH v2 00/14] Introducing AF_XDP support
From: Björn Töpel This RFC introduces a new address family called AF_XDP that is optimized for high performance packet processing and, in upcoming patch sets, zero-copy semantics. In this v2 version, we have removed all zero-copy related code in order to make it smaller, simpler and hopefully more review friendly. This RFC only supports copy-mode for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX using the XDP_DRV path. Zero-copy support requires XDP and driver changes that Jesper Dangaard Brouer is working on. Some of his work is already on the mailing list for review. We will publish our zero-copy support for RX and TX on top of his patch sets at a later point in time. An AF_XDP socket (XSK) is created with the normal socket() syscall. Associated with each XSK are two queues: the RX queue and the TX queue. A socket can receive packets on the RX queue and it can send packets on the TX queue. These queues are registered and sized with the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is mandatory to have at least one of these queues for each socket. In contrast to AF_PACKET V2/V3 these descriptor queues are separated from packet buffers. An RX or TX descriptor points to a data buffer in a memory area called a UMEM. RX and TX can share the same UMEM so that a packet does not have to be copied between RX and TX. Moreover, if a packet needs to be kept for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This again avoids copying data. This new dedicated packet buffer area is called a UMEM. It consists of a number of equally size frames and each frame has a unique frame id. A descriptor in one of the queues references a frame by referencing its frame id. The user space allocates memory for this UMEM using whatever means it feels is most appropriate (malloc, mmap, huge pages, etc). This memory area is then registered with the kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue and the COMPLETION queue. The fill queue is used by the application to send down frame ids for the kernel to fill in with RX packet data. References to these frames will then appear in the RX queue of the XSK once they have been received. The completion queue, on the other hand, contains frame ids that the kernel has transmitted completely and can now be used again by user space, for either TX or RX. Thus, the frame ids appearing in the completion queue are ids that were previously transmitted using the TX queue. In summary, the RX and FILL queues are used for the RX path and the TX and COMPLETION queues are used for the TX path. The socket is then finally bound with a bind() call to a device and a specific queue id on that device, and it is not until bind is completed that traffic starts to flow. Note that in this RFC, all packet data is copied out to user-space. A new feature in this RFC is that the UMEM can be shared between processes, if desired. If a process wants to do this, it simply skips the registration of the UMEM and its corresponding two queues, sets a flag in the bind call and submits the XSK of the process it would like to share UMEM with as well as its own newly created XSK socket. The new process will then receive frame id references in its own RX queue that point to this shared UMEM. Note that since the queue structures are single-consumer / single-producer (for performance reasons), the new process has to create its own socket with associated RX and TX queues, since it cannot share this with the other process. This is also the reason that there is only one set of FILL and COMPLETION queues per UMEM. It is the responsibility of a single process to handle the UMEM. If multiple-producer / multiple-consumer queues are implemented in the future, this requirement could be relaxed. How is then packets distributed between these two XSK? We have introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The user-space application can place an XSK at an arbitrary place in this map. The XDP program can then redirect a packet to a specific index in this map and at this point XDP validates that the XSK in that map was indeed bound to that device and queue number. If not, the packet is dropped. If the map is empty at that index, the packet is also dropped. This also means that it is currently mandatory to have an XDP program loaded (and one XSK in the XSKMAP) to be able to get any traffic to user space through the XSK. AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the driver does not have support for XDP, or XDP_SKB is explicitly chosen when loading the XDP program, XDP_SKB mode is employed that uses SKBs together with the generic XDP support and copies out the data to user space. A fallback mode that works for any network device. On the other hand, if the driver has support for XDP, it will be used by the AF_XDP code to provide