Re: [lng-odp] continuous memory allocation for drivers
On Fri, 11 Nov 2016 11:13:27 +0100 Francois Ozog wrote: On 11 November 2016 at 10:10, Brian Brookswrote: On 11/10 18:52:49, Christophe Milard wrote: Hi, My hope was that packet segments would all be smaller than one page (either normal pages or huge pages) When is this the case? With a 4096 byte page, a couple 1518 byte Ethernet packets can fit. A 9038 byte Jumbo wont fit. [FF] WHen you allocate a queue with 256 packets for Intel, Vritio, Mellanox cards, you need a small area of 256 descriptors that fit in a page. Then drivers allocate 256 buffers in a contiguous memory. This leads to 512K of buffers. They may allocate this zone per packet though. But for high performance cards such as Chelsio and Netcope: this is a strict requirement because the obtained memory zone is managed by HW: packet allocation is not controlled by software. You give the zone to hardware wich places packets the way it wants in the zone. HW informs the SW where the packets are by updating the ring. VFIO does not change the requirement of a large contiguous area. PCIexpress has a limitation of 36M DMA transactions per second. Which is lower than 60Mpps required for 40Gbps and much lower than 148Mpps required for 100Gbps. The only way to achieve line rate is to fit more than one packet in a DMA transaction. That's what Chelsio, Netcope and others are doing. This requires HW controlled memory allocations. This requires large memory blocks to be supplied to HW. As we move forawrd, I expect all cards to adopt a similar scheme and escape the "Intel" Model of IO. Now if we look at performance, cost of managing virt_to_phys() even in kernel for each packet is preventing spread allocations. You amortize the cost by getting the physical address of the 256 buffer zone, and using offsets from that to get the physical address of an individual packet. If you try to do that in userland, then just use linux networking stack, it will be faster ;-) I second Francois here. H/W requires physically contiguous memory at least for DMA queues. A queue well exceeds the size of a page or even huge page in some cases. E.g. ThunderX has a single DMA queue of size 512K and it may even have a 4M queue which is beyond a 2M huge page on ARM when 4K page is default. 64K pages are preferable and they are safer as the huge page is then 512M but it may be too much for some clients. vfio-pci only partially solves that problem because of 2 reasons: 1. In the guest environment there is no vfio-pci, or at least there is no extra mapping that could be done. From VM perspective all DMA memory should be contiguous with respect to Intermediate Virtual Address (under ARM nomenclature). 2. IOMMU mappings are not for free. In synthetic benchmarks there is no difference in performance due to IOMMU but if the system is using IOMMU extensively, e.g. due to many VMs, it may well prove otherwise. IOTLB miss has similar cost to TLB miss. Ideally, a system integrator should have a choice to use it or not. Or is it to ease the memory manager by having a logical array of objects laid out in virtual memory space and depending on the number of objects and the size of each object, a few are bound to span across 2 pages which might not be adjacent in physical memory? Or is it when the view of a packet is a bunch of packet segments which may be of varying sizes and possibly scattered across in memory and the packet needs to go out the wire? Are 2M, 16M, 1G page size used? to guarantee physical memory continuity which is needed by some drivers (read non vfio drivers for PCI). [FF] linux kernel uses a special allocator for that, huge pages are not the unit. As said above, some HW require large contiguous blocks and vfio or iommu does not avoid the requirement. If IOMMU enables IO device the same virtual addressing as the CPU by sharing page tables, would ever a IO device or IOMMU have limitations on the number of pages supported or other performance limitations during the VA->PA translation? [FF] no information on that Does the IOMMU remap interrupts from the IO device when the vm migrates cores? What happens when no irq remapping, does core get irq and must interprocessorinterrupt core where vm is now running? [FF] I hope not. Are non vfio drivers for PCI needing contiguous physical memory the design target? [FF] not related to VFIO but related to HW requirements. Francois Ozog's experience (with dpdk)shows that this hope will fail in some case: not all platforms support the required huge page size. And it would be nice to be able to run even in the absence of huge pages. I am therefore planning to expand drvshm to include a flag requesting contiguous physical memory. But sadly, from user space, this is nothing we can guarantee... So when this flag is set, the allocator will allocate untill physical memory "happens to be continuous". This is a bit like the DPDK approach (try & error), which I dislike,
[lng-odp] clarification of pktout checksum offload feature
Guys, I was going to implement checksum offload for OFP project based on Monarch checksum offload capability and I found out that there is no example for using that API. Also, documentation seams to leave some room for various interpretations, so I would like to clarify that and post a patch to documentation, too. This is an exempt from pktio.h from Monarch LTS: /** * Packet output configuration options bit field * * Packet output configuration options listed in a bit field structure. Packet * output checksum insertion may be enabled or disabled. When it is enabled, * implementation will calculate and insert checksum into every outgoing packet * by default. Application may use a packet metadata flag to disable checksum * insertion per packet bases. For correct operation, packet metadata must * provide valid offsets for the appropriate protocols. For example, UDP * checksum calculation needs both L3 and L4 offsets (to access IP and UDP * headers). When application (e.g. a switch) does not modify L3/L4 data and * thus checksum does not need to be updated, output checksum insertion should * be disabled for optimal performance. From my contact with varoius NICs, including Octeon PKO & VNIC from ThunderX, offloading H/W needs at least: For L4 offload: L4 packet type: TCP/UDP/SCTP L4 header offset L3 header offset L3 type may or may not be required but it is good to define it for consistency For L3 checksum: L3 packet type: IPv4 L3 header offset There is also a second thing: how to disable checksum calculation per-packet? If packet has no type in metadata, then obviously checksum will not be computed. I think that would be the recommended method for now, even if ODP community plans to extend odp_packet API in the future to cover that case. Maybe that is implicit that packet types should be set along header offsets, but it is good to state that clearly and provide some usage example, e.g. in examples/generator. I can send a patch for both doc and generator but I would like to make sure we are on the same page. Regards Maciej
Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add odp_pktio_send_complete() definition
Zoltan, I am currently working on ThunderX port so I can offer an insight into one of the implementations. ThunderX has more server-like network adapter as opposed to Octeon or QorIQ, so buffer management is done in software. I think the problem with pool starvation affects mostly those kinds of platforms and any mismanagement here may have dire costs. On Thunder the buffers are handled the same way as in DPDK, so transmitted buffers have to be reclaimed after hardware finished processing them. The best and most efficient way to free the buffers is to do it while transmitting others on the same queue, i.e. in odp_pktio_send or in enqueue operation. There are several reasons behind this: 1. TX ring is accessed anyway so it minimizes cache misses. 2. TX ring H/W registers are accessed while transmitting packets, so information about the ring occupation is already extracted by software. This leads to minimizing the overhead of H/W register access, which may be quite significant even on internal PCI-E bus. 3. Any other scheme, e.g. doing it in mempool or in RX as suggested previously, leads to extra overhead from points 1 and 2 and another overhead caused by synchronization of access to the ring: - accessing TX ring from mempool must be thread-safe, mempool may be invoked from another context than ring transmission - accessing the transmission ring from receive operation leads to similar thread safety issue where RX and TX, being independent operations from H/W perspective, must be additionally synchronized with respect to each other Summarizing, any high-performance implementation must live with the fact that some buffers will be kept in TX ring for a while and choose the mempool size accordingly. This is true at least for Thunder and any other similar server adapters. On the other hand, the issue may be non-existent in specialized network processors, but then there is no need for extra API or extra software tricks, anyway. There memory pressure may come not only from TX ring, but from RX ring as well, when flooded with packets. That leads to the same challange, but reversed, i.e. the receive function greedily allocates packets to feed the H/W with as many free buffers as possible and there is currently no way to limit that. That is why from Thunder perspective a practical solution is: - explicitly stating the depth of the engine (both RX and TX) by either API or some parameter and letting the implementer choose how to deal with the problem - adding the note that transmission functions are responsible for buffer cleanup, to let the application choose the best strategy This is by all means not a sliver bullet but it gives the user the tools to deal with the problem and at the same time does not impose unnecessary overhead for certain implementations. Cheers Maciej 2015-05-29 18:03 GMT+02:00 Zoltan Kiss zoltan.k...@linaro.org: Hi, On 29/05/15 16:58, Jerin Jacob wrote: I agree. Is it possbile to dedicate core 0/any core in ODP-DPDK implementation to do the house keeping job ? If we are planning for ODP-DPDK implementation as just wrapper over DPDK API then there will not be any value addition to use the ODP API. At least from my experience, We have changed our SDK a lot to fit into ODP model. IMO that kind of effort will be required for useful ODP-DPDK port. It would be good to have some input from other implementations as well: when do you release the sent packets in the Cavium implementation? ___ lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp ___ lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp