Re: [lng-odp] continuous memory allocation for drivers

2016-11-17 Thread Maciej Czekaj

On Fri, 11 Nov 2016 11:13:27 +0100 Francois Ozog wrote:

On 11 November 2016 at 10:10, Brian Brooks  wrote:


On 11/10 18:52:49, Christophe Milard wrote:

Hi,

My hope was that packet segments would all be smaller than one page
(either normal pages or huge pages)

When is this the case? With a 4096 byte page, a couple 1518 byte Ethernet
packets can fit. A 9038 byte Jumbo wont fit.


[FF] WHen you allocate a queue with 256 packets for Intel, Vritio, Mellanox
cards, you need a small area of 256 descriptors that fit in a page. Then
drivers allocate 256 buffers in a contiguous memory. This leads to 512K of
buffers. They may allocate this zone per packet though. But for high
performance cards such as Chelsio and Netcope: this is a strict requirement
because the obtained memory zone is managed by HW: packet allocation is not
controlled by software. You give the zone to hardware wich places packets
the way it wants in the zone. HW informs the SW where the packets are by
updating the ring. VFIO does not change the requirement of a large
contiguous area.
PCIexpress has a limitation of 36M DMA transactions per second. Which is
lower than 60Mpps required for 40Gbps and much lower than 148Mpps required
for 100Gbps. The only way to achieve line rate is to fit more than one
packet in a DMA transaction. That's what Chelsio, Netcope and others are
doing. This requires HW controlled memory allocations. This requires large
memory blocks to be supplied to HW.
As we move forawrd, I expect all cards to adopt a similar scheme and escape
the "Intel" Model of IO.
Now if we look at performance, cost of managing virt_to_phys() even in
kernel for each packet is preventing spread allocations. You amortize the
cost by getting the physical address of the 256 buffer zone, and using
offsets from that to get the physical address of an individual packet. If
you try to do that in userland, then just use linux networking stack, it
will be faster ;-)




I second Francois here. H/W requires physically contiguous memory at 
least for DMA queues.
A queue well exceeds the size of a page or even huge page in some cases. 
E.g. ThunderX has a single
DMA queue of size 512K and it may even have a 4M queue which is beyond a 
2M huge page on ARM when 4K page is default.
64K pages are preferable and they are safer as the huge page is then 
512M but it may be too much for some clients.


vfio-pci only partially solves that problem because of 2 reasons:

1. In the guest environment there is no vfio-pci, or at least there is 
no extra mapping that could be done.
   From VM perspective all DMA memory should be contiguous with respect 
to Intermediate Virtual Address (under ARM nomenclature).


2. IOMMU mappings are not for free. In synthetic benchmarks there is no 
difference in performance due to IOMMU
   but if the system is using IOMMU extensively, e.g. due to many VMs, 
it may well prove otherwise.
   IOTLB miss has similar cost to TLB miss. Ideally, a system 
integrator should have a choice to use it or not.





Or is it to ease the memory manager by having a logical array of objects
laid out in virtual memory space and depending on the number of objects
and the size of each object, a few are bound to span across 2 pages which
might not be adjacent in physical memory?

Or is it when the view of a packet is a bunch of packet segments which
may be of varying sizes and possibly scattered across in memory and the
packet needs to go out the wire?

Are 2M, 16M, 1G page size used?


to guarantee physical memory
continuity which is needed by some drivers (read non vfio drivers for
PCI).

[FF] linux kernel uses a special allocator for that, huge pages are not

the unit. As said above, some HW require large contiguous blocks and vfio
or iommu does not avoid the requirement.



If IOMMU enables IO device the same virtual addressing as the CPU by
sharing page tables, would ever a IO device or IOMMU have limitations
on the number of pages supported or other performance limitations
during the VA->PA translation?

[FF] no information on that



Does the IOMMU remap interrupts from the IO device when the vm
migrates cores? What happens when no irq remapping, does core get
irq and must interprocessorinterrupt core where vm is now running?

[FF] I hope not.



Are non vfio drivers for PCI needing contiguous physical memory the
design target?


[FF] not related to VFIO but related to HW requirements.



Francois Ozog's experience (with dpdk)shows that this hope will fail
in some case: not all platforms support the required huge page size.
And it would be nice to be able to run even in the absence of huge
pages.

I am therefore planning to expand drvshm to include a flag requesting
contiguous physical memory. But sadly, from user space, this is
nothing we can guarantee... So when this flag is set, the allocator
will allocate untill physical memory "happens to be continuous".
This is a bit like the DPDK approach (try & error), which I dislike,

[lng-odp] clarification of pktout checksum offload feature

2016-10-13 Thread Maciej Czekaj


Guys,

I was going to implement checksum offload for OFP project based on 
Monarch checksum offload capability and I found out that there is no 
example for using that API. Also, documentation seams to leave some room 
for various interpretations, so I would like to clarify that and post a 
patch to documentation, too.



This is an exempt from pktio.h from Monarch LTS:


/**
 * Packet output configuration options bit field
 *
 * Packet output configuration options listed in a bit field structure. 
Packet
 * output checksum insertion may be enabled or disabled. When it is 
enabled,
 * implementation will calculate and insert checksum into every 
outgoing packet
 * by default. Application may use a packet metadata flag to disable 
checksum

 * insertion per packet bases. For correct operation, packet metadata must
 * provide valid offsets for the appropriate protocols. For example, UDP
 * checksum calculation needs both L3 and L4 offsets (to access IP and UDP
 * headers). When application (e.g. a switch) does not modify L3/L4 
data and
 * thus checksum does not need to be updated, output checksum insertion 
should

 * be disabled for optimal performance.



From my contact with varoius NICs, including Octeon PKO & VNIC from 
ThunderX, offloading H/W needs at least:


For L4 offload:
L4 packet type: TCP/UDP/SCTP
L4 header offset
L3 header offset
L3 type may or may not be required but it is good to define it for 
consistency


For L3 checksum:
L3 packet type: IPv4
L3 header offset

There is also a second thing: how to disable checksum calculation 
per-packet?
If packet has no type in metadata, then obviously checksum will not be 
computed. I think that would be the recommended method for now, even if 
ODP community plans to  extend odp_packet API in the future to cover 
that case.


Maybe that is implicit that packet types should be set along header 
offsets, but it is good to state that clearly and provide some usage 
example, e.g. in examples/generator. I can send a patch for both doc and 
generator but I would like to make sure we are on the same page.



Regards
Maciej





Re: [lng-odp] [API-NEXT PATCH] api-next: pktio: add odp_pktio_send_complete() definition

2015-06-02 Thread Maciej Czekaj
Zoltan,

I am currently working on ThunderX port so I can offer an insight into one
of the implementations.

ThunderX has more server-like network adapter as opposed to Octeon or
QorIQ, so buffer management is done in software.
I think the problem with pool starvation affects mostly those kinds of
platforms and any mismanagement here may have dire costs.

On Thunder the buffers are handled the same way as in DPDK, so transmitted
buffers have to be reclaimed after hardware finished processing them.
The best and most efficient  way to free the buffers is to do it while
transmitting others on the same queue, i.e. in odp_pktio_send or in enqueue
operation. There are several reasons behind this:

 1. TX ring is accessed anyway so it minimizes cache misses.

 2.  TX ring H/W registers are accessed while transmitting packets, so
information about the ring occupation is already extracted by software.
 This leads to minimizing the overhead of H/W register access, which
may be quite significant even on internal PCI-E bus.

 3. Any other scheme, e.g. doing it in mempool or in RX as suggested
previously, leads to extra overhead from points 1 and 2 and another
overhead caused by synchronization of access to the ring:

 - accessing TX ring from mempool must be thread-safe, mempool may be
invoked from another context than ring transmission
 - accessing the transmission ring from receive operation leads to
similar thread safety issue where RX and TX, being independent operations
from H/W perspective, must be additionally synchronized with respect to
each other

Summarizing, any high-performance implementation must live with the fact
that some buffers will be kept in TX ring for a while and choose the
mempool size accordingly. This is true at least for Thunder and any other
similar server adapters.  On the other hand, the issue may be
non-existent in specialized network processors, but then there is no need
for extra API or extra software tricks, anyway.

There memory pressure may come not only from TX ring, but from RX ring as
well, when flooded with packets. That leads to the same challange, but
reversed, i.e. the receive function greedily allocates packets to feed the
H/W with as many free buffers as possible and there is currently no way to
limit that.

That is why from Thunder perspective a practical solution is:
 - explicitly stating the depth of the engine (both RX and TX) by either
API or some parameter and letting the implementer choose how to deal with
the problem
 - adding the note that transmission functions are responsible for buffer
cleanup, to let the application choose the best strategy

This is by all means not a sliver bullet but it gives the user the tools to
deal with the problem and at the same time does not impose unnecessary
overhead for certain implementations.


Cheers
Maciej

2015-05-29 18:03 GMT+02:00 Zoltan Kiss zoltan.k...@linaro.org:

 Hi,

 On 29/05/15 16:58, Jerin Jacob wrote:

 I agree. Is it possbile to dedicate core 0/any core in ODP-DPDK
 implementation
 to do the house keeping job ? If we are planning for ODP-DPDK
 implementation as just wrapper over DPDK API then there will not be any
 value addition to use the ODP API. At least from my experience, We have
 changed our  SDK a lot to fit into ODP model. IMO that kind of effort
 will
 be required for useful ODP-DPDK port.


 It would be good to have some input from other implementations as well:
 when do you release the sent packets in the Cavium implementation?

 ___
 lng-odp mailing list
 lng-odp@lists.linaro.org
 https://lists.linaro.org/mailman/listinfo/lng-odp

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp