On 7/26/2022 3:59 PM, Dmitry Kozlyuk wrote:
Hi Don,

2022-07-26 14:33 (UTC-0400), Don Wallwork:
This proposal describes a method for translating any huge page
address from virtual to physical or vice versa using simple
addition or subtraction of a single fixed value. This allows
devices to efficiently access arbitrary huge page memory, even
stack data when worker stacks are in huge pages.
What is the use case and how much is the benefit?

Several examples where this could help include:

1. A device could return flow lookup results containing the physical
address of a matching entry that needs to be translated to a virtual
address.

2. Hardware can perform offloads on dynamically allocted heap
memory objects and would need PA to avoid requiring IOMMU.

3. It may be useful to prepare data such as descriptors in stack
variables, then pass the PA to hardware which can DMA directly
from stack memory.

4. The CPU instruction set provides memory operations such as
prefetch, atomics, ALU and so on which operate on virtual
addresses with no software requirement to provide physical
addresses. A device may be able to provide a more optimized
implementation of such features that could avoid performance
degradation associated with using a hardware IOMMU if provided
virtual addresses. Having the ability to offload such operations
without requiring data structure modifications to store an IOVA for
every virtual address is desirable.

All of these cases can run at packet rate and are not operating on
mbuf data. These would all benefit from efficient address translation
in the same way that mbufs already do. Unlike mbuf translation
that only covers VA to PA, this translation can perform both VA to PA
and PA to VA with equal efficiency.


When drivers need to process a large number of memory blocks,
these are typically packets in the form of mbufs,
which already have IOVA attached, so there is no translation.
Does translation of mbuf VA to PA with the proposed method
show significant improvement over reading mbuf->iova?

This proposal does not relate to mbufs.  As you say, there is
already an efficient VA to PA mechanism in place for those.


When drivers need to process a few IOVA-contiguous memory blocks,
they can calculate VA-to-PA offsets in advance,
amortizing translation cost.
Hugepage stack falls within this category.

As the cases listed above hopefully show, there are cases where
it is not practical or desirable to precalculate the offsets.


When legacy memory mode is used, it is possible to map a single
virtual memory region large enough to cover all huge pages. During
legacy hugepage init, each hugepage is mapped into that region.
Legacy mode is called "legacy" with an intent to be deprecated :)

Understood.  For our initial implementation, we were okay with
that limitation given that supporting in legacy mode was simpler.

There is initial allocation (-m) and --socket-limit in dynamic mode.
When initial allocation is equal to the socket limit,
it should be the same behavior as in legacy mode:
the number of hugepages mapped is constant and cannot grow,
so the feature seems applicable as well.

It seems feasible to implement this feature in non-legacy mode as
well. The approach would be similar; reserve a region of virtual
address space large enough to cover all huge pages before they are
allocated.  As huge pages are allocated, they are mapped into the
appropriate location within that virtual address space.


Once all pages have been mapped, any unused holes in that memory
region are unmapped.
Who tracks these holes and prevents translation from their VA?

Since the holes are unmapped, references to locations in unused
regions will result in seg faults.

Why the holes appear?

Memory layout for different NUMA nodes may cause holes.  Also,
there is no guarantee that all huge pages are physically contiguous.


This feature is applicable when rte_eal_iova_mode() == RTE_IOVA_PA
One can say it always works for RTE_IOVA_VA with VA-to-PA offset of 0.

This is true, but requires the use of a hardware IOMMU which
degrades performance.


and could be enabled either by default when the legacy memory EAL
option is given, or a new EAL option could be added to specifically
enable this feature.

It may be desirable to set a capability bit when this feature is
enabled to allow drivers to behave differently depending on the
state of that flag.
The feature requires, in IOVA-as-PA mode:
1) that hugepage mapping is static (legacy mode or "-m" == "--socket-limit");
2) that EAL has succeeded to map all hugepages in one PA-continuous block.

It does not require huge pages to be physically contiguous.
Theoretically the mapping a giant VA region could fail, but
we have not seen this in practice even when running on x86_64
servers with multiple NUMA nodes, many cores and huge pages
that span TBs of physical address space.

As userspace code, DPDK cannot guarantee 2).
Because this mode breaks nothing and just makes translation more efficient,
DPDK can always try to implement it and then report whether it has succeeded.
Applications and drivers can decide what to do by querying this API.

Yes, providing an API to check this capability would
definitely work.

Thanks for all the good feedback.

-Don

Reply via email to