On Thu, Feb 3, 2022 at 7:13 PM Dmitry Kozlyuk <dkozl...@nvidia.com> wrote:
>
> This patchset is a new design and implementation of [1].
>
> # Problem Statement
>
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
>
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
>
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
>
> # Solution
>
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
>
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
>
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
>
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
>
> # Implementation
>
> ## User Interface
>
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
>
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
>
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
>
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
>
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
>
> ## EAL
>
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
>
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
>
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
>
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozl...@nvidia.com/
>

I fixed some checkpatch warnings, updated MAINTAINERS for the added
test and kept ERR level for a log message when creating files in
get_seg_fd().

Thanks again for enhancing the documentation, Dmitry.
And thanks to testers and reviewers.

Series applied.


-- 
David Marchand

Reply via email to