Hey,
This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
[0]
https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuc...@bytedance.com/
The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)
In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:
* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total
memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total
memory)
Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).
So to summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.
Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.
NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on
128G NVDIMMs
with 2M and 1G respectivally.
Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:
before after
(16G get_user_pages_fast 2M page size) ~59 ms -> ~6.1 ms
(16G pin_user_pages_fast 2M page size) ~87 ms -> ~6.2 ms
(16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
(16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
altmap performance gets specially interesting when pinning a pmem dimm:
before after
(128G get_user_pages_fast 2M page size) ~492 ms -> ~49 ms
(128G pin_user_pages_fast 2M page size) ~493 ms -> ~50 ms
(128G get_user_pages_fast altmap 2M page size) ~3.91 ms -> ~70 ms
(128G pin_user_pages_fast altmap 2M page size) ~3.97 ms -> ~74 ms
The unpinning improvement patches are in mmotm/linux-next so removed from this
series.
I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.mart...@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.
Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
Comments and suggestions very much appreciated!
Changelog,
RFC -> v1:
(New patches 1-3, 5-8 but the diffstat is that