This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's transparent to normal applications and virtual machines.
The code is still in active development. It's provided for early design review. Key functionalities: 1) create and describe PMEM NUMA node for NVDIMM memory 2) dumb /proc/PID/idle_pages interface, for user space driven hot page accounting 3) passive kernel cold page migration in page reclaim path 4) improved move_pages() for active user space hot/cold page migration (1) is foundation for transparent usage of NVDIMM for normal apps and virtual machines. (2-4) enable auto placing hot pages in DRAM for better performance. A user space migration daemon is being built based on this kernel patchset to make the full vertical solution. Base kernel is v4.20 . The patches are not suitable for upstreaming in near future -- some are quick hacks, some others need more works. However they are complete enough to demo the necessary kernel changes for the proposed app&VM transparent NVDIMM volatile use model. The interfaces are far from finalized. They kind of illustrate what would be necessary for creating a user space driven solution. The exact forms will ask for more thoughts and inputs. We may adopt HMAT based solution for NUMA node related interface when they are ready. The /proc/PID/idle_pages interface is standalone but non-trivial. Before upstreaming some day, it's expected to take long time to collect various real use cases and feedbacks, so as to refine and stabilize the format. Create PMEM numa node [PATCH 01/21] e820: cheat PMEM as DRAM Mark numa node as DRAM/PMEM [PATCH 02/21] acpi/numa: memorize NUMA node type from SRAT table [PATCH 03/21] x86/numa_emulation: fix fake NUMA in uniform case [PATCH 04/21] x86/numa_emulation: pass numa node type to fake nodes [PATCH 05/21] mmzone: new pgdat flags for DRAM and PMEM [PATCH 06/21] x86,numa: update numa node type [PATCH 07/21] mm: export node type {pmem|dram} under /sys/bus/node Point neighbor DRAM/PMEM to each other [PATCH 08/21] mm: introduce and export pgdat peer_node [PATCH 09/21] mm: avoid duplicate peer target node Standalone zonelist for DRAM and PMEM nodes [PATCH 10/21] mm: build separate zonelist for PMEM and DRAM node Keep page table pages in DRAM [PATCH 11/21] kvm: allocate page table pages from DRAM [PATCH 12/21] x86/pgtable: allocate page table pages from DRAM /proc/PID/idle_pages interface for virtual machine and normal tasks [PATCH 13/21] x86/pgtable: dont check PMD accessed bit [PATCH 14/21] kvm: register in mm_struct [PATCH 15/21] ept-idle: EPT walk for virtual machine [PATCH 16/21] mm-idle: mm_walk for normal task [PATCH 17/21] proc: introduce /proc/PID/idle_pages [PATCH 18/21] kvm-ept-idle: enable module Mark hot pages [PATCH 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Kernel DRAM=>PMEM migration [PATCH 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node [PATCH 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM arch/x86/include/asm/numa.h | 2 arch/x86/include/asm/pgalloc.h | 10 arch/x86/include/asm/pgtable.h | 3 arch/x86/kernel/e820.c | 3 arch/x86/kvm/Kconfig | 11 arch/x86/kvm/Makefile | 4 arch/x86/kvm/ept_idle.c | 841 +++++++++++++++++++++++++++++++ arch/x86/kvm/ept_idle.h | 116 ++++ arch/x86/kvm/mmu.c | 12 arch/x86/mm/numa.c | 3 arch/x86/mm/numa_emulation.c | 30 + arch/x86/mm/pgtable.c | 22 drivers/acpi/numa.c | 5 drivers/base/node.c | 21 fs/proc/base.c | 2 fs/proc/internal.h | 1 fs/proc/task_mmu.c | 54 + include/linux/mm_types.h | 11 include/linux/mmzone.h | 38 + mm/mempolicy.c | 14 mm/migrate.c | 13 mm/page_alloc.c | 77 ++ mm/pagewalk.c | 1 mm/vmscan.c | 38 + virt/kvm/kvm_main.c | 3 25 files changed, 1306 insertions(+), 29 deletions(-) V1 patches: https://lkml.org/lkml/2018/9/2/13 Regards, Fengguang