Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas
On 18.05.21 12:31, Michal Hocko wrote: On Tue 18-05-21 12:06:42, David Hildenbrand wrote: On 18.05.21 11:59, Michal Hocko wrote: On Sun 16-05-21 10:29:24, Mike Rapoport wrote: On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote: [...] + if (!page) + return VM_FAULT_OOM; + + err = set_direct_map_invalid_noflush(page, 1); + if (err) { + put_page(page); + return vmf_error(err); Would we want to translate that to a proper VM_FAULT_..., which would most probably be VM_FAULT_OOM when we fail to allocate a pagetable? That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ. I haven't read through the rest but this has just caught my attention. Is it really reasonable to trigger the oom killer when you cannot invalidate the direct mapping. From a quick look at the code it is quite unlikely to se ENOMEM from that path (it allocates small pages) but this can become quite sublte over time. Shouldn't this simply SIGBUS if it cannot manipulate the direct mapping regardless of the underlying reason for that? OTOH, it means our kernel zones are depleted, so we'd better reclaim somehow ... Killing a userspace seems to be just a bad way around that. Although I have to say openly that I am not a great fan of VM_FAULT_OOM in general. It is usually a a wrong way to tell the handle the failure because it happens outside of the allocation context so you lose all the details (e.g. allocation constrains, numa policy etc.). Also whenever there is ENOMEM then the allocation itself has already made sure that all the reclaim attempts have been already depleted. Just consider an allocation with GFP_NOWAIT/NO_RETRY or similar to fail and propagate ENOMEM up the call stack. Turning that into the OOM killer sounds like a bad idea to me. But that is a more general topic. I have tried to bring this up in the past but there was not much of an interest to fix it as it was not a pressing problem... I'm certainly interested; it would mean that we actually want to try recovering from VM_FAULT_OOM in various cases, and as you state, we might have to supply more information to make that work reliably. Having that said, I guess what we have here is just the same as when our process fails to allocate a generic page table in __handle_mm_fault(), when we fail p4d_alloc() and friends ... -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users
On 18.05.21 12:24, Mark Rutland wrote: On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Have we thought about how this is going to work in practice, e.g. on mobile systems? It seems to me that there are a variety of common applications which might want to use this which people don't expect to inhibit hibernate (e.g. authentication agents, web browsers). Are we happy to say that any userspace application can incidentally inhibit hibernate? It's worth noting that secretmem has to be explicitly enabled by the admin to even work. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas
On 18.05.21 11:59, Michal Hocko wrote: On Sun 16-05-21 10:29:24, Mike Rapoport wrote: On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote: [...] + if (!page) + return VM_FAULT_OOM; + + err = set_direct_map_invalid_noflush(page, 1); + if (err) { + put_page(page); + return vmf_error(err); Would we want to translate that to a proper VM_FAULT_..., which would most probably be VM_FAULT_OOM when we fail to allocate a pagetable? That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ. I haven't read through the rest but this has just caught my attention. Is it really reasonable to trigger the oom killer when you cannot invalidate the direct mapping. From a quick look at the code it is quite unlikely to se ENOMEM from that path (it allocates small pages) but this can become quite sublte over time. Shouldn't this simply SIGBUS if it cannot manipulate the direct mapping regardless of the underlying reason for that? OTOH, it means our kernel zones are depleted, so we'd better reclaim somehow ... -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 8/8] secretmem: test: add basic selftest for memfd_secret(2)
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport The test verifies that file descriptor created with memfd_secret does not allow read/write operations, that secret memory mappings respect RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and ptrace() to the secret memory fail. [...] @@ -0,0 +1,296 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright IBM Corporation, 2020 2021 ? -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport Wire up memfd_secret system call on architectures that define ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86. Signed-off-by: Mike Rapoport Acked-by: Palmer Dabbelt Acked-by: Arnd Bergmann Acked-by: Catalin Marinas Cc: Alexander Viro Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Elena Reshetova Cc: Hagen Paul Pfeifer Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Bottomley Cc: "Kirill A. Shutemov" Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael Kerrisk Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Rick Edgecombe Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Thomas Gleixner Cc: Tycho Andersen Cc: Will Deacon --- arch/arm64/include/uapi/asm/unistd.h | 1 + arch/riscv/include/asm/unistd.h| 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 7 ++- scripts/checksyscalls.sh | 4 7 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h index f83a70e07df8..ce2ee8f1e361 100644 --- a/arch/arm64/include/uapi/asm/unistd.h +++ b/arch/arm64/include/uapi/asm/unistd.h @@ -20,5 +20,6 @@ #define __ARCH_WANT_SET_GET_RLIMIT #define __ARCH_WANT_TIME32_SYSCALLS #define __ARCH_WANT_SYS_CLONE3 +#define __ARCH_WANT_MEMFD_SECRET #include diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h index 977ee6181dab..6c316093a1e5 100644 --- a/arch/riscv/include/asm/unistd.h +++ b/arch/riscv/include/asm/unistd.h @@ -9,6 +9,7 @@ */ #define __ARCH_WANT_SYS_CLONE +#define __ARCH_WANT_MEMFD_SECRET #include diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 28a1423ce32e..e44519020a43 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -451,3 +451,4 @@ 444 i386landlock_create_ruleset sys_landlock_create_ruleset 445 i386landlock_add_rule sys_landlock_add_rule 446 i386landlock_restrict_self sys_landlock_restrict_self +447i386memfd_secretsys_memfd_secret diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index ecd551b08d05..a06f16106f24 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -368,6 +368,7 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447common memfd_secretsys_memfd_secret # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 050511e8f1f8..1a1b5d724497 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1050,6 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type, const void __user *rule_attr, __u32 flags); asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); +asmlinkage long sys_memfd_secret(unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 6de5a7fc066b..28b388368cf6 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -873,8 +873,13 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) #define __NR_landlock_restrict_self 446 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) +#ifdef __ARCH_WANT_MEMFD_SECRET +#define __NR_memfd_secret 447 +__SYSCALL(__NR_memfd_secret, sys_memfd_secret) +#endif + #undef __NR_syscalls -#define __NR_syscalls 447 +#define __NR_syscalls 448 /* * 32 bit systems traditionally used different diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh index a18b47695f55..b7609958ee36 100755 --- a/scripts/checksyscalls.sh +++ b/scripts/checksyscalls.sh @@ -40,6 +40,10 @@ cat << EOF #define __IGNORE_setrlimit/* setrlimit */ #endif +#ifndef __ARCH_WANT_MEMFD_SECRET +#define __IGNORE_memfd_secret +#endif + /* Missing flags argument */ #define __IGNORE_renameat /* renameat2 */ Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Signed-off-by: Mike Rapoport Cc: Alexander Viro Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Elena Reshetova Cc: Hagen Paul Pfeifer Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Bottomley Cc: "Kirill A. Shutemov" Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael Kerrisk Cc: Palmer Dabbelt Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Rick Edgecombe Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Thomas Gleixner Cc: Tycho Andersen Cc: Will Deacon --- include/linux/secretmem.h | 6 ++ kernel/power/hibernate.c | 5 - mm/secretmem.c| 15 +++ 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index e617b4afcc62..21c3771e6a56 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -30,6 +30,7 @@ static inline bool page_is_secretmem(struct page *page) } bool vma_is_secretmem(struct vm_area_struct *vma); +bool secretmem_active(void); #else @@ -43,6 +44,11 @@ static inline bool page_is_secretmem(struct page *page) return false; } +static inline bool secretmem_active(void) +{ + return false; +} + #endif /* CONFIG_SECRETMEM */ #endif /* _LINUX_SECRETMEM_H */ diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index da0b41914177..559acef3fddb 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include "power.h" @@ -81,7 +82,9 @@ void hibernate_release(void) bool hibernation_available(void) { - return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION); + return nohibernate == 0 && + !security_locked_down(LOCKDOWN_HIBERNATION) && + !secretmem_active(); } /** diff --git a/mm/secretmem.c b/mm/secretmem.c index 1ae50089adf1..7c2499e4de22 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400); MODULE_PARM_DESC(secretmem_enable, "Enable secretmem and memfd_secret(2) system call"); +static atomic_t secretmem_users; + +bool secretmem_active(void) +{ + return !!atomic_read(&secretmem_users); +} + static vm_fault_t secretmem_fault(struct vm_fault *vmf) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; @@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = { .fault = secretmem_fault, }; +static int secretmem_release(struct inode *inode, struct file *file) +{ + atomic_dec(&secretmem_users); + return 0; +} + static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) { unsigned long len = vma->vm_end - vma->vm_start; @@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma) } static const struct file_operations secretmem_fops = { + .release= secretmem_release, .mmap = secretmem_mmap, }; @@ -202,6 +216,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags) file->f_flags |= O_LARGEFILE; fd_install(fd, file); + atomic_inc(&secretmem_users); return fd; err_put_fd: It looks a bit racy, but I guess we don't really care about these corner cases. Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas
#ifdef CONFIG_IA64 # include @@ -64,6 +65,9 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size) #ifdef CONFIG_STRICT_DEVMEM static inline int page_is_allowed(unsigned long pfn) { + if (pfn_valid(pfn) && page_is_secretmem(pfn_to_page(pfn))) + return 0; + 1. The memmap might be garbage. You should use pfn_to_online_page() instead. page = pfn_to_online_page(pfn); if (page && page_is_secretmem(page)) return 0; 2. What about !CONFIG_STRICT_DEVMEM? 3. Someone could map physical memory before a secretmem page gets allocated and read the content after it got allocated and gets used. If someone would gain root privileges and would wait for the target application to (re)start, that could be problematic. I do wonder if enforcing CONFIG_STRICT_DEVMEM would be cleaner. devmem_is_allowed() should disallow access to any system ram, and thereby, any possible secretmem pages, avoiding this check completely. [...] diff --git a/mm/secretmem.c b/mm/secretmem.c new file mode 100644 index ..1ae50089adf1 --- /dev/null +++ b/mm/secretmem.c @@ -0,0 +1,239 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright IBM Corporation, 2021 + * + * Author: Mike Rapoport + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include "internal.h" + +#undef pr_fmt +#define pr_fmt(fmt) "secretmem: " fmt + +/* + * Define mode and flag masks to allow validation of the system call + * parameters. + */ +#define SECRETMEM_MODE_MASK(0x0) +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK + +static bool secretmem_enable __ro_after_init; +module_param_named(enable, secretmem_enable, bool, 0400); +MODULE_PARM_DESC(secretmem_enable, +"Enable secretmem and memfd_secret(2) system call"); + +static vm_fault_t secretmem_fault(struct vm_fault *vmf) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + struct inode *inode = file_inode(vmf->vma->vm_file); + pgoff_t offset = vmf->pgoff; + gfp_t gfp = vmf->gfp_mask; + unsigned long addr; + struct page *page; + int err; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) + return vmf_error(-EINVAL); + +retry: + page = find_lock_page(mapping, offset); + if (!page) { + page = alloc_page(gfp | __GFP_ZERO); We'll end up here with gfp == GFP_HIGHUSER (via the mapping below), correct? + if (!page) + return VM_FAULT_OOM; + + err = set_direct_map_invalid_noflush(page, 1); + if (err) { + put_page(page); + return vmf_error(err); Would we want to translate that to a proper VM_FAULT_..., which would most probably be VM_FAULT_OOM when we fail to allocate a pagetable? + } + + __SetPageUptodate(page); + err = add_to_page_cache_lru(page, mapping, offset, gfp); + if (unlikely(err)) { + put_page(page); + /* +* If a split of large page was required, it +* already happened when we marked the page invalid +* which guarantees that this call won't fail +*/ + set_direct_map_default_noflush(page, 1); + if (err == -EEXIST) + goto retry; + + return vmf_error(err); + } + + addr = (unsigned long)page_address(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); Hmm, to me it feels like something like that belongs into the set_direct_map_invalid_*() calls? Otherwise it's just very easy to mess up ... I'm certainly not a filesystem guy. Nothing else jumped at me. To me, the overall approach makes sense and I consider it an improved mlock() mechanism for storing secrets, although I'd love to have some more information in the log regarding access via root, namely that there are still fancy ways to read secretmem memory once root via 1. warm reboot attacks especially in VMs (e.g., modifying the cmdline) 2. kexec-style reboot attacks (e.g., modifying the cmdline) 3. kdump attacks 4. kdb most probably 5. "letting the process read the memory for us" via Kees if that still applies 6. ... most probably something else Just to make people aware that there are still some things to be sorted out when we fully want to protect against privilege escalations. (maybe this information is buried in the cover letter already, where it usually gets lost) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@
Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The secretmem feature is off by default and the user must explicitly enable it at the boot time. Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the processes that have access to the file descriptor. The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File descriptor approach allows explict and controlled sharing of the memory s/explict/explicit/ areas, it allows to seal the operations. Besides, file descriptor based memory paves the way for VMMs to remove the secret memory range from the userpace hipervisor process, for instance QEMU. Andy Lutomirski says: s/userpace hipervisor/userspace hypervisor/ "Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse." memfd_secret() is made a dedicated system call rather than an extention to s/extention/extension/ memfd_create() because it's purpose is to allow the user to create more secure memory mappings rather than to simply allow file based access to the memory. Nowadays a new system call cost is negligible while it is way simpler for userspace to deal with a clear-cut system calls than with a multiplexer or an overloaded syscall. Moreover, the initial implementation of memfd_secret() is completely distinct from memfd_create() so there is no much sense in overloading memfd_create() to begin with. If there will be a need for code sharing between these implementation it can be easily achieved without a need to adjust user visible APIs. The secret memory remains accessible in the process context using uaccess primitives, but it is not exposed to the kernel otherwise; secret memory areas are removed from the direct map and functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area. Once there will be a use case that will require exposing secretmem to the kernel it will be an opt-in request in the system call flags so that user would have to decide what data can be exposed to the kernel. Maybe spell out an example: like page migration. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. Maybe add a link to the Intel performance evaluation. Pages in the secretmem regions are unevictable and unmovable to avoid accidental exposure of the sensitive data via swap or during page migration. Since the secretmem mappings are locked in memory they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked independently from mlock(), an attempt to mlock()/munlock() secretmem range would fail and mlockall()/munlockall() will ignore secretmem mappings. Maybe add something like "similar to pages pinned by VFIO". However, unlike mlock()ed memory, secretmem currently behaves more like long-term GUP: secretmem mappings are unmovable mappings directly consumed by user space. With default limits, there is no excessive use of secretmem and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in the future this should be addressed to allow balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA. A page that was a part of the secret memory area is cleared when it is freed to ensure the data is not exposed to the next user of that page. You could skip that with init_on_free (and eventually also with init_on_alloc) set to avoid double clearing. The following example demonstrates creation of a secret mapping (error handling is omitted): fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884e...@linux.intel.com/ [my mail client messed up the remainder of the mail for whatever rea
Re: [PATCH v19 3/8] set_memory: allow set_direct_map_*_noflush() for multiple pages
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport The underlying implementations of set_direct_map_invalid_noflush() and set_direct_map_default_noflush() allow updating multiple contiguous pages at once. Add numpages parameter to set_direct_map_*_noflush() to expose this ability with these APIs. [...] Finally doing some in-depth review, sorry for not having a detailed look earlier. -int set_direct_map_invalid_noflush(struct page *page) +int set_direct_map_invalid_noflush(struct page *page, int numpages) { struct page_change_data data = { .set_mask = __pgprot(0), .clear_mask = __pgprot(PTE_VALID), }; + unsigned long size = PAGE_SIZE * numpages; Nit: I'd have made this const and added an early exit for !numpages. But whatever you prefer. if (!debug_pagealloc_enabled() && !rodata_full) return 0; return apply_to_page_range(&init_mm, (unsigned long)page_address(page), - PAGE_SIZE, change_page_range, &data); + size, change_page_range, &data); } -int set_direct_map_default_noflush(struct page *page) +int set_direct_map_default_noflush(struct page *page, int numpages) { struct page_change_data data = { .set_mask = __pgprot(PTE_VALID | PTE_WRITE), .clear_mask = __pgprot(PTE_RDONLY), }; + unsigned long size = PAGE_SIZE * numpages; Nit: dito if (!debug_pagealloc_enabled() && !rodata_full) return 0; return apply_to_page_range(&init_mm, (unsigned long)page_address(page), - PAGE_SIZE, change_page_range, &data); + size, change_page_range, &data); } [...] extern int kernel_set_to_readonly; diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 156cd235659f..15a55d6e9cec 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2192,14 +2192,14 @@ static int __set_pages_np(struct page *page, int numpages) return __change_page_attr_set_clr(&cpa, 0); } -int set_direct_map_invalid_noflush(struct page *page) +int set_direct_map_invalid_noflush(struct page *page, int numpages) { - return __set_pages_np(page, 1); + return __set_pages_np(page, numpages); } -int set_direct_map_default_noflush(struct page *page) +int set_direct_map_default_noflush(struct page *page, int numpages) { - return __set_pages_p(page, 1); + return __set_pages_p(page, numpages); } So, what happens if we succeeded setting set_direct_map_invalid_noflush() for some pages but fail when having to split a large mapping? Did I miss something or would the current code not undo what it partially did? Or do we simply not care? I guess to handle this cleanly we would either have to catch all error cases first (esp. splitting large mappings) before actually performing the set to invalid, or have some recovery code in place if possible. AFAIKs, your patch #5 right now only calls it with 1 page, do we need this change at all? Feels like a leftover from older versions to me where we could have had more than a single page. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 2/8] riscv/Kconfig: make direct map manipulation options depend on MMU
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have no meaning when CONFIG_MMU is disabled and there is no point to enable them for the nommu case. Add an explicit dependency on MMU for these options. Signed-off-by: Mike Rapoport Reported-by: kernel test robot --- arch/riscv/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index a8ad8eb76120..c426e7d20907 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -26,8 +26,8 @@ config RISCV select ARCH_HAS_KCOV select ARCH_HAS_MMIOWB select ARCH_HAS_PTE_SPECIAL - select ARCH_HAS_SET_DIRECT_MAP - select ARCH_HAS_SET_MEMORY + select ARCH_HAS_SET_DIRECT_MAP if MMU + select ARCH_HAS_SET_MEMORY if MMU select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL select ARCH_HAS_STRICT_MODULE_RWX if MMU && !XIP_KERNEL select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v19 1/8] mmap: make mlock_future_check() global
On 13.05.21 20:47, Mike Rapoport wrote: From: Mike Rapoport It will be used by the upcoming secret memory implementation. Signed-off-by: Mike Rapoport Cc: Alexander Viro Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Elena Reshetova Cc: Hagen Paul Pfeifer Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Bottomley Cc: "Kirill A. Shutemov" Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael Kerrisk Cc: Palmer Dabbelt Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Rick Edgecombe Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Thomas Gleixner Cc: Tycho Andersen Cc: Will Deacon --- mm/internal.h | 3 +++ mm/mmap.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 54bd0dc2c23c..46eb82eaa195 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -373,6 +373,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) extern void mlock_vma_page(struct page *page); extern unsigned int munlock_vma_page(struct page *page); +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len); + /* * Clear the page's PageMlocked(). This can be useful in a situation where * we want to unconditionally remove a page from the pagecache -- e.g., diff --git a/mm/mmap.c b/mm/mmap.c index 0584e540246e..81f5595a8490 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1352,9 +1352,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint) return hint; } -static inline int mlock_future_check(struct mm_struct *mm, -unsigned long flags, -unsigned long len) +int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len) { unsigned long locked, lock_limit; Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas
On 07.05.21 01:16, Nick Kossifidis wrote: Στις 2021-05-06 20:05, James Bottomley έγραψε: On Thu, 2021-05-06 at 18:45 +0200, David Hildenbrand wrote: Also, there is a way to still read that memory when root by 1. Having kdump active (which would often be the case, but maybe not to dump user pages ) 2. Triggering a kernel crash (easy via proc as root) 3. Waiting for the reboot after kump() created the dump and then reading the content from disk. Anything that can leave physical memory intact but boot to a kernel where the missing direct map entry is restored could theoretically extract the secret. However, it's not exactly going to be a stealthy extraction ... Or, as an attacker, load a custom kexec() kernel and read memory from the new environment. Of course, the latter two are advanced mechanisms, but they are possible when root. We might be able to mitigate, for example, by zeroing out secretmem pages before booting into the kexec kernel, if we care :) I think we could handle it by marking the region, yes, and a zero on shutdown might be useful ... it would prevent all warm reboot type attacks. I had similar concerns about recovering secrets with kdump, and considered cleaning up keyrings before jumping to the new kernel. The problem is we can't provide guarantees in that case, once the kernel has crashed and we are on our way to run crashkernel, we can't be sure we can reliably zero-out anything, the more code we add to that path the Well, I think it depends. Assume we do the following 1) Zero out any secretmem pages when handing them back to the buddy. (alternative: init_on_free=1) -- if not already done, I didn't check the code. 2) On kdump(), zero out all allocated secretmem. It'd be easier if we'd just allocated from a fixed physical memory area; otherwise we have to walk process page tables or use a PFN walker. And zeroing out secretmem pages without a direct mapping is a different challenge. Now, during 2) it can happen that a) We crash in our clearing code (e.g., something is seriously messed up) and fail to start the kdump kernel. That's actually good, instead of leaking data we fail hard. b) We don't find all secretmem pages, for example, because process page tables are messed up or something messed up our memmap (if we'd use that to identify secretmem pages via a PFN walker somehow) But for the simple cases (e.g., malicious root tries to crash the kernel via /proc/sysrq-trigger) both a) and b) wouldn't apply. Obviously, if an admin would want to mitigate right now, he would want to disable kdump completely, meaning any attempt to load a crashkernel would fail and cannot be enabled again for that kernel (also not via cmdline an attacker could modify to reboot into a system with the option for a crashkernel). Disabling kdump in the kernel when secretmem pages are allocated is one approach, although sub-optimal. more risky it gets. However during reboot/normal kexec() we should do some cleanup, it makes sense and secretmem can indeed be useful in that case. Regarding loading custom kexec() kernels, we mitigate this with the kexec file-based API where we can verify the signature of the loaded kimage (assuming the system runs a kernel provided by a trusted 3rd party and we 've maintained a chain of trust since booting). For example in VMs (like QEMU), we often don't clear physical memory during a reboot. So if an attacker manages to load a kernel that you can trick into reading random physical memory areas, we can leak secretmem data I think. And there might be ways to achieve that just using the cmdline, not necessarily loading a different kernel. For example if you limit the kernel footprint ("mem=256M") and disable strict_iomem_checks ("strict_iomem_checks=relaxed") you can just extract that memory via /dev/mem if I am not wrong. So as an attacker, modify the (grub) cmdline to "mem=256M strict_iomem_checks=relaxed", reboot, and read all memory via /dev/mem. Or load a signed kexec kernel with that cmdline and boot into it. Interesting problem :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas
Is this intended to protect keys/etc after the attacker has gained the ability to run arbitrary kernel-mode code? If so, that seems optimistic, doesn't it? Not exactly: there are many types of kernel attack, but mostly the attacker either manages to effect a privilege escalation to root or gets the ability to run a ROP gadget. The object of this code is to be completely secure against root trying to extract the secret (some what similar to the lockdown idea), thus defeating privilege escalation and to provide "sufficient" protection against ROP gadget. What stops "root" from mapping /dev/mem and reading that memory? /dev/mem uses the direct map for the copy at least for read/write, so it gets a fault in the same way root trying to use ptrace does. I think we've protected mmap, but Mike would know that better than I. I'm more concerned about the mmap case -> remap_pfn_range(). Anybody going via the VMA shouldn't see the struct page, at least when vma_normal_page() is properly used; so you cannot detect secretmem memory mapped via /dev/mem reliably. At least that's my theory :) [...] Also, there is a way to still read that memory when root by 1. Having kdump active (which would often be the case, but maybe not to dump user pages ) 2. Triggering a kernel crash (easy via proc as root) 3. Waiting for the reboot after kump() created the dump and then reading the content from disk. Anything that can leave physical memory intact but boot to a kernel where the missing direct map entry is restored could theoretically extract the secret. However, it's not exactly going to be a stealthy extraction ... Or, as an attacker, load a custom kexec() kernel and read memory from the new environment. Of course, the latter two are advanced mechanisms, but they are possible when root. We might be able to mitigate, for example, by zeroing out secretmem pages before booting into the kexec kernel, if we care :) I think we could handle it by marking the region, yes, and a zero on shutdown might be useful ... it would prevent all warm reboot type attacks. Right. But I guess when you're actually root, you can just write a kernel module to extract the information you need (unless we have signed modules, so it could be harder/impossible). -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas
On 06.05.21 17:26, James Bottomley wrote: On Wed, 2021-05-05 at 12:08 -0700, Andrew Morton wrote: On Wed, 3 Mar 2021 18:22:00 +0200 Mike Rapoport wrote: This is an implementation of "secret" mappings backed by a file descriptor. The file descriptor backing secret memory mappings is created using a dedicated memfd_secret system call The desired protection mode for the memory is configured using flags parameter of the system call. The mmap() of the file descriptor created with memfd_secret() will create a "secret" memory mapping. The pages in that mapping will be marked as not present in the direct map and will be present only in the page table of the owning mm. Although normally Linux userspace mappings are protected from other users, such secret mappings are useful for environments where a hostile tenant is trying to trick the kernel into giving them access to other tenants mappings. I continue to struggle with this and I don't recall seeing much enthusiasm from others. Perhaps we're all missing the value point and some additional selling is needed. Am I correct in understanding that the overall direction here is to protect keys (and perhaps other things) from kernel bugs? That if the kernel was bug-free then there would be no need for this feature? If so, that's a bit sad. But realistic I guess. Secret memory really serves several purposes. The "increase the level of difficulty of secret exfiltration" you describe. And, as you say, if the kernel were bug free this wouldn't be necessary. But also: 1. Memory safety for use space code. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. That's an interesting point I didn't realize so far. 2. It also serves as a basis for context protection of virtual machines, but other groups are working on this aspect, and it is broadly similar to the secret exfiltration from the kernel problem. I was wondering if this also helps against CPU microcode issues like spectre and friends. Is this intended to protect keys/etc after the attacker has gained the ability to run arbitrary kernel-mode code? If so, that seems optimistic, doesn't it? Not exactly: there are many types of kernel attack, but mostly the attacker either manages to effect a privilege escalation to root or gets the ability to run a ROP gadget. The object of this code is to be completely secure against root trying to extract the secret (some what similar to the lockdown idea), thus defeating privilege escalation and to provide "sufficient" protection against ROP gadget. What stops "root" from mapping /dev/mem and reading that memory? IOW, would we want to enforce "CONFIG_STRICT_DEVMEM" with CONFIG_SECRETMEM? Also, there is a way to still read that memory when root by 1. Having kdump active (which would often be the case, but maybe not to dump user pages ) 2. Triggering a kernel crash (easy via proc as root) 3. Waiting for the reboot after kump() created the dump and then reading the content from disk. Or, as an attacker, load a custom kexec() kernel and read memory from the new environment. Of course, the latter two are advanced mechanisms, but they are possible when root. We might be able to mitigate, for example, by zeroing out secretmem pages before booting into the kexec kernel, if we care :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 1/2] secretmem/gup: don't check if page is secretmem without reference
On 20.04.21 17:00, Mike Rapoport wrote: From: Mike Rapoport The check in gup_pte_range() whether a page belongs to a secretmem mapping is performed before grabbing the page reference. To avoid potential race move the check after try_grab_compound_head(). Signed-off-by: Mike Rapoport --- mm/gup.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index c3a17b189064..6515f82b0f32 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2080,13 +2080,15 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - if (page_is_secretmem(page)) - goto pte_unmap; - head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; + if (unlikely(page_is_secretmem(page))) { + put_compound_head(head, 1, flags); + goto pte_unmap; + } + if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_compound_head(head, 1, flags); goto pte_unmap; Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 1/2] secretmem/gup: don't check if page is secretmem without reference
On 20.04.21 15:16, Mike Rapoport wrote: From: Mike Rapoport The check in gup_pte_range() whether a page belongs to a secretmem mapping is performed before grabbing the page reference. To avoid potential race move the check after try_grab_compound_head(). Signed-off-by: Mike Rapoport --- mm/gup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index c3a17b189064..4b58c016e949 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2080,13 +2080,13 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - if (page_is_secretmem(page)) - goto pte_unmap; - head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; + if (page_is_secretmem(page)) + goto pte_unmap; + Looking at the hunk below, I wonder if you're missing a put_compound_head(). (also, I'd do if unlikely(page_is_secretmem()) but that's a different discussion) if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_compound_head(head, 1, flags); goto pte_unmap; -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 12:14, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:40:56AM +0200, David Hildenbrand wrote: On 19.04.21 11:38, David Hildenbrand wrote: On 19.04.21 11:36, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). I would have thought the fast path "(PageCompound(page) || !PageLRU(page))" would already avoid calling page_mapping() in many cases. (and I do wonder if a generic page_mapping() optimization would make sense instead) Not sure. Replacing page_mapping() with page_file_mapping() at the call sites at fs/ and mm/ increased the defconfig image by nearly 2k and page_file_mapping() is way simpler than page_mapping() add/remove: 1/0 grow/shrink: 35/0 up/down: 1960/0 (1960) Function old new delta shrink_page_list34143670+256 __set_page_dirty_nobuffers 242 349+107 check_move_unevictable_pages 904 987 +83 move_to_new_page 591 671 +80 shrink_active_list 912 970 +58 move_pages_to_lru911 965 +54 migrate_pages 25002554 +54 shmem_swapin_page 11451197 +52 shmem_undo_range16691719 +50 __test_set_page_writeback620 670 +50 __set_page_dirty_buffers 187 237 +50 __pagevec_lru_add757 807 +50 __munlock_pagevec 11551205 +50 __dump_page 11011151 +50 __cancel_dirty_page 182 232 +50 __remove_mapping 461 510 +49 rmap_walk_file 402 449 +47 isolate_movable_page 240 287 +47 test_clear_page_writeback668 714 +46 page_cache_pipe_buf_try
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 11:38, David Hildenbrand wrote: On 19.04.21 11:36, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). I would have thought the fast path "(PageCompound(page) || !PageLRU(page))" would already avoid calling page_mapping() in many cases. (and I do wonder if a generic page_mapping() optimization would make sense instead) Willy can most probably give the best advise here :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 11:36, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). I would have thought the fast path "(PageCompound(page) || !PageLRU(page))" would already avoid calling page_mapping() in many cases. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. The idea of the patch makes sense to me. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] memfd_secret: use unsigned int rather than long as syscall flags type
On 31.03.21 16:23, Mike Rapoport wrote: From: Mike Rapoport Yuri Norov says: If parameter size is the same for native and compat ABIs, we may wire a syscall made by compat client to native handler. This is true for unsigned int, but not true for unsigned long or pointer. That's why I suggest using unsigned int and so avoid creating compat entry point. Use unsigned int as the type of the flags parameter in memfd_secret() system call. Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc5-mmots-2021-03-30-23, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/syscalls.h | 2 +- mm/secretmem.c| 2 +- tools/testing/selftests/vm/memfd_secret.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 49c93c906893..1a1b5d724497 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1050,7 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type, const void __user *rule_attr, __u32 flags); asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); -asmlinkage long sys_memfd_secret(unsigned long flags); +asmlinkage long sys_memfd_secret(unsigned int flags); /* * Architecture-specific system calls diff --git a/mm/secretmem.c b/mm/secretmem.c index f2ae3f32a193..3b1ba3991964 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -199,7 +199,7 @@ static struct file *secretmem_file_create(unsigned long flags) return file; } -SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) +SYSCALL_DEFINE1(memfd_secret, unsigned int, flags) { struct file *file; int fd, err; diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c index c878c2b841fc..2462f52e9c96 100644 --- a/tools/testing/selftests/vm/memfd_secret.c +++ b/tools/testing/selftests/vm/memfd_secret.c @@ -38,7 +38,7 @@ static unsigned long page_size; static unsigned long mlock_limit_cur; static unsigned long mlock_limit_max; -static int memfd_secret(unsigned long flags) +static int memfd_secret(unsigned int flags) { return syscall(__NR_memfd_secret, flags); } LGTM -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove
On 18.03.21 05:08, Dan Williams wrote: Summary: A dax_dev can be unbound from its driver at any time. Unbind can not fail. The driver-core will always trigger ->remove() and the result from ->remove() is ignored. After ->remove() the driver-core proceeds to tear down context. The filesystem-dax implementation can leave pfns mapped after ->remove() if it is triggered while the filesystem is mounted. Security and data-integrity is forfeit if the dax_dev is repurposed for another security domain (new filesystem or change device modes), or if the dax_dev is physically replaced. CXL is a hotplug bus that makes dax_dev physical replace a real world prospect. All dax_dev pfns must be unmapped at remove. Detect the "remove while mounted" case and trigger memory_failure() over the entire dax_dev range. Details: The get_user_pages_fast() path expects all synchronization to be handled by the pattern of checking for pte presence, taking a page reference, and then validating that the pte was stable over that event. The gup-fast path for devmap / DAX pages additionally attempts to take/hold a live reference against the hosting pgmap over the page pin. The rational for the pgmap reference is to synchronize against a dax-device unbind / ->remove() event, but that is unnecessary if pte invalidation is guaranteed in the ->remove() path. Global dax-device pte invalidation *does* happen when the device is in raw "device-dax" mode where there is a single shared inode to unmap at remove, but the filesystem-dax path has a large number of actively mapped inodes unknown to the driver at ->remove() time. So, that unmap does not happen today for filesystem-dax. However, as Jason points out, that unmap / invalidation *needs* to happen not only to cleanup get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev ->remove() is correlated with actual physical removal / replacement the implications of allowing a physical pfn to be exchanged without tearing down old mappings are severe (security and data-integrity). What is not in this patch set is coordination with the dax_kmem driver to trigger memory_failure() when the dax_dev is onlined as "System RAM". The remove_memory() API was built with the assumption that platform firmware negotiates all removal requests and the OS has a chance to say "no". This is why dax_kmem today simply leaks request_region() to burn that physical address space for any other usage until the next reboot on a manual unbind event if the memory can't be offlined. However a future to make sure that remove_memory() succeeds after memory_failure() of the same range seems a better semantic than permanently burning physical address space. I'd have similar requirements for virtio-mem on forced driver unloading (although less of an issue, because in contrast to cxl, it doesn't really happen in sane environments). However, I'm afraid there are absolutely no guarantees that you can actually offline+rip out memory exposed to the buddy, at least in the general case. I guess it might be possible to some degree if memory was onlined to ZONE_MOVABLE (there are still no guarantees, though), however, bets are completely off with ZONE_NORMAL. Just imagine the memmap of one dax device being allocated from another dax device. You cannot possibly invalidate via memory_failure() and rip out the memory without crashing the whole system afterwards. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [RFC 0/2] virtio-pmem: Asynchronous flush
On 12.03.21 07:02, Dan Williams wrote: On Thu, Mar 11, 2021 at 8:21 PM Pankaj Gupta wrote: Hi David, Jeff reported preflush order issue with the existing implementation of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush for virtio pmem using work queue as done in md/RAID. This patch series intends to solve the preflush ordering issue and also makes the flush asynchronous from the submitting thread POV. Submitting this patch series for feeback and is in WIP. I have done basic testing and currently doing more testing. Pankaj Gupta (2): pmem: make nvdimm_flush asynchronous virtio_pmem: Async virtio-pmem flush drivers/nvdimm/nd_virtio.c | 66 ++-- drivers/nvdimm/pmem.c| 15 drivers/nvdimm/region_devs.c | 3 +- drivers/nvdimm/virtio_pmem.c | 9 + drivers/nvdimm/virtio_pmem.h | 12 +++ 5 files changed, 78 insertions(+), 27 deletions(-) [1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2 Just wondering, was there any follow up of this or are we still waiting for feedback? :) Thank you for bringing this up. My apologies I could not followup on this. I have another version in my local tree but could not post it as I was not sure if I solved the problem correctly. I will clean it up and post for feedback as soon as I can. P.S: Due to serious personal/family health issues I am not able to devote much time on this with other professional commitments. I feel bad that I have this unfinished task. Just in last one year things have not been stable for me & my family and still not getting :( No worries Pankaj. Take care of yourself and your family. The community can handle this for you. I'm open to coaching somebody through what's involved to get this fix landed. Absolutely, no need to worry for now - take care of yourself and your loved ones! I was merely stumbling over this series while cleaning up my inbox, wondering if this is still stuck waiting for review/feedback. No need to rush anything or be stressed. In case I have time to look into this in the future, I'd coordinate in this thread (especially, asking for feedback again so I know where this series stands)! -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [RFC 0/2] virtio-pmem: Asynchronous flush
On 20.04.20 15:19, Pankaj Gupta wrote: Jeff reported preflush order issue with the existing implementation of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush for virtio pmem using work queue as done in md/RAID. This patch series intends to solve the preflush ordering issue and also makes the flush asynchronous from the submitting thread POV. Submitting this patch series for feeback and is in WIP. I have done basic testing and currently doing more testing. Pankaj Gupta (2): pmem: make nvdimm_flush asynchronous virtio_pmem: Async virtio-pmem flush drivers/nvdimm/nd_virtio.c | 66 ++-- drivers/nvdimm/pmem.c| 15 drivers/nvdimm/region_devs.c | 3 +- drivers/nvdimm/virtio_pmem.c | 9 + drivers/nvdimm/virtio_pmem.h | 12 +++ 5 files changed, 78 insertions(+), 27 deletions(-) [1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2 Just wondering, was there any follow up of this or are we still waiting for feedback? :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 22.02.21 10:38, David Hildenbrand wrote: On 17.02.21 17:19, James Bottomley wrote: On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote: [...] The discussion regarding migratability only really popped up because this is a user-visible thing and not being able to migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...). I think the biggest use will potentially come from hardware acceleration. If it becomes simple to add say encryption to a secret page with no cost, then no flag needed. However, if we only have a limited number of keys so once we run out no more encrypted memory then it becomes a costly resource and users might want a choice of being backed by encryption or not. Right. But wouldn't HW support with configurable keys etc. need more syscall parameters (meaning, even memefd_secret() as it is would not be sufficient?). I suspect the simplistic flag approach might not be sufficient. I might be wrong because I have no clue about MKTME and friends. The theory I was operating under is key management is automatic and hidden, but key scarcity can't be, so if you flag requesting hardware backing then you either get success (the kernel found a key) or failure (the kernel is out of keys). If we actually want to specify the key then we need an extra argument and we *must* have a new system call. Anyhow, I still think extending memfd_create() might just be good enough - at least for now. I really think this is the wrong approach for a user space ABI. If we think we'll ever need to move to a separate syscall, we should begin with one. The pain of trying to shift userspace from memfd_create to a new syscall would be enormous. It's not impossible (see clone3) but it's a pain we should avoid if we know it's coming. Sorry for the late reply, there is just too much going on :) *If* we ever realize we need to pass more parameters we can easily have a new syscall for that purpose. *Then*, we know how that syscall will look like. Right now, it's just pure speculation. Until then, going with memfd_create() works just fine IMHO. The worst think that could happen is that we might not be able to create all fancy sectremem flavors in the future via memfd_create() but only via different, highly specialized syscall. I don't see a real problem with that. Adding to that, I'll give up arguing now as I have more important things to do. It has been questioned by various people why we need a dedicate syscall and at least for me, without a satisfying answer. Worst thing is that we end up with a syscall that could have been avoided, for example, because 1. We add existing/future memfd_create() flags to memfd_secret() as well when we need them (sealing, hugetlb., ..). 2. We decide in the future to still add MFD_SECRET support to memfd_secret(). So be it. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 17.02.21 17:19, James Bottomley wrote: On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote: [...] The discussion regarding migratability only really popped up because this is a user-visible thing and not being able to migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...). I think the biggest use will potentially come from hardware acceleration. If it becomes simple to add say encryption to a secret page with no cost, then no flag needed. However, if we only have a limited number of keys so once we run out no more encrypted memory then it becomes a costly resource and users might want a choice of being backed by encryption or not. Right. But wouldn't HW support with configurable keys etc. need more syscall parameters (meaning, even memefd_secret() as it is would not be sufficient?). I suspect the simplistic flag approach might not be sufficient. I might be wrong because I have no clue about MKTME and friends. The theory I was operating under is key management is automatic and hidden, but key scarcity can't be, so if you flag requesting hardware backing then you either get success (the kernel found a key) or failure (the kernel is out of keys). If we actually want to specify the key then we need an extra argument and we *must* have a new system call. Anyhow, I still think extending memfd_create() might just be good enough - at least for now. I really think this is the wrong approach for a user space ABI. If we think we'll ever need to move to a separate syscall, we should begin with one. The pain of trying to shift userspace from memfd_create to a new syscall would be enormous. It's not impossible (see clone3) but it's a pain we should avoid if we know it's coming. Sorry for the late reply, there is just too much going on :) *If* we ever realize we need to pass more parameters we can easily have a new syscall for that purpose. *Then*, we know how that syscall will look like. Right now, it's just pure speculation. Until then, going with memfd_create() works just fine IMHO. The worst think that could happen is that we might not be able to create all fancy sectremem flavors in the future via memfd_create() but only via different, highly specialized syscall. I don't see a real problem with that. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
For the other parts, the question is what we actually want to let user space configure. Being able to specify "Very secure" "maximum secure" "average secure" all doesn't really make sense to me. Well, it doesn't to me either unless the user feels a cost/benefit, so if max cost $100 per invocation and average cost nothing, most people would chose average unless they had a very good reason not to. In your migratable model, if we had separate limits for non-migratable and migratable, with non-migratable being set low to prevent exhaustion, max secure becomes a highly scarce resource, whereas average secure is abundant then having the choice might make sense. I hope that we can find a way to handle the migration part internally. Especially, because Mike wants the default to be "as secure as possible", so if there is a flag, it would have to be an opt-out flag. I guess as long as we don't temporarily map it into the "owned" location in the direct map shared by all VCPUs we are in a good positon. But this needs more thought, of course. The discussion regarding migratability only really popped up because this is a user-visible thing and not being able to migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...). I think the biggest use will potentially come from hardware acceleration. If it becomes simple to add say encryption to a secret page with no cost, then no flag needed. However, if we only have a limited number of keys so once we run out no more encrypted memory then it becomes a costly resource and users might want a choice of being backed by encryption or not. Right. But wouldn't HW support with configurable keys etc. need more syscall parameters (meaning, even memefd_secret() as it is would not be sufficient?). I suspect the simplistic flag approach might not be sufficient. I might be wrong because I have no clue about MKTME and friends. Anyhow, I still think extending memfd_create() might just be good enough - at least for now. Things like HW support might have requirements we don't even know yet and that we cannot even model in memfd_secret() right now. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 16.02.21 17:25, James Bottomley wrote: On Mon, 2021-02-15 at 20:20 +0100, Michal Hocko wrote: [...] What kind of flags are we talking about and why would that be a problem with memfd_create interface? Could you be more specific please? You mean what were the ioctl flags in the patch series linked above? They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in patch 3/5. OK I see. How many potential modes are we talking about? A few or potentially many? Well I initially thought there were two (uncached or not) until you came up with the migratable or non-migratable, which affects the security properties. But now there's also potential for hardware backing, like mktme, described by flags as well. I suppose you could also use RDT to restrict which cache the data goes into: say L1 but not L2 on to lessen the impact of fully uncached (although the big thrust of uncached was to blunt hyperthread side channels). So there is potential for quite a large expansion even though I'd be willing to bet that a lot of the modes people have thought about turn out not to be very effective in the field. Thanks for the insight. I remember that even the "uncached" parts was effectively nacked by x86 maintainers (I might be wrong). For the other parts, the question is what we actually want to let user space configure. Being able to specify "Very secure" "maximum secure" "average secure" all doesn't really make sense to me. The discussion regarding migratability only really popped up because this is a user-visible thing and not being able to migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...). -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
> Am 14.02.2021 um 10:20 schrieb Mike Rapoport : > > On Fri, Feb 12, 2021 at 10:18:19AM +0100, David Hildenbrand wrote: >>> On 12.02.21 00:09, Mike Rapoport wrote: >>> On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote: >>>> On 11.02.21 12:27, Mike Rapoport wrote: >>>>> On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote: >>>> >>>> So let's talk about the main user-visible differences to other memfd files >>>> (especially, other purely virtual files like hugetlbfs). With secretmem: >>>> >>>> - File content can only be read/written via memory mappings. >>>> - File content cannot be swapped out. >>>> >>>> I think there are still valid ways to modify file content using syscalls: >>>> e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just >>>> fine. >>> These work perfectly with any file, so maybe we should have added >>> memfd_create as a flag to open(2) back then and now the secretmem file >>> descriptors? >> >> I think open() vs memfd_create() makes sense: for open, the path specifies >> main properties (tmpfs, hugetlbfs, filesystem). On memfd, there is no such >> path and the "type" has to be specified differently. >> >> Also, open() might open existing files - memfd always creates new files. > > Yes, but still open() returns a handle to a file and memfd_create() returns > a handle to a file. The differences may be well hidden by e.g. O_MEMORY and > than features unique to memfd files will have their set of O_SOMETHING > flags. > Let‘s agree to disagree. > It's the same logic that says "we already have an interface that's close > enough and it's fine to add a bunch of new flags there". No, not quite. But let‘s agree to disagree. > > And here we come to the question "what are the differences that justify a > new system call?" and the answer to this is very subjective. And as such we > can continue bikeshedding forever. I think this fits into the existing memfd_create() syscall just fine, and I heard no compelling argument why it shouldn‘t. That‘s all I can say. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 12.02.21 00:09, Mike Rapoport wrote: On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote: On 11.02.21 12:27, Mike Rapoport wrote: On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote: So let's talk about the main user-visible differences to other memfd files (especially, other purely virtual files like hugetlbfs). With secretmem: - File content can only be read/written via memory mappings. - File content cannot be swapped out. I think there are still valid ways to modify file content using syscalls: e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just fine. These work perfectly with any file, so maybe we should have added memfd_create as a flag to open(2) back then and now the secretmem file descriptors? I think open() vs memfd_create() makes sense: for open, the path specifies main properties (tmpfs, hugetlbfs, filesystem). On memfd, there is no such path and the "type" has to be specified differently. Also, open() might open existing files - memfd always creates new files. AFAIKS, we would need MFD_SECRET and disallow MFD_ALLOW_SEALING and MFD_HUGETLB. So here we start to multiplex. Yes. And as Michal said, maybe we can support combinations in the future. Isn't there a general agreement that syscall multiplexing is not a good thing? Looking at mmap(), madvise(), fallocate(), I think multiplexing is just fine and flags can be mutually exclusive - as long as we're not squashing completely unrelated things into a single system call. As one example: we don't have mmap_private() vs. mmap_shared() vs. mmap_shared_validate(). E.g., MAP_SYNC is only available for MAP_SHARED_VALIDATE. memfd_create already has flags validation that does not look very nice. I assume you're talking about the hugetlb size specifications, right? It's not nice but fairly compact. Adding there only MFD_SECRET will make it a bit less nice, but when we'll grow new functionality into secretmem that will become horrible. What do you have in mind? A couple of MFD_SECRET_* flags that only work with MFD_SECRET won't hurt IMHO. Just like we allow MFD_HUGE_* only with MFD_HUGETLB. Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 11.02.21 12:27, Mike Rapoport wrote: On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote: On 11.02.21 09:39, Michal Hocko wrote: On Thu 11-02-21 09:13:19, Mike Rapoport wrote: On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote: On Tue 09-02-21 11:09:38, Mike Rapoport wrote: [...] Citing my older email: I've hesitated whether to continue to use new flags to memfd_create() or to add a new system call and I've decided to use a new system call after I've started to look into man pages update. There would have been two completely independent descriptions and I think it would have been very confusing. Could you elaborate? Unmapping from the kernel address space can work both for sealed or hugetlb memfds, no? Those features are completely orthogonal AFAICS. With a dedicated syscall you will need to introduce this functionality on top if that is required. Have you considered that? I mean hugetlb pages are used to back guest memory very often. Is this something that will be a secret memory usecase? Please be really specific when giving arguments to back a new syscall decision. Isn't "syscalls have completely independent description" specific enough? No, it's not as you can see from questions I've had above. More on that below. We are talking about API here, not the implementation details whether secretmem supports large pages or not. The purpose of memfd_create() is to create a file-like access to memory. The purpose of memfd_secret() is to create a way to access memory hidden from the kernel. I don't think overloading memfd_create() with the secretmem flags because they happen to return a file descriptor will be better for users, but rather will be more confusing. This is quite a subjective conclusion. I could very well argue that it would be much better to have a single syscall to get a fd backed memory with spedific requirements (sealing, unmapping from the kernel address space). Neither of us would be clearly right or wrong. A more important point is a future extensibility and usability, though. So let's just think of few usecases I have outlined above. Is it unrealistic to expect that secret memory should be sealable? What about hugetlb? Because if the answer is no then a new API is a clear win as the combination of flags would never work and then we would just suffer from the syscall multiplexing without much gain. On the other hand if combination of the functionality is to be expected then you will have to jam it into memfd_create and copy the interface likely causing more confusion. See what I mean? I by no means do not insist one way or the other but from what I have seen so far I have a feeling that the interface hasn't been thought through enough. Sure you have landed with fd based approach and that seems fair. But how to get that fd seems to still have some gaps IMHO. I agree with Michal. This has been raised by different people already, including on LWN (https://lwn.net/Articles/835342/). I can follow Mike's reasoning (man page), and I am also fine if there is a valid reason. However, IMHO the basic description seems to match quite good: memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. Anonymous memory is used for all backing pages of the file. Therefore, files created by memfd_create() have the same semantics as other anonymous memory allocations such as those allocated using mmap(2) with the MAP_ANONYMOUS flag. Even despite my laziness and huge amount of copy-paste you can spot the differences (this is a very old version, update is due): memfd_secret() creates an anonymous file and returns a file descriptor that refers to it. The file can only be memory-mapped; the memory in such mapping will have stronger protection than usual memory mapped files, and so it can be used to store application secrets. Unlike a regular file, a file created with memfd_secret() lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. The initial size of the file is set to 0. Following the call, the file size should be set using ftruncate(2). The memory areas obtained with mmap(2) from the file descriptor are ex‐ clusive to the owning context. These areas are removed from the kernel page tables and only the page table of the process holding the file de‐ scriptor maps the corresponding physical memory. So let's talk about th
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 11.02.21 10:38, Michal Hocko wrote: On Thu 11-02-21 10:01:32, David Hildenbrand wrote: [...] AFAIKS, we would need MFD_SECRET and disallow MFD_ALLOW_SEALING and MFD_HUGETLB. Yes for an initial version. But I do expect a request to support both features is just a matter of time. In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of temporary mappings (eor migration). TBC. I believe this is the mode Mike wants to have by default. A more relax one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal mappings in the kernel for content copying (e.g. for migration). --- Some random thoughts regarding files. What is the page size of secretmem memory? Sometimes we use huge pages, sometimes we fallback to 4k pages. So I assume huge pages in general? Unless there is an explicit request for hugetlb I would say the page size is not really important like for any other fds. Huge pages can be used transparently. What are semantics of MADV()/FALLOCATE() etc on such files? I would expect the same semantic as regular shmem (memfd_create) except the memory doesn't have _any_ backing storage which makes it unevictable. So the reclaim related madv won't work but there shouldn't be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and others don't work. Another thought regarding "doesn't have _any_ backing storage" What are the right semantics when it comes to memory accounting/commit? As secretmem does not have a) any backing storage b) cannot go to swap The MAP_NORESERVE vs. !MAP_NORESERVE handling gets a little unclear. Why "reserve swap space" if the allocations cannot ever go to swap? Sure, we want to "reserve physical memory", but in contrast to other users that can go to swap. Of course, this is only relevant for MAP_PRIVATE secretmem mappings. Other MAP_SHARED assumes there is no need for reserving swap space as it can just go to the backing storage. (yeah, tmpfs/shmem is weird in that regard as well, but again, it's a bit different) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
Some random thoughts regarding files. What is the page size of secretmem memory? Sometimes we use huge pages, sometimes we fallback to 4k pages. So I assume huge pages in general? Unless there is an explicit request for hugetlb I would say the page size is not really important like for any other fds. Huge pages can be used transparently. If everything is currently allocated/mapped on PTE granularity, then yes I agree. I remember previous versions used to "pool 2MB pages", which might have been problematic (thus, my concerns regarding mmap() etc.). If that part is now gone, good! What are semantics of MADV()/FALLOCATE() etc on such files? I would expect the same semantic as regular shmem (memfd_create) except the memory doesn't have _any_ backing storage which makes it unevictable. So the reclaim related madv won't work but there shouldn't be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and others don't work. Agreed if we don't have hugepage semantics. Is userfaultfd() properly fenced? Or does it even work (doubt)? How does it behave if I mmap(FIXED) something in between? In which granularity can I do that (->page-size?)? Again, nothing really exceptional here. This is a mapping like any other from address space manipulation POV. Agreed with the PTE mapping approach. What are other granularity restrictions (->page size)? Don't want to open a big discussion here, just some random thoughts. Maybe it has all been already figured out and most of the answers above are "Fails with -EINVAL". I think that the behavior should be really in sync with shmem semantic as much as possible. Most operations should simply work with an aditional direct map manipulation. There is no real reason to be special. Some functionality might be missing, e.g. hugetlb support but that has been traditionally added on top of shmem interface so nothing really new here. Agreed! -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 11.02.21 09:39, Michal Hocko wrote: On Thu 11-02-21 09:13:19, Mike Rapoport wrote: On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote: On Tue 09-02-21 11:09:38, Mike Rapoport wrote: [...] Citing my older email: I've hesitated whether to continue to use new flags to memfd_create() or to add a new system call and I've decided to use a new system call after I've started to look into man pages update. There would have been two completely independent descriptions and I think it would have been very confusing. Could you elaborate? Unmapping from the kernel address space can work both for sealed or hugetlb memfds, no? Those features are completely orthogonal AFAICS. With a dedicated syscall you will need to introduce this functionality on top if that is required. Have you considered that? I mean hugetlb pages are used to back guest memory very often. Is this something that will be a secret memory usecase? Please be really specific when giving arguments to back a new syscall decision. Isn't "syscalls have completely independent description" specific enough? No, it's not as you can see from questions I've had above. More on that below. We are talking about API here, not the implementation details whether secretmem supports large pages or not. The purpose of memfd_create() is to create a file-like access to memory. The purpose of memfd_secret() is to create a way to access memory hidden from the kernel. I don't think overloading memfd_create() with the secretmem flags because they happen to return a file descriptor will be better for users, but rather will be more confusing. This is quite a subjective conclusion. I could very well argue that it would be much better to have a single syscall to get a fd backed memory with spedific requirements (sealing, unmapping from the kernel address space). Neither of us would be clearly right or wrong. A more important point is a future extensibility and usability, though. So let's just think of few usecases I have outlined above. Is it unrealistic to expect that secret memory should be sealable? What about hugetlb? Because if the answer is no then a new API is a clear win as the combination of flags would never work and then we would just suffer from the syscall multiplexing without much gain. On the other hand if combination of the functionality is to be expected then you will have to jam it into memfd_create and copy the interface likely causing more confusion. See what I mean? I by no means do not insist one way or the other but from what I have seen so far I have a feeling that the interface hasn't been thought through enough. Sure you have landed with fd based approach and that seems fair. But how to get that fd seems to still have some gaps IMHO. I agree with Michal. This has been raised by different people already, including on LWN (https://lwn.net/Articles/835342/). I can follow Mike's reasoning (man page), and I am also fine if there is a valid reason. However, IMHO the basic description seems to match quite good: memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. Anonymous memory is used for all backing pages of the file. Therefore, files created by memfd_create() have the same semantics as other anonymous memory allocations such as those allocated using mmap(2) with the MAP_ANONYMOUS flag. AFAIKS, we would need MFD_SECRET and disallow MFD_ALLOW_SEALING and MFD_HUGETLB. In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of temporary mappings (eor migration). TBC. --- Some random thoughts regarding files. What is the page size of secretmem memory? Sometimes we use huge pages, sometimes we fallback to 4k pages. So I assume huge pages in general? What are semantics of MADV()/FALLOCATE() etc on such files? I assume PUNCH_HOLE fails in a nice way? does it work? Does mremap()/mremap(FIXED) work/is it blocked? Does mprotect() fail in a nice way? Is userfaultfd() properly fenced? Or does it even work (doubt)? How does it behave if I mmap(FIXED) something in between? In which granularity can I do that (->page-size?)? What are other granularity restrictions (->page size)? Don't want to open a big discussion here, just some random thoughts. Maybe it has all been already figured out and most of the answers above are "Fails with -EINVAL". -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 09.02.21 14:25, Michal Hocko wrote: On Tue 09-02-21 11:23:35, David Hildenbrand wrote: [...] I am constantly trying to fight for making more stuff MOVABLE instead of going into the other direction (e.g., because it's easier to implement, which feels like the wrong direction). Maybe I am the only person that really cares about ZONE_MOVABLE these days :) I can't stop such new stuff from popping up, so at least I want it to be documented. MOVABLE zone is certainly an important thing to keep working. And there is still quite a lot of work on the way. But as I've said this is more of a outlier than a norm. On the other hand movable zone is kinda hard requirement for a lot of application and it is to be expected that many features will be less than 100% compatible. Some usecases even impossible. That's why I am arguing that we should have a central document where the movable zone is documented with all the potential problems we have encountered over time and explicitly state which features are fully/partially incompatible. I'll send a mail during the next weeks to gather current restrictions to document them (and include my brain dump). We might see more excessive use of ZONE_MOVABLE in the future and as history told us, of CMA as well. We really should start documenting/caring. @Mike, it would be sufficient for me if one of your patches at least mention the situation in the description like "Please note that secretmem currently behaves much more like long-term GUP instead of mlocked memory; secretmem is unmovable memory directly consumed/controlled by user space. secretmem cannot be placed onto ZONE_MOVABLE/CMA. As long as there is no excessive use of secretmem (e.g., maximum of 16 MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this is barely a real issue. However, it is something to keep in mind when a significant amount of system RAM might be used for secretmem. In the future, we might support migration of secretmem and make it look much more like mlocked memory instead." Just a suggestion. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 09.02.21 11:23, David Hildenbrand wrote: A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE. As I've said it is quite easy to land at the similar situation even with tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is really uncommon. It would be even worse that those would be allowed to consume both CMA/ZONE_MOVABLE. IIRC, tmpfs/MAP_ANON|MAP_SHARED memory a) Is movable, can land in ZONE_MOVABLE/CMA b) Can be limited by sizing tmpfs appropriately AFAIK, what you describe is a problem with memory overcommit, not with zone imbalances (below). Or what am I missing? It can be problem for both. If you have just too much of shm (do not forget about MAP_SHARED|MAP_ANON which is much harder to size from an admin POV) then migrateability doesn't really help because you need a free memory to migrate. Without reclaimability this can easily become a problem. That is why I am saying this is not really a new problem. Swapless systems are not all that uncommon. I get your point, it's similar but still different. "no memory in the system" vs. "plenty of unusable free memory available in the system". In many setups, memory for user space applications can go to ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user space memory (e.g., page tables) and other kernel stuff. Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an application (database) that allocates 4GB of memory. Works just fine. The zone ratio ends up being a problem for example with many processes (-> many page tables). Not being able to put user space memory into the movable zone is a special case. And we are introducing yet another special case here (besides vfio, rdma, unmigratable huge pages like gigantic pages). With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be a big lie of how your system behaves. One has to be very careful when relying on CMA or movable zones. This is definitely worth a comment in the kernel command line parameter documentation. But this is not a new problem. I see the following thing worth documenting: Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of ZONE_MOVABLE/CMA. Assume you make use of 1.5GB of secretmem. Your system might run into OOM any time although you still have plenty of memory on ZONE_MOVAVLE (and even swap!), simply because you are making excessive use of unmovable allocations (for user space!) in an environment where you should not make excessive use of unmovable allocations (e.g., where should page tables go?). yes, you are right of course and I am not really disputing this. But I would argue that 2:1 Movable/Normal is something to expect problems already. "Lowmem" allocations can easily trigger OOM even without secret mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the room and one has to be really careful when relying on them. Right, it's all about what the setup actually needs. Sure, there are cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such that a 2:1 ratio is not feasible. But I claim that these are corner cases. Secretmem gives user space the option to allocate a lot of GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each application on F33 can allocate 16 GiB (!) of secretmem. Got to learn to do my math. It's 16 MiB - so as a default it's less dangerous than I thought! -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE. As I've said it is quite easy to land at the similar situation even with tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is really uncommon. It would be even worse that those would be allowed to consume both CMA/ZONE_MOVABLE. IIRC, tmpfs/MAP_ANON|MAP_SHARED memory a) Is movable, can land in ZONE_MOVABLE/CMA b) Can be limited by sizing tmpfs appropriately AFAIK, what you describe is a problem with memory overcommit, not with zone imbalances (below). Or what am I missing? It can be problem for both. If you have just too much of shm (do not forget about MAP_SHARED|MAP_ANON which is much harder to size from an admin POV) then migrateability doesn't really help because you need a free memory to migrate. Without reclaimability this can easily become a problem. That is why I am saying this is not really a new problem. Swapless systems are not all that uncommon. I get your point, it's similar but still different. "no memory in the system" vs. "plenty of unusable free memory available in the system". In many setups, memory for user space applications can go to ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user space memory (e.g., page tables) and other kernel stuff. Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an application (database) that allocates 4GB of memory. Works just fine. The zone ratio ends up being a problem for example with many processes (-> many page tables). Not being able to put user space memory into the movable zone is a special case. And we are introducing yet another special case here (besides vfio, rdma, unmigratable huge pages like gigantic pages). With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be a big lie of how your system behaves. One has to be very careful when relying on CMA or movable zones. This is definitely worth a comment in the kernel command line parameter documentation. But this is not a new problem. I see the following thing worth documenting: Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of ZONE_MOVABLE/CMA. Assume you make use of 1.5GB of secretmem. Your system might run into OOM any time although you still have plenty of memory on ZONE_MOVAVLE (and even swap!), simply because you are making excessive use of unmovable allocations (for user space!) in an environment where you should not make excessive use of unmovable allocations (e.g., where should page tables go?). yes, you are right of course and I am not really disputing this. But I would argue that 2:1 Movable/Normal is something to expect problems already. "Lowmem" allocations can easily trigger OOM even without secret mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the room and one has to be really careful when relying on them. Right, it's all about what the setup actually needs. Sure, there are cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such that a 2:1 ratio is not feasible. But I claim that these are corner cases. Secretmem gives user space the option to allocate a lot of GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each application on F33 can allocate 16 GiB (!) of secretmem. Which other ways do you know where random user space can do something similar? I'd be curious what other scenarios there are where user space can easily allocate a lot of unmovable memory. The existing controls (mlock limit) don't really match the current semantics of that memory. I repeat it once again: secretmem *currently* resembles long-term pinned memory, not mlocked memory. Well, if we had a proper user space pinning accounting then I would agree that there is a better model to use. But we don't. And previous attempts to achieve that have failed. So I am afraid that we do not have much choice left than using mlock as a model. Yes, I agree. Things will change when implementing migration support for secretmem pages. Until then, the semantics are different and this should be spelled out. For long-term pinnings this is kind of obvious, still we're now documenting it because it's dangerous to not be aware of. Secretmem behaves exactly the same and I think this is worth spelling out: secretmem has the potential of being used much more often than fairly special vfio/rdma/ ... yeah I do agree that pinning is a problem for movable/CMA but most people simply do not care about those. Movable is the thing for hoptlug and a really weird fragmentation avoidance IIRC and CMA is mostly to + handling gigantic pages dynamically handle crap HW. If those are to be used along with secret mem or longterm GUP then they will constantly bump into corner cases. Do not take me wrong, we should be looking at those problems, we should even document them but I do not see this as anything new. We
Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 09.02.21 09:59, Michal Hocko wrote: On Mon 08-02-21 22:38:03, David Hildenbrand wrote: Am 08.02.2021 um 22:13 schrieb Mike Rapoport : On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote: On 08.02.21 09:49, Mike Rapoport wrote: Some questions (and request to document the answers) as we now allow to have unmovable allocations all over the place and I don't see a single comment regarding that in the cover letter: 1. How will the issue of plenty of unmovable allocations for user space be tackled in the future? 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?. Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not allocate movable pages at the first place. That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem. A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE. As I've said it is quite easy to land at the similar situation even with tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is really uncommon. It would be even worse that those would be allowed to consume both CMA/ZONE_MOVABLE. IIRC, tmpfs/MAP_ANON|MAP_SHARED memory a) Is movable, can land in ZONE_MOVABLE/CMA b) Can be limited by sizing tmpfs appropriately AFAIK, what you describe is a problem with memory overcommit, not with zone imbalances (below). Or what am I missing? One has to be very careful when relying on CMA or movable zones. This is definitely worth a comment in the kernel command line parameter documentation. But this is not a new problem. I see the following thing worth documenting: Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of ZONE_MOVABLE/CMA. Assume you make use of 1.5GB of secretmem. Your system might run into OOM any time although you still have plenty of memory on ZONE_MOVAVLE (and even swap!), simply because you are making excessive use of unmovable allocations (for user space!) in an environment where you should not make excessive use of unmovable allocations (e.g., where should page tables go?). The existing controls (mlock limit) don't really match the current semantics of that memory. I repeat it once again: secretmem *currently* resembles long-term pinned memory, not mlocked memory. Things will change when implementing migration support for secretmem pages. Until then, the semantics are different and this should be spelled out. For long-term pinnings this is kind of obvious, still we're now documenting it because it's dangerous to not be aware of. Secretmem behaves exactly the same and I think this is worth spelling out: secretmem has the potential of being used much more often than fairly special vfio/rdma/ ... Looking at a cover letter that doesn't even mention the issue of unmovable allocations makes me thing that we are either trying to ignore the problem or are not aware of the problem. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
> Am 08.02.2021 um 22:13 schrieb Mike Rapoport : > > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote: >> On 08.02.21 09:49, Mike Rapoport wrote: >> >> Some questions (and request to document the answers) as we now allow to have >> unmovable allocations all over the place and I don't see a single comment >> regarding that in the cover letter: >> >> 1. How will the issue of plenty of unmovable allocations for user space be >> tackled in the future? >> >> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE >> and CMA, alloc_conig_range()/alloc_contig_pages?. > > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not > allocate movable pages at the first place. That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem. Please have a look at what Pavel documents regarding long term pinnings and ZONE_MOVABLE in his patches currently on the list. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On 08.02.21 13:17, Michal Hocko wrote: On Mon 08-02-21 12:26:31, David Hildenbrand wrote: [...] My F33 system happily hibernates to disk, even with an application that succeeded in din doing an mlockall(). And it somewhat makes sense. Even my freshly-booted, idle F33 has $ cat /proc/meminfo | grep lock Mlocked:4860 kB So, stopping to hibernate with mlocked memory would essentially prohibit any modern Linux distro to hibernate ever. My system seems to be completely fine without mlocked memory. It would be interesting to see who mlocks memory on your system and check whether the expectated mlock semantic really works for those. This should be documented at least. I checked some other installations (Ubuntu, RHEL), and they also show no sign of Mlock. My notebook (F33) and desktop (F33) both have mlocked memory. Either related to F33 or due to some software (e.g., kerberos). -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On 08.02.21 12:14, David Hildenbrand wrote: On 08.02.21 12:13, David Hildenbrand wrote: On 08.02.21 11:57, Michal Hocko wrote: On Mon 08-02-21 11:53:58, David Hildenbrand wrote: On 08.02.21 11:51, Michal Hocko wrote: On Mon 08-02-21 11:32:11, David Hildenbrand wrote: On 08.02.21 11:18, Michal Hocko wrote: On Mon 08-02-21 10:49:18, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Does this feature need any special handling? As it is effectivelly unevictable memory then it should behave the same as other mlock, ramfs which should already disable hibernation as those cannot be swapped out, no? Why should unevictable memory not go to swap when hibernating? We're merely dumping all of our system RAM (including any unmovable allocations) to swap storage and the system is essentially completely halted. My understanding is that mlock is never really made visible via swap storage. "Using swap storage for hibernation" and "swapping at runtime" are two different things. I might be wrong, though. Well, mlock is certainly used to keep sensitive information, not only to protect from major/minor faults. I think you're right in theory, the man page mentions "Cryptographic security software often handles critical bytes like passwords or secret keys as data structures" ... however, I am not aware of any such swap handling and wasn't able to spot it quickly. Let me take a closer look. s/swap/hibernate/ My F33 system happily hibernates to disk, even with an application that succeeded in din doing an mlockall(). And it somewhat makes sense. Even my freshly-booted, idle F33 has $ cat /proc/meminfo | grep lock Mlocked:4860 kB So, stopping to hibernate with mlocked memory would essentially prohibit any modern Linux distro to hibernate ever. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On 08.02.21 12:13, David Hildenbrand wrote: On 08.02.21 11:57, Michal Hocko wrote: On Mon 08-02-21 11:53:58, David Hildenbrand wrote: On 08.02.21 11:51, Michal Hocko wrote: On Mon 08-02-21 11:32:11, David Hildenbrand wrote: On 08.02.21 11:18, Michal Hocko wrote: On Mon 08-02-21 10:49:18, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Does this feature need any special handling? As it is effectivelly unevictable memory then it should behave the same as other mlock, ramfs which should already disable hibernation as those cannot be swapped out, no? Why should unevictable memory not go to swap when hibernating? We're merely dumping all of our system RAM (including any unmovable allocations) to swap storage and the system is essentially completely halted. My understanding is that mlock is never really made visible via swap storage. "Using swap storage for hibernation" and "swapping at runtime" are two different things. I might be wrong, though. Well, mlock is certainly used to keep sensitive information, not only to protect from major/minor faults. I think you're right in theory, the man page mentions "Cryptographic security software often handles critical bytes like passwords or secret keys as data structures" ... however, I am not aware of any such swap handling and wasn't able to spot it quickly. Let me take a closer look. s/swap/hibernate/ -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On 08.02.21 11:57, Michal Hocko wrote: On Mon 08-02-21 11:53:58, David Hildenbrand wrote: On 08.02.21 11:51, Michal Hocko wrote: On Mon 08-02-21 11:32:11, David Hildenbrand wrote: On 08.02.21 11:18, Michal Hocko wrote: On Mon 08-02-21 10:49:18, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Does this feature need any special handling? As it is effectivelly unevictable memory then it should behave the same as other mlock, ramfs which should already disable hibernation as those cannot be swapped out, no? Why should unevictable memory not go to swap when hibernating? We're merely dumping all of our system RAM (including any unmovable allocations) to swap storage and the system is essentially completely halted. My understanding is that mlock is never really made visible via swap storage. "Using swap storage for hibernation" and "swapping at runtime" are two different things. I might be wrong, though. Well, mlock is certainly used to keep sensitive information, not only to protect from major/minor faults. I think you're right in theory, the man page mentions "Cryptographic security software often handles critical bytes like passwords or secret keys as data structures" ... however, I am not aware of any such swap handling and wasn't able to spot it quickly. Let me take a closer look. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On 08.02.21 11:51, Michal Hocko wrote: On Mon 08-02-21 11:32:11, David Hildenbrand wrote: On 08.02.21 11:18, Michal Hocko wrote: On Mon 08-02-21 10:49:18, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Does this feature need any special handling? As it is effectivelly unevictable memory then it should behave the same as other mlock, ramfs which should already disable hibernation as those cannot be swapped out, no? Why should unevictable memory not go to swap when hibernating? We're merely dumping all of our system RAM (including any unmovable allocations) to swap storage and the system is essentially completely halted. My understanding is that mlock is never really made visible via swap storage. "Using swap storage for hibernation" and "swapping at runtime" are two different things. I might be wrong, though. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
On 08.02.21 11:18, Michal Hocko wrote: On Mon 08-02-21 10:49:18, Mike Rapoport wrote: From: Mike Rapoport It is unsafe to allow saving of secretmem areas to the hibernation snapshot as they would be visible after the resume and this essentially will defeat the purpose of secret memory mappings. Prevent hibernation whenever there are active secret memory users. Does this feature need any special handling? As it is effectivelly unevictable memory then it should behave the same as other mlock, ramfs which should already disable hibernation as those cannot be swapped out, no? Why should unevictable memory not go to swap when hibernating? We're merely dumping all of our system RAM (including any unmovable allocations) to swap storage and the system is essentially completely halted. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
On 08.02.21 09:49, Mike Rapoport wrote: From: Mike Rapoport Hi, @Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem and related patches dropped from there, I can rebase whatever way you prefer. This is an implementation of "secret" mappings backed by a file descriptor. The file descriptor backing secret memory mappings is created using a dedicated memfd_secret system call The desired protection mode for the memory is configured using flags parameter of the system call. The mmap() of the file descriptor created with memfd_secret() will create a "secret" memory mapping. The pages in that mapping will be marked as not present in the direct map and will be present only in the page table of the owning mm. Although normally Linux userspace mappings are protected from other users, such secret mappings are useful for environments where a hostile tenant is trying to trick the kernel into giving them access to other tenants mappings. Additionally, in the future the secret mappings may be used as a mean to protect guest memory in a virtual machine host. For demonstration of secret memory usage we've created a userspace library https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git that does two things: the first is act as a preloader for openssl to redirect all the OPENSSL_malloc calls to secret memory meaning any secret keys get automatically protected this way and the other thing it does is expose the API to the user who needs it. We anticipate that a lot of the use cases would be like the openssl one: many toolkits that deal with secret keys already have special handling for the memory to try to give them greater protection, so this would simply be pluggable into the toolkits without any need for user application modification. Hiding secret memory mappings behind an anonymous file allows usage of the page cache for tracking pages allocated for the "secret" mappings as well as using address_space_operations for e.g. page migration callbacks. The anonymous file may be also used implicitly, like hugetlb files, to implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm ABIs in the future. Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time. In addition, there is also a long term goal to improve management of the direct map. Some questions (and request to document the answers) as we now allow to have unmovable allocations all over the place and I don't see a single comment regarding that in the cover letter: 1. How will the issue of plenty of unmovable allocations for user space be tackled in the future? 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?. 3. How are the plans to support migration in the future and which interface changes will be required? (Michal mentioned some good points to make this configurable via the interface, we should plan ahead and document) Thanks! -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] mm/pmem: Avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled
On 05.02.21 03:39, Aneesh Kumar K.V wrote: Differentiate between hardware not supporting hugepages and user disabling THP via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' For the devdax namespace, the kernel handles the above via the supported_alignment attribute and failing to initialize the namespace if the namespace align value is not supported on the platform. For the fsdax namespace, the kernel will continue to initialize the namespace. This can result in the kernel creating a huge pte entry even though the hardware don't support the same. We do want hugepage support with pmem even if the end-user disabled THP via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence differentiate between hardware/firmware lacking support vs user-controlled disable of THP and prevent a huge fault if the hardware lacks hugepage support. Signed-off-by: Aneesh Kumar K.V --- include/linux/huge_mm.h | 15 +-- mm/huge_memory.c| 6 +- 2 files changed, 14 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 6a19f35f836b..ba973efcd369 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -78,6 +78,7 @@ static inline vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, } enum transparent_hugepage_flag { + TRANSPARENT_HUGEPAGE_NEVER_DAX, TRANSPARENT_HUGEPAGE_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, @@ -123,6 +124,13 @@ extern unsigned long transparent_hugepage_flags; */ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) { + + /* +* If the hardware/firmware marked hugepage support disabled. +*/ + if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX)) + return false; + if (vma->vm_flags & VM_NOHUGEPAGE) return false; @@ -134,12 +142,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) return true; - /* -* For dax vmas, try to always use hugepage mappings. If the kernel does -* not support hugepages, fsdax mappings will fallback to PAGE_SIZE -* mappings, and device-dax namespaces, that try to guarantee a given -* mapping size, will fail to enable -*/ + if (vma_is_dax(vma)) return true; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9237976abe72..d698b7e27447 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -386,7 +386,11 @@ static int __init hugepage_init(void) struct kobject *hugepage_kobj; if (!has_transparent_hugepage()) { - transparent_hugepage_flags = 0; + /* +* Hardware doesn't support hugepages, hence disable +* DAX PMD support. +*/ + transparent_hugepage_flags = 1 << TRANSPARENT_HUGEPAGE_NEVER_DAX; return -EINVAL; } Looks sane to me from my limited understanding of that code :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation
On 29.01.21 09:51, Michal Hocko wrote: On Fri 29-01-21 09:21:28, Mike Rapoport wrote: On Thu, Jan 28, 2021 at 02:01:06PM +0100, Michal Hocko wrote: On Thu 28-01-21 11:22:59, Mike Rapoport wrote: And hugetlb pools may be also depleted by anybody by calling mmap(MAP_HUGETLB) and there is no any limiting knob for this, while secretmem has RLIMIT_MEMLOCK. Yes it can fail. But it would fail at the mmap time when the reservation fails. Not during the #PF time which can be at any time. It may fail at $PF time as well: hugetlb_fault() hugeltb_no_page() ... alloc_huge_page() alloc_gigantic_page() cma_alloc() -ENOMEM; I would have to double check. From what I remember cma allocator is an optimization to increase chances to allocate hugetlb pages when overcommiting because pages should be normally pre-allocated in the pool and reserved during mmap time. But even if a hugetlb page is not pre allocated then this will get propagated as SIGBUS unless that has changed. It's an optimization to allocate gigantic pages dynamically later (so not using memblock during boot). Not just for overcommit, but for any kind of allocation. The actual allocation from cma should happen when setting nr_pages: nr_hugepages_store_common()->set_max_huge_pages()->alloc_pool_huge_page()...->alloc_gigantic_page() The path described above seems to be trying to overcommit gigantic pages, something that can be expected to SIGBUS. Reservations are handled via the pre-allocated pool. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation
On 02.02.21 15:32, Michal Hocko wrote: On Tue 02-02-21 15:26:20, David Hildenbrand wrote: On 02.02.21 15:22, Michal Hocko wrote: On Tue 02-02-21 15:12:21, David Hildenbrand wrote: [...] I think secretmem behaves much more like longterm GUP right now ("unmigratable", "lifetime controlled by user space", "cannot go on CMA/ZONE_MOVABLE"). I'd either want to reasonably well control/limit it or make it behave more like mlocked pages. I thought I have already asked but I must have forgotten. Is there any actual reason why the memory is not movable? Timing attacks? I think the reason is simple: no direct map, no copying of memory. This is an implementation detail though and not something terribly hard to add on top later on. I was more worried there would be really fundamental reason why this is not possible. E.g. security implications. I don't remember all the details. Let's see what Mike thinks regarding migration (e.g., security concerns). -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation
On 02.02.21 15:22, Michal Hocko wrote: On Tue 02-02-21 15:12:21, David Hildenbrand wrote: [...] I think secretmem behaves much more like longterm GUP right now ("unmigratable", "lifetime controlled by user space", "cannot go on CMA/ZONE_MOVABLE"). I'd either want to reasonably well control/limit it or make it behave more like mlocked pages. I thought I have already asked but I must have forgotten. Is there any actual reason why the memory is not movable? Timing attacks? I think the reason is simple: no direct map, no copying of memory. As I mentioned, we would have to temporarily map in order to copy. Mapping it somewhere else (like kmap), outside of the direct map might reduce possible attacks. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation
On 02.02.21 14:32, Michal Hocko wrote: On Tue 02-02-21 14:14:09, David Hildenbrand wrote: [...] As already expressed, I dislike allowing user space to consume an unlimited number unmovable/unmigratable allocations. We already have that in some cases with huge pages (when the arch does not support migration) - but there we can at least manage the consumption using the whole max/reserved/free/... infrastructure. In addition, adding arch support for migration shouldn't be too complicated. Well, mlock is not too different here as well. Hugepages are arguably an easier model because it requires an explicit pre-configuration by an admin. Mlock doesn't have anything like that. Please also note that while mlock pages are migrateable by default, this is not the case in general because they can be configured to disalow migration to prevent from minor page faults as some workloads require that (e.g. RT). Yeah, however that is a very special case. In most cases mlock() simply prevents swapping, you still have movable pages you can place anywhere you like (including on ZONE_MOVABLE). Another example is ramdisk or even tmpfs (with swap storage depleted or not configured). Both are PITA from the OOM POV but they are manageable if people are careful. Right, but again, special cases - e.g., tmpfs explicitly has to be resized. If secretmem behaves along those existing models then we know what to expect at least. I think secretmem behaves much more like longterm GUP right now ("unmigratable", "lifetime controlled by user space", "cannot go on CMA/ZONE_MOVABLE"). I'd either want to reasonably well control/limit it or make it behave more like mlocked pages. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation
On 02.02.21 13:48, Mike Rapoport wrote: On Tue, Feb 02, 2021 at 10:35:05AM +0100, Michal Hocko wrote: On Mon 01-02-21 08:56:19, James Bottomley wrote: I have also proposed potential ways out of this. Either the pool is not fixed sized and you make it a regular unevictable memory (if direct map fragmentation is not considered a major problem) I think that the direct map fragmentation is not a major problem, and the data we have confirms it, so I'd be more than happy to entirely drop the pool, allocate memory page by page and remove each page from the direct map. Still, we cannot prove negative and it could happen that there is a workload that would suffer a lot from the direct map fragmentation, so having a pool of large pages upfront is better than trying to fix it afterwards. As we get more confidence that the direct map fragmentation is not an issue as it is common to believe we may remove the pool altogether. I think that using PMD_ORDER allocations for the pool with a fallback to order 0 will do the job, but unfortunately I doubt we'll reach a consensus about this because dogmatic beliefs are hard to shake... A more restrictive possibility is to still use plain PMD_ORDER allocations to fill the pool, without relying on CMA. In this case there will be no global secretmem specific pool to exhaust, but then it's possible to drain high order free blocks in a system, so CMA has an advantage of limiting secretmem pools to certain amount of memory with somewhat higher probability for high order allocation to succeed. I am not really concerned about fragmenting/breaking up the direct map as long as the feature has to be explicitly enabled (similar to fragmenting the vmemmap). As already expressed, I dislike allowing user space to consume an unlimited number unmovable/unmigratable allocations. We already have that in some cases with huge pages (when the arch does not support migration) - but there we can at least manage the consumption using the whole max/reserved/free/... infrastructure. In addition, adding arch support for migration shouldn't be too complicated. The idea of using CMA is quite good IMHO, because there we can locally limit the direct map fragmentation and don't have to bother about migration at all. We own the area, so we can place as many unmovable allocations on it as we can fit. But it sounds like, we would also need some kind of reservation mechanism in either scenario (CMA vs. no CMA). If we don't want to go full-circle on max/reserved/free/..., allowing for migration of secretmem pages would make sense. Then, these pages become "less special". Map source, copy, unmap destination. The security implementations are the ugly part. I wonder if we could temporarily map somewhere else, so avoiding to touch the direct map during migration. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation
On 26.01.21 12:46, Michal Hocko wrote: On Thu 21-01-21 14:27:19, Mike Rapoport wrote: From: Mike Rapoport Removing a PAGE_SIZE page from the direct map every time such page is allocated for a secret memory mapping will cause severe fragmentation of the direct map. This fragmentation can be reduced by using PMD-size pages as a pool for small pages for secret memory mappings. Add a gen_pool per secretmem inode and lazily populate this pool with PMD-size pages. As pages allocated by secretmem become unmovable, use CMA to back large page caches so that page allocator won't be surprised by failing attempt to migrate these pages. The CMA area used by secretmem is controlled by the "secretmem=" kernel parameter. This allows explicit control over the memory available for secretmem and provides upper hard limit for secretmem consumption. OK, so I have finally had a look at this closer and this is really not acceptable. I have already mentioned that in a response to other patch but any task is able to deprive access to secret memory to other tasks and cause OOM killer which wouldn't really recover ever and potentially panic the system. Now you could be less drastic and only make SIGBUS on fault but that would be still quite terrible. There is a very good reason why hugetlb implements is non-trivial reservation system to avoid exactly these problems. So unless I am really misreading the code Nacked-by: Michal Hocko That doesn't mean I reject the whole idea. There are some details to sort out as mentioned elsewhere but you cannot really depend on pre-allocated pool which can fail at a fault time like that. So, to do it similar to hugetlbfs (e.g., with CMA), there would have to be a mechanism to actually try pre-reserving (e.g., from the CMA area), at which point in time the pages would get moved to the secretmem pool, and a mechanism for mmap() etc. to "reserve" from these secretmem pool, such that there are guarantees at fault time? What we have right now feels like some kind of overcommit (reading, as overcommiting huge pages, so we might get SIGBUS at fault time). TBH, the SIGBUS thingy doesn't sound terrible to me - if this behavior is to be expected right now by applications using it and they can handle it - no guarantees. I fully agree that some kind of reservation/guarantee mechanism would be preferable. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v16 06/11] mm: introduce memfd_secret system call to create "secret" memory areas
On 26.01.21 10:49, Michal Hocko wrote: > On Tue 26-01-21 11:20:11, Mike Rapoport wrote: >> On Tue, Jan 26, 2021 at 10:00:13AM +0100, Michal Hocko wrote: >>> On Tue 26-01-21 10:33:11, Mike Rapoport wrote: On Tue, Jan 26, 2021 at 08:16:14AM +0100, Michal Hocko wrote: > On Mon 25-01-21 23:36:18, Mike Rapoport wrote: >> On Mon, Jan 25, 2021 at 06:01:22PM +0100, Michal Hocko wrote: >>> On Thu 21-01-21 14:27:18, Mike Rapoport wrote: From: Mike Rapoport Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The user will create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the owning mm. The secret memory remains accessible in the process context using uaccess primitives, but it is not accessible using direct/linear map addresses. Functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area. A page that was a part of the secret memory area is cleared when it is freed. The following example demonstrates creation of a secret mapping (error handling is omitted): fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >>> >>> I do not see any access control or permission model for this feature. >>> Is this feature generally safe to anybody? >> >> The mappings obey memlock limit. Besides, this feature should be enabled >> explicitly at boot with the kernel parameter that says what is the >> maximal >> memory size secretmem can consume. > > Why is such a model sufficient and future proof? I mean even when it has > to be enabled by an admin it is still all or nothing approach. Mlock > limit is not really useful because it is per mm rather than per user. > > Is there any reason why this is allowed for non-privileged processes? > Maybe this has been discussed in the past but is there any reason why > this cannot be done by a special device which will allow to provide at > least some permission policy? Why this should not be allowed for non-privileged processes? This behaves similarly to mlocked memory, so I don't see a reason why secretmem should have different permissions model. >>> >>> Because appart from the reclaim aspect it fragments the direct mapping >>> IIUC. That might have an impact on all others, right? >> >> It does fragment the direct map, but first it only splits 1G pages to 2M >> pages and as was discussed several times already it's not that clear which >> page size in the direct map is the best and this is very much workload >> dependent. > > I do appreciate this has been discussed but this changelog is not > specific on any of that reasoning and I am pretty sure nobody will > remember details in few years in the future. Also some numbers would be > appropriate. > >> These are the results of the benchmarks I've run with the default direct >> mapping covered with 1G pages, with disabled 1G pages using "nogbpages" in >> the kernel command line and with the entire direct map forced to use 4K >> pages using a simple patch to arch/x86/mm/init.c. >> >> https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing > > A good start for the data I am asking above. I assume you've seen the benchmark results provided by Xing Zhengjun https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884e...@linux.intel.com/ -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 5/6] mm: Fix memory_failure() handling of dax-namespace metadata
On 13.01.21 08:35, Dan Williams wrote: > Given 'struct dev_pagemap' spans both data pages and metadata pages be > careful to consult the altmap if present to delineate metadata. In fact > the pfn_first() helper already identifies the first valid data pfn, so > export that helper for other code paths via pgmap_pfn_valid(). > > Other usage of get_dev_pagemap() are not a concern because those are > operating on known data pfns having been looking up by get_user_pages(). > I.e. metadata pfns are never user mapped. > > Cc: Naoya Horiguchi > Cc: Andrew Morton > Reported-by: David Hildenbrand > Signed-off-by: Dan Williams > --- > include/linux/memremap.h |6 ++ > mm/memory-failure.c |6 ++ > mm/memremap.c| 15 +++ > 3 files changed, 27 insertions(+) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 79c49e7f5c30..f5b464daeeca 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct > dev_pagemap *pgmap); > void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); > struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > struct dev_pagemap *pgmap); > +bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); > > unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); > void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); > @@ -165,6 +166,11 @@ static inline struct dev_pagemap > *get_dev_pagemap(unsigned long pfn, > return NULL; > } > > +static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long > pfn) > +{ > + return false; > +} > + > static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap) > { > return 0; > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 78b173c7190c..541569cb4a99 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1308,6 +1308,12 @@ static int memory_failure_dev_pagemap(unsigned long > pfn, int flags, >*/ > put_page(page); > > + /* device metadata space is not recoverable */ > + if (!pgmap_pfn_valid(pgmap, pfn)) { > + rc = -ENXIO; > + goto out; > + } > + > /* >* Prevent the inode from being freed while we are interrogating >* the address_space, typically this would be handled by > diff --git a/mm/memremap.c b/mm/memremap.c > index 16b2fb482da1..2455bac89506 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -80,6 +80,21 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap, > int range_id) > return pfn + vmem_altmap_offset(pgmap_altmap(pgmap)); > } > > +bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn) > +{ > + int i; > + > + for (i = 0; i < pgmap->nr_range; i++) { > + struct range *range = &pgmap->ranges[i]; > + > + if (pfn >= PHYS_PFN(range->start) && > + pfn <= PHYS_PFN(range->end)) > + return pfn >= pfn_first(pgmap, i); > + } > + > + return false; > +} > + > static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) > { > const struct range *range = &pgmap->ranges[range_id]; > LGTM Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 2/6] mm: Teach pfn_to_online_page() to consider subsection validity
On 13.01.21 08:35, Dan Williams wrote: > pfn_section_valid() determines pfn validity on subsection granularity > where pfn_valid() may be limited to coarse section granularity. > Explicitly validate subsections after pfn_valid() succeeds. > > Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()") > Cc: Qian Cai > Cc: Michal Hocko > Cc: Oscar Salvador > Reported-by: David Hildenbrand > Signed-off-by: Dan Williams > --- > mm/memory_hotplug.c | 24 > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 55a69d4396e7..9f37f8a68da4 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -308,11 +308,27 @@ static int check_hotplug_memory_addressable(unsigned > long pfn, > struct page *pfn_to_online_page(unsigned long pfn) > { > unsigned long nr = pfn_to_section_nr(pfn); > + struct mem_section *ms; > + > + if (nr >= NR_MEM_SECTIONS) > + return NULL; > + > + ms = __nr_to_section(nr); > + if (!online_section(ms)) > + return NULL; > + > + /* > + * Save some code text when online_section() + > + * pfn_section_valid() are sufficient. > + */ > + if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID)) > + if (!pfn_valid(pfn)) > + return NULL; Nit: if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn)) Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 4/5] mm: Fix page reference leak in soft_offline_page()
On 12.01.21 10:34, Dan Williams wrote: > The conversion to move pfn_to_online_page() internal to > soft_offline_page() missed that the get_user_pages() reference needs to > be dropped when pfn_to_online_page() fails. > > When soft_offline_page() is handed a pfn_valid() && > !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to > a leaked reference. > > Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn") > Cc: Andrew Morton > Cc: Naoya Horiguchi > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: > Signed-off-by: Dan Williams > --- > mm/memory-failure.c | 20 > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 5a38e9eade94..78b173c7190c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct page *page) > return rc; > } > > +static void put_ref_page(struct page *page) > +{ > + if (page) > + put_page(page); > +} > + > /** > * soft_offline_page - Soft offline a page. > * @pfn: pfn to soft-offline > @@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct page *page) > int soft_offline_page(unsigned long pfn, int flags) > { > int ret; > - struct page *page; > bool try_again = true; > + struct page *page, *ref_page = NULL; > + > + WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED)); > > if (!pfn_valid(pfn)) > return -ENXIO; > + if (flags & MF_COUNT_INCREASED) > + ref_page = pfn_to_page(pfn); > + > /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */ > page = pfn_to_online_page(pfn); > - if (!page) > + if (!page) { > + put_ref_page(ref_page); > return -EIO; > + } > > if (PageHWPoison(page)) { > pr_info("%s: %#lx page already poisoned\n", __func__, pfn); > - if (flags & MF_COUNT_INCREASED) > - put_page(page); > + put_ref_page(ref_page); > return 0; > } Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 3/5] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
On 12.01.21 10:34, Dan Williams wrote: > While pfn_to_online_page() is able to determine pfn_valid() at > subsection granularity it is not able to reliably determine if a given > pfn is also online if the section is mixes ZONE_{NORMAL,MOVABLE} with > ZONE_DEVICE. This means that pfn_to_online_page() may return invalid > @page objects. For example with a memory map like: > > 1-1fbff : System RAM > 14200-143002e16 : Kernel code > 14320-143713fff : Kernel rodata > 14380-143b15b7f : Kernel data > 144227000-144ff : Kernel bss > 1fc00-2fbff : Persistent Memory (legacy) > 1fc00-2fbff : namespace0.0 > > This command: > > echo 0x1fc00 > /sys/devices/system/memory/soft_offline_page > > ...succeeds when it should fail. When it succeeds it touches > an uninitialized page and may crash or cause other damage (see > dissolve_free_huge_page()). > > While the memory map above is contrived via the memmap=ss!nn kernel > command line option, the collision happens in practice on shipping > platforms. The memory controller resources that decode spans of > physical address space are a limited resource. One technique > platform-firmware uses to conserve those resources is to share a decoder > across 2 devices to keep the address range contiguous. Unfortunately the > unit of operation of a decoder is 64MiB while the Linux section size is > 128MiB. This results in situations where, without subsection hotplug > memory mappings with different lifetimes collide into one object that > can only express one lifetime. > > Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a > section that mixes ZONE_DEVICE pfns with other online pfns. With > SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall > back to a slow-path check for ZONE_DEVICE pfns in an online section. In > the fast path online_section() for a full ZONE_DEVICE section returns > false. > > Because the collision case is rare, and for simplicity, the > SECTION_TAINT_ZONE_DEVICE flag is never cleared once set. > > Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") > Cc: Andrew Morton > Reported-by: Michal Hocko > Reported-by: David Hildenbrand > Signed-off-by: Dan Williams > --- > include/linux/mmzone.h | 22 +++--- > mm/memory_hotplug.c| 38 ++ > 2 files changed, 53 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index b593316bff3d..0b5c44f730b4 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1273,13 +1273,14 @@ extern size_t mem_section_usage_size(void); > * which results in PFN_SECTION_SHIFT equal 6. > * To sum it up, at least 6 bits are available. > */ > -#define SECTION_MARKED_PRESENT (1UL<<0) > -#define SECTION_HAS_MEM_MAP (1UL<<1) > -#define SECTION_IS_ONLINE(1UL<<2) > -#define SECTION_IS_EARLY (1UL<<3) > -#define SECTION_MAP_LAST_BIT (1UL<<4) > -#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) > -#define SECTION_NID_SHIFT3 > +#define SECTION_MARKED_PRESENT (1UL<<0) > +#define SECTION_HAS_MEM_MAP (1UL<<1) > +#define SECTION_IS_ONLINE(1UL<<2) > +#define SECTION_IS_EARLY (1UL<<3) > +#define SECTION_TAINT_ZONE_DEVICE(1UL<<4) > +#define SECTION_MAP_LAST_BIT (1UL<<5) > +#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) > +#define SECTION_NID_SHIFT3 > > static inline struct page *__section_mem_map_addr(struct mem_section > *section) > { > @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section > *section) > return (section && (section->section_mem_map & SECTION_IS_ONLINE)); > } > > +static inline int online_device_section(struct mem_section *section) > +{ > + unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE; > + > + return section && ((section->section_mem_map & flags) == flags); > +} > + > static inline int online_section_nr(unsigned long nr) > { > return online_section(__nr_to_section(nr)); > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index a845b3979bc0..b2ccb84c3082 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -308,6 +308,7 @@ static int check_hotplug_memory_addressable(unsigned long > pfn, > struct page *pfn_to_online_page(unsigned long pfn) > { > unsigned long nr = pfn_to_section_nr(pfn); > + struct dev_pagemap *pgmap; > struct mem_section *ms; >
Re: [PATCH v2 2/5] mm: Teach pfn_to_online_page() to consider subsection validity
On 12.01.21 10:34, Dan Williams wrote: > pfn_section_valid() determines pfn validity on subsection granularity. > > pfn_valid_within() internally uses pfn_section_valid(), but gates it > with early_section() to preserve the traditional behavior of pfn_valid() > before subsection support was added. > > pfn_to_online_page() wants the explicit precision that pfn_valid() does > not offer, so use pfn_section_valid() directly. Since > pfn_to_online_page() already open codes the validity of the section > number vs NR_MEM_SECTIONS, there's not much value to using > pfn_valid_within(), just use pfn_section_valid(). This loses the > valid_section() check that pfn_valid_within() was performing, but that > was already redundant with the online check. > > Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()") > Cc: Qian Cai > Cc: Michal Hocko > Reported-by: David Hildenbrand > --- > mm/memory_hotplug.c | 16 > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 55a69d4396e7..a845b3979bc0 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -308,11 +308,19 @@ static int check_hotplug_memory_addressable(unsigned > long pfn, > struct page *pfn_to_online_page(unsigned long pfn) > { > unsigned long nr = pfn_to_section_nr(pfn); > + struct mem_section *ms; > + > + if (nr >= NR_MEM_SECTIONS) > + return NULL; > + > + ms = __nr_to_section(nr); > + if (!online_section(ms)) > + return NULL; > + > + if (!pfn_section_valid(ms, pfn)) > + return NULL; That's not sufficient for alternative implementations of pfn_valid(). You still need some kind of pfn_valid(pfn) for alternative versions of pfn_valid(). Consider arm64 memory holes in the memmap. See their current (yet to be fixed/reworked) pfn_valid() implementation. (pfn_valid_within() is implicitly active on arm64) Actually, I think we should add something like the following, to make this clearer (pfn_valid_within() is confusing) #ifdef CONFIG_HAVE_ARCH_PFN_VALID /* We might have to check for holes inside the memmap. */ if (!pfn_valid()) return NULL; #endif > > - if (nr < NR_MEM_SECTIONS && online_section_nr(nr) && > - pfn_valid_within(pfn)) > - return pfn_to_page(pfn); > - return NULL; > + return pfn_to_page(pfn); > } > EXPORT_SYMBOL_GPL(pfn_to_online_page); > > -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 1/5] mm: Move pfn_to_online_page() out of line
On 12.01.21 10:34, Dan Williams wrote: > pfn_to_online_page() is already too large to be a macro or an inline > function. In anticipation of further logic changes / growth, move it out > of line. > > No functional change, just code movement. > > Cc: David Hildenbrand > Reported-by: Michal Hocko > Signed-off-by: Dan Williams > --- > include/linux/memory_hotplug.h | 17 + > mm/memory_hotplug.c| 16 > 2 files changed, 17 insertions(+), 16 deletions(-) > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 15acce5ab106..3d99de0db2dd 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -16,22 +16,7 @@ struct resource; > struct vmem_altmap; > > #ifdef CONFIG_MEMORY_HOTPLUG > -/* > - * Return page for the valid pfn only if the page is online. All pfn > - * walkers which rely on the fully initialized page->flags and others > - * should use this rather than pfn_valid && pfn_to_page > - */ > -#define pfn_to_online_page(pfn) \ > -({ \ > - struct page *___page = NULL; \ > - unsigned long ___pfn = pfn;\ > - unsigned long ___nr = pfn_to_section_nr(___pfn); \ > -\ > - if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > - pfn_valid_within(___pfn)) \ > - ___page = pfn_to_page(___pfn); \ > - ___page; \ > -}) > +struct page *pfn_to_online_page(unsigned long pfn); > > /* > * Types for free bootmem stored in page->lru.next. These have to be in > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index f9d57b9be8c7..55a69d4396e7 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -300,6 +300,22 @@ static int check_hotplug_memory_addressable(unsigned > long pfn, > return 0; > } > > +/* > + * Return page for the valid pfn only if the page is online. All pfn > + * walkers which rely on the fully initialized page->flags and others > + * should use this rather than pfn_valid && pfn_to_page > + */ > +struct page *pfn_to_online_page(unsigned long pfn) > +{ > + unsigned long nr = pfn_to_section_nr(pfn); > + > + if (nr < NR_MEM_SECTIONS && online_section_nr(nr) && > + pfn_valid_within(pfn)) > + return pfn_to_page(pfn); > + return NULL; > +} > +EXPORT_SYMBOL_GPL(pfn_to_online_page); > + > /* > * Reasonably generic function for adding memory. It is > * expected that archs that support memory hotplug will > Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On 08.12.20 18:28, Joao Martins wrote: > Hey, > > This small series, attempts at minimizing 'struct page' overhead by > pursuing a similar approach as Muchun Song series "Free some vmemmap > pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. > > [0] > https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuc...@bytedance.com/ > > The link above describes it quite nicely, but the idea is to reuse tail > page vmemmap areas, particular the area which only describes tail pages. > So a vmemmap page describes 64 struct pages, and the first page for a given > ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second > vmemmap page would contain only tail pages, and that's what gets reused across > the rest of the subsection/section. The bigger the page size, the bigger the > savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap > pages). > > In terms of savings, per 1Tb of memory, the struct page cost would go down > with compound pagemap: > > * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total > memory) > * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total > memory) > That's the dream :) > Along the way I've extended it past 'struct page' overhead *trying* to > address a > few performance issues we knew about for pmem, specifically on the > {pin,get}_user_pages* function family with device-dax vmas which are really > slow even of the fast variants. THP is great on -fast variants but all except > hugetlbfs perform rather poorly on non-fast gup. > > So to summarize what the series does: > > Patches 1-5: Much like Muchun series, we reuse tail page areas across a given > page size (namely @align was referred by remaining memremap/dax code) and > enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or > a > given @align order. The main difference though, is that contrary to the > hugetlbfs > series, there's no vmemmap for the area, because we are onlining it. Yeah, I'd argue that this case is a lot easier to handle. When the buddy is involved, things get more complicated. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v12 04/10] set_memory: allow querying whether set_direct_map_*() is actually enabled
On 25.11.20 13:11, Mike Rapoport wrote: > On Wed, Nov 25, 2020 at 12:22:52PM +0100, David Hildenbrand wrote: >>> #include >>> >>> #endif /* __ASM_CACHEFLUSH_H */ >>> diff --git a/arch/arm64/include/asm/set_memory.h >>> b/arch/arm64/include/asm/set_memory.h >>> new file mode 100644 >>> index ..ecb6b0f449ab >>> --- /dev/null >>> +++ b/arch/arm64/include/asm/set_memory.h >>> @@ -0,0 +1,17 @@ >>> +/* SPDX-License-Identifier: GPL-2.0-only */ >>> + >>> +#ifndef _ASM_ARM64_SET_MEMORY_H >>> +#define _ASM_ARM64_SET_MEMORY_H >>> + >>> +#include >>> + >>> +bool can_set_direct_map(void); >>> +#define can_set_direct_map can_set_direct_map >> >> Well, that looks weird. >> [...] > > We have a lot of those e.g. in linux/pgtable.h > >>> } >>> +#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */ >>> +/* >>> + * Some architectures, e.g. ARM64 can disable direct map modifications at >>> + * boot time. Let them overrive this query. >>> + */ >>> +#ifndef can_set_direct_map >>> +static inline bool can_set_direct_map(void) >>> +{ >>> + return true; >>> +} >> >> I think we prefer __weak functions for something like that, avoids the >> ifdefery. > > I'd prefer this for two reasons: first, static inline can be optimized > away and second, there is no really good place to put generic > implementation. Fair enough Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v12 04/10] set_memory: allow querying whether set_direct_map_*() is actually enabled
> #include > > #endif /* __ASM_CACHEFLUSH_H */ > diff --git a/arch/arm64/include/asm/set_memory.h > b/arch/arm64/include/asm/set_memory.h > new file mode 100644 > index ..ecb6b0f449ab > --- /dev/null > +++ b/arch/arm64/include/asm/set_memory.h > @@ -0,0 +1,17 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > + > +#ifndef _ASM_ARM64_SET_MEMORY_H > +#define _ASM_ARM64_SET_MEMORY_H > + > +#include > + > +bool can_set_direct_map(void); > +#define can_set_direct_map can_set_direct_map Well, that looks weird. [...] > } > +#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */ > +/* > + * Some architectures, e.g. ARM64 can disable direct map modifications at > + * boot time. Let them overrive this query. > + */ > +#ifndef can_set_direct_map > +static inline bool can_set_direct_map(void) > +{ > + return true; > +} I think we prefer __weak functions for something like that, avoids the ifdefery. Apart from that, LGTM. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v8 2/9] mmap: make mlock_future_check() global
On 15.11.20 09:26, Mike Rapoport wrote: On Thu, Nov 12, 2020 at 09:15:18PM +0100, David Hildenbrand wrote: Am 12.11.2020 um 20:08 schrieb Mike Rapoport : On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote: On 10.11.20 19:06, Mike Rapoport wrote: On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote: On 10.11.20 16:14, Mike Rapoport wrote: From: Mike Rapoport It will be used by the upcoming secret memory implementation. Signed-off-by: Mike Rapoport --- mm/internal.h | 3 +++ mm/mmap.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index c43ccdddb0f6..ae146a260b14 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) extern void mlock_vma_page(struct page *page); extern unsigned int munlock_vma_page(struct page *page); +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len); + /* * Clear the page's PageMlocked(). This can be useful in a situation where * we want to unconditionally remove a page from the pagecache -- e.g., diff --git a/mm/mmap.c b/mm/mmap.c index 61f72b09d990..c481f088bd50 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint) return hint; } -static inline int mlock_future_check(struct mm_struct *mm, - unsigned long flags, - unsigned long len) +int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len) { unsigned long locked, lock_limit; So, an interesting question is if you actually want to charge secretmem pages against mlock now, or if you want a dedicated secretmem cgroup controller instead? Well, with the current implementation there are three limits an administrator can use to control secretmem limits: mlock, memcg and kernel parameter. The kernel parameter puts a global upper limit for secretmem usage, memcg accounts all secretmem allocations, including the unused memory in large pages caching and mlock allows per task limit for secretmem mappings, well, like mlock does. I didn't consider a dedicated cgroup, as it seems we already have enough existing knobs and a new one would be unnecessary. To me it feels like the mlock() limit is a wrong fit for secretmem. But maybe there are other cases of using the mlock() limit without actually doing mlock() that I am not aware of (most probably :) )? Secretmem does not explicitly calls to mlock() but it does what mlock() does and a bit more. Citing mlock(2): mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. So, based on that secretmem pages are not swappable, I think that RLIMIT_MEMLOCK is appropriate here. The page explicitly lists mlock() system calls. Well, it's mlock() man page, isn't it? ;-) ;) My thinking was that since secretmem does what mlock() does wrt swapability, it should at least obey the same limit, i.e. RLIMIT_MEMLOCK. Right, but at least currently, it behaves like any other CMA allocation (IIRC they are all unmovable and, therefore, not swappable). In the future, if pages would be movable (but not swappable), I guess it might makes more sense. I assume we never ever want to swap secretmem. "man getrlimit" states for RLIMIT_MEMLOCK: "This is the maximum number of bytes of memory that may be locked into RAM. [...] This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK op‐ eration [...]" So that place has to be updated as well I guess? Otherwise this might come as a surprise for users. E.g., we also don‘t account for gigantic pages - which might be allocated from CMA and are not swappable. Do you mean gigantic pages in hugetlbfs? Yes It seems to me that hugetlbfs accounting is a completely different story. I'd say it is right now comparable to secretmem - which is why I though similar accounting would make sense. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v8 2/9] mmap: make mlock_future_check() global
> Am 12.11.2020 um 20:08 schrieb Mike Rapoport : > > On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote: >>> On 10.11.20 19:06, Mike Rapoport wrote: >>> On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote: >>>> On 10.11.20 16:14, Mike Rapoport wrote: >>>>> From: Mike Rapoport >>>>> >>>>> It will be used by the upcoming secret memory implementation. >>>>> >>>>> Signed-off-by: Mike Rapoport >>>>> --- >>>>> mm/internal.h | 3 +++ >>>>> mm/mmap.c | 5 ++--- >>>>> 2 files changed, 5 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>> index c43ccdddb0f6..ae146a260b14 100644 >>>>> --- a/mm/internal.h >>>>> +++ b/mm/internal.h >>>>> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct >>>>> vm_area_struct *vma) >>>>> extern void mlock_vma_page(struct page *page); >>>>> extern unsigned int munlock_vma_page(struct page *page); >>>>> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, >>>>> + unsigned long len); >>>>> + >>>>> /* >>>>>* Clear the page's PageMlocked(). This can be useful in a situation >>>>> where >>>>>* we want to unconditionally remove a page from the pagecache -- e.g., >>>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>>> index 61f72b09d990..c481f088bd50 100644 >>>>> --- a/mm/mmap.c >>>>> +++ b/mm/mmap.c >>>>> @@ -1348,9 +1348,8 @@ static inline unsigned long >>>>> round_hint_to_min(unsigned long hint) >>>>> return hint; >>>>> } >>>>> -static inline int mlock_future_check(struct mm_struct *mm, >>>>> - unsigned long flags, >>>>> - unsigned long len) >>>>> +int mlock_future_check(struct mm_struct *mm, unsigned long flags, >>>>> + unsigned long len) >>>>> { >>>>> unsigned long locked, lock_limit; >>>>> >>>> >>>> So, an interesting question is if you actually want to charge secretmem >>>> pages against mlock now, or if you want a dedicated secretmem cgroup >>>> controller instead? >>> >>> Well, with the current implementation there are three limits an >>> administrator can use to control secretmem limits: mlock, memcg and >>> kernel parameter. >>> >>> The kernel parameter puts a global upper limit for secretmem usage, >>> memcg accounts all secretmem allocations, including the unused memory in >>> large pages caching and mlock allows per task limit for secretmem >>> mappings, well, like mlock does. >>> >>> I didn't consider a dedicated cgroup, as it seems we already have enough >>> existing knobs and a new one would be unnecessary. >> >> To me it feels like the mlock() limit is a wrong fit for secretmem. But >> maybe there are other cases of using the mlock() limit without actually >> doing mlock() that I am not aware of (most probably :) )? > > Secretmem does not explicitly calls to mlock() but it does what mlock() > does and a bit more. Citing mlock(2): > > mlock(), mlock2(), and mlockall() lock part or all of the calling > process's virtual address space into RAM, preventing that memory from > being paged to the swap area. > > So, based on that secretmem pages are not swappable, I think that > RLIMIT_MEMLOCK is appropriate here. > The page explicitly lists mlock() system calls. E.g., we also don‘t account for gigantic pages - which might be allocated from CMA and are not swappable. >> I mean, my concern is not earth shattering, this can be reworked later. As I >> said, it just feels wrong. >> >> -- >> Thanks, >> >> David / dhildenb >> > > -- > Sincerely yours, > Mike. > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v8 2/9] mmap: make mlock_future_check() global
On 10.11.20 19:06, Mike Rapoport wrote: On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote: On 10.11.20 16:14, Mike Rapoport wrote: From: Mike Rapoport It will be used by the upcoming secret memory implementation. Signed-off-by: Mike Rapoport --- mm/internal.h | 3 +++ mm/mmap.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index c43ccdddb0f6..ae146a260b14 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) extern void mlock_vma_page(struct page *page); extern unsigned int munlock_vma_page(struct page *page); +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len); + /* * Clear the page's PageMlocked(). This can be useful in a situation where * we want to unconditionally remove a page from the pagecache -- e.g., diff --git a/mm/mmap.c b/mm/mmap.c index 61f72b09d990..c481f088bd50 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint) return hint; } -static inline int mlock_future_check(struct mm_struct *mm, -unsigned long flags, -unsigned long len) +int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len) { unsigned long locked, lock_limit; So, an interesting question is if you actually want to charge secretmem pages against mlock now, or if you want a dedicated secretmem cgroup controller instead? Well, with the current implementation there are three limits an administrator can use to control secretmem limits: mlock, memcg and kernel parameter. The kernel parameter puts a global upper limit for secretmem usage, memcg accounts all secretmem allocations, including the unused memory in large pages caching and mlock allows per task limit for secretmem mappings, well, like mlock does. I didn't consider a dedicated cgroup, as it seems we already have enough existing knobs and a new one would be unnecessary. To me it feels like the mlock() limit is a wrong fit for secretmem. But maybe there are other cases of using the mlock() limit without actually doing mlock() that I am not aware of (most probably :) )? I mean, my concern is not earth shattering, this can be reworked later. As I said, it just feels wrong. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v8 2/9] mmap: make mlock_future_check() global
On 10.11.20 16:14, Mike Rapoport wrote: From: Mike Rapoport It will be used by the upcoming secret memory implementation. Signed-off-by: Mike Rapoport --- mm/internal.h | 3 +++ mm/mmap.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index c43ccdddb0f6..ae146a260b14 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) extern void mlock_vma_page(struct page *page); extern unsigned int munlock_vma_page(struct page *page); +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len); + /* * Clear the page's PageMlocked(). This can be useful in a situation where * we want to unconditionally remove a page from the pagecache -- e.g., diff --git a/mm/mmap.c b/mm/mmap.c index 61f72b09d990..c481f088bd50 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint) return hint; } -static inline int mlock_future_check(struct mm_struct *mm, -unsigned long flags, -unsigned long len) +int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len) { unsigned long locked, lock_limit; So, an interesting question is if you actually want to charge secretmem pages against mlock now, or if you want a dedicated secretmem cgroup controller instead? -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas
On 03.11.20 10:52, Mike Rapoport wrote: On Mon, Nov 02, 2020 at 06:51:09PM +0100, David Hildenbrand wrote: Assume you have a system with quite some ZONE_MOVABLE memory (esp. in virtualized environments), eating up a significant amount of !ZONE_MOVABLE memory dynamically at runtime can lead to non-obvious issues. It looks like you have plenty of free memory, but the kernel might still OOM when trying to do kernel allocations e.g., for pagetables. With CMA we at least know what we're dealing with - it behaves like ZONE_MOVABLE except for the owner that can place unmovable pages there. We can use it to compute statically the amount of ZONE_MOVABLE memory we can have in the system without doing harm to the system. Why would you say that secretmem allocates from !ZONE_MOVABLE? If we put boot time reservations aside, the memory allocation for secretmem follows the same rules as the memory allocations for any file descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE. Oh, okay - I missed that! I had the impression that pages are unmovable and allocating from ZONE_MOVABLE would be a violation of that? After the allocation the memory indeed becomes unmovable but it's not like we are eating memory from other zones here. ... and here you have your problem. That's a no-no. We only allow it in very special cases where it can't be avoided - e.g., vfio having to pin guest memory when passing through memory to VMs. Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to unplug the memory again -> endless loop in offline_pages(). Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate secretmem. The owner of the area tries to allocate memory - always fails. Purpose of CMA destroyed. Ideally, we would want to support page migration/compaction and allow for allocation from ZONE_MOVABLE as well. Would involve temporarily mapping, copying, unmapping. Sounds feasible, but not sure which roadblocks we would find on the way. We can support migration/compaction with temporary mapping. The first roadblock I've hit there was that migration allocates 4K destination page and if we use it in secret map we are back to scrambling the direct map into 4K pieces. It still sounds feasible but not as trivial :) That sounds like the proper way for me to do it then. Although migration of secretmem pages sounds feasible now, there maybe other issues I didn't see because I'm not very familiar with migration/compaction code. Migration of PMDs might also be feasible - and it would be even cleaner. But I agree that that might require more work and starting with something simpler (!movable) is the right way to move forward. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas
Assume you have a system with quite some ZONE_MOVABLE memory (esp. in virtualized environments), eating up a significant amount of !ZONE_MOVABLE memory dynamically at runtime can lead to non-obvious issues. It looks like you have plenty of free memory, but the kernel might still OOM when trying to do kernel allocations e.g., for pagetables. With CMA we at least know what we're dealing with - it behaves like ZONE_MOVABLE except for the owner that can place unmovable pages there. We can use it to compute statically the amount of ZONE_MOVABLE memory we can have in the system without doing harm to the system. Why would you say that secretmem allocates from !ZONE_MOVABLE? If we put boot time reservations aside, the memory allocation for secretmem follows the same rules as the memory allocations for any file descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE. Oh, okay - I missed that! I had the impression that pages are unmovable and allocating from ZONE_MOVABLE would be a violation of that? After the allocation the memory indeed becomes unmovable but it's not like we are eating memory from other zones here. ... and here you have your problem. That's a no-no. We only allow it in very special cases where it can't be avoided - e.g., vfio having to pin guest memory when passing through memory to VMs. Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to unplug the memory again -> endless loop in offline_pages(). Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate secretmem. The owner of the area tries to allocate memory - always fails. Purpose of CMA destroyed. Ideally, we would want to support page migration/compaction and allow for allocation from ZONE_MOVABLE as well. Would involve temporarily mapping, copying, unmapping. Sounds feasible, but not sure which roadblocks we would find on the way. We can support migration/compaction with temporary mapping. The first roadblock I've hit there was that migration allocates 4K destination page and if we use it in secret map we are back to scrambling the direct map into 4K pieces. It still sounds feasible but not as trivial :) That sounds like the proper way for me to do it then. But again, there is nothing in the current form of secretmem that prevents allocation from ZONE_MOVABLE. Oh, there is something: That the pages are not movable. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas
On 02.11.20 10:11, David Hildenbrand wrote: On 24.09.20 15:28, Mike Rapoport wrote: From: Mike Rapoport Hi, This is an implementation of "secret" mappings backed by a file descriptor. I've dropped the boot time reservation patch for now as it is not strictly required for the basic usage and can be easily added later either with or without CMA. Hi Mike, I'd like to stress again that I'd prefer *any* secretmem allocations going via CMA as long as these pages are unmovable. The user can allocate a non-significant amount of unmovable allocations only fenced lol, "non-neglectable" or "significant". Guess I need another coffee :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas
On 24.09.20 15:28, Mike Rapoport wrote: From: Mike Rapoport Hi, This is an implementation of "secret" mappings backed by a file descriptor. I've dropped the boot time reservation patch for now as it is not strictly required for the basic usage and can be easily added later either with or without CMA. Hi Mike, I'd like to stress again that I'd prefer *any* secretmem allocations going via CMA as long as these pages are unmovable. The user can allocate a non-significant amount of unmovable allocations only fenced by the mlock limit, which behave very different to mlocked pages - they are not movable for page compaction/migration. Assume you have a system with quite some ZONE_MOVABLE memory (esp. in virtualized environments), eating up a significant amount of !ZONE_MOVABLE memory dynamically at runtime can lead to non-obvious issues. It looks like you have plenty of free memory, but the kernel might still OOM when trying to do kernel allocations e.g., for pagetables. With CMA we at least know what we're dealing with - it behaves like ZONE_MOVABLE except for the owner that can place unmovable pages there. We can use it to compute statically the amount of ZONE_MOVABLE memory we can have in the system without doing harm to the system. Ideally, we would want to support page migration/compaction and allow for allocation from ZONE_MOVABLE as well. Would involve temporarily mapping, copying, unmapping. Sounds feasible, but not sure which roadblocks we would find on the way. [...] The file descriptor backing secret memory mappings is created using a dedicated memfd_secret system call The desired protection mode for the memory is configured using flags parameter of the system call. The mmap() of the file descriptor created with memfd_secret() will create a "secret" memory mapping. The pages in that mapping will be marked as not present in the direct map and will have desired protection bits set in the user page table. For instance, current implementation allows uncached mappings. Although normally Linux userspace mappings are protected from other users, such secret mappings are useful for environments where a hostile tenant is trying to trick the kernel into giving them access to other tenants mappings. Additionally, the secret mappings may be used as a mean to protect guest memory in a virtual machine host. For demonstration of secret memory usage we've created a userspace library [1] that does two things: the first is act as a preloader for openssl to redirect all the OPENSSL_malloc calls to secret memory meaning any secret keys get automatically protected this way and the other thing it does is expose the API to the user who needs it. We anticipate that a lot of the use cases would be like the openssl one: many toolkits that deal with secret keys already have special handling for the memory to try to give them greater protection, so this would simply be pluggable into the toolkits without any need for user application modification. I've hesitated whether to continue to use new flags to memfd_create() or to add a new system call and I've decided to use a new system call after I've started to look into man pages update. There would have been two completely independent descriptions and I think it would have been very confusing. This was also raised on lwn.net by "dullfire" [1]. I do wonder if it would be the right place as well. [1] https://lwn.net/Articles/835342/#Comments Hiding secret memory mappings behind an anonymous file allows (ab)use of the page cache for tracking pages allocated for the "secret" mappings as well as using address_space_operations for e.g. page migration callbacks. The anonymous file may be also used implicitly, like hugetlb files, to implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm ABIs in the future. As the fragmentation of the direct map was one of the major concerns raised during the previous postings, I've added an amortizing cache of PMD-size pages to each file descriptor that is used as an allocation pool for the secret memory areas. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] x86/mm: Fix phys_to_target_node() export
On 30.10.20 03:29, Dan Williams wrote: The core-mm has a default __weak implementation of phys_to_target_node() when the architecture does not override it. That symbol is exported for modules. However, while the export in mm/memory_hotplug.c exported the symbol in the configuration cases of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=y ...and: CONFIG_NUMA_KEEP_MEMINFO=n CONFIG_MEMORY_HOTPLUG=y ...it failed to export the symbol in the case of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=n Always export the symbol from the CONFIG_NUMA_KEEP_MEMINFO section of arch/x86/mm/numa.c, and teach mm/memory_hotplug.c to optionally export in case arch/x86/mm/numa.c has already performed the export. The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid now that the symbol is properly exported in all combinations of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG. Note that in the CONFIG_NUMA=n case no export is needed since their is a dummy static inline implementation of phys_to_target_node() in that case. Reported-by: Randy Dunlap Reported-by: Thomas Gleixner Reported-by: kernel test robot Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation") Cc: Joao Martins Cc: Andrew Morton Cc: x...@kernel.org Cc: Vishal Verma Signed-off-by: Dan Williams --- arch/x86/mm/numa.c |1 + drivers/dax/Kconfig |1 - mm/memory_hotplug.c |5 + 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 44148691d78b..e025947f19e0 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -938,6 +938,7 @@ int phys_to_target_node(phys_addr_t start) return meminfo_to_nid(&numa_reserved_meminfo, start); } +EXPORT_SYMBOL_GPL(phys_to_target_node); int memory_add_physaddr_to_nid(u64 start) { diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 567428e10b7b..d2834c2cfa10 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -50,7 +50,6 @@ config DEV_DAX_HMEM Say M if unsure. config DEV_DAX_HMEM_DEVICES - depends on NUMA_KEEP_MEMINFO # for phys_to_target_node() depends on DEV_DAX_HMEM && DAX=y def_bool y diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b44d4c7ba73b..ed326b489674 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -365,9 +365,14 @@ int __weak phys_to_target_node(u64 start) start); return 0; } + +/* If the arch did not export a strong symbol, export the weak one. */ +#ifndef CONFIG_NUMA_KEEP_MEMINFO EXPORT_SYMBOL_GPL(phys_to_target_node); #endif +#endif + /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v7 1/7] mm: add definition of PMD_PAGE_ORDER
On 26.10.20 09:37, Mike Rapoport wrote: > From: Mike Rapoport > > The definition of PMD_PAGE_ORDER denoting the number of base pages in the > second-level leaf page is already used by DAX and maybe handy in other > cases as well. > > Several architectures already have definition of PMD_ORDER as the size of > second level page table, so to avoid conflict with these definitions use > PMD_PAGE_ORDER name and update DAX respectively. > > Signed-off-by: Mike Rapoport Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v7 3/7] set_memory: allow set_direct_map_*_noflush() for multiple pages
On 26.10.20 20:01, Edgecombe, Rick P wrote: > On Mon, 2020-10-26 at 10:37 +0200, Mike Rapoport wrote: >> +++ b/arch/x86/mm/pat/set_memory.c >> @@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page, >> int numpages) >> return __change_page_attr_set_clr(&cpa, 0); >> } >> >> -int set_direct_map_invalid_noflush(struct page *page) >> +int set_direct_map_invalid_noflush(struct page *page, int numpages) >> { >> - return __set_pages_np(page, 1); >> + return __set_pages_np(page, numpages); >> } >> >> -int set_direct_map_default_noflush(struct page *page) >> +int set_direct_map_default_noflush(struct page *page, int numpages) >> { >> - return __set_pages_p(page, 1); >> + return __set_pages_p(page, numpages); >> } > > Somewhat related to your other series, this could result in large NP > pages and trip up hibernate. > It feels somewhat desirable to disable hibernation once secretmem is enabled, right? Otherwise you'll be writing out your secrets to swap, where they will remain even after booting up again ... Skipping secretmem pages when hibernating is the wrong approach I guess ... -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH 1/2] device-dax/kmem: Fix resource release
On 15.10.20 02:42, Dan Williams wrote: > The conversion to request_mem_region() is broken because it assumes that > the range is marked busy prior to release. However, due to the way that > the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to > let {add,remove}_memory() handle busy) it requires a manual > release_resource() to perform cleanup. > > Given that the actual 'struct resource *' needs to be recalled, not just > the range, add that tracking to the kmem driver-data. > > Reported-by: David Hildenbrand > Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with > release_mem_region()") > Cc: Vishal Verma > Cc: Dave Hansen > Cc: Pavel Tatashin > Cc: Brice Goglin > Cc: Dave Jiang > Cc: Ira Weiny > Cc: Jia He > Cc: Joao Martins > Cc: Jonathan Cameron > Signed-off-by: Dan Williams > --- > drivers/dax/kmem.c | 48 ++-- > 1 file changed, 34 insertions(+), 14 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 6c933f2b604e..af04b6d1d263 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -35,11 +35,17 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, > struct range *r) > return 0; > } > > +struct dax_kmem_data { > + const char *res_name; > + struct resource *res[]; > +}; > + > static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > { > struct device *dev = &dev_dax->dev; > + struct dax_kmem_data *data; > + int rc = -ENOMEM; > int i, mapped = 0; > - char *res_name; > int numa_node; > > /* > @@ -55,14 +61,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > return -EINVAL; > } > > - res_name = kstrdup(dev_name(dev), GFP_KERNEL); > - if (!res_name) > + data = kzalloc(sizeof(*data) + sizeof(struct resource *) * > dev_dax->nr_range, GFP_KERNEL); > + if (!data) > return -ENOMEM; > > + data->res_name = kstrdup(dev_name(dev), GFP_KERNEL); > + if (!data->res_name) > + goto err_res_name; > + > for (i = 0; i < dev_dax->nr_range; i++) { > struct resource *res; > struct range range; > - int rc; > > rc = dax_kmem_range(dev_dax, i, &range); > if (rc) { > @@ -72,7 +81,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > } > > /* Region is permanently reserved if hotremove fails. */ > - res = request_mem_region(range.start, range_len(&range), > res_name); > + res = request_mem_region(range.start, range_len(&range), > data->res_name); > if (!res) { > dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve > region\n", > i, range.start, range.end); > @@ -82,9 +91,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >*/ > if (mapped) > continue; > - kfree(res_name); > - return -EBUSY; > + rc = -EBUSY; > + goto err_request_mem; > } > + data->res[i] = res; > > /* >* Set flags appropriate for System RAM. Leave ..._BUSY clear > @@ -104,18 +114,25 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > if (rc) { > dev_warn(dev, "mapping%d: %#llx-%#llx memory add > failed\n", > i, range.start, range.end); > - release_mem_region(range.start, range_len(&range)); > + release_resource(res); > + kfree(res); > + data->res[i] = NULL; > if (mapped) > continue; > - kfree(res_name); > - return rc; > + goto err_request_mem; > } > mapped++; > } > > - dev_set_drvdata(dev, res_name); > + dev_set_drvdata(dev, data); > > return 0; > + > +err_request_mem: > + kfree(data->res_name); > +err_res_name: > + kfree(data); > + return rc; > } > > #ifdef CONFIG_MEMORY_HOTREMOVE > @@ -123,7 +140,7 @@ static int dev_dax_kmem_remove(struct dev_dax *dev_dax) > { > int i, success = 0; > struct device *dev = &dev_dax->dev;
Re: [PATCH v6 03/11] device-dax/kmem: move resource tracking to drvdata
On 06.10.20 08:55, Dan Williams wrote: > Towards removing the mode specific @dax_kmem_res attribute from the > generic 'struct dev_dax', and preparing for multi-range support, move > resource tracking to driver data. The memory for the resource name > needs to have its own lifetime separate from the device bind lifetime > for cases where the driver is unbound, but the kmem range could not be > unplugged from the page allocator. > > The resource reservation also needs to be released manually via > release_resource() given the awkward manipulation of the > IORESOURCE_BUSY flag. > > Cc: David Hildenbrand > Cc: Vishal Verma > Cc: Dave Hansen > Cc: Pavel Tatashin > Cc: Brice Goglin > Cc: Dave Jiang > Cc: David Hildenbrand > Cc: Ira Weiny > Cc: Jia He > Cc: Joao Martins > Cc: Jonathan Cameron > Signed-off-by: Dan Williams Reviewed-by: David Hildenbrand Thanks! -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v5 01/17] device-dax: make pgmap optional for instance creation
On 01.10.20 18:54, Dan Williams wrote: > On Thu, Oct 1, 2020 at 1:41 AM David Hildenbrand wrote: >> >> On 25.09.20 21:11, Dan Williams wrote: >>> The passed in dev_pagemap is only required in the pmem case as the >>> libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to >>> place the memmap in pmem directly. In the hmem case there is no agent >>> reserving an altmap so it can all be handled by a core internal default. >>> >>> Pass the resource range via a new @range property of 'struct >>> dev_dax_data'. >>> >>> Link: >>> https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.st...@dwillia2-desk3.amr.corp.intel.com >>> Cc: David Hildenbrand >>> Cc: Vishal Verma >>> Cc: Dave Hansen >>> Cc: Pavel Tatashin >>> Cc: Brice Goglin >>> Cc: Dave Jiang >>> Cc: David Hildenbrand >>> Cc: Ira Weiny >>> Cc: Jia He >>> Cc: Joao Martins >>> Cc: Jonathan Cameron >>> Signed-off-by: Dan Williams >>> --- >>> drivers/dax/bus.c | 29 +++-- >>> drivers/dax/bus.h |2 ++ >>> drivers/dax/dax-private.h |9 - >>> drivers/dax/device.c | 28 +++- >>> drivers/dax/hmem/hmem.c|8 >>> drivers/dax/kmem.c | 12 ++-- >>> drivers/dax/pmem/core.c|4 >>> tools/testing/nvdimm/dax-dev.c |8 >>> 8 files changed, 62 insertions(+), 38 deletions(-) >>> >>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c >>> index dffa4655e128..96bd64ba95a5 100644 >>> --- a/drivers/dax/bus.c >>> +++ b/drivers/dax/bus.c >>> @@ -271,7 +271,7 @@ static ssize_t size_show(struct device *dev, >>> struct device_attribute *attr, char *buf) >>> { >>> struct dev_dax *dev_dax = to_dev_dax(dev); >>> - unsigned long long size = resource_size(&dev_dax->region->res); >>> + unsigned long long size = range_len(&dev_dax->range); >>> >>> return sprintf(buf, "%llu\n", size); >>> } >>> @@ -293,19 +293,12 @@ static ssize_t target_node_show(struct device *dev, >>> } >>> static DEVICE_ATTR_RO(target_node); >>> >>> -static unsigned long long dev_dax_resource(struct dev_dax *dev_dax) >>> -{ >>> - struct dax_region *dax_region = dev_dax->region; >>> - >>> - return dax_region->res.start; >>> -} >>> - >>> static ssize_t resource_show(struct device *dev, >>> struct device_attribute *attr, char *buf) >>> { >>> struct dev_dax *dev_dax = to_dev_dax(dev); >>> >>> - return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax)); >>> + return sprintf(buf, "%#llx\n", dev_dax->range.start); >>> } >>> static DEVICE_ATTR(resource, 0400, resource_show, NULL); >>> >>> @@ -376,6 +369,7 @@ static void dev_dax_release(struct device *dev) >>> >>> dax_region_put(dax_region); >>> put_dax(dax_dev); >>> + kfree(dev_dax->pgmap); >>> kfree(dev_dax); >>> } >>> >>> @@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(struct >>> dev_dax_data *data) >>> if (!dev_dax) >>> return ERR_PTR(-ENOMEM); >>> >>> - memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap)); >>> + if (data->pgmap) { >>> + dev_dax->pgmap = kmemdup(data->pgmap, >>> + sizeof(struct dev_pagemap), GFP_KERNEL); >>> + if (!dev_dax->pgmap) >>> + goto err_pgmap; >>> + } >>> >>> /* >>>* No 'host' or dax_operations since there is no access to this >>> @@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(struct >>> dev_dax_data *data) >>> dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC); >>> if (IS_ERR(dax_dev)) { >>> rc = PTR_ERR(dax_dev); >>> - goto err; >>> + goto err_alloc_dax; >>> } >>> >>> /* a device_dax instance is dead while the driver is not attached */ >>> kill_dax(dax_dev); >>> >>> - /*
Re: [PATCH v5 01/17] device-dax: make pgmap optional for instance creation
On 25.09.20 21:11, Dan Williams wrote: > The passed in dev_pagemap is only required in the pmem case as the > libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to > place the memmap in pmem directly. In the hmem case there is no agent > reserving an altmap so it can all be handled by a core internal default. > > Pass the resource range via a new @range property of 'struct > dev_dax_data'. > > Link: > https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.st...@dwillia2-desk3.amr.corp.intel.com > Cc: David Hildenbrand > Cc: Vishal Verma > Cc: Dave Hansen > Cc: Pavel Tatashin > Cc: Brice Goglin > Cc: Dave Jiang > Cc: David Hildenbrand > Cc: Ira Weiny > Cc: Jia He > Cc: Joao Martins > Cc: Jonathan Cameron > Signed-off-by: Dan Williams > --- > drivers/dax/bus.c | 29 +++-- > drivers/dax/bus.h |2 ++ > drivers/dax/dax-private.h |9 - > drivers/dax/device.c | 28 +++- > drivers/dax/hmem/hmem.c|8 > drivers/dax/kmem.c | 12 ++-- > drivers/dax/pmem/core.c|4 > tools/testing/nvdimm/dax-dev.c |8 > 8 files changed, 62 insertions(+), 38 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index dffa4655e128..96bd64ba95a5 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -271,7 +271,7 @@ static ssize_t size_show(struct device *dev, > struct device_attribute *attr, char *buf) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > - unsigned long long size = resource_size(&dev_dax->region->res); > + unsigned long long size = range_len(&dev_dax->range); > > return sprintf(buf, "%llu\n", size); > } > @@ -293,19 +293,12 @@ static ssize_t target_node_show(struct device *dev, > } > static DEVICE_ATTR_RO(target_node); > > -static unsigned long long dev_dax_resource(struct dev_dax *dev_dax) > -{ > - struct dax_region *dax_region = dev_dax->region; > - > - return dax_region->res.start; > -} > - > static ssize_t resource_show(struct device *dev, > struct device_attribute *attr, char *buf) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > > - return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax)); > + return sprintf(buf, "%#llx\n", dev_dax->range.start); > } > static DEVICE_ATTR(resource, 0400, resource_show, NULL); > > @@ -376,6 +369,7 @@ static void dev_dax_release(struct device *dev) > > dax_region_put(dax_region); > put_dax(dax_dev); > + kfree(dev_dax->pgmap); > kfree(dev_dax); > } > > @@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data > *data) > if (!dev_dax) > return ERR_PTR(-ENOMEM); > > - memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap)); > + if (data->pgmap) { > + dev_dax->pgmap = kmemdup(data->pgmap, > + sizeof(struct dev_pagemap), GFP_KERNEL); > + if (!dev_dax->pgmap) > + goto err_pgmap; > + } > > /* >* No 'host' or dax_operations since there is no access to this > @@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data > *data) > dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC); > if (IS_ERR(dax_dev)) { > rc = PTR_ERR(dax_dev); > - goto err; > + goto err_alloc_dax; > } > > /* a device_dax instance is dead while the driver is not attached */ > kill_dax(dax_dev); > > - /* from here on we're committed to teardown via dax_dev_release() */ > + /* from here on we're committed to teardown via dev_dax_release() */ > dev = &dev_dax->dev; > device_initialize(dev); > > dev_dax->dax_dev = dax_dev; > dev_dax->region = dax_region; > + dev_dax->range = data->range; > dev_dax->target_node = dax_region->target_node; > kref_get(&dax_region->kref); > > @@ -458,8 +458,9 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data > *data) > return ERR_PTR(rc); > > return dev_dax; > - > - err: > +err_alloc_dax: > + kfree(dev_dax->pgmap); > +err_pgmap: > kfree(dev_dax); > > return ERR_PTR(rc); > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h > index 299c2e7fac09..4aeb36da83a4 100644 > ---
Re: [PATCH v5 04/17] device-dax/kmem: replace release_resource() with release_mem_region()
On 25.09.20 21:12, Dan Williams wrote: > Towards removing the mode specific @dax_kmem_res attribute from the > generic 'struct dev_dax', and preparing for multi-range support, change > the kmem driver to use the idiomatic release_mem_region() to pair with > the initial request_mem_region(). This also eliminates the need to open > code the release of the resource allocated by request_mem_region(). > > As there are no more dax_kmem_res users, delete this struct member. > > Cc: David Hildenbrand > Cc: Vishal Verma > Cc: Dave Hansen > Cc: Pavel Tatashin > Cc: Brice Goglin > Cc: Dave Jiang > Cc: David Hildenbrand > Cc: Ira Weiny > Cc: Jia He > Cc: Joao Martins > Cc: Jonathan Cameron > Signed-off-by: Dan Williams > --- > drivers/dax/dax-private.h |3 --- > drivers/dax/kmem.c| 20 +++- > 2 files changed, 7 insertions(+), 16 deletions(-) > > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h > index 6779f683671d..12a2dbc43b40 100644 > --- a/drivers/dax/dax-private.h > +++ b/drivers/dax/dax-private.h > @@ -42,8 +42,6 @@ struct dax_region { > * @dev - device core > * @pgmap - pgmap for memmap setup / lifetime (driver owned) > * @range: resource range for the instance > - * @dax_mem_res: physical address range of hotadded DAX memory > - * @dax_mem_name: name for hotadded DAX memory via > add_memory_driver_managed() > */ > struct dev_dax { > struct dax_region *region; > @@ -52,7 +50,6 @@ struct dev_dax { > struct device dev; > struct dev_pagemap *pgmap; > struct range range; > - struct resource *dax_kmem_res; > }; > > static inline u64 range_len(struct range *range) > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 6fe2cb1c5f7c..e56fc688bdc5 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -33,7 +33,7 @@ int dev_dax_kmem_probe(struct device *dev) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > struct range range = dax_kmem_range(dev_dax); > - struct resource *new_res; > + struct resource *res; > char *res_name; > int numa_node; > int rc; > @@ -56,8 +56,8 @@ int dev_dax_kmem_probe(struct device *dev) > return -ENOMEM; > > /* Region is permanently reserved if hotremove fails. */ > - new_res = request_mem_region(range.start, range_len(&range), res_name); > - if (!new_res) { > + res = request_mem_region(range.start, range_len(&range), res_name); > + if (!res) { > dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", > range.start, range.end); > kfree(res_name); > return -EBUSY; > @@ -69,23 +69,20 @@ int dev_dax_kmem_probe(struct device *dev) >* inherit flags from the parent since it may set new flags >* unknown to us that will break add_memory() below. >*/ > - new_res->flags = IORESOURCE_SYSTEM_RAM; > + res->flags = IORESOURCE_SYSTEM_RAM; > > /* >* Ensure that future kexec'd kernels will not treat this as RAM >* automatically. >*/ > - rc = add_memory_driver_managed(numa_node, new_res->start, > -resource_size(new_res), kmem_name); > + rc = add_memory_driver_managed(numa_node, range.start, > range_len(&range), kmem_name); > if (rc) { > - release_resource(new_res); > - kfree(new_res); > + release_mem_region(range.start, range_len(&range)); > kfree(res_name); > return rc; > } > > dev_set_drvdata(dev, res_name); > - dev_dax->dax_kmem_res = new_res; > > return 0; > } > @@ -95,7 +92,6 @@ static int dev_dax_kmem_remove(struct device *dev) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > struct range range = dax_kmem_range(dev_dax); > - struct resource *res = dev_dax->dax_kmem_res; > const char *res_name = dev_get_drvdata(dev); > int rc; > > @@ -114,10 +110,8 @@ static int dev_dax_kmem_remove(struct device *dev) > } > > /* Release and free dax resources */ > - release_resource(res); > - kfree(res); > + release_mem_region(range.start, range_len(&range)); Does that work? AFAIKs, __release_region(&iomem_resource, (start), (n)) -> __release_region() will only remove stuff that is IORESOURCE_BUSY. Maybe storing it in drvdata is indeed the easiest way to remove it from struct dax_region. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v5 03/17] device-dax/kmem: move resource name tracking to drvdata
On 25.09.20 21:11, Dan Williams wrote: > Towards removing the mode specific @dax_kmem_res attribute from the > generic 'struct dev_dax', and preparing for multi-range support, move > resource name tracking to driver data. The memory for the resource name > needs to have its own lifetime separate from the device bind lifetime > for cases where the driver is unbound, but the kmem range could not be > unplugged from the page allocator. > > Cc: David Hildenbrand > Cc: Vishal Verma > Cc: Dave Hansen > Cc: Pavel Tatashin > Cc: Brice Goglin > Cc: Dave Jiang > Cc: David Hildenbrand > Cc: Ira Weiny > Cc: Jia He > Cc: Joao Martins > Cc: Jonathan Cameron > Signed-off-by: Dan Williams > --- > drivers/dax/kmem.c | 16 +--- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index b0d6a99cf12d..6fe2cb1c5f7c 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -34,7 +34,7 @@ int dev_dax_kmem_probe(struct device *dev) > struct dev_dax *dev_dax = to_dev_dax(dev); > struct range range = dax_kmem_range(dev_dax); > struct resource *new_res; > - const char *new_res_name; > + char *res_name; I wonder why that change ... > int numa_node; > int rc; > > @@ -51,15 +51,15 @@ int dev_dax_kmem_probe(struct device *dev) > return -EINVAL; > } > > - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); > - if (!new_res_name) > + res_name = kstrdup(dev_name(dev), GFP_KERNEL); > + if (!res_name) > return -ENOMEM; > > /* Region is permanently reserved if hotremove fails. */ > - new_res = request_mem_region(range.start, range_len(&range), > new_res_name); > + new_res = request_mem_region(range.start, range_len(&range), res_name); > if (!new_res) { > dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", > range.start, range.end); > - kfree(new_res_name); > + kfree(res_name); > return -EBUSY; > } > > @@ -80,9 +80,11 @@ int dev_dax_kmem_probe(struct device *dev) > if (rc) { > release_resource(new_res); > kfree(new_res); > - kfree(new_res_name); > + kfree(res_name); > return rc; > } > + > + dev_set_drvdata(dev, res_name); > dev_dax->dax_kmem_res = new_res; > > return 0; > @@ -94,7 +96,7 @@ static int dev_dax_kmem_remove(struct device *dev) > struct dev_dax *dev_dax = to_dev_dax(dev); > struct range range = dax_kmem_range(dev_dax); > struct resource *res = dev_dax->dax_kmem_res; > - const char *res_name = res->name; > + const char *res_name = dev_get_drvdata(dev); ... because here you're back to "const". > int rc; > > /* > I do wonder if it wouldn't all be easier to just have some struct { const char *res_name; struct resource *res; }; allocating that and storing it as drvdata. But let's see what the next patch brings :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v5 02/17] device-dax/kmem: introduce dax_kmem_range()
On 25.09.20 21:11, Dan Williams wrote: > Towards removing the mode specific @dax_kmem_res attribute from the > generic 'struct dev_dax', and preparing for multi-range support, teach > the driver to calculate the hotplug range from the device range. The > hotplug range is the trivially calculated memory-block-size aligned > version of the device range. > > Cc: David Hildenbrand > Cc: Vishal Verma > Cc: Dave Hansen > Cc: Pavel Tatashin > Cc: Brice Goglin > Cc: Dave Jiang > Cc: David Hildenbrand > Cc: Ira Weiny > Cc: Jia He > Cc: Joao Martins > Cc: Jonathan Cameron > Signed-off-by: Dan Williams > --- > drivers/dax/kmem.c | 40 +--- > 1 file changed, 17 insertions(+), 23 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 5bb133df147d..b0d6a99cf12d 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -19,13 +19,20 @@ static const char *kmem_name; > /* Set if any memory will remain added when the driver will be unloaded. */ > static bool any_hotremove_failed; > > +static struct range dax_kmem_range(struct dev_dax *dev_dax) > +{ > + struct range range; > + > + /* memory-block align the hotplug range */ > + range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes()); > + range.end = ALIGN_DOWN(dev_dax->range.end + 1, > memory_block_size_bytes()) - 1; > + return range; > +} > + > int dev_dax_kmem_probe(struct device *dev) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > - struct range *range = &dev_dax->range; > - resource_size_t kmem_start; > - resource_size_t kmem_size; > - resource_size_t kmem_end; > + struct range range = dax_kmem_range(dev_dax); > struct resource *new_res; > const char *new_res_name; > int numa_node; > @@ -44,25 +51,14 @@ int dev_dax_kmem_probe(struct device *dev) > return -EINVAL; > } > > - /* Hotplug starting at the beginning of the next block: */ > - kmem_start = ALIGN(range->start, memory_block_size_bytes()); > - > - kmem_size = range_len(range); > - /* Adjust the size down to compensate for moving up kmem_start: */ > - kmem_size -= kmem_start - range->start; > - /* Align the size down to cover only complete blocks: */ > - kmem_size &= ~(memory_block_size_bytes() - 1); > - kmem_end = kmem_start + kmem_size; > - > new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); > if (!new_res_name) > return -ENOMEM; > > /* Region is permanently reserved if hotremove fails. */ > - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); > + new_res = request_mem_region(range.start, range_len(&range), > new_res_name); > if (!new_res) { > - dev_warn(dev, "could not reserve region [%pa-%pa]\n", > - &kmem_start, &kmem_end); > + dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", > range.start, range.end); > kfree(new_res_name); > return -EBUSY; > } > @@ -96,9 +92,8 @@ int dev_dax_kmem_probe(struct device *dev) > static int dev_dax_kmem_remove(struct device *dev) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > + struct range range = dax_kmem_range(dev_dax); > struct resource *res = dev_dax->dax_kmem_res; > - resource_size_t kmem_start = res->start; > - resource_size_t kmem_size = resource_size(res); > const char *res_name = res->name; > int rc; > > @@ -108,12 +103,11 @@ static int dev_dax_kmem_remove(struct device *dev) >* there is no way to hotremove this memory until reboot because device >* unbind will succeed even if we return failure. >*/ > - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); > + rc = remove_memory(dev_dax->target_node, range.start, > range_len(&range)); > if (rc) { > any_hotremove_failed = true; > - dev_err(dev, > - "DAX region %pR cannot be hotremoved until the next > reboot\n", > - res); > + dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next > reboot\n", > + range.start, range.end); > return rc; > } > > Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation
On 30.09.20 17:17, James Bottomley wrote: > On Wed, 2020-09-30 at 16:45 +0200, David Hildenbrand wrote: >> On 30.09.20 16:39, James Bottomley wrote: >>> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote: >>>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote: >>>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote: >>>>>> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra >>>>>> wrote: >>>>>>> It will drop them down to 4k pages. Given enough inodes, >>>>>>> and allocating only a single sekrit page per pmd, we'll >>>>>>> shatter the directmap into 4k. >>>>>> >>>>>> Why? Secretmem allocates PMD-size page per inode and uses it >>>>>> as a pool of 4K pages for that inode. This way it ensures >>>>>> that __kernel_map_pages() is always called on PMD boundaries. >>>>> >>>>> Oh, you unmap the 2m page upfront? I read it like you did the >>>>> unmap at the sekrit page alloc, not the pool alloc side of >>>>> things. >>>>> >>>>> Then yes, but then you're wasting gobs of memory. Basically you >>>>> can pin 2M per inode while only accounting a single page. >>>> >>>> Right, quite like THP :) >>>> >>>> I considered using a global pool of 2M pages for secretmem and >>>> handing 4K pages to each inode from that global pool. But I've >>>> decided to waste memory in favor of simplicity. >>> >>> I can also add that the user space consumer of this we wrote does >>> its user pool allocation at a 2M granularity, so nothing is >>> actually wasted. >> >> ... for that specific user space consumer. (or am I missing >> something?) > > I'm not sure I understand what you mean? It's designed to be either > the standard wrapper or an example of how to do the standard wrapper > for the syscall. It uses the same allocator system glibc uses for > malloc/free ... which pretty much everyone uses instead of calling > sys_brk directly. If you look at the granularity glibc uses for > sys_brk, it's not 4k either. Okay thanks, "the user space consumer of this we wrote" didn't sound as generic to me as "the standard wrapper". -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation
On 30.09.20 16:39, James Bottomley wrote: > On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote: >> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote: >>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote: On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote: > It will drop them down to 4k pages. Given enough inodes, and > allocating only a single sekrit page per pmd, we'll shatter the > directmap into 4k. Why? Secretmem allocates PMD-size page per inode and uses it as a pool of 4K pages for that inode. This way it ensures that __kernel_map_pages() is always called on PMD boundaries. >>> >>> Oh, you unmap the 2m page upfront? I read it like you did the unmap >>> at the sekrit page alloc, not the pool alloc side of things. >>> >>> Then yes, but then you're wasting gobs of memory. Basically you can >>> pin 2M per inode while only accounting a single page. >> >> Right, quite like THP :) >> >> I considered using a global pool of 2M pages for secretmem and >> handing 4K pages to each inode from that global pool. But I've >> decided to waste memory in favor of simplicity. > > I can also add that the user space consumer of this we wrote does its > user pool allocation at a 2M granularity, so nothing is actually > wasted. ... for that specific user space consumer. (or am I missing something?) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation
On 25.09.20 09:41, Peter Zijlstra wrote: > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote: >> From: Mike Rapoport >> >> Removing a PAGE_SIZE page from the direct map every time such page is >> allocated for a secret memory mapping will cause severe fragmentation of >> the direct map. This fragmentation can be reduced by using PMD-size pages >> as a pool for small pages for secret memory mappings. >> >> Add a gen_pool per secretmem inode and lazily populate this pool with >> PMD-size pages. > > What's the actual efficacy of this? Since the pmd is per inode, all I > need is a lot of inodes and we're in business to destroy the directmap, > no? > > Afaict there's no privs needed to use this, all a process needs is to > stay below the mlock limit, so a 'fork-bomb' that maps a single secret > page will utterly destroy the direct map. > > I really don't like this, at all. As I expressed earlier, I would prefer allowing allocation of secretmem only from a previously defined CMA area. This would physically locally limit the pain. But my suggestion was not well received :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res
On 24.09.20 23:50, Dan Williams wrote: > On Thu, Sep 24, 2020 at 2:42 PM David Hildenbrand wrote: >> >> >> >>> Am 24.09.2020 um 23:26 schrieb Dan Williams : >>> >>> [..] >>>>> I'm not suggesting to busy the whole "virtio" range, just the portion >>>>> that's about to be passed to add_memory_driver_managed(). >>>> >>>> I'm afraid I don't get your point. For virtio-mem: >>>> >>>> Before: >>>> >>>> 1. Create virtio0 container resource >>>> >>>> 2. (somewhen in the future) add_memory_driver_managed() >>>> - Create resource (System RAM (virtio_mem)), marking it busy/driver >>>> managed >>>> >>>> After: >>>> >>>> 1. Create virtio0 container resource >>>> >>>> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)), >>>> marking it busy/driver managed >>>> 3. add_memory_driver_managed() >>>> >>>> Not helpful or simpler IMHO. >>> >>> The concern I'm trying to address is the theoretical race window and >>> layering violation in this sequence in the kmem driver: >>> >>> 1/ res = request_mem_region(...); >>> 2/ res->flags = IORESOURCE_MEM; >>> 3/ add_memory_driver_managed(); >>> >>> Between 2/ and 3/ something can race and think that it owns the >>> region. Do I think it will happen in practice, no, but it's still a >>> pattern that deserves come cleanup. >> >> I think in that unlikely event (rather impossible), >> add_memory_driver_managed() should fail, detecting a conflicting (busy) >> resource. Not sure what will happen next ( and did not double-check). > > add_memory_driver_managed() will fail, but the release_mem_region() in > kmem to unwind on the error path will do the wrong thing because that > other driver thinks it got ownership of the region. > I think if somebody would race and claim the region for itself (after we unchecked the BUSY flag), there would be another memory resource below our resource container (e.g., via __request_region()). So, interestingly, the current code will do a release_resource->__release_resource(old, true); which will remove whatever somebody added below the resource. If we were to do a remove_resource->__release_resource(old, false); we would only remove what we temporarily added, relocating anychildren (someone nasty added). But yeah, I don't think we have to worry about this case. >> But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting >> other bits. And it would be even better if we could avoid manually messing >> with flags here. > > I'm ok to leave it alone for now (hasn't been and likely never will be > a problem in practice), but I think it was still worth grumbling Definitely, it gives us a better understanding. > about. I'll leave that part of kmem alone in the upcoming split of > dax_kmem_res removal. Yeah, stuff is more complicated than I would wished, so I guess it's better to leave it alone for now until we actually see issues with somebody else regarding *our* device-owned region (or we're able to come up with a cleanup that keeps all corner cases working for kmem and virtio-mem). -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res
> Am 24.09.2020 um 23:26 schrieb Dan Williams : > > [..] >>> I'm not suggesting to busy the whole "virtio" range, just the portion >>> that's about to be passed to add_memory_driver_managed(). >> >> I'm afraid I don't get your point. For virtio-mem: >> >> Before: >> >> 1. Create virtio0 container resource >> >> 2. (somewhen in the future) add_memory_driver_managed() >> - Create resource (System RAM (virtio_mem)), marking it busy/driver >> managed >> >> After: >> >> 1. Create virtio0 container resource >> >> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)), >> marking it busy/driver managed >> 3. add_memory_driver_managed() >> >> Not helpful or simpler IMHO. > > The concern I'm trying to address is the theoretical race window and > layering violation in this sequence in the kmem driver: > > 1/ res = request_mem_region(...); > 2/ res->flags = IORESOURCE_MEM; > 3/ add_memory_driver_managed(); > > Between 2/ and 3/ something can race and think that it owns the > region. Do I think it will happen in practice, no, but it's still a > pattern that deserves come cleanup. I think in that unlikely event (rather impossible), add_memory_driver_managed() should fail, detecting a conflicting (busy) resource. Not sure what will happen next ( and did not double-check). But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting other bits. And it would be even better if we could avoid manually messing with flags here. > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res
On 24.09.20 15:54, Dan Williams wrote: > On Thu, Sep 24, 2020 at 12:26 AM David Hildenbrand wrote: >> >> On 23.09.20 23:41, Dan Williams wrote: >>> On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand wrote: >>>> >>>> On 08.09.20 17:33, Joao Martins wrote: >>>>> [Sorry for the late response] >>>>> >>>>> On 8/21/20 11:06 AM, David Hildenbrand wrote: >>>>>> On 03.08.20 07:03, Dan Williams wrote: >>>>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev) >>>>>>> * could be mixed in a node with faster memory, causing >>>>>>> * unavoidable performance issues. >>>>>>> */ >>>>>>> - numa_node = dev_dax->target_node; >>>>>>> if (numa_node < 0) { >>>>>>> dev_warn(dev, "rejecting DAX region with invalid node: >>>>>>> %d\n", >>>>>>> numa_node); >>>>>>> return -EINVAL; >>>>>>> } >>>>>>> >>>>>>> - /* Hotplug starting at the beginning of the next block: */ >>>>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes()); >>>>>>> - >>>>>>> - kmem_size = range_len(range); >>>>>>> - /* Adjust the size down to compensate for moving up kmem_start: */ >>>>>>> - kmem_size -= kmem_start - range->start; >>>>>>> - /* Align the size down to cover only complete blocks: */ >>>>>>> - kmem_size &= ~(memory_block_size_bytes() - 1); >>>>>>> - kmem_end = kmem_start + kmem_size; >>>>>>> - >>>>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>>>>>> - if (!new_res_name) >>>>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>>>>>> + if (!res_name) >>>>>>> return -ENOMEM; >>>>>>> >>>>>>> - /* Region is permanently reserved if hotremove fails. */ >>>>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); >>>>>>> - if (!new_res) { >>>>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n", >>>>>>> -&kmem_start, &kmem_end); >>>>>>> - kfree(new_res_name); >>>>>>> + res = request_mem_region(range.start, range_len(&range), res_name); >>>>>> >>>>>> I think our range could be empty after aligning. I assume >>>>>> request_mem_region() would check that, but maybe we could report a >>>>>> better error/warning in that case. >>>>>> >>>>> dax_kmem_range() already returns a memory-block-aligned @range but >>>>> IIUC request_mem_region() isn't checking for that. Having said that >>>>> the returned @res wouldn't be different from the passed range.start. >>>>> >>>>>>> /* >>>>>>> * Ensure that future kexec'd kernels will not treat this as RAM >>>>>>> * automatically. >>>>>>> */ >>>>>>> - rc = add_memory_driver_managed(numa_node, new_res->start, >>>>>>> - resource_size(new_res), kmem_name); >>>>>>> + rc = add_memory_driver_managed(numa_node, res->start, >>>>>>> + resource_size(res), kmem_name); >>>>>>> + >>>>>>> + res->flags |= IORESOURCE_BUSY; >>>>>> >>>>>> Hm, I don't think that's correct. Any specific reason why to mark the >>>>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could >>>>>> suddenly stumble over it - and e.g., similarly kexec code when trying to >>>>>> find memory for placing kexec images. I think we should leave this >>>>>> !BUSY, just as it is right now. >>>>>> >>>>> Agreed. >>>>> >>>>>>> if (rc) { >>>>>>> - release_resource(new_res); >>>>>>> - kfree(new_res); >>>>>>> - kfree(new_res_name)
Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res
On 23.09.20 23:41, Dan Williams wrote: > On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand wrote: >> >> On 08.09.20 17:33, Joao Martins wrote: >>> [Sorry for the late response] >>> >>> On 8/21/20 11:06 AM, David Hildenbrand wrote: >>>> On 03.08.20 07:03, Dan Williams wrote: >>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev) >>>>> * could be mixed in a node with faster memory, causing >>>>> * unavoidable performance issues. >>>>> */ >>>>> - numa_node = dev_dax->target_node; >>>>> if (numa_node < 0) { >>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n", >>>>> numa_node); >>>>> return -EINVAL; >>>>> } >>>>> >>>>> - /* Hotplug starting at the beginning of the next block: */ >>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes()); >>>>> - >>>>> - kmem_size = range_len(range); >>>>> - /* Adjust the size down to compensate for moving up kmem_start: */ >>>>> - kmem_size -= kmem_start - range->start; >>>>> - /* Align the size down to cover only complete blocks: */ >>>>> - kmem_size &= ~(memory_block_size_bytes() - 1); >>>>> - kmem_end = kmem_start + kmem_size; >>>>> - >>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>>>> - if (!new_res_name) >>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>>>> + if (!res_name) >>>>> return -ENOMEM; >>>>> >>>>> - /* Region is permanently reserved if hotremove fails. */ >>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); >>>>> - if (!new_res) { >>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n", >>>>> -&kmem_start, &kmem_end); >>>>> - kfree(new_res_name); >>>>> + res = request_mem_region(range.start, range_len(&range), res_name); >>>> >>>> I think our range could be empty after aligning. I assume >>>> request_mem_region() would check that, but maybe we could report a >>>> better error/warning in that case. >>>> >>> dax_kmem_range() already returns a memory-block-aligned @range but >>> IIUC request_mem_region() isn't checking for that. Having said that >>> the returned @res wouldn't be different from the passed range.start. >>> >>>>> /* >>>>> * Ensure that future kexec'd kernels will not treat this as RAM >>>>> * automatically. >>>>> */ >>>>> - rc = add_memory_driver_managed(numa_node, new_res->start, >>>>> - resource_size(new_res), kmem_name); >>>>> + rc = add_memory_driver_managed(numa_node, res->start, >>>>> + resource_size(res), kmem_name); >>>>> + >>>>> + res->flags |= IORESOURCE_BUSY; >>>> >>>> Hm, I don't think that's correct. Any specific reason why to mark the >>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could >>>> suddenly stumble over it - and e.g., similarly kexec code when trying to >>>> find memory for placing kexec images. I think we should leave this >>>> !BUSY, just as it is right now. >>>> >>> Agreed. >>> >>>>> if (rc) { >>>>> - release_resource(new_res); >>>>> - kfree(new_res); >>>>> - kfree(new_res_name); >>>>> + release_mem_region(range.start, range_len(&range)); >>>>> + kfree(res_name); >>>>> return rc; >>>>> } >>>>> - dev_dax->dax_kmem_res = new_res; >>>>> + >>>>> + dev_set_drvdata(dev, res_name); >>>>> >>>>> return 0; >>>>> } >>>>> >>>>> #ifdef CONFIG_MEMORY_HOTREMOVE >>>>> -static int dev_dax_kmem_remove(struct device *dev) >>>>> +static void dax_kmem_release(struct dev_dax *dev_dax) >>>>> { >>>>> - struct dev_dax *dev_dax
Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res
On 08.09.20 17:33, Joao Martins wrote: > [Sorry for the late response] > > On 8/21/20 11:06 AM, David Hildenbrand wrote: >> On 03.08.20 07:03, Dan Williams wrote: >>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev) >>> * could be mixed in a node with faster memory, causing >>> * unavoidable performance issues. >>> */ >>> - numa_node = dev_dax->target_node; >>> if (numa_node < 0) { >>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n", >>> numa_node); >>> return -EINVAL; >>> } >>> >>> - /* Hotplug starting at the beginning of the next block: */ >>> - kmem_start = ALIGN(range->start, memory_block_size_bytes()); >>> - >>> - kmem_size = range_len(range); >>> - /* Adjust the size down to compensate for moving up kmem_start: */ >>> - kmem_size -= kmem_start - range->start; >>> - /* Align the size down to cover only complete blocks: */ >>> - kmem_size &= ~(memory_block_size_bytes() - 1); >>> - kmem_end = kmem_start + kmem_size; >>> - >>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>> - if (!new_res_name) >>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL); >>> + if (!res_name) >>> return -ENOMEM; >>> >>> - /* Region is permanently reserved if hotremove fails. */ >>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); >>> - if (!new_res) { >>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n", >>> -&kmem_start, &kmem_end); >>> - kfree(new_res_name); >>> + res = request_mem_region(range.start, range_len(&range), res_name); >> >> I think our range could be empty after aligning. I assume >> request_mem_region() would check that, but maybe we could report a >> better error/warning in that case. >> > dax_kmem_range() already returns a memory-block-aligned @range but > IIUC request_mem_region() isn't checking for that. Having said that > the returned @res wouldn't be different from the passed range.start. > >>> /* >>> * Ensure that future kexec'd kernels will not treat this as RAM >>> * automatically. >>> */ >>> - rc = add_memory_driver_managed(numa_node, new_res->start, >>> - resource_size(new_res), kmem_name); >>> + rc = add_memory_driver_managed(numa_node, res->start, >>> + resource_size(res), kmem_name); >>> + >>> + res->flags |= IORESOURCE_BUSY; >> >> Hm, I don't think that's correct. Any specific reason why to mark the >> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could >> suddenly stumble over it - and e.g., similarly kexec code when trying to >> find memory for placing kexec images. I think we should leave this >> !BUSY, just as it is right now. >> > Agreed. > >>> if (rc) { >>> - release_resource(new_res); >>> - kfree(new_res); >>> - kfree(new_res_name); >>> + release_mem_region(range.start, range_len(&range)); >>> + kfree(res_name); >>> return rc; >>> } >>> - dev_dax->dax_kmem_res = new_res; >>> + >>> + dev_set_drvdata(dev, res_name); >>> >>> return 0; >>> } >>> >>> #ifdef CONFIG_MEMORY_HOTREMOVE >>> -static int dev_dax_kmem_remove(struct device *dev) >>> +static void dax_kmem_release(struct dev_dax *dev_dax) >>> { >>> - struct dev_dax *dev_dax = to_dev_dax(dev); >>> - struct resource *res = dev_dax->dax_kmem_res; >>> - resource_size_t kmem_start = res->start; >>> - resource_size_t kmem_size = resource_size(res); >>> - const char *res_name = res->name; >>> int rc; >>> + struct device *dev = &dev_dax->dev; >>> + const char *res_name = dev_get_drvdata(dev); >>> + struct range range = dax_kmem_range(dev_dax); >>> >>> /* >>> * We have one shot for removing memory, if some memory blocks were not >>> * offline prior to calling this function remove_memory() will fail, and >>> * there is no way to hotremove this memory until reboot because device >>> -* unbind w
Re: [PATCH] kernel/resource: Fix use of ternary condition in release_mem_region_adjustable
On 22.09.20 08:07, Nathan Chancellor wrote: > Clang warns: > > kernel/resource.c:1281:53: warning: operator '?:' has lower precedence > than '|'; '|' will be evaluated first > [-Wbitwise-conditional-parentheses] > new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : > 0); > ~ ^ > kernel/resource.c:1281:53: note: place parentheses around the '|' > expression to silence this warning > new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : > 0); > ~ ^ > kernel/resource.c:1281:53: note: place parentheses around the '?:' > expression to evaluate it first > new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : > 0); >^ > ( ) > 1 warning generated. > > Add the parentheses as it was clearly intended for the ternary condition > to be evaluated first. > > Fixes: 5fd23bd0d739 ("kernel/resource: make release_mem_region_adjustable() > never fail") > Link: https://github.com/ClangBuiltLinux/linux/issues/1159 > Signed-off-by: Nathan Chancellor > --- > > Presumably, this will be squashed but I included a fixes tag > nonetheless. Apologies if this has already been noticed and fixed > already, I did not find anything on LKML. Hasn't been noticed before (I guess most people build with GCC, which does not warn in this instance, at least for me) thanks! Commit ids are not stable yet, so Andrew will most probably squash it. Reviewed-by: David Hildenbrand > > kernel/resource.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/resource.c b/kernel/resource.c > index ca2a666e4317..3ae2f56cc79d 100644 > --- a/kernel/resource.c > +++ b/kernel/resource.c > @@ -1278,7 +1278,7 @@ void release_mem_region_adjustable(resource_size_t > start, resource_size_t size) >* similarly). >*/ > retry: > - new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : 0); > + new_res = alloc_resource(GFP_KERNEL | (alloc_nofail ? __GFP_NOFAIL : > 0)); > > p = &parent->child; > write_lock(&resource_lock); > > base-commit: 40ee82f47bf297e31d0c47547cd8f24ede52415a > -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()
On 16.09.20 14:10, Wei Yang wrote: > On Wed, Sep 16, 2020 at 12:03:20PM +0200, David Hildenbrand wrote: >> On 16.09.20 12:02, Wei Yang wrote: >>> On Wed, Sep 16, 2020 at 09:30:41AM +0200, David Hildenbrand wrote: >>>> "mem" in the name already indicates the root, similar to >>>> release_mem_region() and devm_request_mem_region(). Make it implicit. >>>> The only single caller always passes iomem_resource, other parents are >>>> not applicable. >>>> >>> >>> Looks good to me. >>> >>> Reviewed-by: Wei Yang >>> >> >> Thanks for the review! >> > > Would you send another version? I didn't take a look into the following > patches, since the 4th is missed. Not planning to send another one as long as there are no further comments. Seems to be an issue on your side because all patches arrived on linux-mm (see https://lore.kernel.org/linux-mm/20200911103459.10306-1-da...@redhat.com/) You can find patch #4 at https://lore.kernel.org/linux-mm/20200911103459.10306-5-da...@redhat.com/ (which has CC "Wei Yang ") -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()
On 16.09.20 12:02, Wei Yang wrote: > On Wed, Sep 16, 2020 at 09:30:41AM +0200, David Hildenbrand wrote: >> "mem" in the name already indicates the root, similar to >> release_mem_region() and devm_request_mem_region(). Make it implicit. >> The only single caller always passes iomem_resource, other parents are >> not applicable. >> > > Looks good to me. > > Reviewed-by: Wei Yang > Thanks for the review! -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()
"mem" in the name already indicates the root, similar to release_mem_region() and devm_request_mem_region(). Make it implicit. The only single caller always passes iomem_resource, other parents are not applicable. Suggested-by: Wei Yang Cc: Andrew Morton Cc: Michal Hocko Cc: Dan Williams Cc: Jason Gunthorpe Cc: Kees Cook Cc: Ard Biesheuvel Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Signed-off-by: David Hildenbrand --- Based on next-20200915. Follow up on "[PATCH v4 0/8] selective merging of system ram resources" [1] That's in next-20200915. As noted during review of v2 by Wei [2]. [1] https://lkml.kernel.org/r/20200911103459.10306-1-da...@redhat.com [2] https://lkml.kernel.org/r/20200915021012.GC2007@L-31X9LVDL-1304.local --- include/linux/ioport.h | 3 +-- kernel/resource.c | 5 ++--- mm/memory_hotplug.c| 2 +- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 7e61389dcb01..5135d4b86cd6 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -251,8 +251,7 @@ extern struct resource * __request_region(struct resource *, extern void __release_region(struct resource *, resource_size_t, resource_size_t); #ifdef CONFIG_MEMORY_HOTREMOVE -extern void release_mem_region_adjustable(struct resource *, resource_size_t, - resource_size_t); +extern void release_mem_region_adjustable(resource_size_t, resource_size_t); #endif #ifdef CONFIG_MEMORY_HOTPLUG extern void merge_system_ram_resource(struct resource *res); diff --git a/kernel/resource.c b/kernel/resource.c index 7a91b935f4c2..ca2a666e4317 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1240,7 +1240,6 @@ EXPORT_SYMBOL(__release_region); #ifdef CONFIG_MEMORY_HOTREMOVE /** * release_mem_region_adjustable - release a previously reserved memory region - * @parent: parent resource descriptor * @start: resource start address * @size: resource region size * @@ -1258,9 +1257,9 @@ EXPORT_SYMBOL(__release_region); * assumes that all children remain in the lower address entry for * simplicity. Enhance this logic when necessary. */ -void release_mem_region_adjustable(struct resource *parent, - resource_size_t start, resource_size_t size) +void release_mem_region_adjustable(resource_size_t start, resource_size_t size) { + struct resource *parent = &iomem_resource; struct resource *new_res = NULL; bool alloc_nofail = false; struct resource **p; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 553c718226b3..7c5e4744ac51 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1764,7 +1764,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) memblock_remove(start, size); } - release_mem_region_adjustable(&iomem_resource, start, size); + release_mem_region_adjustable(start, size); try_offline_node(nid); -- 2.26.2 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail
On 15.09.20 11:06, Wei Yang wrote: > On Tue, Sep 15, 2020 at 09:35:30AM +0200, David Hildenbrand wrote: >> >>>> static int __ref try_remove_memory(int nid, u64 start, u64 size) >>>> { >>>>int rc = 0; >>>> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 >>>> start, u64 size) >>>>memblock_remove(start, size); >>>>} >>>> >>>> - __release_memory_resource(start, size); >>>> + release_mem_region_adjustable(&iomem_resource, start, size); >>>> >>> >>> Seems the only user of release_mem_region_adjustable() is here, can we move >>> iomem_resource into the function body? Actually, we don't iterate the >>> resource >>> tree from any level. We always start from the root. >> >> You mean, making iomem_resource implicit? I can spot that something >> similar was done for >> >> #define devm_release_mem_region(dev, start, n) \ >> __devm_release_region(dev, &iomem_resource, (start), (n)) >> > > What I prefer is remove iomem_resource from the parameter list. Just use is in > the function body. > > For the example you listed, __release_region() would have varies of *parent*, > which looks reasonable to keep it here. Yeah I got that ("making iomem_resource implicit"), as I said: >> I'll send an addon patch for that, ok? - thanks. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 2/7] kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED
On 15.09.20 04:20, Wei Yang wrote: > On Tue, Sep 08, 2020 at 10:10:07PM +0200, David Hildenbrand wrote: >> IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is >> always set to 0 by hardware. This is far from beautiful (and confusing), >> and the bit only applies to SYSRAM. So let's move it out of the >> bus-specific (PnP) defined bits. >> >> We'll add another SYSRAM specific bit soon. If we ever need more bits for >> other purposes, we can steal some from "desc", or reshuffle/regroup what we >> have. > > I think you make this definition because we use IORESOURCE_SYSRAM_RAM for > hotpluged memory? So we make them all in IORESOURCE_SYSRAM_XXX family? Yeah, to specify based on the extended MEM type SYSRAM. Because it really only applies to that. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail
>> static int __ref try_remove_memory(int nid, u64 start, u64 size) >> { >> int rc = 0; >> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 start, >> u64 size) >> memblock_remove(start, size); >> } >> >> -__release_memory_resource(start, size); >> +release_mem_region_adjustable(&iomem_resource, start, size); >> > > Seems the only user of release_mem_region_adjustable() is here, can we move > iomem_resource into the function body? Actually, we don't iterate the resource > tree from any level. We always start from the root. You mean, making iomem_resource implicit? I can spot that something similar was done for #define devm_release_mem_region(dev, start, n) \ __devm_release_region(dev, &iomem_resource, (start), (n)) I'll send an addon patch for that, ok? - thanks. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v4 5/8] mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources
On 15.09.20 04:43, Wei Yang wrote: > On Fri, Sep 11, 2020 at 12:34:56PM +0200, David Hildenbrand wrote: >> Some add_memory*() users add memory in small, contiguous memory blocks. >> Examples include virtio-mem, hyper-v balloon, and the XEN balloon. >> >> This can quickly result in a lot of memory resources, whereby the actual >> resource boundaries are not of interest (e.g., it might be relevant for >> DIMMs, exposed via /proc/iomem to user space). We really want to merge >> added resources in this scenario where possible. >> >> Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource >> either created within add_memory*() or passed via add_memory_resource() >> shall be marked mergeable and merged with applicable siblings. >> >> To implement that, we need a kernel/resource interface to mark selected >> System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger >> merging. >> >> Note: We really want to merge after the whole operation succeeded, not >> directly when adding a resource to the resource tree (it would break >> add_memory_resource() and require splitting resources again when the >> operation failed - e.g., due to -ENOMEM). > > Oops, the latest version is here. Yeah, sorry, I dropped the "mm" prefix on the subject of the cover letter by mistake. > > BTW, I don't see patch 4. Not sure it is junked by my mail system? At least you're in the CC list below with your old mail address (I assume you monitor that). I'll try to use your new address in the future. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v4 8/8] hv_balloon: try to merge system ram resources
Let's try to merge system ram resources we add, to minimize the number of resources in /proc/iomem. We don't care about the boundaries of individual chunks we added. Reviewed-by: Wei Liu Cc: Andrew Morton Cc: Michal Hocko Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Wei Liu Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Signed-off-by: David Hildenbrand --- drivers/hv/hv_balloon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index 3c0d52e244520..b64d2efbefe71 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -726,7 +726,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); ret = add_memory(nid, PFN_PHYS((start_pfn)), - (HA_CHUNK << PAGE_SHIFT), MHP_NONE); + (HA_CHUNK << PAGE_SHIFT), MEMHP_MERGE_RESOURCE); if (ret) { pr_err("hot_add memory failed error is %d\n", ret); -- 2.26.2 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v4 7/8] xen/balloon: try to merge system ram resources
Let's try to merge system ram resources we add, to minimize the number of resources in /proc/iomem. We don't care about the boundaries of individual chunks we added. Reviewed-by: Juergen Gross Cc: Andrew Morton Cc: Michal Hocko Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: Roger Pau Monné Cc: Julien Grall Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Signed-off-by: David Hildenbrand --- drivers/xen/balloon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 9f40a294d398d..b57b2067ecbfb 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -331,7 +331,7 @@ static enum bp_state reserve_additional_memory(void) mutex_unlock(&balloon_mutex); /* add_memory_resource() requires the device_hotplug lock */ lock_device_hotplug(); - rc = add_memory_resource(nid, resource, MHP_NONE); + rc = add_memory_resource(nid, resource, MEMHP_MERGE_RESOURCE); unlock_device_hotplug(); mutex_lock(&balloon_mutex); -- 2.26.2 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v4 4/8] mm/memory_hotplug: prepare passing flags to add_memory() and friends
We soon want to pass flags, e.g., to mark added System RAM resources. mergeable. Prepare for that. This patch is based on a similar patch by Oscar Salvador: https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de Acked-by: Wei Liu Reviewed-by: Juergen Gross # Xen related part Reviewed-by: Pankaj Gupta Cc: Andrew Morton Cc: Michal Hocko Cc: Dan Williams Cc: Jason Gunthorpe Cc: Pankaj Gupta Cc: Baoquan He Cc: Wei Yang Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Greg Kroah-Hartman Cc: Vishal Verma Cc: Dave Jiang Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Wei Liu Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: David Hildenbrand Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: "Oliver O'Halloran" Cc: Pingfan Liu Cc: Nathan Lynch Cc: Libor Pechacek Cc: Anton Blanchard Cc: Leonardo Bras Cc: linuxppc-...@lists.ozlabs.org Cc: linux-a...@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: linux-hyp...@vger.kernel.org Cc: linux-s...@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Signed-off-by: David Hildenbrand --- arch/powerpc/platforms/powernv/memtrace.c | 2 +- arch/powerpc/platforms/pseries/hotplug-memory.c | 2 +- drivers/acpi/acpi_memhotplug.c | 3 ++- drivers/base/memory.c | 3 ++- drivers/dax/kmem.c | 2 +- drivers/hv/hv_balloon.c | 2 +- drivers/s390/char/sclp_cmd.c| 2 +- drivers/virtio/virtio_mem.c | 2 +- drivers/xen/balloon.c | 2 +- include/linux/memory_hotplug.h | 16 mm/memory_hotplug.c | 14 +++--- 11 files changed, 30 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c index 13b369d2cc454..6828108486f83 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -224,7 +224,7 @@ static int memtrace_online(void) ent->mem = 0; } - if (add_memory(ent->nid, ent->start, ent->size)) { + if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) { pr_err("Failed to add trace memory to node %d\n", ent->nid); ret += 1; diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 0ea976d1cac47..e1c9fa0d730f5 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -615,7 +615,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb) nid = memory_add_physaddr_to_nid(lmb->base_addr); /* Add the memory */ - rc = __add_memory(nid, lmb->base_addr, block_sz); + rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE); if (rc) { invalidate_lmb_associativity_index(lmb); return rc; diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index e294f44a78504..2067c3bc55763 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -207,7 +207,8 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (node < 0) node = memory_add_physaddr_to_nid(info->start_addr); - result = __add_memory(node, info->start_addr, info->length); + result = __add_memory(node, info->start_addr, info->length, + MHP_NONE); /* * If the memory block has been used by the kernel, add_memory() diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 4db3c660de831..b4c297dd04755 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -432,7 +432,8 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr, nid = memory_add_physaddr_to_nid(phys_addr); ret = __add_memory(nid, phys_addr, - MIN_MEMORY_BLOCK_SIZE * sections_per_block); + MIN_MEMORY_BLOCK_SIZE * sections_per_block, + MHP_NONE); if (ret) goto out; diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7dcb2902e9b1b..896cb9444e727 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_