Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-18 Thread David Hildenbrand

On 18.05.21 12:31, Michal Hocko wrote:

On Tue 18-05-21 12:06:42, David Hildenbrand wrote:

On 18.05.21 11:59, Michal Hocko wrote:

On Sun 16-05-21 10:29:24, Mike Rapoport wrote:

On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote:

[...]

+   if (!page)
+   return VM_FAULT_OOM;
+
+   err = set_direct_map_invalid_noflush(page, 1);
+   if (err) {
+   put_page(page);
+   return vmf_error(err);


Would we want to translate that to a proper VM_FAULT_..., which would most
probably be VM_FAULT_OOM when we fail to allocate a pagetable?


That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ.


I haven't read through the rest but this has just caught my attention.
Is it really reasonable to trigger the oom killer when you cannot
invalidate the direct mapping. From a quick look at the code it is quite
unlikely to se ENOMEM from that path (it allocates small pages) but this
can become quite sublte over time. Shouldn't this simply SIGBUS if it
cannot manipulate the direct mapping regardless of the underlying reason
for that?



OTOH, it means our kernel zones are depleted, so we'd better reclaim somehow
...


Killing a userspace seems to be just a bad way around that.

Although I have to say openly that I am not a great fan of VM_FAULT_OOM
in general. It is usually a a wrong way to tell the handle the failure
because it happens outside of the allocation context so you lose all the
details (e.g. allocation constrains, numa policy etc.). Also whenever
there is ENOMEM then the allocation itself has already made sure that
all the reclaim attempts have been already depleted. Just consider an
allocation with GFP_NOWAIT/NO_RETRY or similar to fail and propagate
ENOMEM up the call stack. Turning that into the OOM killer sounds like a
bad idea to me.  But that is a more general topic. I have tried to bring
this up in the past but there was not much of an interest to fix it as
it was not a pressing problem...



I'm certainly interested; it would mean that we actually want to try 
recovering from VM_FAULT_OOM in various cases, and as you state, we 
might have to supply more information to make that work reliably.


Having that said, I guess what we have here is just the same as when our 
process fails to allocate a generic page table in __handle_mm_fault(), 
when we fail p4d_alloc() and friends ...


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users

2021-05-18 Thread David Hildenbrand

On 18.05.21 12:24, Mark Rutland wrote:

On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.


Have we thought about how this is going to work in practice, e.g. on
mobile systems? It seems to me that there are a variety of common
applications which might want to use this which people don't expect to
inhibit hibernate (e.g. authentication agents, web browsers).

Are we happy to say that any userspace application can incidentally
inhibit hibernate?


It's worth noting that secretmem has to be explicitly enabled by the 
admin to even work.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-18 Thread David Hildenbrand

On 18.05.21 11:59, Michal Hocko wrote:

On Sun 16-05-21 10:29:24, Mike Rapoport wrote:

On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote:

[...]

+   if (!page)
+   return VM_FAULT_OOM;
+
+   err = set_direct_map_invalid_noflush(page, 1);
+   if (err) {
+   put_page(page);
+   return vmf_error(err);


Would we want to translate that to a proper VM_FAULT_..., which would most
probably be VM_FAULT_OOM when we fail to allocate a pagetable?


That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ.


I haven't read through the rest but this has just caught my attention.
Is it really reasonable to trigger the oom killer when you cannot
invalidate the direct mapping. From a quick look at the code it is quite
unlikely to se ENOMEM from that path (it allocates small pages) but this
can become quite sublte over time. Shouldn't this simply SIGBUS if it
cannot manipulate the direct mapping regardless of the underlying reason
for that?



OTOH, it means our kernel zones are depleted, so we'd better reclaim 
somehow ...


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 8/8] secretmem: test: add basic selftest for memfd_secret(2)

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.



[...]


@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020


2021 ?


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport 
Acked-by: Palmer Dabbelt 
Acked-by: Arnd Bergmann 
Acked-by: Catalin Marinas 
Cc: Alexander Viro 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Christopher Lameter 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Elena Reshetova 
Cc: Hagen Paul Pfeifer 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: James Bottomley 
Cc: "Kirill A. Shutemov" 
Cc: Mark Rutland 
Cc: Matthew Wilcox 
Cc: Michael Kerrisk 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Peter Zijlstra 
Cc: Rick Edgecombe 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Shuah Khan 
Cc: Thomas Gleixner 
Cc: Tycho Andersen 
Cc: Will Deacon 
---
  arch/arm64/include/uapi/asm/unistd.h   | 1 +
  arch/riscv/include/asm/unistd.h| 1 +
  arch/x86/entry/syscalls/syscall_32.tbl | 1 +
  arch/x86/entry/syscalls/syscall_64.tbl | 1 +
  include/linux/syscalls.h   | 1 +
  include/uapi/asm-generic/unistd.h  | 7 ++-
  scripts/checksyscalls.sh   | 4 
  7 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h 
b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
  #define __ARCH_WANT_SET_GET_RLIMIT
  #define __ARCH_WANT_TIME32_SYSCALLS
  #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
  
  #include 

diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
   */
  
  #define __ARCH_WANT_SYS_CLONE

+#define __ARCH_WANT_MEMFD_SECRET
  
  #include 
  
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl

index 28a1423ce32e..e44519020a43 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -451,3 +451,4 @@
  444   i386landlock_create_ruleset sys_landlock_create_ruleset
  445   i386landlock_add_rule   sys_landlock_add_rule
  446   i386landlock_restrict_self  sys_landlock_restrict_self
+447i386memfd_secretsys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index ecd551b08d05..a06f16106f24 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,7 @@
  444   common  landlock_create_ruleset sys_landlock_create_ruleset
  445   common  landlock_add_rule   sys_landlock_add_rule
  446   common  landlock_restrict_self  sys_landlock_restrict_self
+447common  memfd_secretsys_memfd_secret
  
  #

  # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..1a1b5d724497 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1050,6 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct 
landlock_ruleset_attr _
  asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type 
rule_type,
const void __user *rule_attr, __u32 flags);
  asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_memfd_secret(unsigned int flags);
  
  /*

   * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..28b388368cf6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,13 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
  #define __NR_landlock_restrict_self 446
  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
  
+#ifdef __ARCH_WANT_MEMFD_SECRET

+#define __NR_memfd_secret 447
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
+
  #undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 448
  
  /*

   * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
  #define __IGNORE_setrlimit/* setrlimit */
  #endif
  
+#ifndef __ARCH_WANT_MEMFD_SECRET

+#define __IGNORE_memfd_secret
+#endif
+
  /* Missing flags argument */
  #define __IGNORE_renameat /* renameat2 */
  



Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport 
Cc: Alexander Viro 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Christopher Lameter 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Elena Reshetova 
Cc: Hagen Paul Pfeifer 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: James Bottomley 
Cc: "Kirill A. Shutemov" 
Cc: Mark Rutland 
Cc: Matthew Wilcox 
Cc: Michael Kerrisk 
Cc: Palmer Dabbelt 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Peter Zijlstra 
Cc: Rick Edgecombe 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Shuah Khan 
Cc: Thomas Gleixner 
Cc: Tycho Andersen 
Cc: Will Deacon 
---
  include/linux/secretmem.h |  6 ++
  kernel/power/hibernate.c  |  5 -
  mm/secretmem.c| 15 +++
  3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index e617b4afcc62..21c3771e6a56 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -30,6 +30,7 @@ static inline bool page_is_secretmem(struct page *page)
  }
  
  bool vma_is_secretmem(struct vm_area_struct *vma);

+bool secretmem_active(void);
  
  #else
  
@@ -43,6 +44,11 @@ static inline bool page_is_secretmem(struct page *page)

return false;
  }
  
+static inline bool secretmem_active(void)

+{
+   return false;
+}
+
  #endif /* CONFIG_SECRETMEM */
  
  #endif /* _LINUX_SECRETMEM_H */

diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include "power.h"

@@ -81,7 +82,9 @@ void hibernate_release(void)
  
  bool hibernation_available(void)

  {
-   return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+   return nohibernate == 0 &&
+   !security_locked_down(LOCKDOWN_HIBERNATION) &&
+   !secretmem_active();
  }
  
  /**

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1ae50089adf1..7c2499e4de22 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
  MODULE_PARM_DESC(secretmem_enable,
 "Enable secretmem and memfd_secret(2) system call");
  
+static atomic_t secretmem_users;

+
+bool secretmem_active(void)
+{
+   return !!atomic_read(&secretmem_users);
+}
+
  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
  {
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = 
{
.fault = secretmem_fault,
  };
  
+static int secretmem_release(struct inode *inode, struct file *file)

+{
+   atomic_dec(&secretmem_users);
+   return 0;
+}
+
  static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
  {
unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
  }
  
  static const struct file_operations secretmem_fops = {

+   .release= secretmem_release,
.mmap   = secretmem_mmap,
  };
  
@@ -202,6 +216,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)

file->f_flags |= O_LARGEFILE;
  
  	fd_install(fd, file);

+   atomic_inc(&secretmem_users);
return fd;
  
  err_put_fd:




It looks a bit racy, but I guess we don't really care about these corner 
cases.


Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-14 Thread David Hildenbrand

  #ifdef CONFIG_IA64
  # include 
@@ -64,6 +65,9 @@ static inline int valid_mmap_phys_addr_range(unsigned long 
pfn, size_t size)
  #ifdef CONFIG_STRICT_DEVMEM
  static inline int page_is_allowed(unsigned long pfn)
  {
+   if (pfn_valid(pfn) && page_is_secretmem(pfn_to_page(pfn)))
+   return 0;
+


1. The memmap might be garbage. You should use pfn_to_online_page() instead.

page = pfn_to_online_page(pfn);
if (page && page_is_secretmem(page))
return 0;

2. What about !CONFIG_STRICT_DEVMEM?

3. Someone could map physical memory before a secretmem page gets 
allocated and read the content after it got allocated and gets used. If 
someone would gain root privileges and would wait for the target 
application to (re)start, that could be problematic.



I do wonder if enforcing CONFIG_STRICT_DEVMEM would be cleaner. 
devmem_is_allowed() should disallow access to any system ram, and 
thereby, any possible secretmem pages, avoiding this check completely.



[...]

  
diff --git a/mm/secretmem.c b/mm/secretmem.c

new file mode 100644
index ..1ae50089adf1
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK(0x0)
+#define SECRETMEM_FLAGS_MASK   SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+"Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   struct inode *inode = file_inode(vmf->vma->vm_file);
+   pgoff_t offset = vmf->pgoff;
+   gfp_t gfp = vmf->gfp_mask;
+   unsigned long addr;
+   struct page *page;
+   int err;
+
+   if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+   return vmf_error(-EINVAL);
+
+retry:
+   page = find_lock_page(mapping, offset);
+   if (!page) {
+   page = alloc_page(gfp | __GFP_ZERO);


We'll end up here with gfp == GFP_HIGHUSER (via the mapping below), correct?


+   if (!page)
+   return VM_FAULT_OOM;
+
+   err = set_direct_map_invalid_noflush(page, 1);
+   if (err) {
+   put_page(page);
+   return vmf_error(err);


Would we want to translate that to a proper VM_FAULT_..., which would 
most probably be VM_FAULT_OOM when we fail to allocate a pagetable?



+   }
+
+   __SetPageUptodate(page);
+   err = add_to_page_cache_lru(page, mapping, offset, gfp);
+   if (unlikely(err)) {
+   put_page(page);
+   /*
+* If a split of large page was required, it
+* already happened when we marked the page invalid
+* which guarantees that this call won't fail
+*/
+   set_direct_map_default_noflush(page, 1);
+   if (err == -EEXIST)
+   goto retry;
+
+   return vmf_error(err);
+   }
+
+   addr = (unsigned long)page_address(page);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);


Hmm, to me it feels like something like that belongs into the 
set_direct_map_invalid_*() calls? Otherwise it's just very easy to mess 
up ...



I'm certainly not a filesystem guy. Nothing else jumped at me.


To me, the overall approach makes sense and I consider it an improved 
mlock() mechanism for storing secrets, although I'd love to have some 
more information in the log regarding access via root, namely that there 
are still fancy ways to read secretmem memory once root via


1. warm reboot attacks especially in VMs (e.g., modifying the cmdline)
2. kexec-style reboot attacks (e.g., modifying the cmdline)
3. kdump attacks
4. kdb most probably
5. "letting the process read the memory for us" via Kees if that still
   applies
6. ... most probably something else

Just to make people aware that there are still some things to be sorted 
out when we fully want to protect against privilege escalations.


(maybe this information is buried in the cover letter already, where it 
usually gets lost)


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@

Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

Introduce "memfd_secret" system call with the ability to create
memory areas visible only in the context of the owning process and
not mapped not only to other processes but in the kernel page tables
as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file 
descriptor using the memfd_secret() system call. The memory areas

created by mmap() calls from this file descriptor will be unmapped
from the kernel direct map and they will be only mapped in the page
table of the processes that have access to the file descriptor.

The file descriptor based memory has several advantages over the 
"traditional" mm interfaces, such as mlock(), mprotect(), madvise().

File descriptor approach allows explict and controlled sharing of the
memory


s/explict/explicit/


areas, it allows to seal the operations. Besides, file descriptor
based memory paves the way for VMMs to remove the secret memory range
from the userpace hipervisor process, for instance QEMU. Andy
Lutomirski says:


s/userpace hipervisor/userspace hypervisor/



"Getting fd-backed memory into a guest will take some possibly major
work in the kernel, but getting vma-backed memory into a guest
without mapping it in the host user address space seems much, much
worse."

memfd_secret() is made a dedicated system call rather than an
extention to


s/extention/extension/


memfd_create() because it's purpose is to allow the user to create
more secure memory mappings rather than to simply allow file based
access to the memory. Nowadays a new system call cost is negligible
while it is way simpler for userspace to deal with a clear-cut system
calls than with a multiplexer or an overloaded syscall. Moreover, the
initial implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading
memfd_create() to begin with. If there will be a need for code
sharing between these implementation it can be easily achieved
without a need to adjust user visible APIs.

The secret memory remains accessible in the process context using
uaccess primitives, but it is not exposed to the kernel otherwise;
secret memory areas are removed from the direct map and functions in
the follow_page()/get_user_page() family will refuse to return a page
that belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to
the kernel it will be an opt-in request in the system call flags so
that user would have to decide what data can be exposed to the
kernel.


Maybe spell out an example: like page migration.



Removing of the pages from the direct map may cause its fragmentation
on architectures that use large pages to map the physical memory
which affects the system performance. However, the original Kconfig
text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct
map "... can improve the kernel's performance a tiny bit ..." (commit
00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1]
showed that "... although 1G mappings are a good default choice,
there is no compelling evidence that it must be the only choice".
Hence, it is sufficient to have secretmem disabled by default with
the ability of a system administrator to enable it at boot time.


Maybe add a link to the Intel performance evaluation.



Pages in the secretmem regions are unevictable and unmovable to
avoid accidental exposure of the sensitive data via swap or during
page migration.

Since the secretmem mappings are locked in memory they cannot exceed 
RLIMIT_MEMLOCK. Since these mappings are already locked independently

from mlock(), an attempt to mlock()/munlock() secretmem range would
fail and mlockall()/munlockall() will ignore secretmem mappings.


Maybe add something like "similar to pages pinned by VFIO".



However, unlike mlock()ed memory, secretmem currently behaves more
like long-term GUP: secretmem mappings are unmovable mappings
directly consumed by user space. With default limits, there is no
excessive use of secretmem and it poses no real problem in
combination with ZONE_MOVABLE/CMA, but in the future this should be
addressed to allow balanced use of large amounts of secretmem along
with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it
is freed to ensure the data is not exposed to the next user of that
page.


You could skip that with init_on_free (and eventually also with 
init_on_alloc) set to avoid double clearing.




The following example demonstrates creation of a secret mapping
(error handling is omitted):

fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL,
MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

[1]
https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884e...@linux.intel.com/


[my mail client messed up the remainder of the mail for whatever rea

Re: [PATCH v19 3/8] set_memory: allow set_direct_map_*_noflush() for multiple pages

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.



[...]

Finally doing some in-depth review, sorry for not having a detailed look 
earlier.



  
-int set_direct_map_invalid_noflush(struct page *page)

+int set_direct_map_invalid_noflush(struct page *page, int numpages)
  {
struct page_change_data data = {
.set_mask = __pgprot(0),
.clear_mask = __pgprot(PTE_VALID),
};
+   unsigned long size = PAGE_SIZE * numpages;
  


Nit: I'd have made this const and added an early exit for !numpages. But 
whatever you prefer.



if (!debug_pagealloc_enabled() && !rodata_full)
return 0;
  
  	return apply_to_page_range(&init_mm,

   (unsigned long)page_address(page),
-  PAGE_SIZE, change_page_range, &data);
+  size, change_page_range, &data);
  }
  
-int set_direct_map_default_noflush(struct page *page)

+int set_direct_map_default_noflush(struct page *page, int numpages)
  {
struct page_change_data data = {
.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
.clear_mask = __pgprot(PTE_RDONLY),
};
+   unsigned long size = PAGE_SIZE * numpages;
  


Nit: dito


if (!debug_pagealloc_enabled() && !rodata_full)
return 0;
  
  	return apply_to_page_range(&init_mm,

   (unsigned long)page_address(page),
-  PAGE_SIZE, change_page_range, &data);
+  size, change_page_range, &data);
  }
  



[...]


  extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 156cd235659f..15a55d6e9cec 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2192,14 +2192,14 @@ static int __set_pages_np(struct page *page, int 
numpages)
return __change_page_attr_set_clr(&cpa, 0);
  }
  
-int set_direct_map_invalid_noflush(struct page *page)

+int set_direct_map_invalid_noflush(struct page *page, int numpages)
  {
-   return __set_pages_np(page, 1);
+   return __set_pages_np(page, numpages);
  }
  
-int set_direct_map_default_noflush(struct page *page)

+int set_direct_map_default_noflush(struct page *page, int numpages)
  {
-   return __set_pages_p(page, 1);
+   return __set_pages_p(page, numpages);
  }
  


So, what happens if we succeeded setting 
set_direct_map_invalid_noflush() for some pages but fail when having to 
split a large mapping?


Did I miss something or would the current code not undo what it 
partially did? Or do we simply not care?


I guess to handle this cleanly we would either have to catch all error 
cases first (esp. splitting large mappings) before actually performing 
the set to invalid, or have some recovery code in place if possible.



AFAIKs, your patch #5 right now only calls it with 1 page, do we need 
this change at all? Feels like a leftover from older versions to me 
where we could have had more than a single page.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 2/8] riscv/Kconfig: make direct map manipulation options depend on MMU

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Signed-off-by: Mike Rapoport 
Reported-by: kernel test robot 
---
  arch/riscv/Kconfig | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a8ad8eb76120..c426e7d20907 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -26,8 +26,8 @@ config RISCV
select ARCH_HAS_KCOV
select ARCH_HAS_MMIOWB
select ARCH_HAS_PTE_SPECIAL
-   select ARCH_HAS_SET_DIRECT_MAP
-   select ARCH_HAS_SET_MEMORY
+   select ARCH_HAS_SET_DIRECT_MAP if MMU
+   select ARCH_HAS_SET_MEMORY if MMU
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_STRICT_MODULE_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v19 1/8] mmap: make mlock_future_check() global

2021-05-14 Thread David Hildenbrand

On 13.05.21 20:47, Mike Rapoport wrote:

From: Mike Rapoport 

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport 
Cc: Alexander Viro 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Christopher Lameter 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Elena Reshetova 
Cc: Hagen Paul Pfeifer 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: James Bottomley 
Cc: "Kirill A. Shutemov" 
Cc: Mark Rutland 
Cc: Matthew Wilcox 
Cc: Michael Kerrisk 
Cc: Palmer Dabbelt 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Peter Zijlstra 
Cc: Rick Edgecombe 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Shuah Khan 
Cc: Thomas Gleixner 
Cc: Tycho Andersen 
Cc: Will Deacon 
---
  mm/internal.h | 3 +++
  mm/mmap.c | 5 ++---
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 54bd0dc2c23c..46eb82eaa195 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -373,6 +373,9 @@ static inline void munlock_vma_pages_all(struct 
vm_area_struct *vma)
  extern void mlock_vma_page(struct page *page);
  extern unsigned int munlock_vma_page(struct page *page);
  
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,

+ unsigned long len);
+
  /*
   * Clear the page's PageMlocked().  This can be useful in a situation where
   * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 0584e540246e..81f5595a8490 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1352,9 +1352,8 @@ static inline unsigned long round_hint_to_min(unsigned 
long hint)
return hint;
  }
  
-static inline int mlock_future_check(struct mm_struct *mm,

-unsigned long flags,
-unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+  unsigned long len)
  {
unsigned long locked, lock_limit;
  



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-07 Thread David Hildenbrand

On 07.05.21 01:16, Nick Kossifidis wrote:

Στις 2021-05-06 20:05, James Bottomley έγραψε:

On Thu, 2021-05-06 at 18:45 +0200, David Hildenbrand wrote:


Also, there is a way to still read that memory when root by

1. Having kdump active (which would often be the case, but maybe not
to dump user pages )
2. Triggering a kernel crash (easy via proc as root)
3. Waiting for the reboot after kump() created the dump and then
reading the content from disk.


Anything that can leave physical memory intact but boot to a kernel
where the missing direct map entry is restored could theoretically
extract the secret.  However, it's not exactly going to be a stealthy
extraction ...


Or, as an attacker, load a custom kexec() kernel and read memory
from the new environment. Of course, the latter two are advanced
mechanisms, but they are possible when root. We might be able to
mitigate, for example, by zeroing out secretmem pages before booting
into the kexec kernel, if we care :)


I think we could handle it by marking the region, yes, and a zero on
shutdown might be useful ... it would prevent all warm reboot type
attacks.



I had similar concerns about recovering secrets with kdump, and
considered cleaning up keyrings before jumping to the new kernel. The
problem is we can't provide guarantees in that case, once the kernel has
crashed and we are on our way to run crashkernel, we can't be sure we
can reliably zero-out anything, the more code we add to that path the


Well, I think it depends. Assume we do the following

1) Zero out any secretmem pages when handing them back to the buddy. 
(alternative: init_on_free=1) -- if not already done, I didn't check the 
code.


2) On kdump(), zero out all allocated secretmem. It'd be easier if we'd 
just allocated from a fixed physical memory area; otherwise we have to 
walk process page tables or use a PFN walker. And zeroing out secretmem 
pages without a direct mapping is a different challenge.


Now, during 2) it can happen that

a) We crash in our clearing code (e.g., something is seriously messed 
up) and fail to start the kdump kernel. That's actually good, instead of 
leaking data we fail hard.


b) We don't find all secretmem pages, for example, because process page 
tables are messed up or something messed up our memmap (if we'd use that 
to identify secretmem pages via a PFN walker somehow)



But for the simple cases (e.g., malicious root tries to crash the kernel 
via /proc/sysrq-trigger) both a) and b) wouldn't apply.


Obviously, if an admin would want to mitigate right now, he would want 
to disable kdump completely, meaning any attempt to load a crashkernel 
would fail and cannot be enabled again for that kernel (also not via 
cmdline an attacker could modify to reboot into a system with the option 
for a crashkernel). Disabling kdump in the kernel when secretmem pages 
are allocated is one approach, although sub-optimal.



more risky it gets. However during reboot/normal kexec() we should do
some cleanup, it makes sense and secretmem can indeed be useful in that
case. Regarding loading custom kexec() kernels, we mitigate this with
the kexec file-based API where we can verify the signature of the loaded
kimage (assuming the system runs a kernel provided by a trusted 3rd
party and we 've maintained a chain of trust since booting).


For example in VMs (like QEMU), we often don't clear physical memory 
during a reboot. So if an attacker manages to load a kernel that you can 
trick into reading random physical memory areas, we can leak secretmem 
data I think.


And there might be ways to achieve that just using the cmdline, not 
necessarily loading a different kernel. For example if you limit the 
kernel footprint ("mem=256M") and disable strict_iomem_checks 
("strict_iomem_checks=relaxed") you can just extract that memory via 
/dev/mem if I am not wrong.


So as an attacker, modify the (grub) cmdline to "mem=256M 
strict_iomem_checks=relaxed", reboot, and read all memory via /dev/mem. 
Or load a signed kexec kernel with that cmdline and boot into it.


Interesting problem :)

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-06 Thread David Hildenbrand

Is this intended to protect keys/etc after the attacker has
gained the ability to run arbitrary kernel-mode code?  If so,
that seems optimistic, doesn't it?


Not exactly: there are many types of kernel attack, but mostly the
attacker either manages to effect a privilege escalation to root or
gets the ability to run a ROP gadget.  The object of this code is
to be completely secure against root trying to extract the secret
(some what similar to the lockdown idea), thus defeating privilege
escalation and to provide "sufficient" protection against ROP
gadget.


What stops "root" from mapping /dev/mem and reading that memory?


/dev/mem uses the direct map for the copy at least for read/write, so
it gets a fault in the same way root trying to use ptrace does.  I
think we've protected mmap, but Mike would know that better than I.



I'm more concerned about the mmap case -> remap_pfn_range(). Anybody 
going via the VMA shouldn't see the struct page, at least when 
vma_normal_page() is properly used; so you cannot detect secretmem

memory mapped via /dev/mem reliably. At least that's my theory :)

[...]


Also, there is a way to still read that memory when root by

1. Having kdump active (which would often be the case, but maybe not
to dump user pages )
2. Triggering a kernel crash (easy via proc as root)
3. Waiting for the reboot after kump() created the dump and then
reading the content from disk.


Anything that can leave physical memory intact but boot to a kernel
where the missing direct map entry is restored could theoretically
extract the secret.  However, it's not exactly going to be a stealthy
extraction ...


Or, as an attacker, load a custom kexec() kernel and read memory
from the new environment. Of course, the latter two are advanced
mechanisms, but they are possible when root. We might be able to
mitigate, for example, by zeroing out secretmem pages before booting
into the kexec kernel, if we care :)


I think we could handle it by marking the region, yes, and a zero on
shutdown might be useful ... it would prevent all warm reboot type
attacks.


Right. But I guess when you're actually root, you can just write a 
kernel module to extract the information you need (unless we have signed 
modules, so it could be harder/impossible).


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas

2021-05-06 Thread David Hildenbrand

On 06.05.21 17:26, James Bottomley wrote:

On Wed, 2021-05-05 at 12:08 -0700, Andrew Morton wrote:

On Wed,  3 Mar 2021 18:22:00 +0200 Mike Rapoport 
wrote:


This is an implementation of "secret" mappings backed by a file
descriptor.

The file descriptor backing secret memory mappings is created using
a dedicated memfd_secret system call The desired protection mode
for the memory is configured using flags parameter of the system
call. The mmap() of the file descriptor created with memfd_secret()
will create a "secret" memory mapping. The pages in that mapping
will be marked as not present in the direct map and will be present
only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other
users, such secret mappings are useful for environments where a
hostile tenant is trying to trick the kernel into giving them
access to other tenants mappings.


I continue to struggle with this and I don't recall seeing much
enthusiasm from others.  Perhaps we're all missing the value point
and some additional selling is needed.

Am I correct in understanding that the overall direction here is to
protect keys (and perhaps other things) from kernel bugs?  That if
the kernel was bug-free then there would be no need for this
feature?  If so, that's a bit sad.  But realistic I guess.


Secret memory really serves several purposes. The "increase the level
of difficulty of secret exfiltration" you describe.  And, as you say,
if the kernel were bug free this wouldn't be necessary.

But also:

1. Memory safety for use space code.  Once the secret memory is
   allocated, the user can't accidentally pass it into the kernel to be
   transmitted somewhere.


That's an interesting point I didn't realize so far.


2. It also serves as a basis for context protection of virtual
   machines, but other groups are working on this aspect, and it is
   broadly similar to the secret exfiltration from the kernel problem.



I was wondering if this also helps against CPU microcode issues like 
spectre and friends.




Is this intended to protect keys/etc after the attacker has gained
the ability to run arbitrary kernel-mode code?  If so, that seems
optimistic, doesn't it?


Not exactly: there are many types of kernel attack, but mostly the
attacker either manages to effect a privilege escalation to root or
gets the ability to run a ROP gadget.  The object of this code is to be
completely secure against root trying to extract the secret (some what
similar to the lockdown idea), thus defeating privilege escalation and
to provide "sufficient" protection against ROP gadget.


What stops "root" from mapping /dev/mem and reading that memory?

IOW, would we want to enforce "CONFIG_STRICT_DEVMEM" with CONFIG_SECRETMEM?


Also, there is a way to still read that memory when root by

1. Having kdump active (which would often be the case, but maybe not to 
dump user pages )

2. Triggering a kernel crash (easy via proc as root)
3. Waiting for the reboot after kump() created the dump and then reading 
the content from disk.


Or, as an attacker, load a custom kexec() kernel and read memory from 
the new environment. Of course, the latter two are advanced mechanisms, 
but they are possible when root. We might be able to mitigate, for 
example, by zeroing out secretmem pages before booting into the kexec 
kernel, if we care :)


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 1/2] secretmem/gup: don't check if page is secretmem without reference

2021-04-20 Thread David Hildenbrand

On 20.04.21 17:00, Mike Rapoport wrote:

From: Mike Rapoport 

The check in gup_pte_range() whether a page belongs to a secretmem mapping
is performed before grabbing the page reference.

To avoid potential race move the check after try_grab_compound_head().

Signed-off-by: Mike Rapoport 
---
  mm/gup.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c3a17b189064..6515f82b0f32 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2080,13 +2080,15 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
  
-		if (page_is_secretmem(page))

-   goto pte_unmap;
-
head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;
  
+		if (unlikely(page_is_secretmem(page))) {

+   put_compound_head(head, 1, flags);
+   goto pte_unmap;
+   }
+
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_compound_head(head, 1, flags);
goto pte_unmap;



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/2] secretmem/gup: don't check if page is secretmem without reference

2021-04-20 Thread David Hildenbrand

On 20.04.21 15:16, Mike Rapoport wrote:

From: Mike Rapoport 

The check in gup_pte_range() whether a page belongs to a secretmem mapping
is performed before grabbing the page reference.

To avoid potential race move the check after try_grab_compound_head().

Signed-off-by: Mike Rapoport 
---
  mm/gup.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c3a17b189064..4b58c016e949 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2080,13 +2080,13 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
  
-		if (page_is_secretmem(page))

-   goto pte_unmap;
-
head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;
  
+		if (page_is_secretmem(page))

+   goto pte_unmap;
+


Looking at the hunk below, I wonder if you're missing a put_compound_head().

(also, I'd do if unlikely(page_is_secretmem()) but that's a different 
discussion)



if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_compound_head(head, 1, flags);
goto pte_unmap;




--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread David Hildenbrand

On 19.04.21 12:14, Mike Rapoport wrote:

On Mon, Apr 19, 2021 at 11:40:56AM +0200, David Hildenbrand wrote:

On 19.04.21 11:38, David Hildenbrand wrote:

On 19.04.21 11:36, Mike Rapoport wrote:

On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote:

On 19.04.21 10:42, Mike Rapoport wrote:

From: Mike Rapoport 

Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops
due to commit "mm: introduce memfd_secret system call to create "secret"
memory areas".

The perf profile of the test indicated that the regression is caused by
page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range):

 27.76  +2.5  30.23   perf-profile.children.cycles-pp.gup_pgd_range
  0.00  +3.2   3.19 ± 2%  perf-profile.children.cycles-pp.page_mapping
  0.00  +3.7   3.66 ± 2%  perf-profile.children.cycles-pp.page_is_secretmem

Further analysis showed that the slow down happens because neither
page_is_secretmem() nor page_mapping() are not inline and moreover,
multiple page flags checks in page_mapping() involve calling
compound_head() several times for the same page.

Make page_is_secretmem() inline and replace page_mapping() with page flag
checks that do not imply page-to-head conversion.

Reported-by: kernel test robot 
Signed-off-by: Mike Rapoport 
---

@Andrew,
The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would
be added as a fixup to the memfd_secret series.

 include/linux/secretmem.h | 26 +-
 mm/secretmem.c| 12 +---
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 907a6734059c..b842b38cbeb1 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -4,8 +4,32 @@
 #ifdef CONFIG_SECRETMEM
+extern const struct address_space_operations secretmem_aops;
+
+static inline bool page_is_secretmem(struct page *page)
+{
+   struct address_space *mapping;
+
+   /*
+* Using page_mapping() is quite slow because of the actual call
+* instruction and repeated compound_head(page) inside the
+* page_mapping() function.
+* We know that secretmem pages are not compound and LRU so we can
+* save a couple of cycles here.
+*/
+   if (PageCompound(page) || !PageLRU(page))
+   return false;


I'd assume secretmem pages are rare in basically every setup out there. So
maybe throwing in a couple of likely()/unlikely() might make sense.


I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can
hardly estimate which pages are going to be checked.

+
+   mapping = (struct address_space *)
+   ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+


Not sure if open-coding page_mapping is really a good idea here -- or even
necessary after the fast path above is in place. Anyhow, just my 2 cents.


Well, most if the -4.2% of the performance regression kbuild reported were
due to repeated compount_head(page) in page_mapping(). So the whole point
of this patch is to avoid calling page_mapping().


I would have thought the fast path "(PageCompound(page) ||
!PageLRU(page))" would already avoid calling page_mapping() in many cases.


(and I do wonder if a generic page_mapping() optimization would make sense
instead)


Not sure. Replacing page_mapping() with page_file_mapping() at the
call sites at fs/ and mm/ increased the defconfig image by nearly 2k
and page_file_mapping() is way simpler than page_mapping()

add/remove: 1/0 grow/shrink: 35/0 up/down: 1960/0 (1960)
Function old new   delta
shrink_page_list34143670+256
__set_page_dirty_nobuffers   242 349+107
check_move_unevictable_pages 904 987 +83
move_to_new_page 591 671 +80
shrink_active_list   912 970 +58
move_pages_to_lru911 965 +54
migrate_pages   25002554 +54
shmem_swapin_page   11451197 +52
shmem_undo_range16691719 +50
__test_set_page_writeback620 670 +50
__set_page_dirty_buffers 187 237 +50
__pagevec_lru_add757 807 +50
__munlock_pagevec   11551205 +50
__dump_page 11011151 +50
__cancel_dirty_page  182 232 +50
__remove_mapping 461 510 +49
rmap_walk_file   402 449 +47
isolate_movable_page 240 287 +47
test_clear_page_writeback668 714 +46
page_cache_pipe_buf_try

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread David Hildenbrand

On 19.04.21 11:38, David Hildenbrand wrote:

On 19.04.21 11:36, Mike Rapoport wrote:

On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote:

On 19.04.21 10:42, Mike Rapoport wrote:

From: Mike Rapoport 

Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops
due to commit "mm: introduce memfd_secret system call to create "secret"
memory areas".

The perf profile of the test indicated that the regression is caused by
page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range):

27.76  +2.5  30.23   perf-profile.children.cycles-pp.gup_pgd_range
 0.00  +3.2   3.19 ± 2%  perf-profile.children.cycles-pp.page_mapping
 0.00  +3.7   3.66 ± 2%  perf-profile.children.cycles-pp.page_is_secretmem

Further analysis showed that the slow down happens because neither
page_is_secretmem() nor page_mapping() are not inline and moreover,
multiple page flags checks in page_mapping() involve calling
compound_head() several times for the same page.

Make page_is_secretmem() inline and replace page_mapping() with page flag
checks that do not imply page-to-head conversion.

Reported-by: kernel test robot 
Signed-off-by: Mike Rapoport 
---

@Andrew,
The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would
be added as a fixup to the memfd_secret series.

include/linux/secretmem.h | 26 +-
mm/secretmem.c| 12 +---
2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 907a6734059c..b842b38cbeb1 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -4,8 +4,32 @@
#ifdef CONFIG_SECRETMEM
+extern const struct address_space_operations secretmem_aops;
+
+static inline bool page_is_secretmem(struct page *page)
+{
+   struct address_space *mapping;
+
+   /*
+* Using page_mapping() is quite slow because of the actual call
+* instruction and repeated compound_head(page) inside the
+* page_mapping() function.
+* We know that secretmem pages are not compound and LRU so we can
+* save a couple of cycles here.
+*/
+   if (PageCompound(page) || !PageLRU(page))
+   return false;


I'd assume secretmem pages are rare in basically every setup out there. So
maybe throwing in a couple of likely()/unlikely() might make sense.


I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can
hardly estimate which pages are going to be checked.
   

+
+   mapping = (struct address_space *)
+   ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+


Not sure if open-coding page_mapping is really a good idea here -- or even
necessary after the fast path above is in place. Anyhow, just my 2 cents.


Well, most if the -4.2% of the performance regression kbuild reported were
due to repeated compount_head(page) in page_mapping(). So the whole point
of this patch is to avoid calling page_mapping().


I would have thought the fast path "(PageCompound(page) ||
!PageLRU(page))" would already avoid calling page_mapping() in many cases.


(and I do wonder if a generic page_mapping() optimization would make 
sense instead)


Willy can most probably give the best advise here :)

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread David Hildenbrand

On 19.04.21 11:36, Mike Rapoport wrote:

On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote:

On 19.04.21 10:42, Mike Rapoport wrote:

From: Mike Rapoport 

Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops
due to commit "mm: introduce memfd_secret system call to create "secret"
memory areas".

The perf profile of the test indicated that the regression is caused by
page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range):

   27.76  +2.5  30.23   perf-profile.children.cycles-pp.gup_pgd_range
0.00  +3.2   3.19 ± 2%  perf-profile.children.cycles-pp.page_mapping
0.00  +3.7   3.66 ± 2%  perf-profile.children.cycles-pp.page_is_secretmem

Further analysis showed that the slow down happens because neither
page_is_secretmem() nor page_mapping() are not inline and moreover,
multiple page flags checks in page_mapping() involve calling
compound_head() several times for the same page.

Make page_is_secretmem() inline and replace page_mapping() with page flag
checks that do not imply page-to-head conversion.

Reported-by: kernel test robot 
Signed-off-by: Mike Rapoport 
---

@Andrew,
The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would
be added as a fixup to the memfd_secret series.

   include/linux/secretmem.h | 26 +-
   mm/secretmem.c| 12 +---
   2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 907a6734059c..b842b38cbeb1 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -4,8 +4,32 @@
   #ifdef CONFIG_SECRETMEM
+extern const struct address_space_operations secretmem_aops;
+
+static inline bool page_is_secretmem(struct page *page)
+{
+   struct address_space *mapping;
+
+   /*
+* Using page_mapping() is quite slow because of the actual call
+* instruction and repeated compound_head(page) inside the
+* page_mapping() function.
+* We know that secretmem pages are not compound and LRU so we can
+* save a couple of cycles here.
+*/
+   if (PageCompound(page) || !PageLRU(page))
+   return false;


I'd assume secretmem pages are rare in basically every setup out there. So
maybe throwing in a couple of likely()/unlikely() might make sense.


I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can
hardly estimate which pages are going to be checked.
  

+
+   mapping = (struct address_space *)
+   ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+


Not sure if open-coding page_mapping is really a good idea here -- or even
necessary after the fast path above is in place. Anyhow, just my 2 cents.


Well, most if the -4.2% of the performance regression kbuild reported were
due to repeated compount_head(page) in page_mapping(). So the whole point
of this patch is to avoid calling page_mapping().


I would have thought the fast path "(PageCompound(page) || 
!PageLRU(page))" would already avoid calling page_mapping() in many cases.



--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread David Hildenbrand

On 19.04.21 10:42, Mike Rapoport wrote:

From: Mike Rapoport 

Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops
due to commit "mm: introduce memfd_secret system call to create "secret"
memory areas".

The perf profile of the test indicated that the regression is caused by
page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range):

  27.76  +2.5  30.23   perf-profile.children.cycles-pp.gup_pgd_range
   0.00  +3.2   3.19 ± 2%  perf-profile.children.cycles-pp.page_mapping
   0.00  +3.7   3.66 ± 2%  perf-profile.children.cycles-pp.page_is_secretmem

Further analysis showed that the slow down happens because neither
page_is_secretmem() nor page_mapping() are not inline and moreover,
multiple page flags checks in page_mapping() involve calling
compound_head() several times for the same page.

Make page_is_secretmem() inline and replace page_mapping() with page flag
checks that do not imply page-to-head conversion.

Reported-by: kernel test robot 
Signed-off-by: Mike Rapoport 
---

@Andrew,
The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would
be added as a fixup to the memfd_secret series.

  include/linux/secretmem.h | 26 +-
  mm/secretmem.c| 12 +---
  2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 907a6734059c..b842b38cbeb1 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -4,8 +4,32 @@
  
  #ifdef CONFIG_SECRETMEM
  
+extern const struct address_space_operations secretmem_aops;

+
+static inline bool page_is_secretmem(struct page *page)
+{
+   struct address_space *mapping;
+
+   /*
+* Using page_mapping() is quite slow because of the actual call
+* instruction and repeated compound_head(page) inside the
+* page_mapping() function.
+* We know that secretmem pages are not compound and LRU so we can
+* save a couple of cycles here.
+*/
+   if (PageCompound(page) || !PageLRU(page))
+   return false;


I'd assume secretmem pages are rare in basically every setup out there. 
So maybe throwing in a couple of likely()/unlikely() might make sense.



+
+   mapping = (struct address_space *)
+   ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+


Not sure if open-coding page_mapping is really a good idea here -- or 
even necessary after the fast path above is in place. Anyhow, just my 2 
cents.


The idea of the patch makes sense to me.

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] memfd_secret: use unsigned int rather than long as syscall flags type

2021-03-31 Thread David Hildenbrand

On 31.03.21 16:23, Mike Rapoport wrote:

From: Mike Rapoport 

Yuri Norov says:

   If parameter size is the same for native and compat ABIs, we may
   wire a syscall made by compat client to native handler. This is
   true for unsigned int, but not true for unsigned long or pointer.

   That's why I suggest using unsigned int and so avoid creating compat
   entry point.

Use unsigned int as the type of the flags parameter in memfd_secret()
system call.

Signed-off-by: Mike Rapoport 
---

@Andrew,
The patch is vs v5.12-rc5-mmots-2021-03-30-23, I'd appreciate if it would
be added as a fixup to the memfd_secret series.

  include/linux/syscalls.h  | 2 +-
  mm/secretmem.c| 2 +-
  tools/testing/selftests/vm/memfd_secret.c | 2 +-
  3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 49c93c906893..1a1b5d724497 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1050,7 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct 
landlock_ruleset_attr _
  asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type 
rule_type,
const void __user *rule_attr, __u32 flags);
  asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
-asmlinkage long sys_memfd_secret(unsigned long flags);
+asmlinkage long sys_memfd_secret(unsigned int flags);
  
  /*

   * Architecture-specific system calls
diff --git a/mm/secretmem.c b/mm/secretmem.c
index f2ae3f32a193..3b1ba3991964 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -199,7 +199,7 @@ static struct file *secretmem_file_create(unsigned long 
flags)
return file;
  }
  
-SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)

+SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
  {
struct file *file;
int fd, err;
diff --git a/tools/testing/selftests/vm/memfd_secret.c 
b/tools/testing/selftests/vm/memfd_secret.c
index c878c2b841fc..2462f52e9c96 100644
--- a/tools/testing/selftests/vm/memfd_secret.c
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -38,7 +38,7 @@ static unsigned long page_size;
  static unsigned long mlock_limit_cur;
  static unsigned long mlock_limit_max;
  
-static int memfd_secret(unsigned long flags)

+static int memfd_secret(unsigned int flags)
  {
return syscall(__NR_memfd_secret, flags);
  }



LGTM

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove

2021-03-25 Thread David Hildenbrand

On 18.03.21 05:08, Dan Williams wrote:

Summary:

A dax_dev can be unbound from its driver at any time. Unbind can not
fail. The driver-core will always trigger ->remove() and the result from
->remove() is ignored. After ->remove() the driver-core proceeds to tear
down context. The filesystem-dax implementation can leave pfns mapped
after ->remove() if it is triggered while the filesystem is mounted.
Security and data-integrity is forfeit if the dax_dev is repurposed for
another security domain (new filesystem or change device modes), or if
the dax_dev is physically replaced. CXL is a hotplug bus that makes
dax_dev physical replace a real world prospect.

All dax_dev pfns must be unmapped at remove. Detect the "remove while
mounted" case and trigger memory_failure() over the entire dax_dev
range.

Details:

The get_user_pages_fast() path expects all synchronization to be handled
by the pattern of checking for pte presence, taking a page reference,
and then validating that the pte was stable over that event. The
gup-fast path for devmap / DAX pages additionally attempts to take/hold
a live reference against the hosting pgmap over the page pin. The
rational for the pgmap reference is to synchronize against a dax-device
unbind / ->remove() event, but that is unnecessary if pte invalidation
is guaranteed in the ->remove() path.

Global dax-device pte invalidation *does* happen when the device is in
raw "device-dax" mode where there is a single shared inode to unmap at
remove, but the filesystem-dax path has a large number of actively
mapped inodes unknown to the driver at ->remove() time. So, that unmap
does not happen today for filesystem-dax. However, as Jason points out,
that unmap / invalidation *needs* to happen not only to cleanup
get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
->remove() is correlated with actual physical removal / replacement the
implications of allowing a physical pfn to be exchanged without tearing
down old mappings are severe (security and data-integrity).

What is not in this patch set is coordination with the dax_kmem driver
to trigger memory_failure() when the dax_dev is onlined as "System
RAM". The remove_memory() API was built with the assumption that
platform firmware negotiates all removal requests and the OS has a
chance to say "no". This is why dax_kmem today simply leaks
request_region() to burn that physical address space for any other
usage until the next reboot on a manual unbind event if the memory can't
be offlined. However a future to make sure that remove_memory() succeeds
after memory_failure() of the same range seems a better semantic than
permanently burning physical address space.


I'd have similar requirements for virtio-mem on forced driver unloading 
(although less of an issue, because in contrast to cxl, it doesn't 
really happen in sane environments). However, I'm afraid there are 
absolutely no guarantees that you can actually offline+rip out memory 
exposed to the buddy, at least in the general case.


I guess it might be possible to some degree if memory was onlined to 
ZONE_MOVABLE (there are still no guarantees, though), however, bets are 
completely off with ZONE_NORMAL. Just imagine the memmap of one dax 
device being allocated from another dax device. You cannot possibly 
invalidate via memory_failure() and rip out the memory without crashing 
the whole system afterwards.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC 0/2] virtio-pmem: Asynchronous flush

2021-03-12 Thread David Hildenbrand

On 12.03.21 07:02, Dan Williams wrote:

On Thu, Mar 11, 2021 at 8:21 PM Pankaj Gupta
 wrote:


Hi David,


   Jeff reported preflush order issue with the existing implementation
   of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
   for virtio pmem using work queue as done in md/RAID. This patch series
   intends to solve the preflush ordering issue and also makes the flush
   asynchronous from the submitting thread POV.

   Submitting this patch series for feeback and is in WIP. I have
   done basic testing and currently doing more testing.

Pankaj Gupta (2):
pmem: make nvdimm_flush asynchronous
virtio_pmem: Async virtio-pmem flush

   drivers/nvdimm/nd_virtio.c   | 66 ++--
   drivers/nvdimm/pmem.c| 15 
   drivers/nvdimm/region_devs.c |  3 +-
   drivers/nvdimm/virtio_pmem.c |  9 +
   drivers/nvdimm/virtio_pmem.h | 12 +++
   5 files changed, 78 insertions(+), 27 deletions(-)

[1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2



Just wondering, was there any follow up of this or are we still waiting
for feedback? :)


Thank you for bringing this up.

My apologies I could not followup on this. I have another version in my local
tree but could not post it as I was not sure if I solved the problem
correctly. I will
clean it up and post for feedback as soon as I can.

P.S: Due to serious personal/family health issues I am not able to
devote much time
on this with other professional commitments. I feel bad that I have
this unfinished task.
Just in last one year things have not been stable for me & my family
and still not getting :(


No worries Pankaj. Take care of yourself and your family. The
community can handle this for you. I'm open to coaching somebody
through what's involved to get this fix landed.


Absolutely, no need to worry for now - take care of yourself and your 
loved ones! I was merely stumbling over this series while cleaning up my 
inbox, wondering if this is still stuck waiting for review/feedback. No 
need to rush anything or be stressed.


In case I have time to look into this in the future, I'd coordinate in 
this thread (especially, asking for feedback again so I know where this 
series stands)!


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC 0/2] virtio-pmem: Asynchronous flush

2021-03-11 Thread David Hildenbrand

On 20.04.20 15:19, Pankaj Gupta wrote:

  Jeff reported preflush order issue with the existing implementation
  of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
  for virtio pmem using work queue as done in md/RAID. This patch series
  intends to solve the preflush ordering issue and also makes the flush
  asynchronous from the submitting thread POV.

  Submitting this patch series for feeback and is in WIP. I have
  done basic testing and currently doing more testing.

Pankaj Gupta (2):
   pmem: make nvdimm_flush asynchronous
   virtio_pmem: Async virtio-pmem flush

  drivers/nvdimm/nd_virtio.c   | 66 ++--
  drivers/nvdimm/pmem.c| 15 
  drivers/nvdimm/region_devs.c |  3 +-
  drivers/nvdimm/virtio_pmem.c |  9 +
  drivers/nvdimm/virtio_pmem.h | 12 +++
  5 files changed, 78 insertions(+), 27 deletions(-)

[1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2



Just wondering, was there any follow up of this or are we still waiting 
for feedback? :)


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-22 Thread David Hildenbrand

On 22.02.21 10:38, David Hildenbrand wrote:

On 17.02.21 17:19, James Bottomley wrote:

On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote:
[...]

The discussion regarding migratability only really popped up
because this is a user-visible thing and not being able to
migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...).


I think the biggest use will potentially come from hardware
acceleration.  If it becomes simple to add say encryption to a
secret page with no cost, then no flag needed.  However, if we only
have a limited number of keys so once we run out no more encrypted
memory then it becomes a costly resource and users might want a
choice of being backed by encryption or not.


Right. But wouldn't HW support with configurable keys etc. need more
syscall parameters (meaning, even memefd_secret() as it is would not
be sufficient?). I suspect the simplistic flag approach might not
be sufficient. I might be wrong because I have no clue about MKTME
and friends.


The theory I was operating under is key management is automatic and
hidden, but key scarcity can't be, so if you flag requesting hardware
backing then you either get success (the kernel found a key) or failure
(the kernel is out of keys).  If we actually want to specify the key
then we need an extra argument and we *must* have a new system call.


Anyhow, I still think extending memfd_create() might just be good
enough - at least for now.


I really think this is the wrong approach for a user space ABI.  If we
think we'll ever need to move to a separate syscall, we should begin
with one.  The pain of trying to shift userspace from memfd_create to a
new syscall would be enormous.  It's not impossible (see clone3) but
it's a pain we should avoid if we know it's coming.


Sorry for the late reply, there is just too much going on :)

*If* we ever realize we need to pass more parameters we can easily have
a new syscall for that purpose. *Then*, we know how that syscall will
look like. Right now, it's just pure speculation.

Until then, going with memfd_create() works just fine IMHO.

The worst think that could happen is that we might not be able to create
all fancy sectremem flavors in the future via memfd_create() but only
via different, highly specialized syscall. I don't see a real problem
with that.



Adding to that, I'll give up arguing now as I have more important things 
to do. It has been questioned by various people why we need a dedicate 
syscall and at least for me, without a satisfying answer.


Worst thing is that we end up with a syscall that could have been 
avoided, for example, because
1. We add existing/future memfd_create() flags to memfd_secret() as well 
when we need them (sealing, hugetlb., ..).
2. We decide in the future to still add MFD_SECRET support to 
memfd_secret().


So be it.

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-22 Thread David Hildenbrand

On 17.02.21 17:19, James Bottomley wrote:

On Tue, 2021-02-16 at 18:16 +0100, David Hildenbrand wrote:
[...]

   The discussion regarding migratability only really popped up
because this is a user-visible thing and not being able to
migrate can be a real problem (fragmentation, ZONE_MOVABLE, ...).


I think the biggest use will potentially come from hardware
acceleration.  If it becomes simple to add say encryption to a
secret page with no cost, then no flag needed.  However, if we only
have a limited number of keys so once we run out no more encrypted
memory then it becomes a costly resource and users might want a
choice of being backed by encryption or not.


Right. But wouldn't HW support with configurable keys etc. need more
syscall parameters (meaning, even memefd_secret() as it is would not
be sufficient?). I suspect the simplistic flag approach might not
be sufficient. I might be wrong because I have no clue about MKTME
and friends.


The theory I was operating under is key management is automatic and
hidden, but key scarcity can't be, so if you flag requesting hardware
backing then you either get success (the kernel found a key) or failure
(the kernel is out of keys).  If we actually want to specify the key
then we need an extra argument and we *must* have a new system call.


Anyhow, I still think extending memfd_create() might just be good
enough - at least for now.


I really think this is the wrong approach for a user space ABI.  If we
think we'll ever need to move to a separate syscall, we should begin
with one.  The pain of trying to shift userspace from memfd_create to a
new syscall would be enormous.  It's not impossible (see clone3) but
it's a pain we should avoid if we know it's coming.


Sorry for the late reply, there is just too much going on :)

*If* we ever realize we need to pass more parameters we can easily have 
a new syscall for that purpose. *Then*, we know how that syscall will 
look like. Right now, it's just pure speculation.


Until then, going with memfd_create() works just fine IMHO.

The worst think that could happen is that we might not be able to create 
all fancy sectremem flavors in the future via memfd_create() but only 
via different, highly specialized syscall. I don't see a real problem 
with that.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-16 Thread David Hildenbrand

  For the other parts, the question is what we actually want to let
user space configure.

Being able to specify "Very secure" "maximum secure" "average
secure"  all doesn't really make sense to me.


Well, it doesn't to me either unless the user feels a cost/benefit, so
if max cost $100 per invocation and average cost nothing, most people
would chose average unless they had a very good reason not to.  In your
migratable model, if we had separate limits for non-migratable and
migratable, with non-migratable being set low to prevent exhaustion,
max secure becomes a highly scarce resource, whereas average secure is
abundant then having the choice might make sense.


I hope that we can find a way to handle the migration part internally. 
Especially, because Mike wants the default to be "as secure as 
possible", so if there is a flag, it would have to be an opt-out flag.


I guess as long as we don't temporarily map it into the "owned" location 
in the direct map shared by all VCPUs we are in a good positon. But this 
needs more thought, of course.





  The discussion regarding migratability only really popped up because
this is a user-visible thing and not being able to migrate can be a
real problem (fragmentation, ZONE_MOVABLE, ...).


I think the biggest use will potentially come from hardware
acceleration.  If it becomes simple to add say encryption to a secret
page with no cost, then no flag needed.  However, if we only have a
limited number of keys so once we run out no more encrypted memory then
it becomes a costly resource and users might want a choice of being
backed by encryption or not.


Right. But wouldn't HW support with configurable keys etc. need more 
syscall parameters (meaning, even memefd_secret() as it is would not be 
sufficient?). I suspect the simplistic flag approach might not be 
sufficient. I might be wrong because I have no clue about MKTME and friends.


Anyhow, I still think extending memfd_create() might just be good enough 
- at least for now. Things like HW support might have requirements we 
don't even know yet and that we cannot even model in memfd_secret() 
right now.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-16 Thread David Hildenbrand

On 16.02.21 17:25, James Bottomley wrote:

On Mon, 2021-02-15 at 20:20 +0100, Michal Hocko wrote:
[...]

   What kind of flags are we talking about and why would that be a
problem with memfd_create interface? Could you be more specific
please?


You mean what were the ioctl flags in the patch series linked
above? They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in
patch 3/5.


OK I see. How many potential modes are we talking about? A few or
potentially many?
  
Well I initially thought there were two (uncached or not) until you

came up with the migratable or non-migratable, which affects the
security properties.  But now there's also potential for hardware
backing, like mktme,  described by flags as well.  I suppose you could
also use RDT to restrict which cache the data goes into: say L1 but not
L2 on to lessen the impact of fully uncached (although the big thrust
of uncached was to blunt hyperthread side channels).  So there is
potential for quite a large expansion even though I'd be willing to bet
that a lot of the modes people have thought about turn out not to be
very effective in the field.


Thanks for the insight. I remember that even the "uncached" parts was 
effectively nacked by x86 maintainers (I might be wrong). For the other 
parts, the question is what we actually want to let user space configure.


Being able to specify "Very secure" "maximum secure" "average secure" 
all doesn't really make sense to me. The discussion regarding 
migratability only really popped up because this is a user-visible thing 
and not being able to migrate can be a real problem (fragmentation, 
ZONE_MOVABLE, ...).


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-14 Thread David Hildenbrand

> Am 14.02.2021 um 10:20 schrieb Mike Rapoport :
> 
> On Fri, Feb 12, 2021 at 10:18:19AM +0100, David Hildenbrand wrote:
>>> On 12.02.21 00:09, Mike Rapoport wrote:
>>> On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote:
>>>> On 11.02.21 12:27, Mike Rapoport wrote:
>>>>> On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:
>>>> 
>>>> So let's talk about the main user-visible differences to other memfd files
>>>> (especially, other purely virtual files like hugetlbfs). With secretmem:
>>>> 
>>>> - File content can only be read/written via memory mappings.
>>>> - File content cannot be swapped out.
>>>> 
>>>> I think there are still valid ways to modify file content using syscalls:
>>>> e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just
>>>> fine.
>>> These work perfectly with any file, so maybe we should have added
>>> memfd_create as a flag to open(2) back then and now the secretmem file
>>> descriptors?
>> 
>> I think open() vs memfd_create() makes sense: for open, the path specifies
>> main properties (tmpfs, hugetlbfs, filesystem). On memfd, there is no such
>> path and the "type" has to be specified differently.
>> 
>> Also, open() might open existing files - memfd always creates new files.
> 
> Yes, but still open() returns a handle to a file and memfd_create() returns
> a handle to a file. The differences may be well hidden by e.g. O_MEMORY and
> than features unique to memfd files will have their set of O_SOMETHING
> flags.
> 

Let‘s agree to disagree.

> It's the same logic that says "we already have an interface that's close
> enough and it's fine to add a bunch of new flags there".

No, not quite. But let‘s agree to disagree.

> 
> And here we come to the question "what are the differences that justify a
> new system call?" and the answer to this is very subjective. And as such we
> can continue bikeshedding forever.

I think this fits into the existing memfd_create() syscall just fine, and I 
heard no compelling argument why it shouldn‘t. That‘s all I can say.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-12 Thread David Hildenbrand

On 12.02.21 00:09, Mike Rapoport wrote:

On Thu, Feb 11, 2021 at 01:07:10PM +0100, David Hildenbrand wrote:

On 11.02.21 12:27, Mike Rapoport wrote:

On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:


So let's talk about the main user-visible differences to other memfd files
(especially, other purely virtual files like hugetlbfs). With secretmem:

- File content can only be read/written via memory mappings.
- File content cannot be swapped out.

I think there are still valid ways to modify file content using syscalls:
e.g., fallocate(PUNCH_HOLE). Things like truncate also seems to work just
fine.
  
These work perfectly with any file, so maybe we should have added

memfd_create as a flag to open(2) back then and now the secretmem file
descriptors?


I think open() vs memfd_create() makes sense: for open, the path 
specifies main properties (tmpfs, hugetlbfs, filesystem). On memfd, 
there is no such path and the "type" has to be specified differently.


Also, open() might open existing files - memfd always creates new files.

  

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.


So here we start to multiplex.


Yes. And as Michal said, maybe we can support combinations in the future.


Isn't there a general agreement that syscall multiplexing is not a good
thing?


Looking at mmap(), madvise(), fallocate(), I think multiplexing is just 
fine and flags can be mutually exclusive - as long as we're not 
squashing completely unrelated things into a single system call.


As one example: we don't have mmap_private() vs. mmap_shared() vs. 
mmap_shared_validate(). E.g., MAP_SYNC is only available for 
MAP_SHARED_VALIDATE.




memfd_create already has flags validation that does not look very nice.


I assume you're talking about the hugetlb size specifications, right? 
It's not nice but fairly compact.



Adding there only MFD_SECRET will make it a bit less nice, but when we'll
grow new functionality into secretmem that will become horrible.


What do you have in mind? A couple of MFD_SECRET_* flags that only work 
with MFD_SECRET won't hurt IMHO. Just like we allow MFD_HUGE_* only with 
MFD_HUGETLB.


Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-11 Thread David Hildenbrand

On 11.02.21 12:27, Mike Rapoport wrote:

On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote:

On 11.02.21 09:39, Michal Hocko wrote:

On Thu 11-02-21 09:13:19, Mike Rapoport wrote:

On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:

On Tue 09-02-21 11:09:38, Mike Rapoport wrote:

[...]

Citing my older email:

  I've hesitated whether to continue to use new flags to memfd_create() or 
to
  add a new system call and I've decided to use a new system call after I've
  started to look into man pages update. There would have been two 
completely
  independent descriptions and I think it would have been very confusing.


Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.


Isn't "syscalls have completely independent description" specific enough?


No, it's not as you can see from questions I've had above. More on that
below.


We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.


This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.



I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

memfd_create() creates an anonymous file and returns a file descriptor 
that refers to it.  The
file behaves like a regular file, and so can be modified, truncated, 
memory-mapped, and so on.
However,  unlike a regular file, it lives in RAM and has a volatile 
backing storage.  Once all
references to the file are dropped, it is automatically released.  
Anonymous  memory  is  used
for  all  backing pages of the file.  Therefore, files created by 
memfd_create() have the same
semantics as other anonymous memory allocations such as those allocated 
using mmap(2) with the
MAP_ANONYMOUS flag.


Even despite my laziness and huge amount of copy-paste you can spot the
differences (this is a very old version, update is due):

memfd_secret()  creates an anonymous file and returns a file descriptor
that refers to it.  The file can only be memory-mapped; the  memory  in
such  mapping  will  have  stronger protection than usual memory mapped
files, and so it can be used to store application  secrets.   Unlike  a
regular file, a file created with memfd_secret() lives in RAM and has a
volatile backing storage.  Once all references to the file are dropped,
it  is  automatically released.  The initial size of the file is set to
0.  Following the call, the file size should be set using ftruncate(2).

The memory areas obtained with mmap(2) from the file descriptor are ex‐
clusive to the owning context.  These areas are removed from the kernel
page tables and only the page table of the process holding the file de‐
scriptor maps the corresponding physical memory.
  


So let's talk about th

Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-11 Thread David Hildenbrand

On 11.02.21 10:38, Michal Hocko wrote:

On Thu 11-02-21 10:01:32, David Hildenbrand wrote:
[...]

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.


Yes for an initial version. But I do expect a request to support both
features is just a matter of time.


In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.


I believe this is the mode Mike wants to have by default. A more relax
one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal
mappings in the kernel for content copying (e.g. for migration).


---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?


Unless there is an explicit request for hugetlb I would say the page
size is not really important like for any other fds. Huge pages can be
used transparently.
  

What are semantics of MADV()/FALLOCATE() etc on such files?


I would expect the same semantic as regular shmem (memfd_create) except
the memory doesn't have _any_ backing storage which makes it
unevictable. So the reclaim related madv won't work but there shouldn't
be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
others don't work.


Another thought regarding "doesn't have _any_ backing storage"

What are the right semantics when it comes to memory accounting/commit?

As secretmem does not have
a) any backing storage
b) cannot go to swap

The MAP_NORESERVE vs. !MAP_NORESERVE handling gets a little unclear. Why 
"reserve swap space" if the allocations cannot ever go to swap? Sure, we 
want to "reserve physical memory", but in contrast to other users that 
can go to swap.


Of course, this is only relevant for MAP_PRIVATE secretmem mappings. 
Other MAP_SHARED assumes there is no need for reserving swap space as it 
can just go to the backing storage. (yeah, tmpfs/shmem is weird in that 
regard as well, but again, it's a bit different)


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-11 Thread David Hildenbrand

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?


Unless there is an explicit request for hugetlb I would say the page
size is not really important like for any other fds. Huge pages can be
used transparently.


If everything is currently allocated/mapped on PTE granularity, then yes 
I agree. I remember previous versions used to "pool 2MB pages", which 
might have been problematic (thus, my concerns regarding mmap() etc.). 
If that part is now gone, good!


  

What are semantics of MADV()/FALLOCATE() etc on such files?


I would expect the same semantic as regular shmem (memfd_create) except
the memory doesn't have _any_ backing storage which makes it
unevictable. So the reclaim related madv won't work but there shouldn't
be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
others don't work.


Agreed if we don't have hugepage semantics.


Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?


Again, nothing really exceptional here. This is a mapping like any
other from address space manipulation POV.


Agreed with the PTE mapping approach.




What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".


I think that the behavior should be really in sync with shmem semantic
as much as possible. Most operations should simply work with an
aditional direct map manipulation. There is no real reason to be
special. Some functionality might be missing, e.g. hugetlb support but
that has been traditionally added on top of shmem interface so nothing
really new here.


Agreed!

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-11 Thread David Hildenbrand

On 11.02.21 09:39, Michal Hocko wrote:

On Thu 11-02-21 09:13:19, Mike Rapoport wrote:

On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:

On Tue 09-02-21 11:09:38, Mike Rapoport wrote:

[...]

Citing my older email:

 I've hesitated whether to continue to use new flags to memfd_create() or to
 add a new system call and I've decided to use a new system call after I've
 started to look into man pages update. There would have been two completely
 independent descriptions and I think it would have been very confusing.


Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.


Isn't "syscalls have completely independent description" specific enough?


No, it's not as you can see from questions I've had above. More on that
below.


We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.


This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.



I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

   memfd_create() creates an anonymous file and returns a file descriptor 
that refers to it.  The
   file behaves like a regular file, and so can be modified, truncated, 
memory-mapped, and so on.
   However,  unlike a regular file, it lives in RAM and has a volatile 
backing storage.  Once all
   references to the file are dropped, it is automatically released.  
Anonymous  memory  is  used
   for  all  backing pages of the file.  Therefore, files created by 
memfd_create() have the same
   semantics as other anonymous memory allocations such as those allocated 
using mmap(2) with the
   MAP_ANONYMOUS flag.

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.

In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.

---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?

What are semantics of MADV()/FALLOCATE() etc on such files?
I assume PUNCH_HOLE fails in a nice way? does it work?

Does mremap()/mremap(FIXED) work/is it blocked?

Does mprotect() fail in a nice way?

Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?

What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-09 Thread David Hildenbrand

On 09.02.21 14:25, Michal Hocko wrote:

On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
[...]

I am constantly trying to fight for making more stuff MOVABLE instead of
going into the other direction (e.g., because it's easier to implement,
which feels like the wrong direction).

Maybe I am the only person that really cares about ZONE_MOVABLE these days
:) I can't stop such new stuff from popping up, so at least I want it to be
documented.


MOVABLE zone is certainly an important thing to keep working. And there
is still quite a lot of work on the way. But as I've said this is more
of a outlier than a norm. On the other hand movable zone is kinda hard
requirement for a lot of application and it is to be expected that
many features will be less than 100% compatible.  Some usecases even
impossible. That's why I am arguing that we should have a central
document where the movable zone is documented with all the potential
problems we have encountered over time and explicitly state which
features are fully/partially incompatible.



I'll send a mail during the next weeks to gather current restrictions to 
document them (and include my brain dump). We might see more excessive 
use of ZONE_MOVABLE in the future and as history told us, of CMA as 
well. We really should start documenting/caring.


@Mike, it would be sufficient for me if one of your patches at least 
mention the situation in the description like


"Please note that secretmem currently behaves much more like long-term 
GUP instead of mlocked memory; secretmem is unmovable memory directly 
consumed/controlled by user space. secretmem cannot be placed onto 
ZONE_MOVABLE/CMA.


As long as there is no excessive use of secretmem (e.g., maximum of 16 
MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this 
is barely a real issue. However, it is something to keep in mind when a 
significant amount of system RAM might be used for secretmem. In the 
future, we might support migration of secretmem and make it look much 
more like mlocked memory instead."


Just a suggestion.

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-09 Thread David Hildenbrand

On 09.02.21 11:23, David Hildenbrand wrote:

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.


IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with zone
imbalances (below). Or what am I missing?


It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.


I get your point, it's similar but still different. "no memory in the
system" vs. "plenty of unusable free memory available in the system".

In many setups, memory for user space applications can go to
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user
space memory (e.g., page tables) and other kernel stuff.

Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an
application (database) that allocates 4GB of memory. Works just fine.
The zone ratio ends up being a problem for example with many processes
(-> many page tables).

Not being able to put user space memory into the movable zone is a
special case. And we are introducing yet another special case here
(besides vfio, rdma, unmigratable huge pages like gigantic pages).

With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be
a big lie of how your system behaves.

   

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.


I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into OOM
any time although you still have plenty of memory on ZONE_MOVAVLE (and even
swap!), simply because you are making excessive use of unmovable allocations
(for user space!) in an environment where you should not make excessive use
of unmovable allocations (e.g., where should page tables go?).


yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.


Right, it's all about what the setup actually needs. Sure, there are
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such
that a 2:1 ratio is not feasible. But I claim that these are corner cases.

Secretmem gives user space the option to allocate a lot of
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each
application on F33 can allocate 16 GiB (!) of secretmem.


Got to learn to do my math. It's 16 MiB - so as a default it's less 
dangerous than I thought!


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-09 Thread David Hildenbrand

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.


IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with zone
imbalances (below). Or what am I missing?


It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.


I get your point, it's similar but still different. "no memory in the 
system" vs. "plenty of unusable free memory available in the system".


In many setups, memory for user space applications can go to 
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user 
space memory (e.g., page tables) and other kernel stuff.


Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an 
application (database) that allocates 4GB of memory. Works just fine. 
The zone ratio ends up being a problem for example with many processes 
(-> many page tables).


Not being able to put user space memory into the movable zone is a 
special case. And we are introducing yet another special case here 
(besides vfio, rdma, unmigratable huge pages like gigantic pages).


With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be 
a big lie of how your system behaves.


  

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.


I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into OOM
any time although you still have plenty of memory on ZONE_MOVAVLE (and even
swap!), simply because you are making excessive use of unmovable allocations
(for user space!) in an environment where you should not make excessive use
of unmovable allocations (e.g., where should page tables go?).


yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.


Right, it's all about what the setup actually needs. Sure, there are 
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such 
that a 2:1 ratio is not feasible. But I claim that these are corner cases.


Secretmem gives user space the option to allocate a lot of 
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each 
application on F33 can allocate 16 GiB (!) of secretmem.


Which other ways do you know where random user space can do something 
similar? I'd be curious what other scenarios there are where user space 
can easily allocate a lot of unmovable memory.


  

The existing controls (mlock limit) don't really match the current semantics
of that memory. I repeat it once again: secretmem *currently* resembles
long-term pinned memory, not mlocked memory.


Well, if we had a proper user space pinning accounting then I would
agree that there is a better model to use. But we don't. And previous
attempts to achieve that have failed. So I am afraid that we do not have
much choice left than using mlock as a model.


Yes, I agree.




Things will change when
implementing migration support for secretmem pages. Until then, the
semantics are different and this should be spelled out.

For long-term pinnings this is kind of obvious, still we're now documenting
it because it's dangerous to not be aware of. Secretmem behaves exactly the
same and I think this is worth spelling out: secretmem has the potential of
being used much more often than fairly special vfio/rdma/ ...


yeah I do agree that pinning is a problem for movable/CMA but most
people simply do not care about those. Movable is the thing for hoptlug
and a really weird fragmentation avoidance IIRC and CMA is mostly to


+ handling gigantic pages dynamically


handle crap HW. If those are to be used along with secret mem or
longterm GUP then they will constantly bump into corner cases. Do not
take me wrong, we should be looking at those problems, we should even
document them but I do not see this as anything new. We 

Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-09 Thread David Hildenbrand

On 09.02.21 09:59, Michal Hocko wrote:

On Mon 08-02-21 22:38:03, David Hildenbrand wrote:



Am 08.02.2021 um 22:13 schrieb Mike Rapoport :

On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:

On 08.02.21 09:49, Mike Rapoport wrote:

Some questions (and request to document the answers) as we now allow to have
unmovable allocations all over the place and I don't see a single comment
regarding that in the cover letter:

1. How will the issue of plenty of unmovable allocations for user space be
tackled in the future?

2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
and CMA, alloc_conig_range()/alloc_contig_pages?.


Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
allocate movable pages at the first place.


That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
memory and behaves like long-term pinnings in that sense. This is a
real issue when using a lot of sectremem.


A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.


IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with 
zone imbalances (below). Or what am I missing?




One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.


I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of 
ZONE_MOVABLE/CMA.


Assume you make use of 1.5GB of secretmem. Your system might run into 
OOM any time although you still have plenty of memory on ZONE_MOVAVLE 
(and even swap!), simply because you are making excessive use of 
unmovable allocations (for user space!) in an environment where you 
should not make excessive use of unmovable allocations (e.g., where 
should page tables go?).


The existing controls (mlock limit) don't really match the current 
semantics of that memory. I repeat it once again: secretmem *currently* 
resembles long-term pinned memory, not mlocked memory. Things will 
change when implementing migration support for secretmem pages. Until 
then, the semantics are different and this should be spelled out.


For long-term pinnings this is kind of obvious, still we're now 
documenting it because it's dangerous to not be aware of. Secretmem 
behaves exactly the same and I think this is worth spelling out: 
secretmem has the potential of being used much more often than fairly 
special vfio/rdma/ ...


Looking at a cover letter that doesn't even mention the issue of 
unmovable allocations makes me thing that we are either trying to ignore 
the problem or are not aware of the problem.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-08 Thread David Hildenbrand

> Am 08.02.2021 um 22:13 schrieb Mike Rapoport :
> 
> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>> On 08.02.21 09:49, Mike Rapoport wrote:
>> 
>> Some questions (and request to document the answers) as we now allow to have
>> unmovable allocations all over the place and I don't see a single comment
>> regarding that in the cover letter:
>> 
>> 1. How will the issue of plenty of unmovable allocations for user space be
>> tackled in the future?
>> 
>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>> and CMA, alloc_conig_range()/alloc_contig_pages?.
> 
> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> allocate movable pages at the first place.

That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and 
behaves like long-term pinnings in that sense. This is a real issue when using 
a lot of sectremem.

Please have a look at what Pavel documents regarding long term pinnings and 
ZONE_MOVABLE in his patches currently on the list.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-08 Thread David Hildenbrand

On 08.02.21 13:17, Michal Hocko wrote:

On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
[...]

My F33 system happily hibernates to disk, even with an application that
succeeded in din doing an mlockall().

And it somewhat makes sense. Even my freshly-booted, idle F33 has

$ cat /proc/meminfo  | grep lock
Mlocked:4860 kB

So, stopping to hibernate with mlocked memory would essentially prohibit any
modern Linux distro to hibernate ever.


My system seems to be completely fine without mlocked memory. It would
be interesting to see who mlocks memory on your system and check whether
the expectated mlock semantic really works for those. This should be
documented at least.


I checked some other installations (Ubuntu, RHEL), and they also show no 
sign of Mlock. My notebook (F33) and desktop (F33) both have mlocked 
memory. Either related to F33 or due to some software (e.g., kerberos).


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-08 Thread David Hildenbrand

On 08.02.21 12:14, David Hildenbrand wrote:

On 08.02.21 12:13, David Hildenbrand wrote:

On 08.02.21 11:57, Michal Hocko wrote:

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:

On 08.02.21 11:51, Michal Hocko wrote:

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:

On 08.02.21 11:18, Michal Hocko wrote:

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.


Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?



Why should unevictable memory not go to swap when hibernating? We're merely
dumping all of our system RAM (including any unmovable allocations) to swap
storage and the system is essentially completely halted.


My understanding is that mlock is never really made visible via swap
storage.


"Using swap storage for hibernation" and "swapping at runtime" are two
different things. I might be wrong, though.


Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.



I think you're right in theory, the man page mentions "Cryptographic
security software often handles critical bytes like passwords or secret
keys as data structures" ...

however, I am not aware of any such swap handling and wasn't able to
spot it quickly. Let me take a closer look.


s/swap/hibernate/


My F33 system happily hibernates to disk, even with an application that 
succeeded in din doing an mlockall().


And it somewhat makes sense. Even my freshly-booted, idle F33 has

$ cat /proc/meminfo  | grep lock
Mlocked:4860 kB

So, stopping to hibernate with mlocked memory would essentially prohibit 
any modern Linux distro to hibernate ever.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-08 Thread David Hildenbrand

On 08.02.21 12:13, David Hildenbrand wrote:

On 08.02.21 11:57, Michal Hocko wrote:

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:

On 08.02.21 11:51, Michal Hocko wrote:

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:

On 08.02.21 11:18, Michal Hocko wrote:

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.


Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?



Why should unevictable memory not go to swap when hibernating? We're merely
dumping all of our system RAM (including any unmovable allocations) to swap
storage and the system is essentially completely halted.


My understanding is that mlock is never really made visible via swap
storage.


"Using swap storage for hibernation" and "swapping at runtime" are two
different things. I might be wrong, though.


Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.



I think you're right in theory, the man page mentions "Cryptographic
security software often handles critical bytes like passwords or secret
keys as data structures" ...

however, I am not aware of any such swap handling and wasn't able to
spot it quickly. Let me take a closer look.


s/swap/hibernate/


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-08 Thread David Hildenbrand

On 08.02.21 11:57, Michal Hocko wrote:

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:

On 08.02.21 11:51, Michal Hocko wrote:

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:

On 08.02.21 11:18, Michal Hocko wrote:

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.


Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?



Why should unevictable memory not go to swap when hibernating? We're merely
dumping all of our system RAM (including any unmovable allocations) to swap
storage and the system is essentially completely halted.


My understanding is that mlock is never really made visible via swap
storage.


"Using swap storage for hibernation" and "swapping at runtime" are two
different things. I might be wrong, though.


Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.



I think you're right in theory, the man page mentions "Cryptographic 
security software often handles critical bytes like passwords or secret 
keys as data structures" ...


however, I am not aware of any such swap handling and wasn't able to 
spot it quickly. Let me take a closer look.



--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-08 Thread David Hildenbrand

On 08.02.21 11:51, Michal Hocko wrote:

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:

On 08.02.21 11:18, Michal Hocko wrote:

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.


Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?



Why should unevictable memory not go to swap when hibernating? We're merely
dumping all of our system RAM (including any unmovable allocations) to swap
storage and the system is essentially completely halted.


My understanding is that mlock is never really made visible via swap
storage.


"Using swap storage for hibernation" and "swapping at runtime" are two 
different things. I might be wrong, though.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

2021-02-08 Thread David Hildenbrand

On 08.02.21 11:18, Michal Hocko wrote:

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:

From: Mike Rapoport 

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.


Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?



Why should unevictable memory not go to swap when hibernating? We're 
merely dumping all of our system RAM (including any unmovable 
allocations) to swap storage and the system is essentially completely 
halted.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-08 Thread David Hildenbrand

On 08.02.21 09:49, Mike Rapoport wrote:

From: Mike Rapoport 

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.


Some questions (and request to document the answers) as we now allow to 
have unmovable allocations all over the place and I don't see a single 
comment regarding that in the cover letter:


1. How will the issue of plenty of unmovable allocations for user space 
be tackled in the future?


2. How has this issue been documented? E.g., interaction with 
ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?.


3. How are the plans to support migration in the future and which 
interface changes will be required? (Michal mentioned some good points 
to make this configurable via the interface, we should plan ahead and 
document)


Thanks!

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/pmem: Avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled

2021-02-05 Thread David Hildenbrand

On 05.02.21 03:39, Aneesh Kumar K.V wrote:

Differentiate between hardware not supporting hugepages and user disabling THP
via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'

For the devdax namespace, the kernel handles the above via the
supported_alignment attribute and failing to initialize the namespace
if the namespace align value is not supported on the platform.

For the fsdax namespace, the kernel will continue to initialize
the namespace. This can result in the kernel creating a huge pte
entry even though the hardware don't support the same.

We do want hugepage support with pmem even if the end-user disabled THP
via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence
differentiate between hardware/firmware lacking support vs user-controlled
disable of THP and prevent a huge fault if the hardware lacks hugepage
support.

Signed-off-by: Aneesh Kumar K.V 
---
  include/linux/huge_mm.h | 15 +--
  mm/huge_memory.c|  6 +-
  2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6a19f35f836b..ba973efcd369 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -78,6 +78,7 @@ static inline vm_fault_t vmf_insert_pfn_pud(struct vm_fault 
*vmf, pfn_t pfn,
  }
  
  enum transparent_hugepage_flag {

+   TRANSPARENT_HUGEPAGE_NEVER_DAX,
TRANSPARENT_HUGEPAGE_FLAG,
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
@@ -123,6 +124,13 @@ extern unsigned long transparent_hugepage_flags;
   */
  static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
  {
+
+   /*
+* If the hardware/firmware marked hugepage support disabled.
+*/
+   if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX))
+   return false;
+
if (vma->vm_flags & VM_NOHUGEPAGE)
return false;
  
@@ -134,12 +142,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
  
  	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))

return true;
-   /*
-* For dax vmas, try to always use hugepage mappings. If the kernel does
-* not support hugepages, fsdax mappings will fallback to PAGE_SIZE
-* mappings, and device-dax namespaces, that try to guarantee a given
-* mapping size, will fail to enable
-*/
+
if (vma_is_dax(vma))
return true;
  
diff --git a/mm/huge_memory.c b/mm/huge_memory.c

index 9237976abe72..d698b7e27447 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -386,7 +386,11 @@ static int __init hugepage_init(void)
struct kobject *hugepage_kobj;
  
  	if (!has_transparent_hugepage()) {

-   transparent_hugepage_flags = 0;
+   /*
+* Hardware doesn't support hugepages, hence disable
+* DAX PMD support.
+*/
+   transparent_hugepage_flags = 1 << 
TRANSPARENT_HUGEPAGE_NEVER_DAX;
return -EINVAL;
}
  



Looks sane to me from my limited understanding of that code :)

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-02-02 Thread David Hildenbrand

On 29.01.21 09:51, Michal Hocko wrote:

On Fri 29-01-21 09:21:28, Mike Rapoport wrote:

On Thu, Jan 28, 2021 at 02:01:06PM +0100, Michal Hocko wrote:

On Thu 28-01-21 11:22:59, Mike Rapoport wrote:


And hugetlb pools may be also depleted by anybody by calling
mmap(MAP_HUGETLB) and there is no any limiting knob for this, while
secretmem has RLIMIT_MEMLOCK.


Yes it can fail. But it would fail at the mmap time when the reservation
fails. Not during the #PF time which can be at any time.


It may fail at $PF time as well:

hugetlb_fault()
 hugeltb_no_page()
 ...
 alloc_huge_page()
 alloc_gigantic_page()
 cma_alloc()
 -ENOMEM;


I would have to double check. From what I remember cma allocator is an
optimization to increase chances to allocate hugetlb pages when
overcommiting because pages should be normally pre-allocated in the pool
and reserved during mmap time. But even if a hugetlb page is not pre
allocated then this will get propagated as SIGBUS unless that has
changed.


It's an optimization to allocate gigantic pages dynamically later (so 
not using memblock during boot). Not just for overcommit, but for any 
kind of allocation.


The actual allocation from cma should happen when setting nr_pages:

nr_hugepages_store_common()->set_max_huge_pages()->alloc_pool_huge_page()...->alloc_gigantic_page()

The path described above seems to be trying to overcommit gigantic 
pages, something that can be expected to SIGBUS. Reservations are 
handled via the pre-allocated pool.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-02-02 Thread David Hildenbrand

On 02.02.21 15:32, Michal Hocko wrote:

On Tue 02-02-21 15:26:20, David Hildenbrand wrote:

On 02.02.21 15:22, Michal Hocko wrote:

On Tue 02-02-21 15:12:21, David Hildenbrand wrote:
[...]

I think secretmem behaves much more like longterm GUP right now
("unmigratable", "lifetime controlled by user space", "cannot go on
CMA/ZONE_MOVABLE"). I'd either want to reasonably well control/limit it or
make it behave more like mlocked pages.


I thought I have already asked but I must have forgotten. Is there any
actual reason why the memory is not movable? Timing attacks?


I think the reason is simple: no direct map, no copying of memory.


This is an implementation detail though and not something terribly hard
to add on top later on. I was more worried there would be really
fundamental reason why this is not possible. E.g. security implications.


I don't remember all the details. Let's see what Mike thinks regarding 
migration (e.g., security concerns).


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-02-02 Thread David Hildenbrand

On 02.02.21 15:22, Michal Hocko wrote:

On Tue 02-02-21 15:12:21, David Hildenbrand wrote:
[...]

I think secretmem behaves much more like longterm GUP right now
("unmigratable", "lifetime controlled by user space", "cannot go on
CMA/ZONE_MOVABLE"). I'd either want to reasonably well control/limit it or
make it behave more like mlocked pages.


I thought I have already asked but I must have forgotten. Is there any
actual reason why the memory is not movable? Timing attacks?


I think the reason is simple: no direct map, no copying of memory.

As I mentioned, we would have to temporarily map in order to copy. 
Mapping it somewhere else (like kmap), outside of the direct map might 
reduce possible attacks.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-02-02 Thread David Hildenbrand

On 02.02.21 14:32, Michal Hocko wrote:

On Tue 02-02-21 14:14:09, David Hildenbrand wrote:
[...]

As already expressed, I dislike allowing user space to consume an unlimited
number unmovable/unmigratable allocations. We already have that in some
cases with huge pages (when the arch does not support migration) - but there
we can at least manage the consumption using the whole max/reserved/free/...
infrastructure. In addition, adding arch support for migration shouldn't be
too complicated.


Well, mlock is not too different here as well. Hugepages are arguably an
easier model because it requires an explicit pre-configuration by an
admin. Mlock doesn't have anything like that. Please also note that
while mlock pages are migrateable by default, this is not the case in
general because they can be configured to disalow migration to prevent
from minor page faults as some workloads require that (e.g. RT).


Yeah, however that is a very special case. In most cases mlock() simply 
prevents swapping, you still have movable pages you can place anywhere 
you like (including on ZONE_MOVABLE).



Another example is ramdisk or even tmpfs (with swap storage depleted or
not configured). Both are PITA from the OOM POV but they are manageable
if people are careful.


Right, but again, special cases - e.g., tmpfs explicitly has to be resized.


If secretmem behaves along those existing models
then we know what to expect at least.


I think secretmem behaves much more like longterm GUP right now 
("unmigratable", "lifetime controlled by user space", "cannot go on 
CMA/ZONE_MOVABLE"). I'd either want to reasonably well control/limit it 
or make it behave more like mlocked pages.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-02-02 Thread David Hildenbrand

On 02.02.21 13:48, Mike Rapoport wrote:

On Tue, Feb 02, 2021 at 10:35:05AM +0100, Michal Hocko wrote:

On Mon 01-02-21 08:56:19, James Bottomley wrote:

I have also proposed potential ways out of this. Either the pool is not
fixed sized and you make it a regular unevictable memory (if direct map
fragmentation is not considered a major problem)


I think that the direct map fragmentation is not a major problem, and the
data we have confirms it, so I'd be more than happy to entirely drop the
pool, allocate memory page by page and remove each page from the direct
map.

Still, we cannot prove negative and it could happen that there is a
workload that would suffer a lot from the direct map fragmentation, so
having a pool of large pages upfront is better than trying to fix it
afterwards. As we get more confidence that the direct map fragmentation is
not an issue as it is common to believe we may remove the pool altogether.

I think that using PMD_ORDER allocations for the pool with a fallback to
order 0 will do the job, but unfortunately I doubt we'll reach a consensus
about this because dogmatic beliefs are hard to shake...

A more restrictive possibility is to still use plain PMD_ORDER allocations
to fill the pool, without relying on CMA. In this case there will be no
global secretmem specific pool to exhaust, but then it's possible to drain
high order free blocks in a system, so CMA has an advantage of limiting
secretmem pools to certain amount of memory with somewhat higher
probability for high order allocation to succeed.


I am not really concerned about fragmenting/breaking up the direct map 
as long as the feature has to be explicitly enabled (similar to 
fragmenting the vmemmap).


As already expressed, I dislike allowing user space to consume an 
unlimited number unmovable/unmigratable allocations. We already have 
that in some cases with huge pages (when the arch does not support 
migration) - but there we can at least manage the consumption using the 
whole max/reserved/free/... infrastructure. In addition, adding arch 
support for migration shouldn't be too complicated.


The idea of using CMA is quite good IMHO, because there we can locally 
limit the direct map fragmentation and don't have to bother about 
migration at all. We own the area, so we can place as many unmovable 
allocations on it as we can fit.


But it sounds like, we would also need some kind of reservation 
mechanism in either scenario (CMA vs. no CMA).


If we don't want to go full-circle on max/reserved/free/..., allowing 
for migration of secretmem pages would make sense. Then, these pages 
become "less special". Map source, copy, unmap destination. The security 
implementations are the ugly part. I wonder if we could temporarily map 
somewhere else, so avoiding to touch the direct map during migration.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-01-26 Thread David Hildenbrand

On 26.01.21 12:46, Michal Hocko wrote:

On Thu 21-01-21 14:27:19, Mike Rapoport wrote:

From: Mike Rapoport 

Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.

Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.

As pages allocated by secretmem become unmovable, use CMA to back large
page caches so that page allocator won't be surprised by failing attempt to
migrate these pages.

The CMA area used by secretmem is controlled by the "secretmem=" kernel
parameter. This allows explicit control over the memory available for
secretmem and provides upper hard limit for secretmem consumption.


OK, so I have finally had a look at this closer and this is really not
acceptable. I have already mentioned that in a response to other patch
but any task is able to deprive access to secret memory to other tasks
and cause OOM killer which wouldn't really recover ever and potentially
panic the system. Now you could be less drastic and only make SIGBUS on
fault but that would be still quite terrible. There is a very good
reason why hugetlb implements is non-trivial reservation system to avoid
exactly these problems.

So unless I am really misreading the code
Nacked-by: Michal Hocko 

That doesn't mean I reject the whole idea. There are some details to
sort out as mentioned elsewhere but you cannot really depend on
pre-allocated pool which can fail at a fault time like that.


So, to do it similar to hugetlbfs (e.g., with CMA), there would have to 
be a mechanism to actually try pre-reserving (e.g., from the CMA area), 
at which point in time the pages would get moved to the secretmem pool, 
and a mechanism for mmap() etc. to "reserve" from these secretmem pool, 
such that there are guarantees at fault time?


What we have right now feels like some kind of overcommit (reading, as 
overcommiting huge pages, so we might get SIGBUS at fault time).


TBH, the SIGBUS thingy doesn't sound terrible to me - if this behavior 
is to be expected right now by applications using it and they can handle 
it - no guarantees. I fully agree that some kind of 
reservation/guarantee mechanism would be preferable.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v16 06/11] mm: introduce memfd_secret system call to create "secret" memory areas

2021-01-26 Thread David Hildenbrand
On 26.01.21 10:49, Michal Hocko wrote:
> On Tue 26-01-21 11:20:11, Mike Rapoport wrote:
>> On Tue, Jan 26, 2021 at 10:00:13AM +0100, Michal Hocko wrote:
>>> On Tue 26-01-21 10:33:11, Mike Rapoport wrote:
 On Tue, Jan 26, 2021 at 08:16:14AM +0100, Michal Hocko wrote:
> On Mon 25-01-21 23:36:18, Mike Rapoport wrote:
>> On Mon, Jan 25, 2021 at 06:01:22PM +0100, Michal Hocko wrote:
>>> On Thu 21-01-21 14:27:18, Mike Rapoport wrote:
 From: Mike Rapoport 

 Introduce "memfd_secret" system call with the ability to create memory
 areas visible only in the context of the owning process and not mapped 
 not
 only to other processes but in the kernel page tables as well.

 The user will create a file descriptor using the memfd_secret() system
 call. The memory areas created by mmap() calls from this file 
 descriptor
 will be unmapped from the kernel direct map and they will be only 
 mapped in
 the page table of the owning mm.

 The secret memory remains accessible in the process context using 
 uaccess
 primitives, but it is not accessible using direct/linear map addresses.

 Functions in the follow_page()/get_user_page() family will refuse to 
 return
 a page that belongs to the secret memory area.

 A page that was a part of the secret memory area is cleared when it is
 freed.

 The following example demonstrates creation of a secret mapping (error
 handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, 
 fd, 0);
>>>
>>> I do not see any access control or permission model for this feature.
>>> Is this feature generally safe to anybody?
>>
>> The mappings obey memlock limit. Besides, this feature should be enabled
>> explicitly at boot with the kernel parameter that says what is the 
>> maximal
>> memory size secretmem can consume.
>
> Why is such a model sufficient and future proof? I mean even when it has
> to be enabled by an admin it is still all or nothing approach. Mlock
> limit is not really useful because it is per mm rather than per user.
>
> Is there any reason why this is allowed for non-privileged processes?
> Maybe this has been discussed in the past but is there any reason why
> this cannot be done by a special device which will allow to provide at
> least some permission policy?
  
 Why this should not be allowed for non-privileged processes? This behaves
 similarly to mlocked memory, so I don't see a reason why secretmem should
 have different permissions model.
>>>
>>> Because appart from the reclaim aspect it fragments the direct mapping
>>> IIUC. That might have an impact on all others, right?
>>
>> It does fragment the direct map, but first it only splits 1G pages to 2M
>> pages and as was discussed several times already it's not that clear which
>> page size in the direct map is the best and this is very much workload
>> dependent.
> 
> I do appreciate this has been discussed but this changelog is not
> specific on any of that reasoning and I am pretty sure nobody will
> remember details in few years in the future. Also some numbers would be
> appropriate.
> 
>> These are the results of the benchmarks I've run with the default direct
>> mapping covered with 1G pages, with disabled 1G pages using "nogbpages" in
>> the kernel command line and with the entire direct map forced to use 4K
>> pages using a simple patch to arch/x86/mm/init.c.
>>
>> https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing
> 
> A good start for the data I am asking above.

I assume you've seen the benchmark results provided by Xing Zhengjun

https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884e...@linux.intel.com/

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 5/6] mm: Fix memory_failure() handling of dax-namespace metadata

2021-01-13 Thread David Hildenbrand
On 13.01.21 08:35, Dan Williams wrote:
> Given 'struct dev_pagemap' spans both data pages and metadata pages be
> careful to consult the altmap if present to delineate metadata. In fact
> the pfn_first() helper already identifies the first valid data pfn, so
> export that helper for other code paths via pgmap_pfn_valid().
> 
> Other usage of get_dev_pagemap() are not a concern because those are
> operating on known data pfns having been looking up by get_user_pages().
> I.e. metadata pfns are never user mapped.
> 
> Cc: Naoya Horiguchi 
> Cc: Andrew Morton 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/memremap.h |6 ++
>  mm/memory-failure.c  |6 ++
>  mm/memremap.c|   15 +++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 79c49e7f5c30..f5b464daeeca 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap);
>  void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   struct dev_pagemap *pgmap);
> +bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>  
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> @@ -165,6 +166,11 @@ static inline struct dev_pagemap 
> *get_dev_pagemap(unsigned long pfn,
>   return NULL;
>  }
>  
> +static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long 
> pfn)
> +{
> + return false;
> +}
> +
>  static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
>  {
>   return 0;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 78b173c7190c..541569cb4a99 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1308,6 +1308,12 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>*/
>   put_page(page);
>  
> + /* device metadata space is not recoverable */
> + if (!pgmap_pfn_valid(pgmap, pfn)) {
> + rc = -ENXIO;
> + goto out;
> + }
> +
>   /*
>* Prevent the inode from being freed while we are interrogating
>* the address_space, typically this would be handled by
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 16b2fb482da1..2455bac89506 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -80,6 +80,21 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap, 
> int range_id)
>   return pfn + vmem_altmap_offset(pgmap_altmap(pgmap));
>  }
>  
> +bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> + int i;
> +
> + for (i = 0; i < pgmap->nr_range; i++) {
> + struct range *range = &pgmap->ranges[i];
> +
> + if (pfn >= PHYS_PFN(range->start) &&
> + pfn <= PHYS_PFN(range->end))
> + return pfn >= pfn_first(pgmap, i);
> + }
> +
> + return false;
> +}
> +
>  static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
>  {
>   const struct range *range = &pgmap->ranges[range_id];
> 

LGTM

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 2/6] mm: Teach pfn_to_online_page() to consider subsection validity

2021-01-13 Thread David Hildenbrand
On 13.01.21 08:35, Dan Williams wrote:
> pfn_section_valid() determines pfn validity on subsection granularity
> where pfn_valid() may be limited to coarse section granularity.
> Explicitly validate subsections after pfn_valid() succeeds.
> 
> Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
> Cc: Qian Cai 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 
> ---
>  mm/memory_hotplug.c |   24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55a69d4396e7..9f37f8a68da4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,11 +308,27 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> + if (!online_section(ms))
> + return NULL;
> +
> + /*
> +  * Save some code text when online_section() +
> +  * pfn_section_valid() are sufficient.
> +  */
> + if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID))
> + if (!pfn_valid(pfn))
> + return NULL;

Nit:

if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) &&
!pfn_valid(pfn))

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 4/5] mm: Fix page reference leak in soft_offline_page()

2021-01-12 Thread David Hildenbrand
On 12.01.21 10:34, Dan Williams wrote:
> The conversion to move pfn_to_online_page() internal to
> soft_offline_page() missed that the get_user_pages() reference needs to
> be dropped when pfn_to_online_page() fails.
> 
> When soft_offline_page() is handed a pfn_valid() &&
> !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> a leaked reference.
> 
> Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> Cc: Andrew Morton 
> Cc: Naoya Horiguchi 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: 
> Signed-off-by: Dan Williams 
> ---
>  mm/memory-failure.c |   20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5a38e9eade94..78b173c7190c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct page *page)
>   return rc;
>  }
>  
> +static void put_ref_page(struct page *page)
> +{
> + if (page)
> + put_page(page);
> +}
> +
>  /**
>   * soft_offline_page - Soft offline a page.
>   * @pfn: pfn to soft-offline
> @@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct page *page)
>  int soft_offline_page(unsigned long pfn, int flags)
>  {
>   int ret;
> - struct page *page;
>   bool try_again = true;
> + struct page *page, *ref_page = NULL;
> +
> + WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED));
>  
>   if (!pfn_valid(pfn))
>   return -ENXIO;
> + if (flags & MF_COUNT_INCREASED)
> + ref_page = pfn_to_page(pfn);
> +
>   /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
>   page = pfn_to_online_page(pfn);
> - if (!page)
> + if (!page) {
> + put_ref_page(ref_page);
>   return -EIO;
> + }
>  
>   if (PageHWPoison(page)) {
>   pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
> - if (flags & MF_COUNT_INCREASED)
> - put_page(page);
> + put_ref_page(ref_page);
>   return 0;
>   }

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 3/5] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

2021-01-12 Thread David Hildenbrand
On 12.01.21 10:34, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixes ZONE_{NORMAL,MOVABLE} with
> ZONE_DEVICE. This means that pfn_to_online_page() may return invalid
> @page objects. For example with a memory map like:
> 
> 1-1fbff : System RAM
>   14200-143002e16 : Kernel code
>   14320-143713fff : Kernel rodata
>   14380-143b15b7f : Kernel data
>   144227000-144ff : Kernel bss
> 1fc00-2fbff : Persistent Memory (legacy)
>   1fc00-2fbff : namespace0.0
> 
> This command:
> 
> echo 0x1fc00 > /sys/devices/system/memory/soft_offline_page
> 
> ...succeeds when it should fail. When it succeeds it touches
> an uninitialized page and may crash or cause other damage (see
> dissolve_free_huge_page()).
> 
> While the memory map above is contrived via the memmap=ss!nn kernel
> command line option, the collision happens in practice on shipping
> platforms. The memory controller resources that decode spans of
> physical address space are a limited resource. One technique
> platform-firmware uses to conserve those resources is to share a decoder
> across 2 devices to keep the address range contiguous. Unfortunately the
> unit of operation of a decoder is 64MiB while the Linux section size is
> 128MiB. This results in situations where, without subsection hotplug
> memory mappings with different lifetimes collide into one object that
> can only express one lifetime.
> 
> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section. In
> the fast path online_section() for a full ZONE_DEVICE section returns
> false.
> 
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton 
> Reported-by: Michal Hocko 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/mmzone.h |   22 +++---
>  mm/memory_hotplug.c|   38 ++
>  2 files changed, 53 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b593316bff3d..0b5c44f730b4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1273,13 +1273,14 @@ extern size_t mem_section_usage_size(void);
>   *  which results in PFN_SECTION_SHIFT equal 6.
>   * To sum it up, at least 6 bits are available.
>   */
> -#define  SECTION_MARKED_PRESENT  (1UL<<0)
> -#define SECTION_HAS_MEM_MAP  (1UL<<1)
> -#define SECTION_IS_ONLINE(1UL<<2)
> -#define SECTION_IS_EARLY (1UL<<3)
> -#define SECTION_MAP_LAST_BIT (1UL<<4)
> -#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT3
> +#define SECTION_MARKED_PRESENT   (1UL<<0)
> +#define SECTION_HAS_MEM_MAP  (1UL<<1)
> +#define SECTION_IS_ONLINE(1UL<<2)
> +#define SECTION_IS_EARLY (1UL<<3)
> +#define SECTION_TAINT_ZONE_DEVICE(1UL<<4)
> +#define SECTION_MAP_LAST_BIT (1UL<<5)
> +#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> +#define SECTION_NID_SHIFT3
>  
>  static inline struct page *__section_mem_map_addr(struct mem_section 
> *section)
>  {
> @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section 
> *section)
>   return (section && (section->section_mem_map & SECTION_IS_ONLINE));
>  }
>  
> +static inline int online_device_section(struct mem_section *section)
> +{
> + unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
> +
> + return section && ((section->section_mem_map & flags) == flags);
> +}
> +
>  static inline int online_section_nr(unsigned long nr)
>  {
>   return online_section(__nr_to_section(nr));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a845b3979bc0..b2ccb84c3082 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,6 +308,7 @@ static int check_hotplug_memory_addressable(unsigned long 
> pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct dev_pagemap *pgmap;
>   struct mem_section *ms;
>  

Re: [PATCH v2 2/5] mm: Teach pfn_to_online_page() to consider subsection validity

2021-01-12 Thread David Hildenbrand
On 12.01.21 10:34, Dan Williams wrote:
> pfn_section_valid() determines pfn validity on subsection granularity.
> 
> pfn_valid_within() internally uses pfn_section_valid(), but gates it
> with early_section() to preserve the traditional behavior of pfn_valid()
> before subsection support was added.
> 
> pfn_to_online_page() wants the explicit precision that pfn_valid() does
> not offer, so use pfn_section_valid() directly. Since
> pfn_to_online_page() already open codes the validity of the section
> number vs NR_MEM_SECTIONS, there's not much value to using
> pfn_valid_within(), just use pfn_section_valid(). This loses the
> valid_section() check that pfn_valid_within() was performing, but that
> was already redundant with the online check.
> 
> Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
> Cc: Qian Cai 
> Cc: Michal Hocko 
> Reported-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c |   16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55a69d4396e7..a845b3979bc0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,11 +308,19 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> + if (!online_section(ms))
> + return NULL;
> +
> + if (!pfn_section_valid(ms, pfn))
> + return NULL;

That's not sufficient for alternative implementations of pfn_valid().

You still need some kind of pfn_valid(pfn) for alternative versions of
pfn_valid(). Consider arm64 memory holes in the memmap. See their
current (yet to be fixed/reworked) pfn_valid() implementation.
(pfn_valid_within() is implicitly active on arm64)

Actually, I think we should add something like the following, to make
this clearer (pfn_valid_within() is confusing)

#ifdef CONFIG_HAVE_ARCH_PFN_VALID
/* We might have to check for holes inside the memmap. */
if (!pfn_valid())
return NULL;
#endif

>  
> - if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> - pfn_valid_within(pfn))
> - return pfn_to_page(pfn);
> - return NULL;
> + return pfn_to_page(pfn);
>  }
>  EXPORT_SYMBOL_GPL(pfn_to_online_page);
>  
> 


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/5] mm: Move pfn_to_online_page() out of line

2021-01-12 Thread David Hildenbrand
On 12.01.21 10:34, Dan Williams wrote:
> pfn_to_online_page() is already too large to be a macro or an inline
> function. In anticipation of further logic changes / growth, move it out
> of line.
> 
> No functional change, just code movement.
> 
> Cc: David Hildenbrand 
> Reported-by: Michal Hocko 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/memory_hotplug.h |   17 +
>  mm/memory_hotplug.c|   16 
>  2 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 15acce5ab106..3d99de0db2dd 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -16,22 +16,7 @@ struct resource;
>  struct vmem_altmap;
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -/*
> - * Return page for the valid pfn only if the page is online. All pfn
> - * walkers which rely on the fully initialized page->flags and others
> - * should use this rather than pfn_valid && pfn_to_page
> - */
> -#define pfn_to_online_page(pfn) \
> -({  \
> - struct page *___page = NULL;   \
> - unsigned long ___pfn = pfn;\
> - unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> -\
> - if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> - pfn_valid_within(___pfn))  \
> - ___page = pfn_to_page(___pfn); \
> - ___page;   \
> -})
> +struct page *pfn_to_online_page(unsigned long pfn);
>  
>  /*
>   * Types for free bootmem stored in page->lru.next. These have to be in
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..55a69d4396e7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,22 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>   return 0;
>  }
>  
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> + unsigned long nr = pfn_to_section_nr(pfn);
> +
> + if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> + pfn_valid_within(pfn))
> +     return pfn_to_page(pfn);
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);
> +
>  /*
>   * Reasonably generic function for adding memory.  It is
>   * expected that archs that support memory hotplug will
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2020-12-09 Thread David Hildenbrand
On 08.12.20 18:28, Joao Martins wrote:
> Hey,
> 
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 
> 
> [0] 
> https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuc...@bytedance.com/
> 
> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap 
> pages).
> 
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total 
> memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total 
> memory)
> 

That's the dream :)

> Along the way I've extended it past 'struct page' overhead *trying* to 
> address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
> 
> So to summarize what the series does:
> 
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or 
> a
> given @align order. The main difference though, is that contrary to the 
> hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it.

Yeah, I'd argue that this case is a lot easier to handle. When the buddy
is involved, things get more complicated.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v12 04/10] set_memory: allow querying whether set_direct_map_*() is actually enabled

2020-11-25 Thread David Hildenbrand
On 25.11.20 13:11, Mike Rapoport wrote:
> On Wed, Nov 25, 2020 at 12:22:52PM +0100, David Hildenbrand wrote:
>>>  #include 
>>>  
>>>  #endif /* __ASM_CACHEFLUSH_H */
>>> diff --git a/arch/arm64/include/asm/set_memory.h 
>>> b/arch/arm64/include/asm/set_memory.h
>>> new file mode 100644
>>> index ..ecb6b0f449ab
>>> --- /dev/null
>>> +++ b/arch/arm64/include/asm/set_memory.h
>>> @@ -0,0 +1,17 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +
>>> +#ifndef _ASM_ARM64_SET_MEMORY_H
>>> +#define _ASM_ARM64_SET_MEMORY_H
>>> +
>>> +#include 
>>> +
>>> +bool can_set_direct_map(void);
>>> +#define can_set_direct_map can_set_direct_map
>>
>> Well, that looks weird.
>> [...]
> 
> We have a lot of those e.g. in linux/pgtable.h
> 
>>>  }
>>> +#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
>>> +/*
>>> + * Some architectures, e.g. ARM64 can disable direct map modifications at
>>> + * boot time. Let them overrive this query.
>>> + */
>>> +#ifndef can_set_direct_map
>>> +static inline bool can_set_direct_map(void)
>>> +{
>>> +   return true;
>>> +}
>>
>> I think we prefer __weak functions for something like that, avoids the
>> ifdefery.
> 
> I'd prefer this for two reasons: first, static inline can be optimized
> away and second, there is no really good place to put generic
> implementation.

Fair enough

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v12 04/10] set_memory: allow querying whether set_direct_map_*() is actually enabled

2020-11-25 Thread David Hildenbrand
>  #include 
>  
>  #endif /* __ASM_CACHEFLUSH_H */
> diff --git a/arch/arm64/include/asm/set_memory.h 
> b/arch/arm64/include/asm/set_memory.h
> new file mode 100644
> index ..ecb6b0f449ab
> --- /dev/null
> +++ b/arch/arm64/include/asm/set_memory.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#ifndef _ASM_ARM64_SET_MEMORY_H
> +#define _ASM_ARM64_SET_MEMORY_H
> +
> +#include 
> +
> +bool can_set_direct_map(void);
> +#define can_set_direct_map can_set_direct_map

Well, that looks weird.
[...]

>  }
> +#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
> +/*
> + * Some architectures, e.g. ARM64 can disable direct map modifications at
> + * boot time. Let them overrive this query.
> + */
> +#ifndef can_set_direct_map
> +static inline bool can_set_direct_map(void)
> +{
> + return true;
> +}

I think we prefer __weak functions for something like that, avoids the
ifdefery.

Apart from that, LGTM.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v8 2/9] mmap: make mlock_future_check() global

2020-11-17 Thread David Hildenbrand

On 15.11.20 09:26, Mike Rapoport wrote:

On Thu, Nov 12, 2020 at 09:15:18PM +0100, David Hildenbrand wrote:



Am 12.11.2020 um 20:08 schrieb Mike Rapoport :

On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote:

On 10.11.20 19:06, Mike Rapoport wrote:
On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:

On 10.11.20 16:14, Mike Rapoport wrote:

From: Mike Rapoport 

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport 
---
   mm/internal.h | 3 +++
   mm/mmap.c | 5 ++---
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..ae146a260b14 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct 
vm_area_struct *vma)
   extern void mlock_vma_page(struct page *page);
   extern unsigned int munlock_vma_page(struct page *page);
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+  unsigned long len);
+
   /*
* Clear the page's PageMlocked().  This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 61f72b09d990..c481f088bd50 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned 
long hint)
   return hint;
   }
-static inline int mlock_future_check(struct mm_struct *mm,
- unsigned long flags,
- unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+   unsigned long len)
   {
   unsigned long locked, lock_limit;



So, an interesting question is if you actually want to charge secretmem
pages against mlock now, or if you want a dedicated secretmem cgroup
controller instead?


Well, with the current implementation there are three limits an
administrator can use to control secretmem limits: mlock, memcg and
kernel parameter.

The kernel parameter puts a global upper limit for secretmem usage,
memcg accounts all secretmem allocations, including the unused memory in
large pages caching and mlock allows per task limit for secretmem
mappings, well, like mlock does.

I didn't consider a dedicated cgroup, as it seems we already have enough
existing knobs and a new one would be unnecessary.


To me it feels like the mlock() limit is a wrong fit for secretmem. But
maybe there are other cases of using the mlock() limit without actually
doing mlock() that I am not aware of (most probably :) )?


Secretmem does not explicitly calls to mlock() but it does what mlock()
does and a bit more. Citing mlock(2):

  mlock(),  mlock2(),  and  mlockall()  lock  part  or all of the calling
  process's virtual address space into RAM, preventing that  memory  from
  being paged to the swap area.

So, based on that secretmem pages are not swappable, I think that
RLIMIT_MEMLOCK is appropriate here.



The page explicitly lists mlock() system calls.


Well, it's mlock() man page, isn't it? ;-)


;)



My thinking was that since secretmem does what mlock() does wrt
swapability, it should at least obey the same limit, i.e.
RLIMIT_MEMLOCK.


Right, but at least currently, it behaves like any other CMA allocation 
(IIRC they are all unmovable and, therefore, not swappable). In the 
future, if pages would be movable (but not swappable), I guess it might 
makes more sense. I assume we never ever want to swap secretmem.


"man getrlimit" states for RLIMIT_MEMLOCK:

"This is the maximum number of bytes of memory that may be
 locked into RAM.  [...] This limit affects
 mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation.
 Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK op‐
 eration [...]"

So that place has to be updated as well I guess? Otherwise this might 
come as a surprise for users.





E.g., we also don‘t
account for gigantic pages - which might be allocated from CMA and are
not swappable.
  
Do you mean gigantic pages in hugetlbfs?


Yes


It seems to me that hugetlbfs accounting is a completely different
story.


I'd say it is right now comparable to secretmem - which is why I though 
similar accounting would make sense.



--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v8 2/9] mmap: make mlock_future_check() global

2020-11-12 Thread David Hildenbrand

> Am 12.11.2020 um 20:08 schrieb Mike Rapoport :
> 
> On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote:
>>> On 10.11.20 19:06, Mike Rapoport wrote:
>>> On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
>>>> On 10.11.20 16:14, Mike Rapoport wrote:
>>>>> From: Mike Rapoport 
>>>>> 
>>>>> It will be used by the upcoming secret memory implementation.
>>>>> 
>>>>> Signed-off-by: Mike Rapoport 
>>>>> ---
>>>>>   mm/internal.h | 3 +++
>>>>>   mm/mmap.c | 5 ++---
>>>>>   2 files changed, 5 insertions(+), 3 deletions(-)
>>>>> 
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index c43ccdddb0f6..ae146a260b14 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct 
>>>>> vm_area_struct *vma)
>>>>>   extern void mlock_vma_page(struct page *page);
>>>>>   extern unsigned int munlock_vma_page(struct page *page);
>>>>> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>>>> +  unsigned long len);
>>>>> +
>>>>>   /*
>>>>>* Clear the page's PageMlocked().  This can be useful in a situation 
>>>>> where
>>>>>* we want to unconditionally remove a page from the pagecache -- e.g.,
>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>> index 61f72b09d990..c481f088bd50 100644
>>>>> --- a/mm/mmap.c
>>>>> +++ b/mm/mmap.c
>>>>> @@ -1348,9 +1348,8 @@ static inline unsigned long 
>>>>> round_hint_to_min(unsigned long hint)
>>>>>   return hint;
>>>>>   }
>>>>> -static inline int mlock_future_check(struct mm_struct *mm,
>>>>> - unsigned long flags,
>>>>> - unsigned long len)
>>>>> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>>>> +   unsigned long len)
>>>>>   {
>>>>>   unsigned long locked, lock_limit;
>>>>> 
>>>> 
>>>> So, an interesting question is if you actually want to charge secretmem
>>>> pages against mlock now, or if you want a dedicated secretmem cgroup
>>>> controller instead?
>>> 
>>> Well, with the current implementation there are three limits an
>>> administrator can use to control secretmem limits: mlock, memcg and
>>> kernel parameter.
>>> 
>>> The kernel parameter puts a global upper limit for secretmem usage,
>>> memcg accounts all secretmem allocations, including the unused memory in
>>> large pages caching and mlock allows per task limit for secretmem
>>> mappings, well, like mlock does.
>>> 
>>> I didn't consider a dedicated cgroup, as it seems we already have enough
>>> existing knobs and a new one would be unnecessary.
>> 
>> To me it feels like the mlock() limit is a wrong fit for secretmem. But
>> maybe there are other cases of using the mlock() limit without actually
>> doing mlock() that I am not aware of (most probably :) )?
> 
> Secretmem does not explicitly calls to mlock() but it does what mlock()
> does and a bit more. Citing mlock(2):
> 
>  mlock(),  mlock2(),  and  mlockall()  lock  part  or all of the calling
>  process's virtual address space into RAM, preventing that  memory  from
>  being paged to the swap area.
> 
> So, based on that secretmem pages are not swappable, I think that
> RLIMIT_MEMLOCK is appropriate here.
> 

The page explicitly lists mlock() system calls. E.g., we also don‘t account for 
gigantic pages - which might be allocated from CMA and are not swappable.



>> I mean, my concern is not earth shattering, this can be reworked later. As I
>> said, it just feels wrong.
>> 
>> -- 
>> Thanks,
>> 
>> David / dhildenb
>> 
> 
> -- 
> Sincerely yours,
> Mike.
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v8 2/9] mmap: make mlock_future_check() global

2020-11-12 Thread David Hildenbrand

On 10.11.20 19:06, Mike Rapoport wrote:

On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:

On 10.11.20 16:14, Mike Rapoport wrote:

From: Mike Rapoport 

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport 
---
   mm/internal.h | 3 +++
   mm/mmap.c | 5 ++---
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..ae146a260b14 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct 
vm_area_struct *vma)
   extern void mlock_vma_page(struct page *page);
   extern unsigned int munlock_vma_page(struct page *page);
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len);
+
   /*
* Clear the page's PageMlocked().  This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 61f72b09d990..c481f088bd50 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned 
long hint)
return hint;
   }
-static inline int mlock_future_check(struct mm_struct *mm,
-unsigned long flags,
-unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+  unsigned long len)
   {
unsigned long locked, lock_limit;



So, an interesting question is if you actually want to charge secretmem
pages against mlock now, or if you want a dedicated secretmem cgroup
controller instead?


Well, with the current implementation there are three limits an
administrator can use to control secretmem limits: mlock, memcg and
kernel parameter.

The kernel parameter puts a global upper limit for secretmem usage,
memcg accounts all secretmem allocations, including the unused memory in
large pages caching and mlock allows per task limit for secretmem
mappings, well, like mlock does.

I didn't consider a dedicated cgroup, as it seems we already have enough
existing knobs and a new one would be unnecessary.


To me it feels like the mlock() limit is a wrong fit for secretmem. But 
maybe there are other cases of using the mlock() limit without actually 
doing mlock() that I am not aware of (most probably :) )?


I mean, my concern is not earth shattering, this can be reworked later. 
As I said, it just feels wrong.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v8 2/9] mmap: make mlock_future_check() global

2020-11-10 Thread David Hildenbrand

On 10.11.20 16:14, Mike Rapoport wrote:

From: Mike Rapoport 

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport 
---
  mm/internal.h | 3 +++
  mm/mmap.c | 5 ++---
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..ae146a260b14 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct 
vm_area_struct *vma)
  extern void mlock_vma_page(struct page *page);
  extern unsigned int munlock_vma_page(struct page *page);
  
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,

+ unsigned long len);
+
  /*
   * Clear the page's PageMlocked().  This can be useful in a situation where
   * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 61f72b09d990..c481f088bd50 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned 
long hint)
return hint;
  }
  
-static inline int mlock_future_check(struct mm_struct *mm,

-unsigned long flags,
-unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+  unsigned long len)
  {
unsigned long locked, lock_limit;
  



So, an interesting question is if you actually want to charge secretmem 
pages against mlock now, or if you want a dedicated secretmem cgroup 
controller instead?


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas

2020-11-03 Thread David Hildenbrand

On 03.11.20 10:52, Mike Rapoport wrote:

On Mon, Nov 02, 2020 at 06:51:09PM +0100, David Hildenbrand wrote:

Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
virtualized environments), eating up a significant amount of !ZONE_MOVABLE
memory dynamically at runtime can lead to non-obvious issues. It looks like
you have plenty of free memory, but the kernel might still OOM when trying
to do kernel allocations e.g., for pagetables. With CMA we at least know
what we're dealing with - it behaves like ZONE_MOVABLE except for the owner
that can place unmovable pages there. We can use it to compute statically
the amount of ZONE_MOVABLE memory we can have in the system without doing
harm to the system.


Why would you say that secretmem allocates from !ZONE_MOVABLE?
If we put boot time reservations aside, the memory allocation for
secretmem follows the same rules as the memory allocations for any file
descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE.


Oh, okay - I missed that! I had the impression that pages are unmovable and
allocating from ZONE_MOVABLE would be a violation of that?


After the allocation the memory indeed becomes unmovable but it's not
like we are eating memory from other zones here.


... and here you have your problem. That's a no-no. We only allow it in very
special cases where it can't be avoided - e.g., vfio having to pin guest
memory when passing through memory to VMs.

Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to unplug
the memory again -> endless loop in offline_pages().

Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate
secretmem. The owner of the area tries to allocate memory - always fails.
Purpose of CMA destroyed.




Ideally, we would want to support page migration/compaction and allow for
allocation from ZONE_MOVABLE as well. Would involve temporarily mapping,
copying, unmapping. Sounds feasible, but not sure which roadblocks we would
find on the way.


We can support migration/compaction with temporary mapping. The first
roadblock I've hit there was that migration allocates 4K destination
page and if we use it in secret map we are back to scrambling the direct
map into 4K pieces. It still sounds feasible but not as trivial :)


That sounds like the proper way for me to do it then.
  
Although migration of secretmem pages sounds feasible now, there maybe

other issues I didn't see because I'm not very familiar with
migration/compaction code.


Migration of PMDs might also be feasible -  and it would be even 
cleaner. But I agree that that might require more work and starting with 
something simpler (!movable) is the right way to move forward.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas

2020-11-02 Thread David Hildenbrand

Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
virtualized environments), eating up a significant amount of !ZONE_MOVABLE
memory dynamically at runtime can lead to non-obvious issues. It looks like
you have plenty of free memory, but the kernel might still OOM when trying
to do kernel allocations e.g., for pagetables. With CMA we at least know
what we're dealing with - it behaves like ZONE_MOVABLE except for the owner
that can place unmovable pages there. We can use it to compute statically
the amount of ZONE_MOVABLE memory we can have in the system without doing
harm to the system.


Why would you say that secretmem allocates from !ZONE_MOVABLE?
If we put boot time reservations aside, the memory allocation for
secretmem follows the same rules as the memory allocations for any file
descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE.


Oh, okay - I missed that! I had the impression that pages are unmovable 
and allocating from ZONE_MOVABLE would be a violation of that?



After the allocation the memory indeed becomes unmovable but it's not
like we are eating memory from other zones here.


... and here you have your problem. That's a no-no. We only allow it in 
very special cases where it can't be avoided - e.g., vfio having to pin 
guest memory when passing through memory to VMs.


Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to 
unplug the memory again -> endless loop in offline_pages().


Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate 
secretmem. The owner of the area tries to allocate memory - always 
fails. Purpose of CMA destroyed.





Ideally, we would want to support page migration/compaction and allow for
allocation from ZONE_MOVABLE as well. Would involve temporarily mapping,
copying, unmapping. Sounds feasible, but not sure which roadblocks we would
find on the way.


We can support migration/compaction with temporary mapping. The first
roadblock I've hit there was that migration allocates 4K destination
page and if we use it in secret map we are back to scrambling the direct
map into 4K pieces. It still sounds feasible but not as trivial :)


That sounds like the proper way for me to do it then.



But again, there is nothing in the current form of secretmem that
prevents allocation from ZONE_MOVABLE.


Oh, there is something: That the pages are not movable.

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas

2020-11-02 Thread David Hildenbrand

On 02.11.20 10:11, David Hildenbrand wrote:

On 24.09.20 15:28, Mike Rapoport wrote:

From: Mike Rapoport 

Hi,

This is an implementation of "secret" mappings backed by a file descriptor.
I've dropped the boot time reservation patch for now as it is not strictly
required for the basic usage and can be easily added later either with or
without CMA.


Hi Mike,

I'd like to stress again that I'd prefer *any* secretmem allocations
going via CMA as long as these pages are unmovable. The user can
allocate a non-significant amount of unmovable allocations only fenced


lol, "non-neglectable" or "significant". Guess I need another coffee :)


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas

2020-11-02 Thread David Hildenbrand

On 24.09.20 15:28, Mike Rapoport wrote:

From: Mike Rapoport 

Hi,

This is an implementation of "secret" mappings backed by a file descriptor.
I've dropped the boot time reservation patch for now as it is not strictly
required for the basic usage and can be easily added later either with or
without CMA.


Hi Mike,

I'd like to stress again that I'd prefer *any* secretmem allocations 
going via CMA as long as these pages are unmovable. The user can 
allocate a non-significant amount of unmovable allocations only fenced 
by the mlock limit, which behave very different to mlocked pages - they 
are not movable for page compaction/migration.


Assume you have a system with quite some ZONE_MOVABLE memory (esp. in 
virtualized environments), eating up a significant amount of 
!ZONE_MOVABLE memory dynamically at runtime can lead to non-obvious 
issues. It looks like you have plenty of free memory, but the kernel 
might still OOM when trying to do kernel allocations e.g., for 
pagetables. With CMA we at least know what we're dealing with - it 
behaves like ZONE_MOVABLE except for the owner that can place unmovable 
pages there. We can use it to compute statically the amount of 
ZONE_MOVABLE memory we can have in the system without doing harm to the 
system.


Ideally, we would want to support page migration/compaction and allow 
for allocation from ZONE_MOVABLE as well. Would involve temporarily 
mapping, copying, unmapping. Sounds feasible, but not sure which 
roadblocks we would find on the way.


[...]



The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will have desired protection bits set in the user page
table. For instance, current implementation allows uncached mappings.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library
[1] that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

I've hesitated whether to continue to use new flags to memfd_create() or to
add a new system call and I've decided to use a new system call after I've
started to look into man pages update. There would have been two completely
independent descriptions and I think it would have been very confusing.


This was also raised on lwn.net by "dullfire" [1]. I do wonder if it 
would be the right place as well.


[1] https://lwn.net/Articles/835342/#Comments



Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

As the fragmentation of the direct map was one of the major concerns raised
during the previous postings, I've added an amortizing cache of PMD-size
pages to each file descriptor that is used as an allocation pool for the
secret memory areas.


--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] x86/mm: Fix phys_to_target_node() export

2020-10-30 Thread David Hildenbrand

On 30.10.20 03:29, Dan Williams wrote:

The core-mm has a default __weak implementation of phys_to_target_node()
when the architecture does not override it. That symbol is exported
for modules. However, while the export in mm/memory_hotplug.c exported
the symbol in the configuration cases of:

CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y

...and:

CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y

...it failed to export the symbol in the case of:

CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n

Always export the symbol from the CONFIG_NUMA_KEEP_MEMINFO section of
arch/x86/mm/numa.c, and teach mm/memory_hotplug.c to optionally export
in case arch/x86/mm/numa.c has already performed the export.

The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported in all combinations of
CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG. Note that in the
CONFIG_NUMA=n case no export is needed since their is a dummy static
inline implementation of phys_to_target_node() in that case.

Reported-by: Randy Dunlap 
Reported-by: Thomas Gleixner 
Reported-by: kernel test robot 
Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() 
implementation")
Cc: Joao Martins 
Cc: Andrew Morton 
Cc: x...@kernel.org
Cc: Vishal Verma 
Signed-off-by: Dan Williams 
---
  arch/x86/mm/numa.c  |1 +
  drivers/dax/Kconfig |1 -
  mm/memory_hotplug.c |5 +
  3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 44148691d78b..e025947f19e0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -938,6 +938,7 @@ int phys_to_target_node(phys_addr_t start)
  
  	return meminfo_to_nid(&numa_reserved_meminfo, start);

  }
+EXPORT_SYMBOL_GPL(phys_to_target_node);
  
  int memory_add_physaddr_to_nid(u64 start)

  {
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 567428e10b7b..d2834c2cfa10 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -50,7 +50,6 @@ config DEV_DAX_HMEM
  Say M if unsure.
  
  config DEV_DAX_HMEM_DEVICES

-   depends on NUMA_KEEP_MEMINFO # for phys_to_target_node()
depends on DEV_DAX_HMEM && DAX=y
def_bool y
  
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c

index b44d4c7ba73b..ed326b489674 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -365,9 +365,14 @@ int __weak phys_to_target_node(u64 start)
start);
return 0;
  }
+
+/* If the arch did not export a strong symbol, export the weak one. */
+#ifndef CONFIG_NUMA_KEEP_MEMINFO
  EXPORT_SYMBOL_GPL(phys_to_target_node);
  #endif
  
+#endif

+
  /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
  static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 unsigned long start_pfn,




Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v7 1/7] mm: add definition of PMD_PAGE_ORDER

2020-10-27 Thread David Hildenbrand
On 26.10.20 09:37, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The definition of PMD_PAGE_ORDER denoting the number of base pages in the
> second-level leaf page is already used by DAX and maybe handy in other
> cases as well.
> 
> Several architectures already have definition of PMD_ORDER as the size of
> second level page table, so to avoid conflict with these definitions use
> PMD_PAGE_ORDER name and update DAX respectively.
> 
> Signed-off-by: Mike Rapoport 

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v7 3/7] set_memory: allow set_direct_map_*_noflush() for multiple pages

2020-10-27 Thread David Hildenbrand
On 26.10.20 20:01, Edgecombe, Rick P wrote:
> On Mon, 2020-10-26 at 10:37 +0200, Mike Rapoport wrote:
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page,
>> int numpages)
>> return __change_page_attr_set_clr(&cpa, 0);
>>  }
>>  
>> -int set_direct_map_invalid_noflush(struct page *page)
>> +int set_direct_map_invalid_noflush(struct page *page, int numpages)
>>  {
>> -   return __set_pages_np(page, 1);
>> +   return __set_pages_np(page, numpages);
>>  }
>>  
>> -int set_direct_map_default_noflush(struct page *page)
>> +int set_direct_map_default_noflush(struct page *page, int numpages)
>>  {
>> -   return __set_pages_p(page, 1);
>> +   return __set_pages_p(page, numpages);
>>  }
> 
> Somewhat related to your other series, this could result in large NP
> pages and trip up hibernate.
> 

It feels somewhat desirable to disable hibernation once secretmem is
enabled, right? Otherwise you'll be writing out your secrets to swap,
where they will remain even after booting up again ...

Skipping secretmem pages when hibernating is the wrong approach I guess ...

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/2] device-dax/kmem: Fix resource release

2020-10-15 Thread David Hildenbrand
On 15.10.20 02:42, Dan Williams wrote:
> The conversion to request_mem_region() is broken because it assumes that
> the range is marked busy prior to release. However, due to the way that
> the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to
> let {add,remove}_memory() handle busy) it requires a manual
> release_resource() to perform cleanup.
> 
> Given that the actual 'struct resource *' needs to be recalled, not just
> the range, add that tracking to the kmem driver-data.
> 
> Reported-by: David Hildenbrand 
> Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with 
> release_mem_region()")
> Cc: Vishal Verma 
> Cc: Dave Hansen 
> Cc: Pavel Tatashin 
> Cc: Brice Goglin 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Cc: Jia He 
> Cc: Joao Martins 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/kmem.c |   48 ++--
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 6c933f2b604e..af04b6d1d263 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -35,11 +35,17 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, 
> struct range *r)
>   return 0;
>  }
>  
> +struct dax_kmem_data {
> + const char *res_name;
> + struct resource *res[];
> +};
> +
>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  {
>   struct device *dev = &dev_dax->dev;
> + struct dax_kmem_data *data;
> + int rc = -ENOMEM;
>   int i, mapped = 0;
> - char *res_name;
>   int numa_node;
>  
>   /*
> @@ -55,14 +61,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   return -EINVAL;
>   }
>  
> - res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> - if (!res_name)
> + data = kzalloc(sizeof(*data) + sizeof(struct resource *) * 
> dev_dax->nr_range, GFP_KERNEL);
> + if (!data)
>   return -ENOMEM;
>  
> + data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> + if (!data->res_name)
> + goto err_res_name;
> +
>   for (i = 0; i < dev_dax->nr_range; i++) {
>   struct resource *res;
>   struct range range;
> - int rc;
>  
>   rc = dax_kmem_range(dev_dax, i, &range);
>   if (rc) {
> @@ -72,7 +81,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   }
>  
>   /* Region is permanently reserved if hotremove fails. */
> - res = request_mem_region(range.start, range_len(&range), 
> res_name);
> + res = request_mem_region(range.start, range_len(&range), 
> data->res_name);
>   if (!res) {
>   dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve 
> region\n",
>   i, range.start, range.end);
> @@ -82,9 +91,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>*/
>   if (mapped)
>   continue;
> - kfree(res_name);
> - return -EBUSY;
> + rc = -EBUSY;
> + goto err_request_mem;
>   }
> + data->res[i] = res;
>  
>   /*
>* Set flags appropriate for System RAM.  Leave ..._BUSY clear
> @@ -104,18 +114,25 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   if (rc) {
>   dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
> failed\n",
>   i, range.start, range.end);
> - release_mem_region(range.start, range_len(&range));
> + release_resource(res);
> + kfree(res);
> + data->res[i] = NULL;
>   if (mapped)
>   continue;
> - kfree(res_name);
> - return rc;
> + goto err_request_mem;
>   }
>   mapped++;
>   }
>  
> - dev_set_drvdata(dev, res_name);
> + dev_set_drvdata(dev, data);
>  
>   return 0;
> +
> +err_request_mem:
> + kfree(data->res_name);
> +err_res_name:
> + kfree(data);
> + return rc;
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> @@ -123,7 +140,7 @@ static int dev_dax_kmem_remove(struct dev_dax *dev_dax)
>  {
>   int i, success = 0;
>   struct device *dev = &dev_dax->dev;

Re: [PATCH v6 03/11] device-dax/kmem: move resource tracking to drvdata

2020-10-06 Thread David Hildenbrand
On 06.10.20 08:55, Dan Williams wrote:
> Towards removing the mode specific @dax_kmem_res attribute from the
> generic 'struct dev_dax', and preparing for multi-range support, move
> resource tracking to driver data.  The memory for the resource name
> needs to have its own lifetime separate from the device bind lifetime
> for cases where the driver is unbound, but the kmem range could not be
> unplugged from the page allocator.
> 
> The resource reservation also needs to be released manually via
> release_resource() given the awkward manipulation of the
> IORESOURCE_BUSY flag.
> 
> Cc: David Hildenbrand 
> Cc: Vishal Verma 
> Cc: Dave Hansen 
> Cc: Pavel Tatashin 
> Cc: Brice Goglin 
> Cc: Dave Jiang 
> Cc: David Hildenbrand 
> Cc: Ira Weiny 
> Cc: Jia He 
> Cc: Joao Martins 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 

Reviewed-by: David Hildenbrand 

Thanks!


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 01/17] device-dax: make pgmap optional for instance creation

2020-10-01 Thread David Hildenbrand
On 01.10.20 18:54, Dan Williams wrote:
> On Thu, Oct 1, 2020 at 1:41 AM David Hildenbrand  wrote:
>>
>> On 25.09.20 21:11, Dan Williams wrote:
>>> The passed in dev_pagemap is only required in the pmem case as the
>>> libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to
>>> place the memmap in pmem directly.  In the hmem case there is no agent
>>> reserving an altmap so it can all be handled by a core internal default.
>>>
>>> Pass the resource range via a new @range property of 'struct
>>> dev_dax_data'.
>>>
>>> Link: 
>>> https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.st...@dwillia2-desk3.amr.corp.intel.com
>>> Cc: David Hildenbrand 
>>> Cc: Vishal Verma 
>>> Cc: Dave Hansen 
>>> Cc: Pavel Tatashin 
>>> Cc: Brice Goglin 
>>> Cc: Dave Jiang 
>>> Cc: David Hildenbrand 
>>> Cc: Ira Weiny 
>>> Cc: Jia He 
>>> Cc: Joao Martins 
>>> Cc: Jonathan Cameron 
>>> Signed-off-by: Dan Williams 
>>> ---
>>>  drivers/dax/bus.c  |   29 +++--
>>>  drivers/dax/bus.h  |2 ++
>>>  drivers/dax/dax-private.h  |9 -
>>>  drivers/dax/device.c   |   28 +++-
>>>  drivers/dax/hmem/hmem.c|8 
>>>  drivers/dax/kmem.c |   12 ++--
>>>  drivers/dax/pmem/core.c|4 
>>>  tools/testing/nvdimm/dax-dev.c |8 
>>>  8 files changed, 62 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>>> index dffa4655e128..96bd64ba95a5 100644
>>> --- a/drivers/dax/bus.c
>>> +++ b/drivers/dax/bus.c
>>> @@ -271,7 +271,7 @@ static ssize_t size_show(struct device *dev,
>>>   struct device_attribute *attr, char *buf)
>>>  {
>>>   struct dev_dax *dev_dax = to_dev_dax(dev);
>>> - unsigned long long size = resource_size(&dev_dax->region->res);
>>> + unsigned long long size = range_len(&dev_dax->range);
>>>
>>>   return sprintf(buf, "%llu\n", size);
>>>  }
>>> @@ -293,19 +293,12 @@ static ssize_t target_node_show(struct device *dev,
>>>  }
>>>  static DEVICE_ATTR_RO(target_node);
>>>
>>> -static unsigned long long dev_dax_resource(struct dev_dax *dev_dax)
>>> -{
>>> - struct dax_region *dax_region = dev_dax->region;
>>> -
>>> - return dax_region->res.start;
>>> -}
>>> -
>>>  static ssize_t resource_show(struct device *dev,
>>>   struct device_attribute *attr, char *buf)
>>>  {
>>>   struct dev_dax *dev_dax = to_dev_dax(dev);
>>>
>>> - return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax));
>>> + return sprintf(buf, "%#llx\n", dev_dax->range.start);
>>>  }
>>>  static DEVICE_ATTR(resource, 0400, resource_show, NULL);
>>>
>>> @@ -376,6 +369,7 @@ static void dev_dax_release(struct device *dev)
>>>
>>>   dax_region_put(dax_region);
>>>   put_dax(dax_dev);
>>> + kfree(dev_dax->pgmap);
>>>   kfree(dev_dax);
>>>  }
>>>
>>> @@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(struct 
>>> dev_dax_data *data)
>>>   if (!dev_dax)
>>>   return ERR_PTR(-ENOMEM);
>>>
>>> - memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap));
>>> + if (data->pgmap) {
>>> + dev_dax->pgmap = kmemdup(data->pgmap,
>>> + sizeof(struct dev_pagemap), GFP_KERNEL);
>>> + if (!dev_dax->pgmap)
>>> + goto err_pgmap;
>>> + }
>>>
>>>   /*
>>>* No 'host' or dax_operations since there is no access to this
>>> @@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(struct 
>>> dev_dax_data *data)
>>>   dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
>>>   if (IS_ERR(dax_dev)) {
>>>   rc = PTR_ERR(dax_dev);
>>> - goto err;
>>> + goto err_alloc_dax;
>>>   }
>>>
>>>   /* a device_dax instance is dead while the driver is not attached */
>>>   kill_dax(dax_dev);
>>>
>>> - /* 

Re: [PATCH v5 01/17] device-dax: make pgmap optional for instance creation

2020-10-01 Thread David Hildenbrand
On 25.09.20 21:11, Dan Williams wrote:
> The passed in dev_pagemap is only required in the pmem case as the
> libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to
> place the memmap in pmem directly.  In the hmem case there is no agent
> reserving an altmap so it can all be handled by a core internal default.
> 
> Pass the resource range via a new @range property of 'struct
> dev_dax_data'.
> 
> Link: 
> https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.st...@dwillia2-desk3.amr.corp.intel.com
> Cc: David Hildenbrand 
> Cc: Vishal Verma 
> Cc: Dave Hansen 
> Cc: Pavel Tatashin 
> Cc: Brice Goglin 
> Cc: Dave Jiang 
> Cc: David Hildenbrand 
> Cc: Ira Weiny 
> Cc: Jia He 
> Cc: Joao Martins 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/bus.c  |   29 +++--
>  drivers/dax/bus.h  |2 ++
>  drivers/dax/dax-private.h  |9 -
>  drivers/dax/device.c   |   28 +++-
>  drivers/dax/hmem/hmem.c|8 
>  drivers/dax/kmem.c |   12 ++--
>  drivers/dax/pmem/core.c|4 
>  tools/testing/nvdimm/dax-dev.c |8 
>  8 files changed, 62 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index dffa4655e128..96bd64ba95a5 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -271,7 +271,7 @@ static ssize_t size_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
> - unsigned long long size = resource_size(&dev_dax->region->res);
> + unsigned long long size = range_len(&dev_dax->range);
>  
>   return sprintf(buf, "%llu\n", size);
>  }
> @@ -293,19 +293,12 @@ static ssize_t target_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(target_node);
>  
> -static unsigned long long dev_dax_resource(struct dev_dax *dev_dax)
> -{
> - struct dax_region *dax_region = dev_dax->region;
> -
> - return dax_region->res.start;
> -}
> -
>  static ssize_t resource_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>  
> - return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax));
> + return sprintf(buf, "%#llx\n", dev_dax->range.start);
>  }
>  static DEVICE_ATTR(resource, 0400, resource_show, NULL);
>  
> @@ -376,6 +369,7 @@ static void dev_dax_release(struct device *dev)
>  
>   dax_region_put(dax_region);
>   put_dax(dax_dev);
> + kfree(dev_dax->pgmap);
>   kfree(dev_dax);
>  }
>  
> @@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
> *data)
>   if (!dev_dax)
>   return ERR_PTR(-ENOMEM);
>  
> - memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap));
> + if (data->pgmap) {
> + dev_dax->pgmap = kmemdup(data->pgmap,
> + sizeof(struct dev_pagemap), GFP_KERNEL);
> + if (!dev_dax->pgmap)
> + goto err_pgmap;
> + }
>  
>   /*
>* No 'host' or dax_operations since there is no access to this
> @@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
> *data)
>   dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
>   if (IS_ERR(dax_dev)) {
>   rc = PTR_ERR(dax_dev);
> - goto err;
> + goto err_alloc_dax;
>   }
>  
>   /* a device_dax instance is dead while the driver is not attached */
>   kill_dax(dax_dev);
>  
> - /* from here on we're committed to teardown via dax_dev_release() */
> + /* from here on we're committed to teardown via dev_dax_release() */
>   dev = &dev_dax->dev;
>   device_initialize(dev);
>  
>   dev_dax->dax_dev = dax_dev;
>   dev_dax->region = dax_region;
> + dev_dax->range = data->range;
>   dev_dax->target_node = dax_region->target_node;
>   kref_get(&dax_region->kref);
>  
> @@ -458,8 +458,9 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
> *data)
>   return ERR_PTR(rc);
>  
>   return dev_dax;
> -
> - err:
> +err_alloc_dax:
> + kfree(dev_dax->pgmap);
> +err_pgmap:
>   kfree(dev_dax);
>  
>   return ERR_PTR(rc);
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 299c2e7fac09..4aeb36da83a4 100644
> --- 

Re: [PATCH v5 04/17] device-dax/kmem: replace release_resource() with release_mem_region()

2020-09-30 Thread David Hildenbrand
On 25.09.20 21:12, Dan Williams wrote:
> Towards removing the mode specific @dax_kmem_res attribute from the
> generic 'struct dev_dax', and preparing for multi-range support, change
> the kmem driver to use the idiomatic release_mem_region() to pair with
> the initial request_mem_region(). This also eliminates the need to open
> code the release of the resource allocated by request_mem_region().
> 
> As there are no more dax_kmem_res users, delete this struct member.
> 
> Cc: David Hildenbrand 
> Cc: Vishal Verma 
> Cc: Dave Hansen 
> Cc: Pavel Tatashin 
> Cc: Brice Goglin 
> Cc: Dave Jiang 
> Cc: David Hildenbrand 
> Cc: Ira Weiny 
> Cc: Jia He 
> Cc: Joao Martins 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/dax-private.h |3 ---
>  drivers/dax/kmem.c|   20 +++-
>  2 files changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 6779f683671d..12a2dbc43b40 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -42,8 +42,6 @@ struct dax_region {
>   * @dev - device core
>   * @pgmap - pgmap for memmap setup / lifetime (driver owned)
>   * @range: resource range for the instance
> - * @dax_mem_res: physical address range of hotadded DAX memory
> - * @dax_mem_name: name for hotadded DAX memory via 
> add_memory_driver_managed()
>   */
>  struct dev_dax {
>   struct dax_region *region;
> @@ -52,7 +50,6 @@ struct dev_dax {
>   struct device dev;
>   struct dev_pagemap *pgmap;
>   struct range range;
> - struct resource *dax_kmem_res;
>  };
>  
>  static inline u64 range_len(struct range *range)
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 6fe2cb1c5f7c..e56fc688bdc5 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -33,7 +33,7 @@ int dev_dax_kmem_probe(struct device *dev)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>   struct range range = dax_kmem_range(dev_dax);
> - struct resource *new_res;
> + struct resource *res;
>   char *res_name;
>   int numa_node;
>   int rc;
> @@ -56,8 +56,8 @@ int dev_dax_kmem_probe(struct device *dev)
>   return -ENOMEM;
>  
>   /* Region is permanently reserved if hotremove fails. */
> - new_res = request_mem_region(range.start, range_len(&range), res_name);
> - if (!new_res) {
> + res = request_mem_region(range.start, range_len(&range), res_name);
> + if (!res) {
>   dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", 
> range.start, range.end);
>   kfree(res_name);
>   return -EBUSY;
> @@ -69,23 +69,20 @@ int dev_dax_kmem_probe(struct device *dev)
>* inherit flags from the parent since it may set new flags
>* unknown to us that will break add_memory() below.
>*/
> - new_res->flags = IORESOURCE_SYSTEM_RAM;
> + res->flags = IORESOURCE_SYSTEM_RAM;
>  
>   /*
>* Ensure that future kexec'd kernels will not treat this as RAM
>* automatically.
>*/
> - rc = add_memory_driver_managed(numa_node, new_res->start,
> -resource_size(new_res), kmem_name);
> + rc = add_memory_driver_managed(numa_node, range.start, 
> range_len(&range), kmem_name);
>   if (rc) {
> - release_resource(new_res);
> - kfree(new_res);
> + release_mem_region(range.start, range_len(&range));
>   kfree(res_name);
>   return rc;
>   }
>  
>   dev_set_drvdata(dev, res_name);
> - dev_dax->dax_kmem_res = new_res;
>  
>   return 0;
>  }
> @@ -95,7 +92,6 @@ static int dev_dax_kmem_remove(struct device *dev)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>   struct range range = dax_kmem_range(dev_dax);
> - struct resource *res = dev_dax->dax_kmem_res;
>   const char *res_name = dev_get_drvdata(dev);
>   int rc;
>  
> @@ -114,10 +110,8 @@ static int dev_dax_kmem_remove(struct device *dev)
>   }
>  
>   /* Release and free dax resources */
> - release_resource(res);
> - kfree(res);
> + release_mem_region(range.start, range_len(&range));

Does that work? AFAIKs,

__release_region(&iomem_resource, (start), (n)) -> __release_region()

will only remove stuff that is IORESOURCE_BUSY.

Maybe storing it in drvdata is indeed the easiest way to remove it from
struct dax_region.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 03/17] device-dax/kmem: move resource name tracking to drvdata

2020-09-30 Thread David Hildenbrand
On 25.09.20 21:11, Dan Williams wrote:
> Towards removing the mode specific @dax_kmem_res attribute from the
> generic 'struct dev_dax', and preparing for multi-range support, move
> resource name tracking to driver data.  The memory for the resource name
> needs to have its own lifetime separate from the device bind lifetime
> for cases where the driver is unbound, but the kmem range could not be
> unplugged from the page allocator.
> 
> Cc: David Hildenbrand 
> Cc: Vishal Verma 
> Cc: Dave Hansen 
> Cc: Pavel Tatashin 
> Cc: Brice Goglin 
> Cc: Dave Jiang 
> Cc: David Hildenbrand 
> Cc: Ira Weiny 
> Cc: Jia He 
> Cc: Joao Martins 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/kmem.c |   16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index b0d6a99cf12d..6fe2cb1c5f7c 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -34,7 +34,7 @@ int dev_dax_kmem_probe(struct device *dev)
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>   struct range range = dax_kmem_range(dev_dax);
>   struct resource *new_res;
> - const char *new_res_name;
> + char *res_name;

I wonder why that change ...

>   int numa_node;
>   int rc;
>  
> @@ -51,15 +51,15 @@ int dev_dax_kmem_probe(struct device *dev)
>   return -EINVAL;
>   }
>  
> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> - if (!new_res_name)
> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> + if (!res_name)
>   return -ENOMEM;
>  
>   /* Region is permanently reserved if hotremove fails. */
> - new_res = request_mem_region(range.start, range_len(&range), 
> new_res_name);
> + new_res = request_mem_region(range.start, range_len(&range), res_name);
>   if (!new_res) {
>   dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", 
> range.start, range.end);
> - kfree(new_res_name);
> + kfree(res_name);
>   return -EBUSY;
>   }
>  
> @@ -80,9 +80,11 @@ int dev_dax_kmem_probe(struct device *dev)
>   if (rc) {
>   release_resource(new_res);
>   kfree(new_res);
> - kfree(new_res_name);
> + kfree(res_name);
>   return rc;
>   }
> +
> + dev_set_drvdata(dev, res_name);
>   dev_dax->dax_kmem_res = new_res;
>  
>   return 0;
> @@ -94,7 +96,7 @@ static int dev_dax_kmem_remove(struct device *dev)
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>   struct range range = dax_kmem_range(dev_dax);
>   struct resource *res = dev_dax->dax_kmem_res;
> - const char *res_name = res->name;
> + const char *res_name = dev_get_drvdata(dev);

... because here you're back to "const".

>   int rc;
>  
>   /*
> 

I do wonder if it wouldn't all be easier to just have some

struct {
const char *res_name;
struct resource *res;
};

allocating that and storing it as drvdata. But let's see what the next
patch brings :)

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v5 02/17] device-dax/kmem: introduce dax_kmem_range()

2020-09-30 Thread David Hildenbrand
On 25.09.20 21:11, Dan Williams wrote:
> Towards removing the mode specific @dax_kmem_res attribute from the
> generic 'struct dev_dax', and preparing for multi-range support, teach
> the driver to calculate the hotplug range from the device range. The
> hotplug range is the trivially calculated memory-block-size aligned
> version of the device range.
> 
> Cc: David Hildenbrand 
> Cc: Vishal Verma 
> Cc: Dave Hansen 
> Cc: Pavel Tatashin 
> Cc: Brice Goglin 
> Cc: Dave Jiang 
> Cc: David Hildenbrand 
> Cc: Ira Weiny 
> Cc: Jia He 
> Cc: Joao Martins 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/kmem.c |   40 +---
>  1 file changed, 17 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 5bb133df147d..b0d6a99cf12d 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -19,13 +19,20 @@ static const char *kmem_name;
>  /* Set if any memory will remain added when the driver will be unloaded. */
>  static bool any_hotremove_failed;
>  
> +static struct range dax_kmem_range(struct dev_dax *dev_dax)
> +{
> + struct range range;
> +
> + /* memory-block align the hotplug range */
> + range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes());
> + range.end = ALIGN_DOWN(dev_dax->range.end + 1, 
> memory_block_size_bytes()) - 1;
> + return range;
> +}
> +
>  int dev_dax_kmem_probe(struct device *dev)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
> - struct range *range = &dev_dax->range;
> - resource_size_t kmem_start;
> - resource_size_t kmem_size;
> - resource_size_t kmem_end;
> + struct range range = dax_kmem_range(dev_dax);
>   struct resource *new_res;
>   const char *new_res_name;
>   int numa_node;
> @@ -44,25 +51,14 @@ int dev_dax_kmem_probe(struct device *dev)
>   return -EINVAL;
>   }
>  
> - /* Hotplug starting at the beginning of the next block: */
> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
> -
> - kmem_size = range_len(range);
> - /* Adjust the size down to compensate for moving up kmem_start: */
> - kmem_size -= kmem_start - range->start;
> - /* Align the size down to cover only complete blocks: */
> - kmem_size &= ~(memory_block_size_bytes() - 1);
> - kmem_end = kmem_start + kmem_size;
> -
>   new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>   if (!new_res_name)
>   return -ENOMEM;
>  
>   /* Region is permanently reserved if hotremove fails. */
> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
> + new_res = request_mem_region(range.start, range_len(&range), 
> new_res_name);
>   if (!new_res) {
> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
> -  &kmem_start, &kmem_end);
> + dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", 
> range.start, range.end);
>   kfree(new_res_name);
>   return -EBUSY;
>   }
> @@ -96,9 +92,8 @@ int dev_dax_kmem_probe(struct device *dev)
>  static int dev_dax_kmem_remove(struct device *dev)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct range range = dax_kmem_range(dev_dax);
>   struct resource *res = dev_dax->dax_kmem_res;
> - resource_size_t kmem_start = res->start;
> - resource_size_t kmem_size = resource_size(res);
>   const char *res_name = res->name;
>   int rc;
>  
> @@ -108,12 +103,11 @@ static int dev_dax_kmem_remove(struct device *dev)
>* there is no way to hotremove this memory until reboot because device
>* unbind will succeed even if we return failure.
>*/
> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> + rc = remove_memory(dev_dax->target_node, range.start, 
> range_len(&range));
>   if (rc) {
>   any_hotremove_failed = true;
> - dev_err(dev,
> - "DAX region %pR cannot be hotremoved until the next 
> reboot\n",
> - res);
> + dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next 
> reboot\n",
> + range.start, range.end);
>   return rc;
>   }
>  
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

2020-09-30 Thread David Hildenbrand
On 30.09.20 17:17, James Bottomley wrote:
> On Wed, 2020-09-30 at 16:45 +0200, David Hildenbrand wrote:
>> On 30.09.20 16:39, James Bottomley wrote:
>>> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
>>>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
>>>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
>>>>>> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra
>>>>>> wrote:
>>>>>>> It will drop them down to 4k pages. Given enough inodes,
>>>>>>> and allocating only a single sekrit page per pmd, we'll
>>>>>>> shatter the directmap into 4k.
>>>>>>
>>>>>> Why? Secretmem allocates PMD-size page per inode and uses it
>>>>>> as a pool of 4K pages for that inode. This way it ensures
>>>>>> that __kernel_map_pages() is always called on PMD boundaries.
>>>>>
>>>>> Oh, you unmap the 2m page upfront? I read it like you did the
>>>>> unmap at the sekrit page alloc, not the pool alloc side of
>>>>> things.
>>>>>
>>>>> Then yes, but then you're wasting gobs of memory. Basically you
>>>>> can pin 2M per inode while only accounting a single page.
>>>>
>>>> Right, quite like THP :)
>>>>
>>>> I considered using a global pool of 2M pages for secretmem and
>>>> handing 4K pages to each inode from that global pool. But I've
>>>> decided to waste memory in favor of simplicity.
>>>
>>> I can also add that the user space consumer of this we wrote does
>>> its user pool allocation at a 2M granularity, so nothing is
>>> actually wasted.
>>
>> ... for that specific user space consumer. (or am I missing
>> something?)
> 
> I'm not sure I understand what you mean?  It's designed to be either
> the standard wrapper or an example of how to do the standard wrapper
> for the syscall.  It uses the same allocator system glibc uses for
> malloc/free ... which pretty much everyone uses instead of calling
> sys_brk directly.  If you look at the granularity glibc uses for
> sys_brk, it's not 4k either.

Okay thanks, "the user space consumer of this we wrote" didn't sound as
generic to me as "the standard wrapper".

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

2020-09-30 Thread David Hildenbrand
On 30.09.20 16:39, James Bottomley wrote:
> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
 On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> It will drop them down to 4k pages. Given enough inodes, and
> allocating only a single sekrit page per pmd, we'll shatter the
> directmap into 4k.

 Why? Secretmem allocates PMD-size page per inode and uses it as a
 pool of 4K pages for that inode. This way it ensures that
 __kernel_map_pages() is always called on PMD boundaries.
>>>
>>> Oh, you unmap the 2m page upfront? I read it like you did the unmap
>>> at the sekrit page alloc, not the pool alloc side of things.
>>>
>>> Then yes, but then you're wasting gobs of memory. Basically you can
>>> pin 2M per inode while only accounting a single page.
>>
>> Right, quite like THP :)
>>
>> I considered using a global pool of 2M pages for secretmem and
>> handing 4K pages to each inode from that global pool. But I've
>> decided to waste memory in favor of simplicity.
> 
> I can also add that the user space consumer of this we wrote does its
> user pool allocation at a 2M granularity, so nothing is actually
> wasted.

... for that specific user space consumer. (or am I missing something?)

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

2020-09-25 Thread David Hildenbrand
On 25.09.20 09:41, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
>> From: Mike Rapoport 
>>
>> Removing a PAGE_SIZE page from the direct map every time such page is
>> allocated for a secret memory mapping will cause severe fragmentation of
>> the direct map. This fragmentation can be reduced by using PMD-size pages
>> as a pool for small pages for secret memory mappings.
>>
>> Add a gen_pool per secretmem inode and lazily populate this pool with
>> PMD-size pages.
> 
> What's the actual efficacy of this? Since the pmd is per inode, all I
> need is a lot of inodes and we're in business to destroy the directmap,
> no?
> 
> Afaict there's no privs needed to use this, all a process needs is to
> stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> page will utterly destroy the direct map.
> 
> I really don't like this, at all.

As I expressed earlier, I would prefer allowing allocation of secretmem
only from a previously defined CMA area. This would physically locally
limit the pain.

But my suggestion was not well received :)

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

2020-09-25 Thread David Hildenbrand
On 24.09.20 23:50, Dan Williams wrote:
> On Thu, Sep 24, 2020 at 2:42 PM David Hildenbrand  wrote:
>>
>>
>>
>>> Am 24.09.2020 um 23:26 schrieb Dan Williams :
>>>
>>> [..]
>>>>> I'm not suggesting to busy the whole "virtio" range, just the portion
>>>>> that's about to be passed to add_memory_driver_managed().
>>>>
>>>> I'm afraid I don't get your point. For virtio-mem:
>>>>
>>>> Before:
>>>>
>>>> 1. Create virtio0 container resource
>>>>
>>>> 2. (somewhen in the future) add_memory_driver_managed()
>>>> - Create resource (System RAM (virtio_mem)), marking it busy/driver
>>>>   managed
>>>>
>>>> After:
>>>>
>>>> 1. Create virtio0 container resource
>>>>
>>>> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
>>>>   marking it busy/driver managed
>>>> 3. add_memory_driver_managed()
>>>>
>>>> Not helpful or simpler IMHO.
>>>
>>> The concern I'm trying to address is the theoretical race window and
>>> layering violation in this sequence in the kmem driver:
>>>
>>> 1/ res = request_mem_region(...);
>>> 2/ res->flags = IORESOURCE_MEM;
>>> 3/ add_memory_driver_managed();
>>>
>>> Between 2/ and 3/ something can race and think that it owns the
>>> region. Do I think it will happen in practice, no, but it's still a
>>> pattern that deserves come cleanup.
>>
>> I think in that unlikely event (rather impossible), 
>> add_memory_driver_managed() should fail, detecting a conflicting (busy) 
>> resource. Not sure what will happen next ( and did not double-check).
> 
> add_memory_driver_managed() will fail, but the release_mem_region() in
> kmem to unwind on the error path will do the wrong thing because that
> other driver thinks it got ownership of the region.
> 

I think if somebody would race and claim the region for itself (after we
unchecked the BUSY flag), there would be another memory resource below
our resource container (e.g., via __request_region()).

So, interestingly, the current code will do a

release_resource->__release_resource(old, true);

which will remove whatever somebody added below the resource.

If we were to do a

remove_resource->__release_resource(old, false);

we would only remove what we temporarily added, relocating anychildren
(someone nasty added).

But yeah, I don't think we have to worry about this case.

>> But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting 
>> other bits. And it would be even better if we could avoid manually messing 
>> with flags here.
> 
> I'm ok to leave it alone for now (hasn't been and likely never will be
> a problem in practice), but I think it was still worth grumbling

Definitely, it gives us a better understanding.

> about. I'll leave that part of kmem alone in the upcoming split of
> dax_kmem_res removal.

Yeah, stuff is more complicated than I would wished, so I guess it's
better to leave it alone for now until we actually see issues with
somebody else regarding *our* device-owned region (or we're able to come
up with a cleanup that keeps all corner cases working for kmem and
virtio-mem).

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

2020-09-24 Thread David Hildenbrand


> Am 24.09.2020 um 23:26 schrieb Dan Williams :
> 
> [..]
>>> I'm not suggesting to busy the whole "virtio" range, just the portion
>>> that's about to be passed to add_memory_driver_managed().
>> 
>> I'm afraid I don't get your point. For virtio-mem:
>> 
>> Before:
>> 
>> 1. Create virtio0 container resource
>> 
>> 2. (somewhen in the future) add_memory_driver_managed()
>> - Create resource (System RAM (virtio_mem)), marking it busy/driver
>>   managed
>> 
>> After:
>> 
>> 1. Create virtio0 container resource
>> 
>> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
>>   marking it busy/driver managed
>> 3. add_memory_driver_managed()
>> 
>> Not helpful or simpler IMHO.
> 
> The concern I'm trying to address is the theoretical race window and
> layering violation in this sequence in the kmem driver:
> 
> 1/ res = request_mem_region(...);
> 2/ res->flags = IORESOURCE_MEM;
> 3/ add_memory_driver_managed();
> 
> Between 2/ and 3/ something can race and think that it owns the
> region. Do I think it will happen in practice, no, but it's still a
> pattern that deserves come cleanup.

I think in that unlikely event (rather impossible), add_memory_driver_managed() 
should fail, detecting a conflicting (busy) resource. Not sure what will happen 
next ( and did not double-check).

But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting 
other bits. And it would be even better if we could avoid manually messing with 
flags here.
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

2020-09-24 Thread David Hildenbrand
On 24.09.20 15:54, Dan Williams wrote:
> On Thu, Sep 24, 2020 at 12:26 AM David Hildenbrand  wrote:
>>
>> On 23.09.20 23:41, Dan Williams wrote:
>>> On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand  wrote:
>>>>
>>>> On 08.09.20 17:33, Joao Martins wrote:
>>>>> [Sorry for the late response]
>>>>>
>>>>> On 8/21/20 11:06 AM, David Hildenbrand wrote:
>>>>>> On 03.08.20 07:03, Dan Williams wrote:
>>>>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>>>>>>  * could be mixed in a node with faster memory, causing
>>>>>>>  * unavoidable performance issues.
>>>>>>>  */
>>>>>>> -   numa_node = dev_dax->target_node;
>>>>>>> if (numa_node < 0) {
>>>>>>> dev_warn(dev, "rejecting DAX region with invalid node: 
>>>>>>> %d\n",
>>>>>>> numa_node);
>>>>>>> return -EINVAL;
>>>>>>> }
>>>>>>>
>>>>>>> -   /* Hotplug starting at the beginning of the next block: */
>>>>>>> -   kmem_start = ALIGN(range->start, memory_block_size_bytes());
>>>>>>> -
>>>>>>> -   kmem_size = range_len(range);
>>>>>>> -   /* Adjust the size down to compensate for moving up kmem_start: */
>>>>>>> -   kmem_size -= kmem_start - range->start;
>>>>>>> -   /* Align the size down to cover only complete blocks: */
>>>>>>> -   kmem_size &= ~(memory_block_size_bytes() - 1);
>>>>>>> -   kmem_end = kmem_start + kmem_size;
>>>>>>> -
>>>>>>> -   new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>>>> -   if (!new_res_name)
>>>>>>> +   res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>>>> +   if (!res_name)
>>>>>>> return -ENOMEM;
>>>>>>>
>>>>>>> -   /* Region is permanently reserved if hotremove fails. */
>>>>>>> -   new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>>>>>>> -   if (!new_res) {
>>>>>>> -   dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>>>>>>> -&kmem_start, &kmem_end);
>>>>>>> -   kfree(new_res_name);
>>>>>>> +   res = request_mem_region(range.start, range_len(&range), res_name);
>>>>>>
>>>>>> I think our range could be empty after aligning. I assume
>>>>>> request_mem_region() would check that, but maybe we could report a
>>>>>> better error/warning in that case.
>>>>>>
>>>>> dax_kmem_range() already returns a memory-block-aligned @range but
>>>>> IIUC request_mem_region() isn't checking for that. Having said that
>>>>> the returned @res wouldn't be different from the passed range.start.
>>>>>
>>>>>>> /*
>>>>>>>  * Ensure that future kexec'd kernels will not treat this as RAM
>>>>>>>  * automatically.
>>>>>>>  */
>>>>>>> -   rc = add_memory_driver_managed(numa_node, new_res->start,
>>>>>>> -  resource_size(new_res), kmem_name);
>>>>>>> +   rc = add_memory_driver_managed(numa_node, res->start,
>>>>>>> +  resource_size(res), kmem_name);
>>>>>>> +
>>>>>>> +   res->flags |= IORESOURCE_BUSY;
>>>>>>
>>>>>> Hm, I don't think that's correct. Any specific reason why to mark the
>>>>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
>>>>>> suddenly stumble over it - and e.g., similarly kexec code when trying to
>>>>>> find memory for placing kexec images. I think we should leave this
>>>>>> !BUSY, just as it is right now.
>>>>>>
>>>>> Agreed.
>>>>>
>>>>>>> if (rc) {
>>>>>>> -   release_resource(new_res);
>>>>>>> -   kfree(new_res);
>>>>>>> -   kfree(new_res_name)

Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

2020-09-24 Thread David Hildenbrand
On 23.09.20 23:41, Dan Williams wrote:
> On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand  wrote:
>>
>> On 08.09.20 17:33, Joao Martins wrote:
>>> [Sorry for the late response]
>>>
>>> On 8/21/20 11:06 AM, David Hildenbrand wrote:
>>>> On 03.08.20 07:03, Dan Williams wrote:
>>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>>>>  * could be mixed in a node with faster memory, causing
>>>>>  * unavoidable performance issues.
>>>>>  */
>>>>> -   numa_node = dev_dax->target_node;
>>>>> if (numa_node < 0) {
>>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>>>>> numa_node);
>>>>> return -EINVAL;
>>>>> }
>>>>>
>>>>> -   /* Hotplug starting at the beginning of the next block: */
>>>>> -   kmem_start = ALIGN(range->start, memory_block_size_bytes());
>>>>> -
>>>>> -   kmem_size = range_len(range);
>>>>> -   /* Adjust the size down to compensate for moving up kmem_start: */
>>>>> -   kmem_size -= kmem_start - range->start;
>>>>> -   /* Align the size down to cover only complete blocks: */
>>>>> -   kmem_size &= ~(memory_block_size_bytes() - 1);
>>>>> -   kmem_end = kmem_start + kmem_size;
>>>>> -
>>>>> -   new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>> -   if (!new_res_name)
>>>>> +   res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>> +   if (!res_name)
>>>>> return -ENOMEM;
>>>>>
>>>>> -   /* Region is permanently reserved if hotremove fails. */
>>>>> -   new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>>>>> -   if (!new_res) {
>>>>> -   dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>>>>> -&kmem_start, &kmem_end);
>>>>> -   kfree(new_res_name);
>>>>> +   res = request_mem_region(range.start, range_len(&range), res_name);
>>>>
>>>> I think our range could be empty after aligning. I assume
>>>> request_mem_region() would check that, but maybe we could report a
>>>> better error/warning in that case.
>>>>
>>> dax_kmem_range() already returns a memory-block-aligned @range but
>>> IIUC request_mem_region() isn't checking for that. Having said that
>>> the returned @res wouldn't be different from the passed range.start.
>>>
>>>>> /*
>>>>>  * Ensure that future kexec'd kernels will not treat this as RAM
>>>>>  * automatically.
>>>>>  */
>>>>> -   rc = add_memory_driver_managed(numa_node, new_res->start,
>>>>> -  resource_size(new_res), kmem_name);
>>>>> +   rc = add_memory_driver_managed(numa_node, res->start,
>>>>> +  resource_size(res), kmem_name);
>>>>> +
>>>>> +   res->flags |= IORESOURCE_BUSY;
>>>>
>>>> Hm, I don't think that's correct. Any specific reason why to mark the
>>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
>>>> suddenly stumble over it - and e.g., similarly kexec code when trying to
>>>> find memory for placing kexec images. I think we should leave this
>>>> !BUSY, just as it is right now.
>>>>
>>> Agreed.
>>>
>>>>> if (rc) {
>>>>> -   release_resource(new_res);
>>>>> -   kfree(new_res);
>>>>> -   kfree(new_res_name);
>>>>> +   release_mem_region(range.start, range_len(&range));
>>>>> +   kfree(res_name);
>>>>> return rc;
>>>>> }
>>>>> -   dev_dax->dax_kmem_res = new_res;
>>>>> +
>>>>> +   dev_set_drvdata(dev, res_name);
>>>>>
>>>>> return 0;
>>>>>  }
>>>>>
>>>>>  #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>> -static int dev_dax_kmem_remove(struct device *dev)
>>>>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>>>>>  {
>>>>> -   struct dev_dax *dev_dax

Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

2020-09-23 Thread David Hildenbrand
On 08.09.20 17:33, Joao Martins wrote:
> [Sorry for the late response]
> 
> On 8/21/20 11:06 AM, David Hildenbrand wrote:
>> On 03.08.20 07:03, Dan Williams wrote:
>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>>  * could be mixed in a node with faster memory, causing
>>>  * unavoidable performance issues.
>>>  */
>>> -   numa_node = dev_dax->target_node;
>>> if (numa_node < 0) {
>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>>> numa_node);
>>> return -EINVAL;
>>> }
>>>  
>>> -   /* Hotplug starting at the beginning of the next block: */
>>> -   kmem_start = ALIGN(range->start, memory_block_size_bytes());
>>> -
>>> -   kmem_size = range_len(range);
>>> -   /* Adjust the size down to compensate for moving up kmem_start: */
>>> -   kmem_size -= kmem_start - range->start;
>>> -   /* Align the size down to cover only complete blocks: */
>>> -   kmem_size &= ~(memory_block_size_bytes() - 1);
>>> -   kmem_end = kmem_start + kmem_size;
>>> -
>>> -   new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>> -   if (!new_res_name)
>>> +   res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>> +   if (!res_name)
>>> return -ENOMEM;
>>>  
>>> -   /* Region is permanently reserved if hotremove fails. */
>>> -   new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>>> -   if (!new_res) {
>>> -   dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>>> -&kmem_start, &kmem_end);
>>> -   kfree(new_res_name);
>>> +   res = request_mem_region(range.start, range_len(&range), res_name);
>>
>> I think our range could be empty after aligning. I assume
>> request_mem_region() would check that, but maybe we could report a
>> better error/warning in that case.
>>
> dax_kmem_range() already returns a memory-block-aligned @range but
> IIUC request_mem_region() isn't checking for that. Having said that
> the returned @res wouldn't be different from the passed range.start.
> 
>>> /*
>>>  * Ensure that future kexec'd kernels will not treat this as RAM
>>>  * automatically.
>>>  */
>>> -   rc = add_memory_driver_managed(numa_node, new_res->start,
>>> -  resource_size(new_res), kmem_name);
>>> +   rc = add_memory_driver_managed(numa_node, res->start,
>>> +  resource_size(res), kmem_name);
>>> +
>>> +   res->flags |= IORESOURCE_BUSY;
>>
>> Hm, I don't think that's correct. Any specific reason why to mark the
>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
>> suddenly stumble over it - and e.g., similarly kexec code when trying to
>> find memory for placing kexec images. I think we should leave this
>> !BUSY, just as it is right now.
>>
> Agreed.
> 
>>> if (rc) {
>>> -   release_resource(new_res);
>>> -   kfree(new_res);
>>> -   kfree(new_res_name);
>>> +   release_mem_region(range.start, range_len(&range));
>>> +   kfree(res_name);
>>> return rc;
>>> }
>>> -   dev_dax->dax_kmem_res = new_res;
>>> +
>>> +   dev_set_drvdata(dev, res_name);
>>>  
>>> return 0;
>>>  }
>>>  
>>>  #ifdef CONFIG_MEMORY_HOTREMOVE
>>> -static int dev_dax_kmem_remove(struct device *dev)
>>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>>>  {
>>> -   struct dev_dax *dev_dax = to_dev_dax(dev);
>>> -   struct resource *res = dev_dax->dax_kmem_res;
>>> -   resource_size_t kmem_start = res->start;
>>> -   resource_size_t kmem_size = resource_size(res);
>>> -   const char *res_name = res->name;
>>> int rc;
>>> +   struct device *dev = &dev_dax->dev;
>>> +   const char *res_name = dev_get_drvdata(dev);
>>> +   struct range range = dax_kmem_range(dev_dax);
>>>  
>>> /*
>>>  * We have one shot for removing memory, if some memory blocks were not
>>>  * offline prior to calling this function remove_memory() will fail, and
>>>  * there is no way to hotremove this memory until reboot because device
>>> -* unbind w

Re: [PATCH] kernel/resource: Fix use of ternary condition in release_mem_region_adjustable

2020-09-21 Thread David Hildenbrand
On 22.09.20 08:07, Nathan Chancellor wrote:
> Clang warns:
> 
> kernel/resource.c:1281:53: warning: operator '?:' has lower precedence
> than '|'; '|' will be evaluated first
> [-Wbitwise-conditional-parentheses]
> new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : 
> 0);
>  ~ ^
> kernel/resource.c:1281:53: note: place parentheses around the '|'
> expression to silence this warning
> new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : 
> 0);
>  ~ ^
> kernel/resource.c:1281:53: note: place parentheses around the '?:'
> expression to evaluate it first
> new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : 
> 0);
>^
>   (  )
> 1 warning generated.
> 
> Add the parentheses as it was clearly intended for the ternary condition
> to be evaluated first.
> 
> Fixes: 5fd23bd0d739 ("kernel/resource: make release_mem_region_adjustable() 
> never fail")
> Link: https://github.com/ClangBuiltLinux/linux/issues/1159
> Signed-off-by: Nathan Chancellor 
> ---
> 
> Presumably, this will be squashed but I included a fixes tag
> nonetheless. Apologies if this has already been noticed and fixed
> already, I did not find anything on LKML.

Hasn't been noticed before (I guess most people build with GCC, which
does not warn in this instance, at least for me) thanks!

Commit ids are not stable yet, so Andrew will most probably squash it.

Reviewed-by: David Hildenbrand 

> 
>  kernel/resource.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/resource.c b/kernel/resource.c
> index ca2a666e4317..3ae2f56cc79d 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1278,7 +1278,7 @@ void release_mem_region_adjustable(resource_size_t 
> start, resource_size_t size)
>* similarly).
>*/
>  retry:
> - new_res = alloc_resource(GFP_KERNEL | alloc_nofail ? __GFP_NOFAIL : 0);
> + new_res = alloc_resource(GFP_KERNEL | (alloc_nofail ? __GFP_NOFAIL : 
> 0));
>  
>   p = &parent->child;
>   write_lock(&resource_lock);
> 
> base-commit: 40ee82f47bf297e31d0c47547cd8f24ede52415a
> 


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()

2020-09-16 Thread David Hildenbrand
On 16.09.20 14:10, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 12:03:20PM +0200, David Hildenbrand wrote:
>> On 16.09.20 12:02, Wei Yang wrote:
>>> On Wed, Sep 16, 2020 at 09:30:41AM +0200, David Hildenbrand wrote:
>>>> "mem" in the name already indicates the root, similar to
>>>> release_mem_region() and devm_request_mem_region(). Make it implicit.
>>>> The only single caller always passes iomem_resource, other parents are
>>>> not applicable.
>>>>
>>>
>>> Looks good to me.
>>>
>>> Reviewed-by: Wei Yang 
>>>
>>
>> Thanks for the review!
>>
> 
> Would you send another version? I didn't take a look into the following
> patches, since the 4th is missed.

Not planning to send another one as long as there are no further
comments. Seems to be an issue on your side because all patches arrived
on linux-mm (see
https://lore.kernel.org/linux-mm/20200911103459.10306-1-da...@redhat.com/)

You can find patch #4 at
https://lore.kernel.org/linux-mm/20200911103459.10306-5-da...@redhat.com/

(which has CC "Wei Yang ")

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()

2020-09-16 Thread David Hildenbrand
On 16.09.20 12:02, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 09:30:41AM +0200, David Hildenbrand wrote:
>> "mem" in the name already indicates the root, similar to
>> release_mem_region() and devm_request_mem_region(). Make it implicit.
>> The only single caller always passes iomem_resource, other parents are
>> not applicable.
>>
> 
> Looks good to me.
> 
> Reviewed-by: Wei Yang 
>

Thanks for the review!

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH] kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()

2020-09-16 Thread David Hildenbrand
"mem" in the name already indicates the root, similar to
release_mem_region() and devm_request_mem_region(). Make it implicit.
The only single caller always passes iomem_resource, other parents are
not applicable.

Suggested-by: Wei Yang 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Kees Cook 
Cc: Ard Biesheuvel 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---

Based on next-20200915. Follow up on
"[PATCH v4 0/8] selective merging of system ram resources" [1]
That's in next-20200915. As noted during review of v2 by Wei [2].

[1] https://lkml.kernel.org/r/20200911103459.10306-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20200915021012.GC2007@L-31X9LVDL-1304.local

---
 include/linux/ioport.h | 3 +--
 kernel/resource.c  | 5 ++---
 mm/memory_hotplug.c| 2 +-
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 7e61389dcb01..5135d4b86cd6 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -251,8 +251,7 @@ extern struct resource * __request_region(struct resource *,
 extern void __release_region(struct resource *, resource_size_t,
resource_size_t);
 #ifdef CONFIG_MEMORY_HOTREMOVE
-extern void release_mem_region_adjustable(struct resource *, resource_size_t,
- resource_size_t);
+extern void release_mem_region_adjustable(resource_size_t, resource_size_t);
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void merge_system_ram_resource(struct resource *res);
diff --git a/kernel/resource.c b/kernel/resource.c
index 7a91b935f4c2..ca2a666e4317 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1240,7 +1240,6 @@ EXPORT_SYMBOL(__release_region);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /**
  * release_mem_region_adjustable - release a previously reserved memory region
- * @parent: parent resource descriptor
  * @start: resource start address
  * @size: resource region size
  *
@@ -1258,9 +1257,9 @@ EXPORT_SYMBOL(__release_region);
  *   assumes that all children remain in the lower address entry for
  *   simplicity.  Enhance this logic when necessary.
  */
-void release_mem_region_adjustable(struct resource *parent,
-  resource_size_t start, resource_size_t size)
+void release_mem_region_adjustable(resource_size_t start, resource_size_t size)
 {
+   struct resource *parent = &iomem_resource;
struct resource *new_res = NULL;
bool alloc_nofail = false;
struct resource **p;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 553c718226b3..7c5e4744ac51 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1764,7 +1764,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
memblock_remove(start, size);
}
 
-   release_mem_region_adjustable(&iomem_resource, start, size);
+   release_mem_region_adjustable(start, size);
 
try_offline_node(nid);
 
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail

2020-09-15 Thread David Hildenbrand
On 15.09.20 11:06, Wei Yang wrote:
> On Tue, Sep 15, 2020 at 09:35:30AM +0200, David Hildenbrand wrote:
>>
>>>> static int __ref try_remove_memory(int nid, u64 start, u64 size)
>>>> {
>>>>int rc = 0;
>>>> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 
>>>> start, u64 size)
>>>>memblock_remove(start, size);
>>>>}
>>>>
>>>> -  __release_memory_resource(start, size);
>>>> +  release_mem_region_adjustable(&iomem_resource, start, size);
>>>>
>>>
>>> Seems the only user of release_mem_region_adjustable() is here, can we move
>>> iomem_resource into the function body? Actually, we don't iterate the 
>>> resource
>>> tree from any level. We always start from the root.
>>
>> You mean, making iomem_resource implicit? I can spot that something
>> similar was done for
>>
>> #define devm_release_mem_region(dev, start, n) \
>>  __devm_release_region(dev, &iomem_resource, (start), (n))
>>
> 
> What I prefer is remove iomem_resource from the parameter list. Just use is in
> the function body.
> 
> For the example you listed, __release_region() would have varies of *parent*,
> which looks reasonable to keep it here.

Yeah I got that ("making iomem_resource implicit"), as I said:

>> I'll send an addon patch for that, ok? - thanks.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 2/7] kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED

2020-09-15 Thread David Hildenbrand
On 15.09.20 04:20, Wei Yang wrote:
> On Tue, Sep 08, 2020 at 10:10:07PM +0200, David Hildenbrand wrote:
>> IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
>> always set to 0 by hardware. This is far from beautiful (and confusing),
>> and the bit only applies to SYSRAM. So let's move it out of the
>> bus-specific (PnP) defined bits.
>>
>> We'll add another SYSRAM specific bit soon. If we ever need more bits for
>> other purposes, we can steal some from "desc", or reshuffle/regroup what we
>> have.
> 
> I think you make this definition because we use IORESOURCE_SYSRAM_RAM for
> hotpluged memory? So we make them all in IORESOURCE_SYSRAM_XXX family?

Yeah, to specify based on the extended MEM type SYSRAM. Because it
really only applies to that.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/7] kernel/resource: make release_mem_region_adjustable() never fail

2020-09-15 Thread David Hildenbrand


>> static int __ref try_remove_memory(int nid, u64 start, u64 size)
>> {
>>  int rc = 0;
>> @@ -1777,7 +1757,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
>> u64 size)
>>  memblock_remove(start, size);
>>  }
>>
>> -__release_memory_resource(start, size);
>> +release_mem_region_adjustable(&iomem_resource, start, size);
>>
> 
> Seems the only user of release_mem_region_adjustable() is here, can we move
> iomem_resource into the function body? Actually, we don't iterate the resource
> tree from any level. We always start from the root.

You mean, making iomem_resource implicit? I can spot that something
similar was done for

#define devm_release_mem_region(dev, start, n) \
__devm_release_region(dev, &iomem_resource, (start), (n))

I'll send an addon patch for that, ok? - thanks.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v4 5/8] mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources

2020-09-15 Thread David Hildenbrand
On 15.09.20 04:43, Wei Yang wrote:
> On Fri, Sep 11, 2020 at 12:34:56PM +0200, David Hildenbrand wrote:
>> Some add_memory*() users add memory in small, contiguous memory blocks.
>> Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
>>
>> This can quickly result in a lot of memory resources, whereby the actual
>> resource boundaries are not of interest (e.g., it might be relevant for
>> DIMMs, exposed via /proc/iomem to user space). We really want to merge
>> added resources in this scenario where possible.
>>
>> Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
>> either created within add_memory*() or passed via add_memory_resource()
>> shall be marked mergeable and merged with applicable siblings.
>>
>> To implement that, we need a kernel/resource interface to mark selected
>> System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
>> merging.
>>
>> Note: We really want to merge after the whole operation succeeded, not
>> directly when adding a resource to the resource tree (it would break
>> add_memory_resource() and require splitting resources again when the
>> operation failed - e.g., due to -ENOMEM).
> 
> Oops, the latest version is here.

Yeah, sorry, I dropped the "mm" prefix on the subject of the cover
letter by mistake.

> 
> BTW, I don't see patch 4. Not sure it is junked by my mail system?

At least you're in the CC list below with your old mail address (I
assume you monitor that).

I'll try to use your new address in the future.


-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH v4 8/8] hv_balloon: try to merge system ram resources

2020-09-11 Thread David Hildenbrand
Let's try to merge system ram resources we add, to minimize the number
of resources in /proc/iomem. We don't care about the boundaries of
individual chunks we added.

Reviewed-by: Wei Liu 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 3c0d52e244520..b64d2efbefe71 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -726,7 +726,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned 
long size,
 
nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
ret = add_memory(nid, PFN_PHYS((start_pfn)),
-   (HA_CHUNK << PAGE_SHIFT), MHP_NONE);
+   (HA_CHUNK << PAGE_SHIFT), MEMHP_MERGE_RESOURCE);
 
if (ret) {
pr_err("hot_add memory failed error is %d\n", ret);
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH v4 7/8] xen/balloon: try to merge system ram resources

2020-09-11 Thread David Hildenbrand
Let's try to merge system ram resources we add, to minimize the number
of resources in /proc/iomem. We don't care about the boundaries of
individual chunks we added.

Reviewed-by: Juergen Gross 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Roger Pau Monné 
Cc: Julien Grall 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 drivers/xen/balloon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 9f40a294d398d..b57b2067ecbfb 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -331,7 +331,7 @@ static enum bp_state reserve_additional_memory(void)
mutex_unlock(&balloon_mutex);
/* add_memory_resource() requires the device_hotplug lock */
lock_device_hotplug();
-   rc = add_memory_resource(nid, resource, MHP_NONE);
+   rc = add_memory_resource(nid, resource, MEMHP_MERGE_RESOURCE);
unlock_device_hotplug();
mutex_lock(&balloon_mutex);
 
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH v4 4/8] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-11 Thread David Hildenbrand
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de

Acked-by: Wei Liu 
Reviewed-by: Juergen Gross  # Xen related part
Reviewed-by: Pankaj Gupta 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: David Hildenbrand 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Oliver O'Halloran" 
Cc: Pingfan Liu 
Cc: Nathan Lynch 
Cc: Libor Pechacek 
Cc: Anton Blanchard 
Cc: Leonardo Bras 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-a...@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
 drivers/acpi/acpi_memhotplug.c  |  3 ++-
 drivers/base/memory.c   |  3 ++-
 drivers/dax/kmem.c  |  2 +-
 drivers/hv/hv_balloon.c |  2 +-
 drivers/s390/char/sclp_cmd.c|  2 +-
 drivers/virtio/virtio_mem.c |  2 +-
 drivers/xen/balloon.c   |  2 +-
 include/linux/memory_hotplug.h  | 16 
 mm/memory_hotplug.c | 14 +++---
 11 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 13b369d2cc454..6828108486f83 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -224,7 +224,7 @@ static int memtrace_online(void)
ent->mem = 0;
}
 
-   if (add_memory(ent->nid, ent->start, ent->size)) {
+   if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
pr_err("Failed to add trace memory to node %d\n",
ent->nid);
ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 0ea976d1cac47..e1c9fa0d730f5 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -615,7 +615,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = __add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index e294f44a78504..2067c3bc55763 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -207,7 +207,8 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = __add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length,
+ MHP_NONE);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4db3c660de831..b4c297dd04755 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -432,7 +432,8 @@ static ssize_t probe_store(struct device *dev, struct 
device_attribute *attr,
 
nid = memory_add_physaddr_to_nid(phys_addr);
ret = __add_memory(nid, phys_addr,
-  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block,
+  MHP_NONE);
 
if (ret)
goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7dcb2902e9b1b..896cb9444e727 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 * this as RAM automatically.
 */
rc = add_

  1   2   3   >