from:"David Rientjes"

Re: [PATCH RFC 0/4] static key support for error injection functions

2024-06-02 Thread David Rientjes

On Fri, 31 May 2024, Vlastimil Babka wrote:

> Patches 3 and 4 implement the static keys for the two mm fault injection
> sites in slab and page allocators. For a quick demonstration I've run a
> VM and the simple test from [1] that stresses the slab allocator and got
> this time before the series:
> 
> real0m8.349s
> user0m0.694s
> sys 0m7.648s
> 
> with perf showing
> 
>0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page  
>   
>   
>   ▒
> 
> And after the series
> 
> real0m7.924s
> user0m0.727s
> sys 0m7.191s
> 
> and the functions gone from perf report.
> 

Impressive results that will no doubt be a win for kernels that enable 
these options.

Both CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC go out of their way to 
have no overhead, both in performance and kernel text overhead, when the 
.config options are disabled.

Do we have any insight into the distros or users that enable either of 
these options and are expecting optimal performance?  I would have assumed 
that while CONFIG_FAULT_INJECTION may be enabled that any users who would 
care deeply about this would have disabled both of these debug options.

> There might be other such fault injection callsites in hotpaths of other
> subsystems but I didn't search for them at this point.
> 
> [1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccb...@suse.cz/
> [2] 
> https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/
> 
> Signed-off-by: Vlastimil Babka 
> ---
> Vlastimil Babka (4):
>   fault-inject: add support for static keys around fault injection sites
>   error-injection: support static keys around injectable functions
>   mm, slab: add static key for should_failslab()
>   mm, page_alloc: add static key for should_fail_alloc_page()
> 
>  include/asm-generic/error-injection.h | 13 ++-
>  include/asm-generic/vmlinux.lds.h |  2 +-
>  include/linux/error-injection.h   |  9 +---
>  include/linux/fault-inject.h  |  7 +-
>  kernel/fail_function.c| 22 +++---
>  lib/error-inject.c|  6 -
>  lib/fault-inject.c| 43 
> ++-
>  mm/fail_page_alloc.c  |  3 ++-
>  mm/failslab.c |  2 +-
>  mm/internal.h |  2 ++
>  mm/page_alloc.c   | 11 ++---
>  mm/slab.h |  3 +++
>  mm/slub.c | 10 +---
>  13 files changed, 114 insertions(+), 19 deletions(-)
> ---
> base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
> change-id: 20240530-fault-injection-statickeys-66b7222e91b7
> 
> Best regards,
> -- 
> Vlastimil Babka 
> 
>

Re: [PATCH v2] vhost-vdpa: account iommu allocations

2023-12-26 Thread David Rientjes

On Tue, 26 Dec 2023, Pasha Tatashin wrote:

> iommu allocations should be accounted in order to allow admins to
> monitor and limit the amount of iommu memory.
> 
> Signed-off-by: Pasha Tatashin 
> Acked-by: Michael S. Tsirkin 

Acked-by: David Rientjes

Re: [PATCH v2] mm/compaction:let proactive compaction order configurable

2021-04-16 Thread David Rientjes

On Sat, 17 Apr 2021, chukaiping wrote:

> Currently the proactive compaction order is fixed to
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
> normal 4KB memory, but it's too high for the machines with small
> normal memory, for example the machines with most memory configured
> as 1GB hugetlbfs huge pages. In these machines the max order of
> free pages is often below 9, and it's always below 9 even with hard
> compaction. This will lead to proactive compaction be triggered very
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable
> by user, and the default value is still COMPACTION_HPAGE_ORDER.
> 

Still not entirely clear on the use case beyond hugepages.  In your 
response from v1, you indicated you were not concerned with allocation 
latency of hugepages but rather had a thundering herd problem where once 
fragmentation got bad, many threads started compacting all at once.

I'm not sure that tuning the proactive compaction order is the right 
solution.  I think the proactive compaction order is more about starting 
compaction when a known order of interest (like a hugepage) is fully 
depleted and we want a page of that order so the idea is to start 
recovering from that situation.

Is this not a userspace policy decision?  I'm wondering if it would simply 
be better to manually invoke compaction periodically or when the 
fragmentation ratio has reached a certain level.  You can manually invoke 
compaction yourself through sysfs for each node on the system.

> Signed-off-by: chukaiping 
> Reported-by: kernel test robot 
> ---
> 
> Changes in v2:
> - fix the compile error in ia64 and powerpc
> - change the hard coded max order number from 10 to MAX_ORDER - 1
> 
>  include/linux/compaction.h |1 +
>  kernel/sysctl.c|   11 +++
>  mm/compaction.c|   14 +++---
>  3 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int order)
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>   void *buffer, size_t *length, loff_t *ppos);
>  extern int sysctl_extfrag_threshold;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 62fbd09..a607d4d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -195,6 +195,8 @@ enum sysctl_writes_mode {
>  #endif /* CONFIG_SMP */
>  #endif /* CONFIG_SCHED_DEBUG */
>  
> +static int max_buddy_zone = MAX_ORDER - 1;
> +
>  #ifdef CONFIG_COMPACTION
>  static int min_extfrag_threshold;
>  static int max_extfrag_threshold = 1000;
> @@ -2871,6 +2873,15 @@ int proc_do_static_key(struct ctl_table *table, int 
> write,
>   .extra2 = &one_hundred,
>   },
>   {
> + .procname   = "compaction_order",
> + .data   = &sysctl_compaction_order,
> + .maxlen = sizeof(sysctl_compaction_order),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = &max_buddy_zone,
> + },
> + {
>   .procname   = "extfrag_threshold",
>   .data   = &sysctl_extfrag_threshold,
>   .maxlen = sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f447..bfd1d5e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)
>  {
> - return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> + return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on larger
> @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order;
>  
>  /*
>   * This is the entry poin

Re: [PATCH] mm/compaction:let proactive compaction order configurable

2021-04-12 Thread David Rientjes

On Mon, 12 Apr 2021, chukaiping wrote:

> Currently the proactive compaction order is fixed to
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
> normal 4KB memory, but it's too high for the machines with small
> normal memory, for example the machines with most memory configured
> as 1GB hugetlbfs huge pages. In these machines the max order of
> free pages is often below 9, and it's always below 9 even with hard
> compaction. This will lead to proactive compaction be triggered very
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable
> by user, and the default value is still COMPACTION_HPAGE_ORDER.
> 

I'm curious why you have proactive compaction enabled at all in this case?

The order-9 threshold is likely to optimize for hugepage availability, but 
in your setup it appears that's not a goal.

So what benefit does proactive compaction provide if only done for order-3 
or order-4?

> Signed-off-by: chukaiping 
> ---
>  include/linux/compaction.h |1 +
>  kernel/sysctl.c|   10 ++
>  mm/compaction.c|7 ---
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int order)
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>   void *buffer, size_t *length, loff_t *ppos);
>  extern int sysctl_extfrag_threshold;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 62fbd09..277df31 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -114,6 +114,7 @@
>  static int __maybe_unused neg_one = -1;
>  static int __maybe_unused two = 2;
>  static int __maybe_unused four = 4;
> +static int __maybe_unused ten = 10;
>  static unsigned long zero_ul;
>  static unsigned long one_ul = 1;
>  static unsigned long long_max = LONG_MAX;
> @@ -2871,6 +2872,15 @@ int proc_do_static_key(struct ctl_table *table, int 
> write,
>   .extra2 = &one_hundred,
>   },
>   {
> + .procname   = "compaction_order",
> + .data   = &sysctl_compaction_order,
> + .maxlen = sizeof(sysctl_compaction_order),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = &ten,
> + },
> + {
>   .procname   = "extfrag_threshold",
>   .data   = &sysctl_extfrag_threshold,
>   .maxlen = sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f447..a192996 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)
>  {
> - return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> + return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on larger
> @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
>  
>  /*
>   * This is the entry point for compacting all nodes via
> -- 
> 1.7.1
> 
>

Re: [PATCH v2] KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations

2021-04-07 Thread David Rientjes

On Tue, 6 Apr 2021, Sean Christopherson wrote:

> Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that
> that the allocations are accounted, to make it easier to audit KVM's
> allocations in the future, and to be consistent with other cache usage in
> KVM.
> 
> When using SLAB/SLUB, this is a nop as the cache itself is created with
> SLAB_ACCOUNT.
> 
> When using SLOB, there are caveats within caveats.  SLOB doesn't honor
> SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU
> allocations now being accounted.   But, even that depends on internal
> SLOB details as SLOB will only go to the page allocator when its cache is
> depleted.  That just happens to be extremely likely for vCPUs because the
> size of kvm_vcpu is larger than the a page for almost all combinations of
> architecture and page size.  Whether or not the SLOB behavior is by
> design is unknown; it's just as likely that no SLOB users care about
> accounding and so no one has bothered to implemented support in SLOB.
> Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup
> users, if any exist.
> 
> Cc: Wanpeng Li 
> Signed-off-by: Sean Christopherson 

Always happy to see this ambiguity (SLAB_ACCOUNT vs GFP_KERNEL_ACCOUNT) 
resolved for slab allocations.

Acked-by: David Rientjes

Re: [PATCH v6 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-24 Thread David Rientjes

On Mon, 22 Mar 2021, Zi Yan wrote:

> From: Zi Yan 
> 
> We did not have a direct user interface of splitting the compound page
> backing a THP and there is no need unless we want to expose the THP
> implementation details to users. Make /split_huge_pages accept
> a new command to do that.
> 
> By writing ",," to
> /split_huge_pages, THPs within the given virtual address range
> from the process with the given pid are split. It is used to test
> split_huge_page function. In addition, a selftest program is added to
> tools/testing/selftests/vm to utilize the interface by splitting
> PMD THPs and PTE-mapped THPs.
> 

I'm curious if this is the only use case or whether you have additional 
use cases or extensions in mind?

Specifically, I'm wondering if you are looking into more appropriately 
dividing the limited number of hugepages available on the system amongst 
the most latency sensitive processes?

The set of hugepages available on the system can be limited by 
fragmentation.  We've found opportunities where latency sensitive 
processes would benefit from memory backed by thp, but they cannot be 
allocated at fault for this reason.  Yet, there are other processes that 
have memory backed by hugepages that may not be benefiting from them.

Could this be useful to split a hugepage for a latency tolerant 
application, migrate the pages elsewhere, and make the now-free hugepage 
available for a latency sensitive application (through something like my 
MADV_COLLAPSE proposal?)

Couple questions inline.

> This does not change the old behavior, i.e., writing 1 to the interface
> to split all THPs in the system.
> 
> Changelog:
> 
> From v5:
> 1. Skipped special VMAs and other fixes. (suggested by Yang Shi)
> 
> From v4:
> 1. Fixed the error code return issue, spotted by kernel test robot
>.
> 
> From v3:
> 1. Factored out split huge pages in the given pid code to a separate
>function.
> 2. Added the missing put_page for not split pages.
> 3. pr_debug -> pr_info, make reading results simpler.
> 
> From v2:
> 1. Reused existing /split_huge_pages interface. (suggested by
>Yang Shi)
> 
> From v1:
> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>robot .
> 2. Dropped the use of find_mm_struct and code it directly, since there
>is no need for the permission check in that function and the function
>is only available when migration is on.
> 3. Added some comments in the selftest program to clarify how PTE-mapped
>THPs are formed.
> 
> Signed-off-by: Zi Yan 
> Reviewed-by: Yang Shi 
> ---
>  mm/huge_memory.c  | 151 -
>  tools/testing/selftests/vm/.gitignore |   1 +
>  tools/testing/selftests/vm/Makefile   |   1 +
>  .../selftests/vm/split_huge_page_test.c   | 318 ++
>  4 files changed, 464 insertions(+), 7 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bff92dea5ab3..b653255a548e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -7,6 +7,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
>  };
>  
>  #ifdef CONFIG_DEBUG_FS
> -static int split_huge_pages_set(void *data, u64 val)
> +static void split_huge_pages_all(void)
>  {
>   struct zone *zone;
>   struct page *page;
>   unsigned long pfn, max_zone_pfn;
>   unsigned long total = 0, split = 0;
>  
> - if (val != 1)
> - return -EINVAL;
> -
> + pr_info("Split all THPs\n");

Is this necessary to print?

>   for_each_populated_zone(zone) {
>   max_zone_pfn = zone_end_pfn(zone);
>   for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> @@ -2959,11 +2958,149 @@ static int split_huge_pages_set(void *data, u64 val)
>   }
>  
>   pr_info("%lu of %lu THP split\n", split, total);
> +}
>  
> - return 0;
> +static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
> +{
> + return vma_is_special_huge(vma) || (vma->vm_flags & VM_IO) ||
> + is_vm_hugetlb_page(vma);
>  }
> -DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
> - "%llu\n");
> +
> +static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> + unsigned long vaddr_end)
> +{
> + int ret = 0;
> + struct task_struct *task;
> + struct mm_struct *mm;
> + unsigned long total = 0, split = 0;
> + unsigned long addr;
> +
> + vaddr_start &= PAGE_MASK;
> + vaddr_end &= PAGE_MASK;
> +
> + /* Find the task_struct from pid */
> + rcu_read_lock();
> + task = find_task_by_vpid(pid);
> + if (!task) {
> + rcu_read_unlock();
> + ret = -ESRCH;
> + goto out;
> + }
> + get_task_struct(task);
> + rcu_read_unlock();

Re: [PATCH v2 3/4] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-24 Thread David Rientjes

On Wed, 24 Mar 2021, Matthew Wilcox (Oracle) wrote:

> If we're trying to allocate 4MB of memory, the table will be 8KiB in size
> (1024 pointers * 8 bytes per pointer), which can usually be satisfied
> by a kmalloc (which is significantly faster).  Instead of changing this
> open-coded implementation, just use kvmalloc().
> 
> This improves the allocation speed of vmalloc(4MB) by approximately
> 5% in our benchmark.  It's still dominated by the 1024 calls to
> alloc_pages_node(), which will be the subject of a later patch.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: David Rientjes

Re: [PATCH v2 2/4] mm/util: Add kvmalloc_node_caller

2021-03-24 Thread David Rientjes

On Wed, 24 Mar 2021, Matthew Wilcox (Oracle) wrote:

> Allow the caller of kvmalloc to specify who counts as the allocator
> of the memory instead of assuming it's the immediate caller.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: David Rientjes

Re: [PATCH v2 4/4] MAINTAINERS: Add Vlad Rezki as vmalloc maintainer

2021-03-24 Thread David Rientjes

On Wed, 24 Mar 2021, Matthew Wilcox (Oracle) wrote:

> People should know to cc Vlad on vmalloc-related patches.  With this,
> get-maintainer.pl suggests:
> 
> Uladzislau Rezki  (maintainer:VMALLOC)
> Andrew Morton  (maintainer:MEMORY MANAGEMENT)
> linux...@kvack.org (open list:VMALLOC)
> linux-kernel@vger.kernel.org (open list)
> 
> which looks appropriate to me.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  MAINTAINERS   | 7 +++
>  arch/arm64/kernel/module.c| 3 +--
>  arch/arm64/net/bpf_jit_comp.c | 3 +--
>  arch/parisc/kernel/module.c   | 5 ++---
>  arch/x86/hyperv/hv_init.c | 3 +--
>  5 files changed, 12 insertions(+), 9 deletions(-)

Looks like we got some extra cleanups with this :)

> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8c44fd8fd85d..8b9cb5525cb1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19282,6 +19282,13 @@ S:   Maintained
>  F:   drivers/vlynq/vlynq.c
>  F:   include/linux/vlynq.h
>  
> +VMALLOC
> +M:   Uladzislau Rezki 

Should be 'M' or 'R'?  (Not sure if Vlad will be pushing patches or not.)

> +L:   linux...@kvack.org
> +S:   Maintained
> +F:   mm/vmalloc.c
> +F:   include/linux/vmalloc.h
> +
>  VME SUBSYSTEM
>  M:   Martyn Welch 
>  M:   Manohar Vanga 
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index fa4186459927..5e31bc901631 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -53,8 +53,7 @@ void *module_alloc(unsigned long size)
>*/
>   p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
>   module_alloc_base + SZ_2G, GFP_KERNEL,
> - PAGE_KERNEL, 0, NUMA_NO_NODE,
> - _RET_IP_);
> + PAGE_KERNEL, 0, NUMA_NO_NODE, _RET_IP_);
>  
>   if (p && (kasan_module_alloc(p, size) < 0)) {
>   vfree(p);
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 8aca5bf74685..f3a6c1b49c4e 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -1133,8 +1133,7 @@ void *bpf_jit_alloc_exec(unsigned long size)
>  {
>   return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
>   BPF_JIT_REGION_END, GFP_KERNEL,
> - PAGE_KERNEL, 0, NUMA_NO_NODE,
> - _RET_IP_);
> + PAGE_KERNEL, 0, NUMA_NO_NODE, _RET_IP_);
>  }
>  
>  void bpf_jit_free_exec(void *addr)
> diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
> index 320807f755a7..7181c4f99059 100644
> --- a/arch/parisc/kernel/module.c
> +++ b/arch/parisc/kernel/module.c
> @@ -198,9 +198,8 @@ void *module_alloc(unsigned long size)
>* easier than trying to map the text, data, init_text and
>* init_data correctly */
>   return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> - GFP_KERNEL,
> - PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
> - _RET_IP_);
> + GFP_KERNEL, PAGE_KERNEL_RWX, 0,
> + NUMA_NO_NODE, _RET_IP_);
>  }
>  
>  #ifndef CONFIG_64BIT
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 4866351282b0..a256a955d05f 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -427,8 +427,7 @@ void __init hyperv_init(void)
>  
>   hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
>   VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
> - VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> - _RET_IP_);
> + VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, _RET_IP_);
>   if (hv_hypercall_pg == NULL) {
>   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>   goto remove_cpuhp_state;
> -- 
> 2.30.2
> 
> 
>

Re: [PATCH] mm/slub.c: Trivial typo fixes

2021-03-24 Thread David Rientjes

On Wed, 24 Mar 2021, Bhaskar Chowdhury wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 3021ce9bf1b3..cd3c7be33f69 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3,7 +3,7 @@
>   * SLUB: A slab allocator that limits cache line use instead of queuing
>   * objects in per cpu and per node lists.
>   *
> - * The allocator synchronizes using per slab locks or atomic operatios
> + * The allocator synchronizes using per slab locks or atomic operation

operations

Re: [PATCH 2/2] slub: remove resiliency_test() function

2021-03-17 Thread David Rientjes

On Tue, 16 Mar 2021, glit...@gmail.com wrote:

> From: Oliver Glitta 
> 
> Function resiliency_test() is hidden behind #ifdef
> SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody
> runs it.
> 
> This function is replaced with kselftest for SLUB added
> by the previous patch "selftests: add a kselftest for SLUB
> debugging functionality".
> 
> Signed-off-by: Oliver Glitta 

Very nice!

Acked-by: David Rientjes

Re: [PATCH 1/2] selftests: add a kselftest for SLUB debugging functionality

2021-03-17 Thread David Rientjes

On Tue, 16 Mar 2021, glit...@gmail.com wrote:

> From: Oliver Glitta 
> 
> SLUB has resiliency_test() function which is hidden behind #ifdef
> SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody
> runs it. Kselftest should proper replacement for it.
> 
> Try changing byte in redzone after allocation and changing
> pointer to next free node, first byte, 50th byte and redzone
> byte. Check if validation finds errors.
> 
> There are several differences from the original resiliency test:
> Tests create own caches with known state instead of corrupting
> shared kmalloc caches.
> 
> The corruption of freepointer uses correct offset, the original
> resiliency test got broken with freepointer changes.
> 
> Scratch changing random byte test, because it does not have
> meaning in this form where we need deterministic results.
> 
> Add new option CONFIG_TEST_SLUB in Kconfig.
> 
> Add parameter to function validate_slab_cache() to return
> number of errors in cache.
> 
> Signed-off-by: Oliver Glitta 

Acked-by: David Rientjes

Re: [selftests] e48d82b67a: BUG_TestSlub_RZ_alloc(Not_tainted):Redzone_overwritten

2021-03-17 Thread David Rientjes

On Wed, 17 Mar 2021, Vlastimil Babka wrote:

> > Greeting,
> > 
> > FYI, we noticed the following commit (built with gcc-9):
> > 
> > commit: e48d82b67a2b760eedf7b95ca15f41267496386c ("[PATCH 1/2] selftests: 
> > add a kselftest for SLUB debugging functionality")
> > url: 
> > https://github.com/0day-ci/linux/commits/glittao-gmail-com/selftests-add-a-kselftest-for-SLUB-debugging-functionality/20210316-204257
> > base: 
> > https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
> > 
> > in testcase: trinity
> > version: trinity-static-i386-x86_64-f93256fb_2019-08-28
> > with following parameters:
> > 
> > group: group-04
> > 
> > test-description: Trinity is a linux system call fuzz tester.
> > test-url: http://codemonkey.org.uk/projects/trinity/
> > 
> > 
> > on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
> > 
> > caused below changes (please refer to attached dmesg/kmsg for entire 
> > log/backtrace):
> > 
> > 
> > +---+---++
> > |   
> > | v5.12-rc2 | e48d82b67a |
> > +---+---++
> > | BUG_TestSlub_RZ_alloc(Not_tainted):Redzone_overwritten
> > | 0 | 69 |
> > | INFO:0x(ptrval)-0x(ptrval)@offset=#.First_byte#instead_of 
> > | 0 | 69 |
> > | INFO:Allocated_in_resiliency_test_age=#cpu=#pid=  
> > | 0 | 69 |
> > | INFO:Slab0x(ptrval)objects=#used=#fp=0x(ptrval)flags= 
> > | 0 | 69 |
> > | INFO:Object0x(ptrval)@offset=#fp=0x(ptrval)   
> > | 0 | 69 |
> > | BUG_TestSlub_next_ptr_free(Tainted:G_B):Freechain_corrupt 
> > | 0 | 69 |
> > | INFO:Freed_in_resiliency_test_age=#cpu=#pid=  
> > | 0 | 69 |
> > | 
> > BUG_TestSlub_next_ptr_free(Tainted:G_B):Wrong_object_count.Counter_is#but_counted_were
> > | 0 | 69 |
> > | BUG_TestSlub_next_ptr_free(Tainted:G_B):Redzone_overwritten   
> > | 0 | 69 |
> > | 
> > BUG_TestSlub_next_ptr_free(Tainted:G_B):Objects_remaining_in_TestSlub_next_ptr_free_on__kmem_cache_shutdown()
> >  | 0 | 69 |
> > | INFO:Object0x(ptrval)@offset= 
> > | 0 | 69 |
> > | BUG_TestSlub_1th_word_free(Tainted:G_B):Poison_overwritten
> > | 0 | 69 |
> > | BUG_TestSlub_50th_word_free(Tainted:G_B):Poison_overwritten   
> > | 0 | 69 |
> > | BUG_TestSlub_RZ_free(Tainted:G_B):Redzone_overwritten 
> > | 0 | 69 |
> > +---+---++
> > 
> > 
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot 
> > 
> > 
> > 
> > [   22.154049] random: get_random_u32 called from 
> > __kmem_cache_create+0x23/0x3e0 with crng_init=0 
> > [   22.154070] random: get_random_u32 called from 
> > cache_random_seq_create+0x7c/0x140 with crng_init=0 
> > [   22.154167] random: get_random_u32 called from allocate_slab+0x155/0x5e0 
> > with crng_init=0 
> > [   22.154690] test_slub: 1. kmem_cache: Clobber Redzone 0x12->0x(ptrval)
> > [   22.164499] 
> > =
> > [   22.166629] BUG TestSlub_RZ_alloc (Not tainted): Redzone overwritten
> > [   22.168179] 
> > -
> > [   22.168179]
> > [   22.168372] Disabling lock debugging due to kernel taint
> > [   22.168372] INFO: 0x(ptrval)-0x(ptrval) @offset=1064. First byte 0x12 
> > instead of 0xcc
> > [   22.168372] INFO: Allocated in resiliency_test+0x47/0x1be age=3 cpu=0 
> > pid=1 
> > [   22.168372] __slab_alloc+0x57/0x80 
> > [   22.168372] kmem_cache_alloc (kbuild/src/consumer/mm/slub.c:2871 
> > kbuild/src/consumer/mm/slub.c:2915 kbuild/src/consumer/mm/slub.c:2920) 
> > [   22.168372] resiliency_test (kbuild/src/consumer/lib/test_slub.c:34 
> > kbuild/src/consumer/lib/t

Re: [PATCH] [PATCH] mm, slub: enable slub_debug static key when creating cache with explicit debug flags

2021-03-15 Thread David Rientjes

On Mon, 15 Mar 2021, Vlastimil Babka wrote:

> Commit ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()")
> introduced a static key to optimize the case where no debugging is enabled for
> any cache. The static key is enabled when slub_debug boot parameter is passed,
> or CONFIG_SLUB_DEBUG_ON enabled.
> 
> However, some caches might be created with one or more debugging flags
> explicitly passed to kmem_cache_create(), and the commit missed this. Thus the
> debugging functionality would not be actually performed for these caches 
> unless
> the static key gets enabled by boot param or config.
> 
> This patch fixes it by checking for debugging flags passed to
> kmem_cache_create() and enabling the static key accordingly.
> 
> Note such explicit debugging flags should not be used outside of debugging and
> testing as they will now enable the static key globally. btrfs_init_cachep()
> creates a cache with SLAB_RED_ZONE but that's a mistake that's being corrected
> [1]. rcu_torture_stats() creates a cache with SLAB_STORE_USER, but that is a
> testing module so it's OK and will start working as intended after this patch.
> 
> Also note that in case of backports to kernels before v5.12 that don't have
> 59450bbc12be ("mm, slab, slub: stop taking cpu hotplug lock"),
> static_branch_enable_cpuslocked() should be used.
> 

Since this affects 5.9+, is the plan to propose backports to stable with 
static_branch_enable_cpuslocked() once this is merged?  (I notice the 
absence of the stable tag here, which I believe is intended.)

> [1] 
> https://lore.kernel.org/linux-btrfs/20210315141824.26099-1-dste...@suse.com/
> 
> Reported-by: Oliver Glitta 
> Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()")
> Signed-off-by: Vlastimil Babka 

Acked-by: David Rientjes 

> ---
>  mm/slub.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 350a37f30e60..cd6694ad1a0a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3827,6 +3827,15 @@ static int calculate_sizes(struct kmem_cache *s, int 
> forced_order)
>  
>  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>  {
> +#ifdef CONFIG_SLUB_DEBUG
> + /*
> +  * If no slub_debug was enabled globally, the static key is not yet
> +  * enabled by setup_slub_debug(). Enable it if the cache is being
> +  * created with any of the debugging flags passed explicitly.
> +  */
> + if (flags & SLAB_DEBUG_FLAGS)
> + static_branch_enable(&slub_debug_enabled);
> +#endif
>   s->flags = kmem_cache_flags(s->size, flags, s->name);
>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>   s->random = get_random_long();

Re: [PATCH v6 2/3] mm, slub: don't combine pr_err with INFO

2021-03-14 Thread David Rientjes

On Sun, 14 Mar 2021, Yafang Shao wrote:

> It is strange to combine "pr_err" with "INFO", so let's remove the
> prefix completely.
> This patch is motivated by David's comment[1].
> 
> - before the patch
> [ 8846.517809] INFO: Slab 0xf42a2c60 objects=33 used=3 
> fp=0x60d32ca8 flags=0x17c0010200(slab|head)
> 
> - after the patch
> [ 6343.396602] Slab 0x4382e02b objects=33 used=3 
> fp=0x9ae06ffc flags=0x17c0010200(slab|head)
> 
> [1]. 
> https://lore.kernel.org/linux-mm/b9c0f2b6-e9b0-0c36-ebdd-2bc684c5a...@redhat.com/#t
> 
> Suggested-by: Vlastimil Babka 
> Signed-off-by: Yafang Shao 
> Acked-by: Vlastimil Babka 
> Reviewed-by: Miaohe Lin 
> Reviewed-by: Andy Shevchenko 
> Reviewed-by: David Hildenbrand 
> Cc: Matthew Wilcox 

Acked-by: David Rientjes

Re: [PATCH] memcg: enable memcg oom-kill for __GFP_NOFAIL

2021-02-25 Thread David Rientjes

On Tue, 23 Feb 2021, Shakeel Butt wrote:

> In the era of async memcg oom-killer, the commit a0d8b00a3381 ("mm:
> memcg: do not declare OOM from __GFP_NOFAIL allocations") added the code
> to skip memcg oom-killer for __GFP_NOFAIL allocations. The reason was
> that the __GFP_NOFAIL callers will not enter aync oom synchronization
> path and will keep the task marked as in memcg oom. At that time the
> tasks marked in memcg oom can bypass the memcg limits and the oom
> synchronization would have happened later in the later userspace
> triggered page fault. Thus letting the task marked as under memcg oom
> bypass the memcg limit for arbitrary time.
> 
> With the synchronous memcg oom-killer (commit 29ef680ae7c21 ("memcg,
> oom: move out_of_memory back to the charge path")) and not letting the
> task marked under memcg oom to bypass the memcg limits (commit
> 1f14c1ac19aa4 ("mm: memcg: do not allow task about to OOM kill to bypass
> the limit")), we can again allow __GFP_NOFAIL allocations to trigger
> memcg oom-kill. This will make memcg oom behavior closer to page
> allocator oom behavior.
> 
> Signed-off-by: Shakeel Butt 

Acked-by: David Rientjes

Re: [PATCH v5 2/3] mm, slub: don't combine pr_err with INFO

2021-02-17 Thread David Rientjes

On Mon, 15 Feb 2021, Yafang Shao wrote:

> It is strange to combine "pr_err" with "INFO", so let's remove the
> prefix completely.
> This patch is motivated by David's comment[1].
> 
> - before the patch
> [ 8846.517809] INFO: Slab 0xf42a2c60 objects=33 used=3 
> fp=0x60d32ca8 flags=0x17c0010200(slab|head)
> 
> - after the patch
> [ 6343.396602] Slab 0x4382e02b objects=33 used=3 
> fp=0x9ae06ffc flags=0x17c0010200(slab|head)
> 
> [1]. 
> https://lore.kernel.org/linux-mm/b9c0f2b6-e9b0-0c36-ebdd-2bc684c5a...@redhat.com/#t
> 
> Suggested-by: Vlastimil Babka 
> Signed-off-by: Yafang Shao 
> Acked-by: Vlastimil Babka 
> Reviewed-by: Miaohe Lin 
> Reviewed-by: Andy Shevchenko 
> Reviewed-by: David Hildenbrand 
> Cc: Matthew Wilcox 

Acked-by: David Rientjes

Re: [PATCH] mm, compaction: make fast_isolate_freepages() stay within zone

2021-02-17 Thread David Rientjes

On Wed, 17 Feb 2021, Vlastimil Babka wrote:

> Compaction always operates on pages from a single given zone when isolating
> both pages to migrate and freepages. Pageblock boundaries are intersected with
> zone boundaries to be safe in case zone starts or ends in the middle of
> pageblock. The use of pageblock_pfn_to_page() protects against non-contiguous
> pageblocks.
> 
> The functions fast_isolate_freepages() and fast_isolate_around() don't
> currently protect the fast freepage isolation thoroughly enough against these
> corner cases, and can result in freepage isolation operate outside of zone
> boundaries:
> 
> - in fast_isolate_freepages() if we get a pfn from the first pageblock of a
>   zone that starts in the middle of that pageblock, 'highest' can be a pfn
>   outside of the zone. If we fail to isolate anything in this function, we
>   may then call fast_isolate_around() on a pfn outside of the zone and there
>   effectively do a set_pageblock_skip(page_to_pfn(highest)) which may 
> currently
>   hit a VM_BUG_ON() in some configurations
> - fast_isolate_around() checks only the zone end boundary and not beginning,
>   nor that the pageblock is contiguous (with pageblock_pfn_to_page()) so it's
>   possible that we end up calling isolate_freepages_block() on a range of 
> pfn's
>   from two different zones and end up e.g. isolating freepages under the wrong
>   zone's lock.
> 
> This patch should fix the above issues.
> 
> Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a 
> migration target")
> Cc: 
> Signed-off-by: Vlastimil Babka 

Acked-by: David Rientjes

Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

2021-02-16 Thread David Rientjes

On Tue, 16 Feb 2021, Michal Hocko wrote:

> > Hugepages can be preallocated to avoid unpredictable allocation latency.
> > If we run into 4k page shortage, the kernel can trigger OOM even though
> > there were free hugepages. When OOM is triggered by user address page
> > fault handler, we can use oom notifier to free hugepages in user space
> > but if it's triggered by memory allocation for kernel, there is no way
> > to synchronously handle it in user space.
> 
> Can you expand some more on what kind of problem do you see?
> Hugetlb pages are, by definition, a preallocated, unreclaimable and
> admin controlled pool of pages.

Small nit: true of non-surplus hugetlb pages.

> Under those conditions it is expected
> and required that the sizing would be done very carefully. Why is that a
> problem in your particular setup/scenario?
> 
> If the sizing is really done properly and then a random process can
> trigger OOM then this can lead to malfunctioning of those workloads
> which do depend on hugetlb pool, right? So isn't this a kinda DoS
> scenario?
> 
> > This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> > enabled, it first tries to free a hugepage if available before invoking
> > the oom-killer. The default value is disabled not to change the current
> > behavior.
> 
> Why is this interface not hugepage size aware? It is quite different to
> release a GB huge page or 2MB one. Or is it expected to release the
> smallest one? To the implementation...
> 
> [...]
> > +static int sacrifice_hugepage(void)
> > +{
> > +   int ret;
> > +
> > +   spin_lock(&hugetlb_lock);
> > +   ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);
> 
> ... no it is going to release the default huge page. This will be 2MB in
> most cases but this is not given.
> 
> Unless I am mistaken this will free up also reserved hugetlb pages. This
> would mean that a page fault would SIGBUS which is very likely not
> something we want to do right? You also want to use oom nodemask rather
> than a full one.
> 
> Overall, I am not really happy about this feature even when above is
> fixed, but let's hear more the actual problem first.

Shouldn't this behavior be possible as an oomd plugin instead, perhaps 
triggered by psi?  I'm not sure if oomd is intended only to kill something 
(oomkilld? lol) or if it can be made to do sysadmin level behavior, such 
as shrinking the hugetlb pool, to solve the oom condition.

If so, it seems like we want to do this at the absolute last minute.  In 
other words, reclaim has failed to free memory by other means so we would 
like to shrink the hugetlb pool.  (It's the reason why it's implemented as 
a predecessor to oom as opposed to part of reclaim in general.)

Do we have the ability to suppress the oom killer until oomd has a chance 
to react in this scenario?

Re: [PATCH V2] mm: compaction: update the COMPACT[STALL|FAIL] events properly

2021-02-12 Thread David Rientjes

On Fri, 12 Feb 2021, Charan Teja Reddy wrote:

> By definition, COMPACT[STALL|FAIL] events needs to be counted when there
> is 'At least in one zone compaction wasn't deferred or skipped from the
> direct compaction'. And when compaction is skipped or deferred,
> COMPACT_SKIPPED will be returned but it will still go and update these
> compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
> is counted without even trying the compaction.
> 
> Correct this by skipping the counting of these events when
> COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
> the unnecessary try into the get_page_from_freelist() when compaction is
> not even tried.
> 
> There is a corner case where compaction is skipped but still count
> COMPACTSTALL event, which is that IRQ came and freed the page and the
> same is captured in capture_control.
> 
> Signed-off-by: Charan Teja Reddy 

Acked-by: David Rientjes

Re: [PATCH] mm/slub: minor coding style tweaks

2021-02-10 Thread David Rientjes

On Tue, 9 Feb 2021, Zhiyuan Dai wrote:

> This patch adds whitespace to fix coding style issues,
> improve code reading.
> 
> Signed-off-by: Zhiyuan Dai 

Acked-by: David Rientjes

Re: [PATCH] mm/slab: minor coding style tweaks

2021-02-10 Thread David Rientjes

On Tue, 9 Feb 2021, Zhiyuan Dai wrote:

> Fixed some coding style issues, improve code reading.
> This patch adds whitespace to clearly separate the parameters.
> 
> Signed-off-by: Zhiyuan Dai 

Acked-by: David Rientjes

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread David Rientjes

On Sun, 7 Feb 2021, Song Bao Hua (Barry Song) wrote:

> NUMA balancer is just one of many reasons for page migration. Even one
> simple alloc_pages() can cause memory migration in just single NUMA
> node or UMA system.
> 
> The other reasons for page migration include but are not limited to:
> * memory move due to CMA
> * memory move due to huge pages creation
> 
> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> in the whole system.
> 

What about only for mlocked memory, i.e. disable 
vm.compact_unevictable_allowed?

Adding syscalls is a big deal, we can make a reasonable inference that 
we'll have to support this forever if it's merged.  I haven't seen mention 
of what other unevictable memory *should* be migratable that would be 
adversely affected if we disable that sysctl.  Maybe that gets you part of 
the way there and there are some other deficiencies, but it seems like a 
good start would be to describe how CONFIG_NUMA_BALANCING=n + 
vm.compact_unevcitable_allowed + mlock() doesn't get you mostly there and 
then look into what's missing.

If it's a very compelling case where there simply are no alternatives, it 
would make sense.  Alternative is to find a more generic way, perhaps in 
combination with vm.compact_unevictable_allowed, to achieve what you're 
looking to do that can be useful even beyond your originally intended use 
case.

Re: [PATCH] mm: page_alloc: update the COMPACT[STALL|FAIL] events properly

2021-02-05 Thread David Rientjes

On Tue, 2 Feb 2021, Charan Teja Kalla wrote:

> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 519a60d..531f244 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, 
> >> unsigned int order,
> >>memalloc_noreclaim_restore(noreclaim_flag);
> >>psi_memstall_leave(&pflags);
> >>  
> >> +  if (*compact_result == COMPACT_SKIPPED)
> >> +  return NULL;
> >>/*
> >> * At least in one zone compaction wasn't deferred or skipped, so let's
> >> * count a compaction stall
> > 
> > This makes sense, I wonder if it would also be useful to check that 
> > page == NULL, either in try_to_compact_pages() or here for 
> > COMPACT_SKIPPED?
> 
> In the code, when COMPACT_SKIPPED is being returned, the page will
> always be NULL. So, I'm not sure how much useful it is for the page ==
> NULL check here. Or I failed to understand your point here?
> 

Your code is short-circuiting the rest of  __alloc_pages_direct_compact() 
where the return value is dictated by whether page is NULL or non-NULL.  
We can't leak a captured page if we are testing for it being NULL or 
non-NULL, which is what the rest of __alloc_pages_direct_compact() does 
*before* your change.  So the idea was to add a check the page is actually 
NULL here since you are now relying on the return value of 
compact_zone_order() to be COMPACT_SKIPPED to infer page == NULL.

I agree that's currently true in the code, I was trying to catch any 
errors where current->capture_control.page was non-NULL but 
try_to_compact_pages() returns COMPACT_SKIPPED.  There's some complexity 
here.

So my idea was the expand this out to:

if (*compact_result == COMPACT_SKIPPED) {
VM_BUG_ON(page);
return NULL;
}

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Dan Williams wrote:

> > > I don't have an objection to binding, but doesn't this require that the
> > > check in cxl_validate_cmd_from_user() guarantees send_cmd->size_in cannot
> > > be greater than 1MB?
> >
> > You're correct. I'd need to add:
> > cxlm->mbox.payload_size =
> > min_t(size_t, 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE), 
> > 1<<20)
> 
> nit, use the existing SZ_1M define.
> 

Sounds good, thanks both!  I'll assume you'll follow-up on this in the 
next revision for patch 7 ("cxl/mem: Add send command") and we can 
consider this resolved :)

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Ben Widawsky wrote:

> > I haven't seen the update to 8.2.8.4.5 to know yet :)
> > 
> > You make a good point of at least being able to interact with the driver.  
> > I think you could argue that if the driver binds, then the payload size is 
> > accepted, in which case it would be strange to get an EINVAL when using 
> > the ioctl with anything >1MB.
> > 
> > Concern was that if we mask off the reserved bits from the command 
> > register that we could be masking part of the payload size that is being 
> > passed if the accepted max is >1MB.  Idea was to avoid any possibility of 
> > this inconsistency.  If this is being checked for ioctl, seems like it's 
> > checking reserved bits.
> > 
> > But maybe I should just wait for the spec update.
> 
> Well, I wouldn't hold your breath (it would be an errata in this case anyway).
> My preference would be to just allow allow mailbox payload size to be 2^31 and
> not deal with this.
> 
> My question was how strongly do you feel it's an error that should prevent
> binding.
> 

I don't have an objection to binding, but doesn't this require that the 
check in cxl_validate_cmd_from_user() guarantees send_cmd->size_in cannot 
be greater than 1MB?

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Ben Widawsky wrote:

> > I think that's what 8.2.8.4.3 says, no?  And then 8.2.8.4.5 says you 
> > can use up to Payload Size.  That's why my recommendation was to enforce 
> > this in cxl_mem_setup_mailbox() up front.
> 
> Yeah. I asked our spec people to update 8.2.8.4.5 to make it clearer. I'd 
> argue
> the intent is how you describe it, but the implementation isn't.
> 
> My argument was silly anyway because if you specify greater than 1M as your
> payload, you will get EINVAL at the ioctl.
> 
> The value of how it works today is the driver will at least bind and allow you
> to interact with it.
> 
> How strongly do you feel about this?
> 

I haven't seen the update to 8.2.8.4.5 to know yet :)

You make a good point of at least being able to interact with the driver.  
I think you could argue that if the driver binds, then the payload size is 
accepted, in which case it would be strange to get an EINVAL when using 
the ioctl with anything >1MB.

Concern was that if we mask off the reserved bits from the command 
register that we could be masking part of the payload size that is being 
passed if the accepted max is >1MB.  Idea was to avoid any possibility of 
this inconsistency.  If this is being checked for ioctl, seems like it's 
checking reserved bits.

But maybe I should just wait for the spec update.

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Ben Widawsky wrote:

> > > > > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > > > > +{
> > > > > > > > +   const int cap = cxl_read_mbox_reg32(cxlm, 
> > > > > > > > CXLDEV_MB_CAPS_OFFSET);
> > > > > > > > +
> > > > > > > > +   cxlm->mbox.payload_size =
> > > > > > > > +   1 << CXL_GET_FIELD(cap, 
> > > > > > > > CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > > > > +
> > > > > > > > +   /* 8.2.8.4.3 */
> > > > > > > > +   if (cxlm->mbox.payload_size < 256) {
> > > > > > > > +   dev_err(&cxlm->pdev->dev, "Mailbox is too small 
> > > > > > > > (%zub)",
> > > > > > > > +   cxlm->mbox.payload_size);
> > > > > > > > +   return -ENXIO;
> > > > > > > > +   }
> > > > > > > 
> > > > > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as 
> > > > > > > well and 
> > > > > > > return ENXIO if true?
> > > > > > 
> > > > > > If some crazy vendor wanted to ship a mailbox larger than 1M, why 
> > > > > > should the
> > > > > > driver not allow it?
> > > > > > 
> > > > > 
> > > > > Because the spec disallows it :)
> > > > 
> > > > I don't see it being the driver's responsibility to enforce spec 
> > > > correctness
> > > > though. In certain cases, I need to use the spec, like I have to pick 
> > > > /some/
> > > > mailbox timeout. For other cases... 
> > > > 
> > > > I'm not too familiar with what other similar drivers may or may not do 
> > > > in
> > > > situations like this. The current 256 limit is mostly a reflection of 
> > > > that being
> > > > too small to even support advertised mandatory commands. So things 
> > > > can't work in
> > > > that scenario, but things can work if they have a larger register size 
> > > > (so long
> > > > as the BAR advertises enough space).
> > > > 
> > > 
> > > I don't think things can work above 1MB, either, though.  Section 
> > > 8.2.8.4.5 specifies 20 bits to define the payload length, if this is 
> > > larger than cxlm->mbox.payload_size it would venture into the reserved 
> > > bits of the command register.
> > > 
> > > So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload 
> > > size > 1MB and then only check 20 bits for the command register?
> > 
> > So it's probably a spec bug, but actually the payload size is 21 bits... 
> > I'll
> > check if that was a mistake.
> 
> Well I guess they wanted to be able to specify 1M exactly... Spec should
> probably say you can't go over 1M
> 

I think that's what 8.2.8.4.3 says, no?  And then 8.2.8.4.5 says you 
can use up to Payload Size.  That's why my recommendation was to enforce 
this in cxl_mem_setup_mailbox() up front.

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Ben Widawsky wrote:

> > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > +{
> > > > > + const int cap = cxl_read_mbox_reg32(cxlm, 
> > > > > CXLDEV_MB_CAPS_OFFSET);
> > > > > +
> > > > > + cxlm->mbox.payload_size =
> > > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > +
> > > > > + /* 8.2.8.4.3 */
> > > > > + if (cxlm->mbox.payload_size < 256) {
> > > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > > + cxlm->mbox.payload_size);
> > > > > + return -ENXIO;
> > > > > + }
> > > > 
> > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and 
> > > > return ENXIO if true?
> > > 
> > > If some crazy vendor wanted to ship a mailbox larger than 1M, why should 
> > > the
> > > driver not allow it?
> > > 
> > 
> > Because the spec disallows it :)
> 
> I don't see it being the driver's responsibility to enforce spec correctness
> though. In certain cases, I need to use the spec, like I have to pick /some/
> mailbox timeout. For other cases... 
> 
> I'm not too familiar with what other similar drivers may or may not do in
> situations like this. The current 256 limit is mostly a reflection of that 
> being
> too small to even support advertised mandatory commands. So things can't work 
> in
> that scenario, but things can work if they have a larger register size (so 
> long
> as the BAR advertises enough space).
> 

I don't think things can work above 1MB, either, though.  Section 
8.2.8.4.5 specifies 20 bits to define the payload length, if this is 
larger than cxlm->mbox.payload_size it would venture into the reserved 
bits of the command register.

So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload 
size > 1MB and then only check 20 bits for the command register?

Re: [PATCH 05/14] cxl/mem: Register CXL memX devices

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Ben Widawsky wrote:

> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl 
> > > b/Documentation/ABI/testing/sysfs-bus-cxl
> > > new file mode 100644
> > > index ..fe7b87eba988
> > > --- /dev/null
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -0,0 +1,26 @@
> > > +What:/sys/bus/cxl/devices/memX/firmware_version
> > > +Date:December, 2020
> > > +KernelVersion:   v5.12
> > > +Contact: linux-...@vger.kernel.org
> > > +Description:
> > > + (RO) "FW Revision" string as reported by the Identify
> > > + Memory Device Output Payload in the CXL-2.0
> > > + specification.
> > > +
> > > +What:/sys/bus/cxl/devices/memX/ram/size
> > > +Date:December, 2020
> > > +KernelVersion:   v5.12
> > > +Contact: linux-...@vger.kernel.org
> > > +Description:
> > > + (RO) "Volatile Only Capacity" as reported by the
> > > + Identify Memory Device Output Payload in the CXL-2.0
> > > + specification.
> > > +
> > > +What:/sys/bus/cxl/devices/memX/pmem/size
> > > +Date:December, 2020
> > > +KernelVersion:   v5.12
> > > +Contact: linux-...@vger.kernel.org
> > > +Description:
> > > + (RO) "Persistent Only Capacity" as reported by the
> > > + Identify Memory Device Output Payload in the CXL-2.0
> > > + specification.
> > 
> > Aren't volatile and persistent capacities expressed in multiples of 256MB?
> 
> As of the spec today, volatile and persistent capacities are required to be
> in multiples of 256MB, however, future specs may not have such a requirement 
> and
> I think keeping sysfs ABI easily forward portable makes sense.
> 

Makes sense, can we add that these are expressed in bytes or is that 
already implied?

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Ben Widawsky wrote:

> On 21-01-30 15:51:49, David Rientjes wrote:
> > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> > 
> > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > +{
> > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > +
> > > + cxlm->mbox.payload_size =
> > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > +
> > > + /* 8.2.8.4.3 */
> > > + if (cxlm->mbox.payload_size < 256) {
> > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > + cxlm->mbox.payload_size);
> > > + return -ENXIO;
> > > + }
> > 
> > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and 
> > return ENXIO if true?
> 
> If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> driver not allow it?
> 

Because the spec disallows it :)

Re: [PATCH] hugetlbfs: show pagesize in unit of GB if possible

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Miaohe Lin wrote:

> >> Hugepage size in unit of GB is supported. We could show pagesize in unit of
> >> GB to make it more friendly to read. Also rework the calculation code of
> >> page size unit to make it more readable.
> >>
> >> Signed-off-by: Miaohe Lin 
> >> ---
> >>  fs/hugetlbfs/inode.c | 12 
> >>  1 file changed, 8 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >> index 3a08fbae3b53..40a9795f250a 100644
> >> --- a/fs/hugetlbfs/inode.c
> >> +++ b/fs/hugetlbfs/inode.c
> >> @@ -1014,11 +1014,15 @@ static int hugetlbfs_show_options(struct seq_file 
> >> *m, struct dentry *root)
> >>if (sbinfo->max_inodes != -1)
> >>seq_printf(m, ",nr_inodes=%lu", sbinfo->max_inodes);
> >>  
> >> -  hpage_size /= 1024;
> >> -  mod = 'K';
> >> -  if (hpage_size >= 1024) {
> >> -  hpage_size /= 1024;
> >> +  if (hpage_size >= SZ_1G) {
> >> +  hpage_size /= SZ_1G;
> >> +  mod = 'G';
> >> +  } else if (hpage_size >= SZ_1M) {
> >> +  hpage_size /= SZ_1M;
> >>mod = 'M';
> >> +  } else {
> >> +  hpage_size /= SZ_1K;
> >> +  mod = 'K';
> >>}
> >>seq_printf(m, ",pagesize=%lu%c", hpage_size, mod);
> >>if (spool) {
> > 
> > NACK, this can break existing userspace parsing.
> > .
> > 
> 
> I see. I should take care of this. Many thanks.
> 

Thanks.  It's an important point to emphasize because it shows how 
important user-facing interfaces are.  Once the hugetlbfs mount options 
are printed in the way they are, it becomes very difficult to change them 
because there can be userspace in the wild that would break as a result.  
This is why it's crucial to be very careful when specifying user-facing 
interfaces the first time so they are extensible.

Re: [PATCH] mm/huge_memory.c: use helper range_in_vma() in __split_huge_p[u|m]d_locked()

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Miaohe Lin wrote:

> The helper range_in_vma() is introduced via commit 017b1660df89 ("mm:
> migration: fix migration of huge PMD shared pages"). But we forgot to
> use it in __split_huge_pud_locked() and __split_huge_pmd_locked().
> 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/huge_memory.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 987cf5e4cf90..33353a4f95fb 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1959,8 +1959,7 @@ static void __split_huge_pud_locked(struct 
> vm_area_struct *vma, pud_t *pud,
>   unsigned long haddr)
>  {
>   VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
> - VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> - VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
> + VM_BUG_ON_VMA(!range_in_vma(vma, haddr, haddr + HPAGE_PUD_SIZE), vma);
>   VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
>  
>   count_vm_event(THP_SPLIT_PUD);
> @@ -2039,8 +2038,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   int i;
>  
>   VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
> - VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> - VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> + VM_BUG_ON_VMA(!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE), vma);
>   VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
>   && !pmd_devmap(*pmd));
>  

This actually loses information, right?  Before the patch, we can 
determine which conditional is failing because we know the line number 
that the VM_BUG_ON() is happening on.  After the patch, we don't know 
this.

I don't think that's crucial, but I'm not sure it makes sense to do this 
if the only upside is that we removed one total line of code :)

Re: [PATCH] mm: page_alloc: update the COMPACT[STALL|FAIL] events properly

2021-02-01 Thread David Rientjes

On Mon, 1 Feb 2021, Charan Teja Reddy wrote:

> By defination, COMPACT[STALL|FAIL] events needs to be counted when there

s/defination/definition/

> is 'At least in one zone compaction wasn't deferred or skipped from the
> direct compaction'. And when compaction is skipped or deferred,
> COMPACT_SKIPPED will be returned but it will still go and update these
> compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
> is counted without even trying the compaction.
> 
> Correct this by skipping the counting of these events when
> COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
> the unnecessary try into the get_page_from_freelist() when compaction is
> not even tried.
> 
> Signed-off-by: Charan Teja Reddy 
> ---
>  mm/page_alloc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 519a60d..531f244 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   memalloc_noreclaim_restore(noreclaim_flag);
>   psi_memstall_leave(&pflags);
>  
> + if (*compact_result == COMPACT_SKIPPED)
> + return NULL;
>   /*
>* At least in one zone compaction wasn't deferred or skipped, so let's
>* count a compaction stall

This makes sense, I wonder if it would also be useful to check that 
page == NULL, either in try_to_compact_pages() or here for 
COMPACT_SKIPPED?

Re: [RFC][PATCH 04/13] mm/numa: node demotion data structure and lookup

2021-01-30 Thread David Rientjes

On Mon, 25 Jan 2021, Dave Hansen wrote:

> diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path 
> mm/migrate.c
> --- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path  
> 2021-01-25 16:23:09.553866709 -0800
> +++ b/mm/migrate.c2021-01-25 16:23:09.558866709 -0800
> @@ -1161,6 +1161,22 @@ out:
>   return rc;
>  }
>  
> +static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = 
> NUMA_NO_NODE};

__read_mostly?

> +
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * @returns: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> + return node_demotion[node];
> +}
> +
>  /*
>   * Obtain the lock on page, remove all ptes and migrate the page
>   * to the newly allocated page in newpage.
> _
>

Re: [RFC][PATCH 00/13] [v5] Migrate Pages in lieu of discard

2021-01-30 Thread David Rientjes

On Mon, 25 Jan 2021, Dave Hansen wrote:

> This also contains a few prerequisite patches that fix up an issue
> with the vm.zone_reclaim_mode sysctl ABI.
> 

I think these patches (patches 1-3) can be staged in -mm now since they 
fix vm.zone_reclaim_mode correctness and consistency.

Andrew, would it be possible to take patches 1-3 now since they are fix an 
existing issue?

Re: [RFC][PATCH 03/13] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

2021-01-30 Thread David Rientjes

On Mon, 25 Jan 2021, Dave Hansen wrote:

> 
> From: Dave Hansen 
> 
> RECLAIM_ZONE was assumed to be unused because it was never explicitly
> used in the kernel.  However, there were a number of places where it
> was checked implicitly by checking 'node_reclaim_mode' for a zero
> value.
> 
> These zero checks are not great because it is not obvious what a zero
> mode *means* in the code.  Replace them with a helper which makes it
> more obvious: node_reclaim_enabled().
> 
> This helper also provides a handy place to explicitly check the
> RECLAIM_ZONE bit itself.  Check it explicitly there to make it more
> obvious where the bit can affect behavior.
> 
> This should have no functional impact.
> 
> Signed-off-by: Dave Hansen 
> Reviewed-by: Ben Widawsky 
> Acked-by: Christoph Lameter 
> Cc: Alex Shi 
> Cc: "Tobin C. Harding" 
> Cc: Christoph Lameter 
> Cc: Andrew Morton 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: Qian Cai 
> Cc: Daniel Wagner 
> Cc: osalvador 

Acked-by: David Rientjes

Re: [PATCH] drm/ttm: Use __GFP_NOWARN for huge pages in ttm_pool_alloc_page

2021-01-30 Thread David Rientjes

On Thu, 28 Jan 2021, Michel Dänzer wrote:

> From: Michel Dänzer 
> 
> Without __GFP_NOWARN, attempts at allocating huge pages can trigger
> dmesg splats like below (which are essentially noise, since TTM falls
> back to normal pages if it can't get a huge one).
> 
> [ 9556.710241] clinfo: page allocation failure: order:9, 
> mode:0x194dc2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_NORETRY|__GFP_ZERO|__GFP_NOMEMALLOC),
>  nodemask=(null),cpuset=user.slice,mems_allowed=0
> [ 9556.710259] CPU: 1 PID: 470821 Comm: clinfo Tainted: GE 
> 5.10.10+ #4
> [ 9556.710264] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 
> TOMAHAWK (MS-7A34), BIOS 1.OR 11/29/2019
> [ 9556.710268] Call Trace:
> [ 9556.710281]  dump_stack+0x6b/0x83
> [ 9556.710288]  warn_alloc.cold+0x7b/0xdf
> [ 9556.710297]  ? __alloc_pages_direct_compact+0x137/0x150
> [ 9556.710303]  __alloc_pages_slowpath.constprop.0+0xc1b/0xc50
> [ 9556.710312]  __alloc_pages_nodemask+0x2ec/0x320
> [ 9556.710325]  ttm_pool_alloc+0x2e4/0x5e0 [ttm]
> [ 9556.710332]  ? kvmalloc_node+0x46/0x80
> [ 9556.710341]  ttm_tt_populate+0x37/0xe0 [ttm]
> [ 9556.710350]  ttm_bo_handle_move_mem+0x142/0x180 [ttm]
> [ 9556.710359]  ttm_bo_validate+0x11d/0x190 [ttm]
> [ 9556.710391]  ? drm_vma_offset_add+0x2f/0x60 [drm]
> [ 9556.710399]  ttm_bo_init_reserved+0x2a7/0x320 [ttm]
> [ 9556.710529]  amdgpu_bo_do_create+0x1b8/0x500 [amdgpu]
> [ 9556.710657]  ? amdgpu_bo_subtract_pin_size+0x60/0x60 [amdgpu]
> [ 9556.710663]  ? get_page_from_freelist+0x11f9/0x1450
> [ 9556.710789]  amdgpu_bo_create+0x40/0x270 [amdgpu]
> [ 9556.710797]  ? _raw_spin_unlock+0x16/0x30
> [ 9556.710927]  amdgpu_gem_create_ioctl+0x123/0x310 [amdgpu]
> [ 9556.711062]  ? amdgpu_gem_force_release+0x150/0x150 [amdgpu]
> [ 9556.711098]  drm_ioctl_kernel+0xaa/0xf0 [drm]
> [ 9556.711133]  drm_ioctl+0x20f/0x3a0 [drm]
> [ 9556.711267]  ? amdgpu_gem_force_release+0x150/0x150 [amdgpu]
> [ 9556.711276]  ? preempt_count_sub+0x9b/0xd0
> [ 9556.711404]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
> [ 9556.711411]  __x64_sys_ioctl+0x83/0xb0
> [ 9556.711417]  do_syscall_64+0x33/0x80
> [ 9556.711421]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Fixes: bf9eee249ac2 ("drm/ttm: stop using GFP_TRANSHUGE_LIGHT")
> Signed-off-by: Michel Dänzer 

Acked-by: David Rientjes 

Mikhail Gavrilov  reported the same issue.

Re: [bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]

2021-01-30 Thread David Rientjes

On Sat, 30 Jan 2021, David Rientjes wrote:

> On Sun, 31 Jan 2021, Mikhail Gavrilov wrote:
> 
> > The 5.11-rc5 (git 76c057c84d28) brought a new issue.
> > Now the kernel log is flooded with the message "page allocation failure".
> > 
> > Trace:
> > msedge:cs0: page allocation failure: order:10,
> 
> Order-10, wow!
> 
> ttm_pool_alloc() will start at order-10 and back off trying smaller orders 
> if necessary.  This is a regression introduced in
> 
> commit bf9eee249ac2032521677dd74e31ede5429afbc0
> Author: Christian König 
> Date:   Wed Jan 13 14:02:04 2021 +0100
> 
> drm/ttm: stop using GFP_TRANSHUGE_LIGHT
> 
> Namely, it removed the __GFP_NOWARN that we otherwise require.  I'll send 
> a patch in reply.
> 

Looks like Michel Dänzer  already sent a patch that 
should fix this:
https://lore.kernel.org/lkml/20210128095346.2421-1-mic...@daenzer.net/

Re: [bug] 5.11-rc5 brought page allocation failure issue [ttm][amdgpu]

2021-01-30 Thread David Rientjes

On Sun, 31 Jan 2021, Mikhail Gavrilov wrote:

> The 5.11-rc5 (git 76c057c84d28) brought a new issue.
> Now the kernel log is flooded with the message "page allocation failure".
> 
> Trace:
> msedge:cs0: page allocation failure: order:10,

Order-10, wow!

ttm_pool_alloc() will start at order-10 and back off trying smaller orders 
if necessary.  This is a regression introduced in

commit bf9eee249ac2032521677dd74e31ede5429afbc0
Author: Christian König 
Date:   Wed Jan 13 14:02:04 2021 +0100

drm/ttm: stop using GFP_TRANSHUGE_LIGHT

Namely, it removed the __GFP_NOWARN that we otherwise require.  I'll send 
a patch in reply.

> mode:0x190cc2(GFP_HIGHUSER|__GFP_NORETRY|__GFP_NOMEMALLOC),
> nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 18 PID: 4540 Comm: msedge:cs0 Tainted: GW
> - ---  5.11.0-0.rc5.20210128git76c057c84d28.138.fc34.x86_64 #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 3402 01/13/2021
> Call Trace:
>  dump_stack+0x8b/0xb0
>  warn_alloc.cold+0x72/0xd6
>  ? _cond_resched+0x16/0x50
>  ? __alloc_pages_direct_compact+0x1a1/0x210
>  __alloc_pages_slowpath.constprop.0+0xf64/0xf90
>  ? kmem_cache_alloc+0x299/0x310
>  ? lock_acquire+0x173/0x380
>  ? trace_hardirqs_on+0x1b/0xe0
>  ? lock_release+0x1e9/0x400
>  __alloc_pages_nodemask+0x37d/0x400
>  ttm_pool_alloc+0x2a3/0x630 [ttm]
>  ttm_tt_populate+0x37/0xe0 [ttm]
>  ttm_bo_handle_move_mem+0x142/0x180 [ttm]
>  ttm_bo_evict+0x12e/0x1b0 [ttm]
>  ? kfree+0xeb/0x660
>  ? amdgpu_vram_mgr_new+0x34d/0x3d0 [amdgpu]
>  ttm_mem_evict_first+0x101/0x4d0 [ttm]
>  ttm_bo_mem_space+0x2c8/0x330 [ttm]
>  ttm_bo_validate+0x163/0x1c0 [ttm]
>  amdgpu_cs_bo_validate+0x82/0x190 [amdgpu]
>  amdgpu_cs_list_validate+0x105/0x150 [amdgpu]
>  amdgpu_cs_ioctl+0x803/0x1ef0 [amdgpu]
>  ? trace_hardirqs_off_caller+0x41/0xd0
>  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
>  drm_ioctl_kernel+0x8c/0xe0 [drm]
>  drm_ioctl+0x20f/0x3c0 [drm]
>  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
>  ? selinux_file_ioctl+0x147/0x200
>  ? lock_acquired+0x1fa/0x380
>  ? lock_release+0x1e9/0x400
>  ? trace_hardirqs_on+0x1b/0xe0
>  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
>  __x64_sys_ioctl+0x82/0xb0
>  do_syscall_64+0x33/0x40
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f829c36c11b
> Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c
> c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 25 bd 0c 00 f7 d8 64 89 01 48
> RSP: 002b:7f8282c14f38 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 7f8282c14fa0 RCX: 7f829c36c11b
> RDX: 7f8282c14fa0 RSI: c0186444 RDI: 0018
> RBP: c0186444 R08: 7f8282c15640 R09: 7f8282c14f80
> R10:  R11: 0246 R12: 1f592c0fe088
> R13: 0018 R14:  R15: fffd
> Mem-Info:
> active_anon:24325 inactive_anon:3569299 isolated_anon:0
>  active_file:704540 inactive_file:2709725 isolated_file:0
>  unevictable:1230 dirty:256317 writeback:7074
>  slab_reclaimable:222328 slab_unreclaimable:112852
>  mapped:838359 shmem:469422 pagetables:47722 bounce:0
>  free:107165 free_pcp:1298 free_cma:0
> Node 0 active_anon:97300kB inactive_anon:14277196kB
> active_file:2818160kB inactive_file:10838900kB unevictable:4920kB
> isolated(anon):0kB isolated(file):0kB mapped:3353436kB dirty:1025268kB
> writeback:28296kB shmem:1877688kB shmem_thp: 0kB shmem_pmdmapped: 0kB
> anon_thp: 0kB writeback_tmp:0kB kernel_stack:62528kB
> pagetables:190888kB all_unreclaimable? no
> Node 0 DMA free:11800kB min:32kB low:44kB high:56kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
> present:15992kB managed:15900kB mlocked:0kB bounce:0kB free_pcp:0kB
> local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3056 31787 31787 31787
> Node 0 DMA32 free:303044kB min:6492kB low:9620kB high:12748kB
> reserved_highatomic:0KB active_anon:20kB inactive_anon:1322808kB
> active_file:5136kB inactive_file:483136kB unevictable:0kB
> writepending:220876kB present:3314552kB managed:3246620kB mlocked:0kB
> bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 28731 28731 28731
> Node 0 Normal free:113816kB min:61052kB low:90472kB high:119892kB
> reserved_highatomic:0KB active_anon:97280kB inactive_anon:12953852kB
> active_file:2812656kB inactive_file:10355000kB unevictable:4920kB
> writepending:832688kB present:30133248kB managed:29421044kB
> mlocked:4920kB bounce:0kB free_pcp:5180kB local_pcp:4kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0 0
> Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U)
> 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB
> Node 0 DMA32: 1009*4kB (UME) 724*8kB (UME) 488*16kB (UME) *32kB
> (UME) 950*64kB (UME) 620*128kB (UME) 223*256kB (UME) 74*512kB (M)
> 11*1024kB (M) 2*2048kB (ME) 0*4096kB = 303684kB
> Node

Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-01-30 Thread David Rientjes

On Fri, 29 Jan 2021, Ben Widawsky wrote:

> +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> +{
> + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> +
> + cxlm->mbox.payload_size =
> + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> +
> + /* 8.2.8.4.3 */
> + if (cxlm->mbox.payload_size < 256) {
> + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> + cxlm->mbox.payload_size);
> + return -ENXIO;
> + }

Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and 
return ENXIO if true?

Re: [PATCH 05/14] cxl/mem: Register CXL memX devices

2021-01-30 Thread David Rientjes

On Fri, 29 Jan 2021, Ben Widawsky wrote:

> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl 
> b/Documentation/ABI/testing/sysfs-bus-cxl
> new file mode 100644
> index ..fe7b87eba988
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -0,0 +1,26 @@
> +What:/sys/bus/cxl/devices/memX/firmware_version
> +Date:December, 2020
> +KernelVersion:   v5.12
> +Contact: linux-...@vger.kernel.org
> +Description:
> + (RO) "FW Revision" string as reported by the Identify
> + Memory Device Output Payload in the CXL-2.0
> + specification.
> +
> +What:/sys/bus/cxl/devices/memX/ram/size
> +Date:December, 2020
> +KernelVersion:   v5.12
> +Contact: linux-...@vger.kernel.org
> +Description:
> + (RO) "Volatile Only Capacity" as reported by the
> + Identify Memory Device Output Payload in the CXL-2.0
> + specification.
> +
> +What:/sys/bus/cxl/devices/memX/pmem/size
> +Date:December, 2020
> +KernelVersion:   v5.12
> +Contact: linux-...@vger.kernel.org
> +Description:
> + (RO) "Persistent Only Capacity" as reported by the
> + Identify Memory Device Output Payload in the CXL-2.0
> + specification.

Aren't volatile and persistent capacities expressed in multiples of 256MB?

Re: [PATCH 04/14] cxl/mem: Implement polled mode mailbox

2021-01-30 Thread David Rientjes

On Fri, 29 Jan 2021, Ben Widawsky wrote:

> Provide enough functionality to utilize the mailbox of a memory device.
> The mailbox is used to interact with the firmware running on the memory
> device.
> 
> The CXL specification defines separate capabilities for the mailbox and
> the memory device. The mailbox interface has a doorbell to indicate
> ready to accept commands and the memory device has a capability register
> that indicates the mailbox interface is ready. The expectation is that
> the doorbell-ready is always later than the memory-device-indication
> that the mailbox is ready.
> 
> Create a function to handle sending a command, optionally with a
> payload, to the memory device, polling on a result, and then optionally
> copying out the payload. The algorithm for doing this comes straight out
> of the CXL 2.0 specification.
> 
> Primary mailboxes are capable of generating an interrupt when submitting
> a command in the background. That implementation is saved for a later
> time.
> 
> Secondary mailboxes aren't implemented at this time.
> 
> The flow is proven with one implemented command, "identify". Because the
> class code has already told the driver this is a memory device and the
> identify command is mandatory.
> 
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/Kconfig |  14 ++
>  drivers/cxl/cxl.h   |  39 +
>  drivers/cxl/mem.c   | 342 +++-
>  3 files changed, 394 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 3b66b46af8a0..fe591f74af96 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -32,4 +32,18 @@ config CXL_MEM
> Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
>  
> If unsure say 'm'.
> +
> +config CXL_MEM_INSECURE_DEBUG
> + bool "CXL.mem debugging"
> + depends on CXL_MEM
> + help
> +   Enable debug of all CXL command payloads.
> +
> +   Some CXL devices and controllers support encryption and other
> +   security features. The payloads for the commands that enable
> +   those features may contain sensitive clear-text security
> +   material. Disable debug of those command payloads by default.
> +   If you are a kernel developer actively working on CXL
> +   security enabling say Y, otherwise say N.

Not specific to this patch, but the reference to encryption made me 
curious about integrity: are all CXL.mem devices compatible with DIMP? 
Some?  None?

> +
>  endif
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index a3da7f8050c4..df3d97154b63 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -31,9 +31,36 @@
>  #define CXLDEV_MB_CAPS_OFFSET 0x00
>  #define   CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
>  #define CXLDEV_MB_CTRL_OFFSET 0x04
> +#define   CXLDEV_MB_CTRL_DOORBELL BIT(0)
>  #define CXLDEV_MB_CMD_OFFSET 0x08
> +#define   CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
> +#define   CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
>  #define CXLDEV_MB_STATUS_OFFSET 0x10
> +#define   CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
>  #define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
> +#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
> +
> +/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
> +#define CXLMDEV_STATUS_OFFSET 0x0
> +#define   CXLMDEV_DEV_FATAL BIT(0)
> +#define   CXLMDEV_FW_HALT BIT(1)
> +#define   CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
> +#define CXLMDEV_MS_NOT_READY 0
> +#define CXLMDEV_MS_READY 1
> +#define CXLMDEV_MS_ERROR 2
> +#define CXLMDEV_MS_DISABLED 3
> +#define   CXLMDEV_READY(status) \
> + (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == 
> CXLMDEV_MS_READY)
> +#define   CXLMDEV_MBOX_IF_READY BIT(4)
> +#define   CXLMDEV_RESET_NEEDED_SHIFT 5
> +#define   CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
> +#define CXLMDEV_RESET_NEEDED_NOT 0
> +#define CXLMDEV_RESET_NEEDED_COLD 1
> +#define CXLMDEV_RESET_NEEDED_WARM 2
> +#define CXLMDEV_RESET_NEEDED_HOT 3
> +#define CXLMDEV_RESET_NEEDED_CXL 4
> +#define   CXLMDEV_RESET_NEEDED(status) \
> + (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != 
> CXLMDEV_RESET_NEEDED_NOT)
>  
>  /**
>   * struct cxl_mem - A CXL memory device
> @@ -44,6 +71,16 @@ struct cxl_mem {
>   struct pci_dev *pdev;
>   void __iomem *regs;
>  
> + struct {
> + struct range range;
> + } pmem;
> +
> + struct {
> + struct range range;
> + } ram;
> +
> + char firmware_version[0x10];
> +
>   /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
>   struct {
>   void __iomem *regs;
> @@ -51,6 +88,7 @@ struct cxl_mem {
>  
>   /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
>   struct {
> + struct mutex mutex; /* Protects device mailbox and firmware */
>   void __iomem *regs;
>   size_t payload_size;
>   } mbox;
> @@ -89,5 +127,6 @@ struct cxl_mem {
>  
>  cxl_reg(status);
>  cxl_reg(mbox);
>

Re: [PATCH 02/14] cxl/mem: Map memory device registers

2021-01-30 Thread David Rientjes

On Fri, 29 Jan 2021, Ben Widawsky wrote:

> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> new file mode 100644
> index ..d81d0ba4617c
> --- /dev/null
> +++ b/drivers/cxl/cxl.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright(c) 2020 Intel Corporation. */
> +
> +#ifndef __CXL_H__
> +#define __CXL_H__
> +
> +/**
> + * struct cxl_mem - A CXL memory device
> + * @pdev: The PCI device associated with this CXL device.
> + * @regs: IO mappings to the device's MMIO
> + */
> +struct cxl_mem {
> + struct pci_dev *pdev;
> + void __iomem *regs;
> +};
> +
> +#endif

Stupid question: can there be more than one CXL.mem capable logical 
device?  I only ask to determine if an ordinal is needed to enumerate 
multiple LDs.

> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index f4ee9a507ac9..a869c8dc24cc 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -4,6 +4,58 @@
>  #include 
>  #include 
>  #include "pci.h"
> +#include "cxl.h"
> +
> +/**
> + * cxl_mem_create() - Create a new &struct cxl_mem.
> + * @pdev: The pci device associated with the new &struct cxl_mem.
> + * @reg_lo: Lower 32b of the register locator
> + * @reg_hi: Upper 32b of the register locator.
> + *
> + * Return: The new &struct cxl_mem on success, NULL on failure.
> + *
> + * Map the BAR for a CXL memory device. This BAR has the memory device's
> + * registers for the device as specified in CXL specification.
> + */
> +static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo,
> +   u32 reg_hi)
> +{
> + struct device *dev = &pdev->dev;
> + struct cxl_mem *cxlm;
> + void __iomem *regs;
> + u64 offset;
> + u8 bar;
> + int rc;
> +
> + offset = ((u64)reg_hi << 32) | (reg_lo & CXL_REGLOC_ADDR_MASK);
> + bar = (reg_lo >> CXL_REGLOC_BIR_SHIFT) & CXL_REGLOC_BIR_MASK;
> +
> + /* Basic sanity check that BAR is big enough */
> + if (pci_resource_len(pdev, bar) < offset) {
> + dev_err(dev, "BAR%d: %pr: too small (offset: %#llx)\n", bar,
> + &pdev->resource[bar], (unsigned long long)offset);
> + return NULL;
> + }
> +
> + rc = pcim_iomap_regions(pdev, BIT(bar), pci_name(pdev));
> + if (rc != 0) {
> + dev_err(dev, "failed to map registers\n");
> + return NULL;
> + }
> +
> + cxlm = devm_kzalloc(&pdev->dev, sizeof(*cxlm), GFP_KERNEL);
> + if (!cxlm) {
> + dev_err(dev, "No memory available\n");
> + return NULL;
> + }
> +
> + regs = pcim_iomap_table(pdev)[bar];
> + cxlm->pdev = pdev;
> + cxlm->regs = regs + offset;
> +
> + dev_dbg(dev, "Mapped CXL Memory Device resource\n");
> + return cxlm;
> +}
>  
>  static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
>  {
> @@ -32,15 +84,42 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
>  static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id 
> *id)
>  {
>   struct device *dev = &pdev->dev;
> - int regloc;
> + struct cxl_mem *cxlm;
> + int rc, regloc, i;
> +
> + rc = pcim_enable_device(pdev);
> + if (rc)
> + return rc;
>  
>   regloc = cxl_mem_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC);
>   if (!regloc) {
>   dev_err(dev, "register location dvsec not found\n");
>   return -ENXIO;
>   }
> + regloc += 0xc; /* Skip DVSEC + reserved fields */

Assuming the DVSEC revision number is always 0x0 or there's no value in 
storing this in struct cxl_mem for the future.

Acked-by: David Rientjes

Re: [PATCH 01/14] cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints

2021-01-30 Thread David Rientjes

On Fri, 29 Jan 2021, Ben Widawsky wrote:

> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> new file mode 100644
> index ..3b66b46af8a0
> --- /dev/null
> +++ b/drivers/cxl/Kconfig
> @@ -0,0 +1,35 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +menuconfig CXL_BUS
> + tristate "CXL (Compute Express Link) Devices Support"
> + depends on PCI
> + help
> +   CXL is a bus that is electrically compatible with PCI Express, but
> +   layers three protocols on that signalling (CXL.io, CXL.cache, and
> +   CXL.mem). The CXL.cache protocol allows devices to hold cachelines
> +   locally, the CXL.mem protocol allows devices to be fully coherent
> +   memory targets, the CXL.io protocol is equivalent to PCI Express.
> +   Say 'y' to enable support for the configuration and management of
> +   devices supporting these protocols.
> +
> +if CXL_BUS
> +
> +config CXL_MEM
> + tristate "CXL.mem: Endpoint Support"

Nit: "CXL.mem: Memory Devices" or "CXL Memory Devices: CXL.mem" might look 
better, but feel free to ignore.

Acked-by: David Rientjes

Re: [PATCH] hugetlbfs: show pagesize in unit of GB if possible

2021-01-30 Thread David Rientjes

On Sat, 30 Jan 2021, Miaohe Lin wrote:

> Hugepage size in unit of GB is supported. We could show pagesize in unit of
> GB to make it more friendly to read. Also rework the calculation code of
> page size unit to make it more readable.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 3a08fbae3b53..40a9795f250a 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1014,11 +1014,15 @@ static int hugetlbfs_show_options(struct seq_file *m, 
> struct dentry *root)
>   if (sbinfo->max_inodes != -1)
>   seq_printf(m, ",nr_inodes=%lu", sbinfo->max_inodes);
>  
> - hpage_size /= 1024;
> - mod = 'K';
> - if (hpage_size >= 1024) {
> - hpage_size /= 1024;
> + if (hpage_size >= SZ_1G) {
> + hpage_size /= SZ_1G;
> + mod = 'G';
> + } else if (hpage_size >= SZ_1M) {
> + hpage_size /= SZ_1M;
>   mod = 'M';
> + } else {
> + hpage_size /= SZ_1K;
> + mod = 'K';
>   }
>   seq_printf(m, ",pagesize=%lu%c", hpage_size, mod);
>   if (spool) {

NACK, this can break existing userspace parsing.

Re: [PATCH v3] mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo

2021-01-30 Thread David Rientjes

On Fri, 29 Jan 2021, David Hildenbrand wrote:

> Let's count the number of CMA pages per zone and print them in
> /proc/zoneinfo.
> 
> Having access to the total number of CMA pages per zone is helpful for
> debugging purposes to know where exactly the CMA pages ended up, and to
> figure out how many pages of a zone might behave differently, even after
> some of these pages might already have been allocated.
> 
> As one example, CMA pages part of a kernel zone cannot be used for
> ordinary kernel allocations but instead behave more like ZONE_MOVABLE.
> 
> For now, we are only able to get the global nr+free cma pages from
> /proc/meminfo and the free cma pages per zone from /proc/zoneinfo.
> 
> Example after this patch when booting a 6 GiB QEMU VM with
> "hugetlb_cma=2G":
>   # cat /proc/zoneinfo | grep cma
>   cma  0
> nr_free_cma  0
>   cma  0
> nr_free_cma  0
>   cma  524288
> nr_free_cma  493016
>   cma  0
>   cma  0
>   # cat /proc/meminfo | grep Cma
>   CmaTotal:2097152 kB
>   CmaFree: 1972064 kB
> 
> Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
>   one can be sure when spotting "cma 0", that there are definetly no
>   CMA pages located in a zone.
> 
> Cc: Andrew Morton 
> Cc: Thomas Gleixner 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Mike Rapoport 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: David Rientjes 
> Cc: linux-...@vger.kernel.org
> Signed-off-by: David Hildenbrand 

Acked-by: David Rientjes

Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()

2021-01-28 Thread David Rientjes

On Wed, 27 Jan 2021, Alexander Lobakin wrote:

> A bunch of drivers test the page before reusing/recycling for two
> common conditions:
>  - if a page was allocated under memory pressure (pfmemalloc page);
>  - if a page was allocated at a distant memory node (to exclude
>slowdowns).
> 
> Introduce and use a new common function for doing this and eliminate
> all functions-duplicates from drivers.
> 
> Suggested-by: David Rientjes 
> Signed-off-by: Alexander Lobakin 

Looks even better than I thought!

(Since all of the changes are in drivers/net/ethernet/, I assume 
everything directly or indirectly includes skbuff.h already.)

Acked-by: David Rientjes 

Thanks for doing this.

Re: [PATCH v2 net-next 4/4] net: page_pool: simplify page recycling condition tests

2021-01-28 Thread David Rientjes

On Wed, 27 Jan 2021, Alexander Lobakin wrote:

> pool_page_reusable() is a leftover from pre-NUMA-aware times. For now,
> this function is just a redundant wrapper over page_is_pfmemalloc(),
> so Inline it into its sole call site.
> 
> Signed-off-by: Alexander Lobakin 
> Acked-by: Jesper Dangaard Brouer 
> Reviewed-by: Ilias Apalodimas 

Acked-by: David Rientjes

Re: [PATCH v2 net-next 2/4] skbuff: constify skb_propagate_pfmemalloc() "page" argument

2021-01-28 Thread David Rientjes

On Wed, 27 Jan 2021, Alexander Lobakin wrote:

> The function doesn't write anything to the page struct itself,
> so this argument can be const.
> 
> Misc: align second argument to the brace while at it.
> 
> Signed-off-by: Alexander Lobakin 

Acked-by: David Rientjes

Re: [PATCH v2 net-next 1/4] mm: constify page_is_pfmemalloc() argument

2021-01-28 Thread David Rientjes

On Wed, 27 Jan 2021, Alexander Lobakin wrote:

> The function only tests for page->index, so its argument should be
> const.
> 
> Signed-off-by: Alexander Lobakin 

Acked-by: David Rientjes

Re: [PATCH 1/3] mm, slub: use pGp to print page flags

2021-01-28 Thread David Rientjes

On Thu, 28 Jan 2021, Yafang Shao wrote:

> As pGp has been already introduced in printk, we'd better use it to make
> the output human readable.
> 
> Before this change, the output is,
> [ 6155.716018] INFO: Slab 0x4027dd4f objects=33 used=3 
> fp=0x8cd1579c flags=0x17c0010200
> 
> While after this change, the output is,
> [ 8846.517809] INFO: Slab 0xf42a2c60 objects=33 used=3 
> fp=0x60d32ca8 flags=0x17c0010200(slab|head)
> 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Yafang Shao 

Acked-by: David Rientjes

Re: [PATCH v2] mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo

2021-01-28 Thread David Rientjes

On Thu, 28 Jan 2021, David Hildenbrand wrote:

> > On Thu, 28 Jan 2021, David Hildenbrand wrote:
> > 
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 7758486097f9..957680db41fa 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1650,6 +1650,11 @@ static void zoneinfo_show_print(struct seq_file *m, 
> >> pg_data_t *pgdat,
> >>   zone->spanned_pages,
> >>   zone->present_pages,
> >>   zone_managed_pages(zone));
> >> +#ifdef CONFIG_CMA
> >> +seq_printf(m,
> >> +   "\ncma  %lu",
> >> +   zone->cma_pages);
> >> +#endif
> >> 
> >>seq_printf(m,
> >>   "\nprotection: (%ld",
> > 
> > Hmm, not sure about this.  If cma is only printed for CONFIG_CMA, we can't 
> > distinguish between (1) a kernel without your patch without including some 
> > version checking and (2) a kernel without CONFIG_CMA enabled.  IOW, 
> > "cma 0" carries value: we know immediately that we do not have any CMA 
> > pages on this zone, period.
> > 
> > /proc/zoneinfo is also not known for its conciseness so I think printing 
> > "cma 0" even for !CONFIG_CMA is helpful :)
> > 
> > I think this #ifdef should be removed and it should call into a 
> > zone_cma_pages(struct zone *zone) which returns 0UL if disabled.
> > 
> 
> Yeah, that’s also what I proposed in a sub-thread here.
> 

Ah, I certainly think your original intuition was correct.

> The last option would be going the full mile and not printing nr_free_cma. 
> Code might get a bit uglier though, but we could also remove that stats 
> counter ;)
> 
> I don‘t particularly care, while printing „0“ might be easier, removing 
> nr_free_cma might be cleaner.
> 
> But then, maybe there are tools that expect that value to be around on any 
> kernel?
> 

Yeah, that's probably undue risk, the ship has sailed and there's no 
significant upside.

I still think "cma 0" in /proc/zoneinfo carries value, though, especially 
for NUMA and it looks like this is how it's done in linux-next.  With a 
single read of the file, userspace can make the determination what CMA 
pages exist on this node.

In general, I think the rule-of-thumb is that the fewer ifdefs in 
/proc/zoneinfo, the easier it is for userspace to parse it.

(I made that change to /proc/zoneinfo to even print non-existant zones for 
each node because otherwise you cannot determine what the indices of 
things like vm.lowmem_reserve_ratio represent.)

Re: [PATCH v2] mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo

2021-01-28 Thread David Rientjes

On Thu, 28 Jan 2021, David Hildenbrand wrote:

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7758486097f9..957680db41fa 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1650,6 +1650,11 @@ static void zoneinfo_show_print(struct seq_file *m, 
> pg_data_t *pgdat,
>  zone->spanned_pages,
>  zone->present_pages,
>  zone_managed_pages(zone));
> +#ifdef CONFIG_CMA
> + seq_printf(m,
> +"\ncma  %lu",
> +zone->cma_pages);
> +#endif
>  
>   seq_printf(m,
>  "\nprotection: (%ld",

Hmm, not sure about this.  If cma is only printed for CONFIG_CMA, we can't 
distinguish between (1) a kernel without your patch without including some 
version checking and (2) a kernel without CONFIG_CMA enabled.  IOW, 
"cma 0" carries value: we know immediately that we do not have any CMA 
pages on this zone, period.

/proc/zoneinfo is also not known for its conciseness so I think printing 
"cma 0" even for !CONFIG_CMA is helpful :)

I think this #ifdef should be removed and it should call into a 
zone_cma_pages(struct zone *zone) which returns 0UL if disabled.

Re: [PATCH] mm, slub: remove slub_memcg_sysfs boot param and CONFIG_SLUB_MEMCG_SYSFS_ON

2021-01-27 Thread David Rientjes

On Wed, 27 Jan 2021, Vlastimil Babka wrote:

> The boot param and config determine the value of memcg_sysfs_enabled, which is
> unused since commit 10befea91b61 ("mm: memcg/slab: use a single set of
> kmem_caches for all allocations") as there are no per-memcg kmem caches
> anymore.
> 
> Signed-off-by: Vlastimil Babka 

Acked-by: David Rientjes

Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-26 Thread David Rientjes

On Thu, 21 Jan 2021, Sean Christopherson wrote:

> True, but the expected dual-usage is more about backwards compatibility than
> anything else.  Running an SEV-ES VM requires a heavily enlightened guest 
> vBIOS
> and kernel, which means that a VM that was created as an SEV guest cannot 
> easily
> be converted to an SEV-ES guest, and it would require cooperation from the 
> guest
> (if it's even feasible?).
> 
> SEV-SNP, another incremental enhancement (on SEV-ES), further strengthens the
> argument for SEV and SEV-* coexistenence.  SEV-SNP and SEV-ES will share the
> same ASID range, so the question is really, "do we expect to run SEV guests 
> and
> any flavor of SEV-* guests on the same platform".  And due to SEV-* not being
> directly backward compatible with SEV, the answer will eventually be "yes", as
> we'll want to keep running existing SEV guest while also spinning up new SEV-*
> guests.
> 

Agreed, cloud providers will most certainly want to run both SEV and SEV-* 
guests on the same platform.

> That being said, it's certainly possible to abstract the different key types
> between AMD and Intel (assuming s390 won't use the cgroup due to it's plethora
> of keys).  TDX private keys are equivalent to SEV-ES ASIDs, and MKTME keys (if
> the kernel ever gains a user) could be thrown into the same bucket as SEV IDs,
> albeit with some minor mental gymnastics.
> 
> E.g. this mapping isn't horrendous:
> 
>   encrpytion_ids.basic.*   == SEV   == MKTME
>   encrpytion_ids.enhanced.*== SEV-* == TDX
> 
> The names will likely be a bit vague, but I don't think they'll be so far off
> that it'd be impossible for someone with SEV/TDX knowledge to glean their 
> intent.
> And realistically, if anyone gets to the point where they care about 
> controlling
> SEV or TDX IDs, they've already plowed through hundreds of pages of dense
> documentation; having to read a few lines of cgroup docs to understand basic 
> vs.
> enhanced probably won't faze them at all.
> 

The abstraction makes sense for both AMD and Intel offerings today.  It 
makes me wonder if we want a read-only 
encryption_ids.{basic,enhanced}.type file to describe the underlying 
technology ("SEV-ES/SEV-SNP", "TDX", etc).  Since the technology is 
discoverable by other means and we are assuming one encryption type per 
pool of encryption ids, we likely don't need this.

I'm slightly concerned about extensibility if there is to be an 
incremental enhancement atop SEV-* or TDX with yet another pool of 
encryption ids.  (For example, when we only had hugepages, this name was 
perfect; then we got 1GB pages which became "gigantic pages", so are 512GB 
pages "enormous"? :)  I could argue (encryption_ids.basic.*,
encryption_ids.enhanced.*) should map to 
(encryption_ids.legacy.*, encryption_ids.*) but that's likely 
bikeshedding.

Thomas: does encryption_ids.{basic,enhanced}.* make sense for ASID 
partitioning?

Tejun: if this makes sense for legacy SEV and SEV-* per Thomas, and this 
is now abstracted to be technology (vendor) neutral, does this make sense 
to you?

Re: [External] Re: [PATCH v13 02/12] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP

2021-01-26 Thread David Rientjes

On Tue, 26 Jan 2021, Muchun Song wrote:

> > I'm not sure that Kconfig is the right place to document functional
> > behavior of the kernel, especially for non-configurable options.  Seems
> > like this is already served by existing comments added by this patch
> > series in the files where the description is helpful.
> 
> OK. So do you mean just remove the help text here?
> 

Yeah, I'd suggest removing the help text from a non-configurable Kconfig 
option and double checking that its substance is available elsewhere (like 
the giant comment in mm/hugetlb_vmemmap.c).

Re: [External] Re: [PATCH v13 02/12] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP

2021-01-26 Thread David Rientjes

On Mon, 25 Jan 2021, Muchun Song wrote:

> > >> I'm not sure I understand the rationale for providing this help text if
> > >> this is def_bool depending on CONFIG_HUGETLB_PAGE.  Are you intending 
> > >> that
> > >> this is actually configurable and we want to provide guidance to the 
> > >> admin
> > >> on when to disable it (which it currently doesn't)?  If not, why have the
> > >> help text?
> > >
> > > This is __not__ configurable. Seems like a comment to help others
> > > understand this option. Like Randy said.
> >
> > Yes, it could be written with '#' (or "comment") comment syntax instead of 
> > as help text.
> 
> Got it. I will update in the next version. Thanks.
> 

I'm not sure that Kconfig is the right place to document functional 
behavior of the kernel, especially for non-configurable options.  Seems 
like this is already served by existing comments added by this patch 
series in the files where the description is helpful.

Re: [PATCH net-next 2/3] net: constify page_is_pfmemalloc() argument at call sites

2021-01-25 Thread David Rientjes

On Mon, 25 Jan 2021, Alexander Lobakin wrote:

> Constify "page" argument for page_is_pfmemalloc() users where applicable.
> 
> Signed-off-by: Alexander Lobakin 
> ---
>  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c   | 2 +-
>  drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 2 +-
>  drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +-
>  drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
>  drivers/net/ethernet/intel/igc/igc_main.c | 2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 2 +-
>  include/linux/skbuff.h| 4 ++--
>  11 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
> b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> index 512080640cbc..0f8e962b5010 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> @@ -2800,7 +2800,7 @@ static void hns3_nic_alloc_rx_buffers(struct 
> hns3_enet_ring *ring,
>   writel(i, ring->tqp->io_base + HNS3_RING_RX_RING_HEAD_REG);
>  }
>  
> -static bool hns3_page_is_reusable(struct page *page)
> +static bool hns3_page_is_reusable(const struct page *page)
>  {
>   return page_to_nid(page) == numa_mem_id() &&
>   !page_is_pfmemalloc(page);

Hi Alexander,

All of these functions appear to be doing the same thing, would it make 
sense to simply add this to a header file and remove all the code 
duplication as well?

Re: [PATCH v13 06/12] mm: hugetlb: set the PageHWPoison to the raw error page

2021-01-24 Thread David Rientjes

On Sun, 17 Jan 2021, Muchun Song wrote:

> Because we reuse the first tail vmemmap page frame and remap it
> with read-only, we cannot set the PageHWPosion on a tail page.
> So we can use the head[4].private to record the real error page
> index and set the raw error page PageHWPoison later.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Oscar Salvador 

Acked-by: David Rientjes

Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page

2021-01-24 Thread David Rientjes

On Sun, 17 Jan 2021, Muchun Song wrote:

> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index ce4be1fa93c2..3b146d5949f3 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -40,7 +41,8 @@
>   * @remap_pte:   called for each non-empty PTE (lowest-level) 
> entry.
>   * @reuse_page:  the page which is reused for the tail vmemmap 
> pages.
>   * @reuse_addr:  the virtual address of the @reuse_page page.
> - * @vmemmap_pages:   the list head of the vmemmap pages that can be freed.
> + * @vmemmap_pages:   the list head of the vmemmap pages that can be freed
> + *   or is mapped from.
>   */
>  struct vmemmap_remap_walk {
>   void (*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
>   struct list_head *vmemmap_pages;
>  };
>  
> +/* The gfp mask of allocating vmemmap page */
> +#define GFP_VMEMMAP_PAGE \
> + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
> +

This is unnecessary, just use the gfp mask directly in allocator.

>  static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end,
> struct vmemmap_remap_walk *walk)
> @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned 
> long end,
>   free_vmemmap_page_list(&vmemmap_pages);
>  }
>  
> +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk)
> +{
> + pgprot_t pgprot = PAGE_KERNEL;
> + struct page *page;
> + void *to;
> +
> + BUG_ON(pte_page(*pte) != walk->reuse_page);
> +
> + page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> + list_del(&page->lru);
> + to = page_to_virt(page);
> + copy_page(to, (void *)walk->reuse_addr);
> +
> + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> +}
> +
> +static void alloc_vmemmap_page_list(struct list_head *list,
> + unsigned long start, unsigned long end)
> +{
> + unsigned long addr;
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE) {
> + struct page *page;
> + int nid = page_to_nid((const void *)addr);
> +
> +retry:
> + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
> + if (unlikely(!page)) {
> + msleep(100);
> + /*
> +  * We should retry infinitely, because we cannot
> +  * handle allocation failures. Once we allocate
> +  * vmemmap pages successfully, then we can free
> +  * a HugeTLB page.
> +  */
> + goto retry;

Ugh, I don't think this will work, there's no guarantee that we'll ever 
succeed and now we can't free a 2MB hugepage because we cannot allocate a 
4KB page.  We absolutely have to ensure we make forward progress here.

We're going to be freeing the hugetlb page after this succeeeds, can we 
not use part of the hugetlb page that we're freeing for this memory 
instead?

> + }
> + list_add_tail(&page->lru, list);
> + }
> +}
> +
> +/**
> + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, 
> end)
> + *to the page which is from the @vmemmap_pages
> + *respectively.
> + * @start:   start address of the vmemmap virtual address range.
> + * @end: end address of the vmemmap virtual address range.
> + * @reuse:   reuse address.
> + */
> +void vmemmap_remap_alloc(unsigned long start, unsigned long end,
> +  unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte  = vmemmap_restore_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages  = &vmemmap_pages,
> + };
> +
> + might_sleep();
> +
> + /* See the comment in the vmemmap_remap_free(). */
> + BUG_ON(start - reuse != PAGE_SIZE);
> +
> + alloc_vmemmap_page_list(&vmemmap_pages, start, end);
> + vmemmap_remap_range(reuse, end, &walk);
> +}
> +
>  /*
>   * Allocate a block of memory to be used to back the virtual memory map
>   * or to back the page tables that are used to create the mapping.
> -- 
> 2.11.0
> 
>

Re: [PATCH v13 02/12] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP

2021-01-24 Thread David Rientjes

On Sun, 17 Jan 2021, Muchun Song wrote:

> The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
> of unnecessary vmemmap associated with HugeTLB pages. The config
> option is introduced early so that supporting code can be written
> to depend on the option. The initial version of the code only
> provides support for x86-64.
> 
> Like other code which frees vmemmap, this config option depends on
> HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
> used to register bootmem info. Therefore, make sure
> register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
> is defined.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Oscar Salvador 
> Acked-by: Mike Kravetz 
> ---
>  arch/x86/mm/init_64.c |  2 +-
>  fs/Kconfig| 18 ++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0a45f062826e..0435bee2e172 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>  
>  static void __init register_page_bootmem_info(void)
>  {
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>   int i;
>  
>   for_each_online_node(i)
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 976e8b9033c4..e7c4c2a79311 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -245,6 +245,24 @@ config HUGETLBFS
>  config HUGETLB_PAGE
>   def_bool HUGETLBFS
>  
> +config HUGETLB_PAGE_FREE_VMEMMAP
> + def_bool HUGETLB_PAGE

I'm not sure I understand the rationale for providing this help text if 
this is def_bool depending on CONFIG_HUGETLB_PAGE.  Are you intending that 
this is actually configurable and we want to provide guidance to the admin 
on when to disable it (which it currently doesn't)?  If not, why have the 
help text?

> + depends on X86_64
> + depends on SPARSEMEM_VMEMMAP
> + depends on HAVE_BOOTMEM_INFO_NODE
> + help
> +   The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
> +   some vmemmap pages associated with pre-allocated HugeTLB pages.
> +   For example, on X86_64 6 vmemmap pages of size 4KB each can be
> +   saved for each 2MB HugeTLB page.  4094 vmemmap pages of size 4KB
> +   each can be saved for each 1GB HugeTLB page.
> +
> +   When a HugeTLB page is allocated or freed, the vmemmap array
> +   representing the range associated with the page will need to be
> +   remapped.  When a page is allocated, vmemmap pages are freed
> +   after remapping.  When a page is freed, previously discarded
> +   vmemmap pages must be allocated before remapping.
> +
>  config MEMFD_CREATE
>   def_bool TMPFS || HUGETLBFS
>

Re: [PATCH v13 04/12] mm: hugetlb: defer freeing of HugeTLB pages

2021-01-24 Thread David Rientjes

On Sun, 17 Jan 2021, Muchun Song wrote:

> In the subsequent patch, we should allocate the vmemmap pages when
> freeing HugeTLB pages. But update_and_free_page() is always called
> with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
> vmemmap pages. However, we can defer the actual freeing in a kworker
> to prevent from using GFP_ATOMIC to allocate the vmemmap pages.
> 
> The update_hpage_vmemmap_workfn() is where the call to allocate
> vmemmmap pages will be inserted.
> 

I think it's reasonable to assume that userspace can release free hugetlb 
pages from the pool on oom conditions when reclaim has become too 
expensive.  This approach now requires that we can allocate vmemmap pages 
in a potential oom condition as a prerequisite for freeing memory, which 
seems less than ideal.

And, by doing this through a kworker, we can presumably get queued behind 
another work item that requires memory to make forward progress in this 
oom condition.

Two thoughts:

- We're going to be freeing the hugetlb page after we can allocate the 
  vmemmap pages, so why do we need to allocate with GFP_KERNEL?  Can't we
  simply dip into memory reserves using GFP_ATOMIC (and thus can be 
  holding hugetlb_lock) because we know we'll be freeing more memory than
  we'll be allocating?  I think requiring a GFP_KERNEL allocation to block
  to free memory for vmemmap when we'll be freeing memory ourselves is
  dubious.  This simplifies all of this.

- If the answer is that we actually have to use GFP_KERNEL for other 
  reasons, what are your thoughts on pre-allocating the vmemmap as opposed
  to deferring to a kworker?  In other words, preallocate the necessary
  memory with GFP_KERNEL and put it on a linked list in struct hstate 
  before acquiring hugetlb_lock.

> Signed-off-by: Muchun Song 
> Reviewed-by: Mike Kravetz 
> ---
>  mm/hugetlb.c | 74 
> ++--
>  mm/hugetlb_vmemmap.c | 12 -
>  mm/hugetlb_vmemmap.h | 17 
>  3 files changed, 89 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 140135fc8113..c165186ec2cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1292,15 +1292,85 @@ static inline void 
> destroy_compound_gigantic_page(struct page *page,
>   unsigned int order) { }
>  #endif
>  
> -static void update_and_free_page(struct hstate *h, struct page *page)
> +static void __free_hugepage(struct hstate *h, struct page *page);
> +
> +/*
> + * As update_and_free_page() is always called with holding hugetlb_lock, so 
> we
> + * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
> + * the vmemmap pages.
> + *
> + * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
> + * pages will be inserted.
> + *
> + * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of 
> pages
> + * to be freed and frees them one-by-one. As the page->mapping pointer is 
> going
> + * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
> + * llist_node structure of a lockless linked list of huge pages to be freed.
> + */
> +static LLIST_HEAD(hpage_update_freelist);
> +
> +static void update_hpage_vmemmap_workfn(struct work_struct *work)
>  {
> - int i;
> + struct llist_node *node;
> +
> + node = llist_del_all(&hpage_update_freelist);
> +
> + while (node) {
> + struct page *page;
> + struct hstate *h;
> +
> + page = container_of((struct address_space **)node,
> +  struct page, mapping);
> + node = node->next;
> + page->mapping = NULL;
> + h = page_hstate(page);
> +
> + spin_lock(&hugetlb_lock);
> + __free_hugepage(h, page);
> + spin_unlock(&hugetlb_lock);
>  
> + cond_resched();

Wouldn't it be better to hold hugetlb_lock for the iteration rather than 
constantly dropping it and reacquiring it?  Use 
cond_resched_lock(&hugetlb_lock) instead?

Re: [PATCH V3] mm/compaction: correct deferral logic for proactive compaction

2021-01-24 Thread David Rientjes

On Wed, 20 Jan 2021, Vlastimil Babka wrote:

> On 1/19/21 8:26 PM, David Rientjes wrote:
> > On Mon, 18 Jan 2021, Charan Teja Reddy wrote:
> > 
> >> should_proactive_compact_node() returns true when sum of the
> >> weighted fragmentation score of all the zones in the node is greater
> >> than the wmark_high of compaction, which then triggers the proactive
> >> compaction that operates on the individual zones of the node. But
> >> proactive compaction runs on the zone only when its weighted
> >> fragmentation score is greater than wmark_low(=wmark_high - 10).
> >> 
> >> This means that the sum of the weighted fragmentation scores of all the
> >> zones can exceed the wmark_high but individual weighted fragmentation
> >> zone scores can still be less than wmark_low which makes the unnecessary
> >> trigger of the proactive compaction only to return doing nothing.
> >> 
> >> Issue with the return of proactive compaction with out even trying is
> >> its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
> >> the scores across the proactive compaction is same, thinking that
> >> compaction didn't make any progress but in reality it didn't even try.
> > 
> > Isn't this an issue in deferred compaction as well?  It seems like 
> > deferred compaction should check that work was actually performed before 
> > deferring subsequent calls to compaction.
> 
> Direct compaction does, proactive not.
> 
> > In other words, I don't believe deferred compaction is intended to avoid 
> > checks to determine if compaction is worth it; it should only defer 
> > *additional* work that was not productive.
> 
> Yeah, that should be more optimal.
> 

Charan, is this something you'd like to follow up on, or should I take a 
look instead?

Thanks!

Re: [PATCH V3] mm/compaction: correct deferral logic for proactive compaction

2021-01-19 Thread David Rientjes

On Mon, 18 Jan 2021, Charan Teja Reddy wrote:

> should_proactive_compact_node() returns true when sum of the
> weighted fragmentation score of all the zones in the node is greater
> than the wmark_high of compaction, which then triggers the proactive
> compaction that operates on the individual zones of the node. But
> proactive compaction runs on the zone only when its weighted
> fragmentation score is greater than wmark_low(=wmark_high - 10).
> 
> This means that the sum of the weighted fragmentation scores of all the
> zones can exceed the wmark_high but individual weighted fragmentation
> zone scores can still be less than wmark_low which makes the unnecessary
> trigger of the proactive compaction only to return doing nothing.
> 
> Issue with the return of proactive compaction with out even trying is
> its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
> the scores across the proactive compaction is same, thinking that
> compaction didn't make any progress but in reality it didn't even try.

Isn't this an issue in deferred compaction as well?  It seems like 
deferred compaction should check that work was actually performed before 
deferring subsequent calls to compaction.

In other words, I don't believe deferred compaction is intended to avoid 
checks to determine if compaction is worth it; it should only defer 
*additional* work that was not productive.

Thoughts?

> With the delay between successive retries for proactive compaction is
> 500msec, it can result into the deferral for ~30sec with out even trying
> the proactive compaction.
> 
> Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
> 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
> sizes 5GB and 6GB respectively. After opening some apps on the android,
> the weighted fragmentation scores of these zones are 47 and 49
> respectively. Since the sum of these fragmentation scores are above the
> wmark_high which triggers the proactive compaction and there since the
> individual zones weighted fragmentation scores are below wmark_low, it
> returns without trying the proactive compaction. As a result the
> weighted fragmentation scores of the zones are still 47 and 49 which
> makes the existing logic to defer the compaction thinking that
> noprogress is made across the compaction.
> 
> Fix this by checking just zone fragmentation score, not the weighted, in
> __compact_finished() and use the zones weighted fragmentation score in
> fragmentation_score_node(). In the test case above, If the weighted
> average of is above wmark_high, then individual score (not adjusted) of
> atleast one zone has to be above wmark_high. Thus it avoids the
> unnecessary trigger and deferrals of the proactive compaction.
> 
> Fix-suggested-by: Vlastimil Babka 

Suggested-by

> Signed-off-by: Charan Teja Reddy 

Acked-by: David Rientjes

Re: [PATCH v2] mm, oom: Fix a comment in dump_task

2021-01-15 Thread David Rientjes

On Fri, 15 Jan 2021, Tang Yizhou wrote:

> If p is a kthread, it will be checked in oom_unkillable_task() so
> we can delete the corresponding comment.
> 
> Signed-off-by: Tang Yizhou 

Acked-by: David Rientjes

Re: [PATCH v2] mm/slub: disable user tracing for kmemleak caches by default

2021-01-15 Thread David Rientjes

On Wed, 13 Jan 2021, Johannes Berg wrote:

> From: Johannes Berg 
> 
> If kmemleak is enabled, it uses a kmem cache for its own objects.
> These objects are used to hold information kmemleak uses, including
> a stack trace. If slub_debug is also turned on, each of them has
> *another* stack trace, so the overhead adds up, and on my tests (on
> ARCH=um, admittedly) 2/3rds of the allocations end up being doing
> the stack tracing.
> 
> Turn off SLAB_STORE_USER if SLAB_NOLEAKTRACE was given, to avoid
> storing the essentially same data twice.
> 
> Signed-off-by: Johannes Berg 

Acked-by: David Rientjes 

> ---
> Perhaps instead it should go the other way around, and kmemleak
> could even use/access the stack trace that's already in there ...
> But I don't really care too much, I can just turn off slub debug
> for the kmemleak caches via the command line anyway :-)
> 

I think making the change to kmem_cache_flags() is likely the simplest.

Re: [PATCH] MAINTAINERS: add myself as slab allocators maintainer

2021-01-15 Thread David Rientjes

On Thu, 14 Jan 2021, Vlastimil Babka wrote:

> On 1/8/21 7:46 PM, Christoph Lameter wrote:
> > I am ok with you as a slab maintainer. I have seen some good work from
> > you.
> > 
> > Acked-by: Christoph Lameter 
> 
> Thanks!
> 

Acked-by: David Rientjes 

Great addition!

Re: [PATCH V2] x86/sev-es: Fix SEV-ES #VC handler for string port IO

2021-01-11 Thread David Rientjes

On Mon, 11 Jan 2021, Tom Lendacky wrote:

> > Don't assume dest/source buffers are userspace addresses when manually
> > copying data for string I/O or MOVS MMIO, as {get,put}_user() will fail
> > if handed a kernel address and ultimately lead to a kernel panic.
> > 
> > Signed-off-by: Hyunwook (Wooky) Baek 
> > Acked-by: David Rientjes 
> > ---
> > 
> > This patch is tested by invoking INSB/OUTSB instructions in kernel space in
> > a
> > SEV-ES-enabled VM. Without the patch, the kernel crashed with the following
> > message:
> >"SEV-ES: Unsupported exception in #VC instruction emulation - can't
> > continue"
> > With the patch, the instructions successfully read/wrote the string from/to
> > the I/O port.
> 
> Shouldn't this have a Fixes: tag?
> 

Makes sense, I think this should likely be:

Fixes: f980f9c31a92 ("x86/sev-es: Compile early handler code into kernel image")

Re: [PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-08 Thread David Rientjes

On Fri, 8 Jan 2021, Suren Baghdasaryan wrote:

> > > @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> > > const struct iovec __user *, vec,
> > >   goto release_task;
> > >   }
> > >
> > > - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > > + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > > + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> > >   if (IS_ERR_OR_NULL(mm)) {
> > >   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > >   goto release_task;
> > >   }
> > >
> > > + /*
> > > +  * Require CAP_SYS_NICE for influencing process performance. Note 
> > > that
> > > +  * only non-destructive hints are currently supported.
> > > +  */
> > > + if (!capable(CAP_SYS_NICE)) {
> > > + ret = -EPERM;
> > > + goto release_task;
> >
> > mmput?
> 
> Ouch! Thanks for pointing it out! Will include in the next respin.
> 

With the fix, feel free to add:

Acked-by: David Rientjes 

Thanks Suren!

Re: [PATCH] mm, slub: Consider rest of partial list if acquire_slab() fails

2020-12-28 Thread David Rientjes

On Mon, 28 Dec 2020, Jann Horn wrote:

> acquire_slab() fails if there is contention on the freelist of the page
> (probably because some other CPU is concurrently freeing an object from the
> page). In that case, it might make sense to look for a different page
> (since there might be more remote frees to the page from other CPUs, and we
> don't want contention on struct page).
> 
> However, the current code accidentally stops looking at the partial list
> completely in that case. Especially on kernels without CONFIG_NUMA set,
> this means that get_partial() fails and new_slab_objects() falls back to
> new_slab(), allocating new pages. This could lead to an unnecessary
> increase in memory fragmentation.
> 
> Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop")
> Signed-off-by: Jann Horn 

Acked-by: David Rientjes 

Indeed, it looks like commit 7ced37197196 ("slub: Acquire_slab() avoid 
loop") stopped the iteration prematurely.

Re: [PATCH] mm/slab: Perform init_on_free earlier

2020-12-10 Thread David Rientjes

On Thu, 10 Dec 2020, Alexander Popov wrote:

> Currently in CONFIG_SLAB init_on_free happens too late, and heap
> objects go to the heap quarantine not being erased.
> 
> Lets move init_on_free clearing before calling kasan_slab_free().
> In that case heap quarantine will store erased objects, similarly
> to CONFIG_SLUB=y behavior.
> 
> Signed-off-by: Alexander Popov 
> Reviewed-by: Alexander Potapenko 

Acked-by: David Rientjes

Re: [PATCH] mm, slab, slub: clear the slab_cache field when freeing page

2020-12-10 Thread David Rientjes

On Thu, 10 Dec 2020, Vlastimil Babka wrote:

> The page allocator expects that page->mapping is NULL for a page being freed.
> SLAB and SLUB use the slab_cache field which is in union with mapping, but
> before freeing the page, the field is referenced with the "mapping" name when
> set to NULL.
> 
> It's IMHO more correct (albeit functionally the same) to use the slab_cache
> name as that's the field we use in SL*B, and document why we clear it in a
> comment (we don't clear fields such as s_mem or freelist, as page allocator
> doesn't care about those). While using the 'mapping' name would automagically
> keep the code correct if the unions in struct page changed, such changes 
> should
> be done consciously and needed changes evaluated - the comment should help 
> with
> that.
> 
> Signed-off-by: Vlastimil Babka 

Acked-by: David Rientjes

Re: [Patch v3 0/2] cgroup: KVM: New Encryption IDs cgroup controller

2020-12-10 Thread David Rientjes

On Thu, 10 Dec 2020, Christian Borntraeger wrote:

> > * However, the boilerplate to usefulness ratio doesn't look too good and I
> >   wonder whether what we should do is adding a generic "misc" controller
> >   which can host this sort of static hierarchical counting. I'll think more
> >   on it.
> 
> We first dicussed to have
> encryption_ids.stat
> encryption_ids.max
> encryption_ids.current
> 
> and we added the sev in later, so that we can also have tdx, seid, sgx or 
> whatever.
> Maybe also 2 or more things at the same time.
> 
> Right now this code has
> 
> encryption_ids.sev.stat
> encryption_ids.sev.max
> encryption_ids.sev.current
> 
> And it would be trivial to extend it to have
> encryption_ids.seid.stat
> encryption_ids.seid.max
> encryption_ids.seid.current
> on s390 instead (for our secure guests).
> 
> So in the end this is almost already a misc controller, the only thing that we
> need to change is the capability to also define things other than 
> encryption.*.*
> And of course we would need to avoid adding lots of random garbage to such a 
> thing.
> 
> But if you feel ok with the burden to keep things kind of organized a misc
> controller would certainly work for the encryption ID usecase as well. 
> So I would be fine with the thing as is or a misc controlĺer.
> 

Yeah, I think generalization of this would come in the form of either (1) 
the dumping ground of an actual "misc" controller, that you elude to, or 
(2) a kernel abstraction so you can spin up your own generic controller 
that has the {current, max, stat} support.  In the case of the latter, 
encryption IDs becomes a user of that abstraction.

Concern with a single misc controller would be that any subsystem that 
wants to use it has to exactly fit this support: current, max, stat, 
nothing more.  The moment a controller needs some additional support, and 
its controller is already implemented in previous kernel versionv as a 
part of "misc," we face a problem.

On the other hand, a kernel abstraction that provides just the basic 
{current, max, stat} support might be interesting if it can be extended by 
the subsystem instance using it.

Re: [PATCH] KVM/SVM: add support for SEV attestation command

2020-12-10 Thread David Rientjes

On Wed, 9 Dec 2020, Brijesh Singh wrote:

> Noted, I will send v2 with these fixed.
> 

And with those changes:

Acked-by: David Rientjes 

Thanks Brijesh!

Re: [RFC Patch 0/2] KVM: SVM: Cgroup support for SVM SEV ASIDs

2020-11-24 Thread David Rientjes

On Tue, 24 Nov 2020, Vipin Sharma wrote:

> > > Looping Janosch and Christian back into the thread.   
> > > 
> > >   
> > > 
> > > I interpret this suggestion as
> > > 
> > > encryption.{sev,sev_es,keyids}.{max,current,events} for AMD and Intel 
> > > 
> > 
> > I think it makes sense to use encryption_ids instead of simply encryption, 
> > that
> > way it's clear the cgroup is accounting ids as opposed to restricting what
> > techs can be used on yes/no basis.
> > 

Agreed.

> > > offerings, which was my thought on this as well.  
> > > 
> > >   
> > > 
> > > Certainly the kernel could provide a single interface for all of these 
> > > and
> > > key value pairs depending on the underlying encryption technology but it  
> > > 
> > > seems to only introduce additional complexity in the kernel in string 
> > > 
> > > parsing that can otherwise be avoided.  I think we all agree that a 
> > > single
> > > interface for all encryption keys or one-value-per-file could be done in  
> > > 
> > > the kernel and handled by any userspace agent that is configuring these   
> > > 
> > > values.   
> > > 
> > >   
> > > 
> > > I think Vipin is adding a root level file that describes how many keys we 
> > > 
> > > have available on the platform for each technology.  So I think this 
> > > comes
> > > down to, for example, a single encryption.max file vs 
> > > 
> > > encryption.{sev,sev_es,keyid}.max.  SEV and SEV-ES ASIDs are provisioned  
> > > 
> > 
> > Are you suggesting that the cgroup omit "current" and "events"?  I agree 
> > there's
> > no need to enumerate platform total, but not knowing how many of the 
> > allowed IDs
> > have been allocated seems problematic.
> > 
> 
> We will be showing encryption_ids.{sev,sev_es}.{max,current}
> I am inclined to not provide "events" as I am not using it, let me know
> if this file is required, I can provide it then.
> 
> I will provide an encryption_ids.{sev,sev_es}.stat file, which shows
> total available ids on the platform. This one will be useful for
> scheduling jobs in the cloud infrastructure based on total supported
> capacity.
> 

Makes sense.  I assume the stat file is only at the cgroup root level 
since it would otherwise be duplicating its contents in every cgroup under 
it.  Probably not very helpful for child cgroup to see stat = 509 ASIDs 
but max = 100 :)

Re: Pinning ZONE_MOVABLE pages

2020-11-22 Thread David Rientjes

On Fri, 20 Nov 2020, Pavel Tatashin wrote:

> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
> 
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map. Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.
> 
> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
> 
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.
> 
> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.
> 

I'd like to know more about this use case, ZONE_MOVABLE is typically a 
great way to optimize for thp availability because, absent memory pinning, 
this memory can always be defragmented.  So the idea is that DPDK will now 
allocate all of its thp from ZONE_NORMAL or only a small subset?  Seems 
like an invitation for oom kill if the sizing of ZONE_NORMAL is 
insufficient.

> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.
> 
> Sample function stacks through which this info needs to be passed is this:
> 
> pin_user_pages_remote(gup_flags)
>  __get_user_pages_remote(gup_flags)
>   __gup_longterm_locked(gup_flags)
>__get_user_pages_locked(gup_flags)
> __get_user_pages(gup_flags)
>  faultin_page(gup_flags)
>   Convert gup_flags into fault_flags
>   handle_mm_fault(fault_flags)
> 
> From handle_mm_fault(), the stack diverges into various faults,
> examples include:
> 
> Transparent Huge Page
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> create_huge_pmd(vmf);
> do_huge_pmd_anonymous_page(vmf);
> mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
> vmf.gfp_mask should be passed as well.
> 
> There are several other similar paths in a transparent huge page, also
> there is a named path where allocation is based on filesystems, and
> the flag should be honored there as well, but it does not have to be
> added at the same time.
> 
> Regular Pages
> handle_mm_fault(fault_flags)
> __handle_mm_fault(fault_flags)
> Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
> handle_pte_fault(vmf)
> do_anonymous_page(vmf);
> page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
> replace change this call according to gfp_mask.
> 

This would likely be useful for AMD SEV as well, which requires guest 
pages to be pinned because the encryption algorithm depends on the host 
physical address.  This ensures that plaintext memory for two pages don't 
result in the same ciphertext.

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2020-11-18 Thread David Rientjes

On Wed, 18 Nov 2020, Roman Gushchin wrote:

> On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> > On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > > 
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > > 
> > > Signed-off-by: Bharata B Rao 
> > 
> > Acked-by: Vlastimil Babka 
> > 
> > Ideally, we would react to hotplug events and update existing caches
> > accordingly. But for that, recalculation of order for existing caches would
> > have to be made safe, while not affecting hot paths. We have removed the
> > sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> > order changes") as it didn't seem easy and worth the trouble.
> > 
> > In case somebody wants to start with a large order right from the boot
> > because they know they will hotplug lots of cpus later, they can use
> > slub_min_objects= boot param to override this heuristic. So in case this
> > change regresses somebody's performance, there's a way around it and thus
> > the risk is low IMHO.
> 
> I agree. For the absolute majority of users there will be no difference.
> And there is a good workaround for the rest.
> 
> Acked-by: Roman Gushchin 
> 

Acked-by: David Rientjes

Re: [RFC Patch 0/2] KVM: SVM: Cgroup support for SVM SEV ASIDs

2020-11-13 Thread David Rientjes

On Mon, 2 Nov 2020, Sean Christopherson wrote:

> On Fri, Oct 02, 2020 at 01:48:10PM -0700, Vipin Sharma wrote:
> > On Fri, Sep 25, 2020 at 03:22:20PM -0700, Vipin Sharma wrote:
> > > I agree with you that the abstract name is better than the concrete
> > > name, I also feel that we must provide HW extensions. Here is one
> > > approach:
> > > 
> > > Cgroup name: cpu_encryption, encryption_slots, or memcrypt (open to
> > > suggestions)
> > > 
> > > Control files: slots.{max, current, events}
> 
> I don't particularly like the "slots" name, mostly because it could be 
> confused
> with KVM's memslots.  Maybe encryption_ids.ids.{max, current, events}?  I 
> don't
> love those names either, but "encryption" and "IDs" are the two obvious
> commonalities betwee TDX's encryption key IDs and SEV's encryption address
> space IDs.
> 

Looping Janosch and Christian back into the thread.

I interpret this suggestion as
encryption.{sev,sev_es,keyids}.{max,current,events} for AMD and Intel 
offerings, which was my thought on this as well.

Certainly the kernel could provide a single interface for all of these and 
key value pairs depending on the underlying encryption technology but it 
seems to only introduce additional complexity in the kernel in string 
parsing that can otherwise be avoided.  I think we all agree that a single 
interface for all encryption keys or one-value-per-file could be done in 
the kernel and handled by any userspace agent that is configuring these 
values.

I think Vipin is adding a root level file that describes how many keys we 
have available on the platform for each technology.  So I think this comes 
down to, for example, a single encryption.max file vs 
encryption.{sev,sev_es,keyid}.max.  SEV and SEV-ES ASIDs are provisioned 
separately so we treat them as their own resource here.

So which is easier?

$ cat encryption.sev.max
10
$ echo -n 15 > encryption.sev.max

or

$ cat encryption.max
sev 10
sev_es 10
keyid 0
$ echo -n "sev 10" > encryption.max

I would argue the former is simplest (always preferring 
one-value-per-file) and avoids any string parsing or resource controller 
lookups that need to match on that string in the kernel.

The set of encryption.{sev,sev_es,keyid} files that exist would depend on
CONFIG_CGROUP_ENCRYPTION and whether CONFIG_AMD_MEM_ENCRYPT or 
CONFIG_INTEL_TDX is configured.  Both can be configured so we have all 
three files, but the root file will obviously indicate 0 keys available 
for one of them (can't run on AMD and Intel at the same time :).

So I'm inclined to suggest that the one-value-per-file format is the ideal 
way to go unless there are objections to it.

Re: [PATCH v2] mm: memcg/slab: Fix root memcg vmstats

2020-11-10 Thread David Rientjes

On Tue, 10 Nov 2020, Muchun Song wrote:

> If we reparent the slab objects to the root memcg, when we free
> the slab object, we need to update the per-memcg vmstats to keep
> it correct for the root memcg. Now this at least affects the vmstat
> of NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack
> size is smaller than the PAGE_SIZE.
> 
> Fixes: ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes")
> Signed-off-by: Muchun Song 
> Acked-by: Roman Gushchin 

Acked-by: David Rientjes 

I assume that without this fix that the root memcg's vmstat would always 
be inflated if we reparented?  If that's accurate, perhaps this is 
deserving of a

Cc: sta...@vger.kernel.org # 5.3+

Re: [PATCH rfc 0/3] mm: memcg: deprecate cgroup v1 non-hierarchical mode

2020-11-04 Thread David Rientjes

On Tue, 3 Nov 2020, Roman Gushchin wrote:

> The non-hierarchical cgroup v1 mode is a legacy of early days
> of the memory controller and doesn't bring any value today.
> However, it complicates the code and creates many edge cases
> all over the memory controller code.
> 
> It's a good time to deprecate it completely. This patchset removes
> the internal logic, adjusts the user interface and updates
> the documentation. The alt patch removes some bits of the cgroup
> core code, which become obsolete.
> 
> 
> Roman Gushchin (3):
>   mm: memcg: deprecate the non-hierarchical mode
>   docs: cgroup-v1: reflect the deprecation of the non-hierarchical mode
>   cgroup: remove obsoleted broken_hierarchy and warned_broken_hierarchy
> 

For all three patches:

Acked-by: David Rientjes 

Very welcome change to see, we've always prevented the non-hierarchical 
mode from being set in our kernel.

Re: [cma] 1ea6c22c9b: page_allocation_failure:order:#,mode:#(__GFP_RECLAIMABLE),nodemask=(null)

2020-11-03 Thread David Rientjes

On Tue, 3 Nov 2020, kernel test robot wrote:

> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: 1ea6c22c9b85ec176bb78d7076be06a4142f8bdd ("[PATCH 1/2] cma: redirect 
> page allocation to CMA")
> url: 
> https://github.com/0day-ci/linux/commits/Chris-Goldsworthy/Increasing-CMA-Utilization-with-a-GFP-Flag/20201102-224143
> base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
> for-next
> 
> in testcase: boot
> 
> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +---+++
> |   | 
> 67b6d2ce11 | 1ea6c22c9b |
> +---+++
> | boot_successes| 
> 6  | 0  |
> | boot_failures | 
> 0  | 8  |
> | page_allocation_failure:order:#,mode:#(__GFP_RECLAIMABLE),nodemask=(null) | 
> 0  | 8  |
> | Mem-Info  | 
> 0  | 8  |
> | kernel_BUG_at_kernel/workqueue.c  | 
> 0  | 8  |
> | invalid_opcode:#[##]  | 
> 0  | 8  |
> | EIP:workqueue_init_early  | 
> 0  | 8  |
> | Kernel_panic-not_syncing:Fatal_exception  | 
> 0  | 8  |
> +---+++
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> [2.108390] Memory: 8203240K/8388088K available (12819K kernel code, 6610K 
> rwdata, 11292K rodata, 628K init, 12064K bss, 184848K reserved, 0K 
> cma-reserved, 7481224K highmem)
> [2.111999] Checking if this processor honours the WP bit even in 
> supervisor mode...Ok.
> [2.113918] random: get_random_u32 called from kmem_cache_open+0x1f/0x240 
> with crng_init=0
> [2.114387] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> [2.117656] Kernel/User page tables isolation: enabled
> [2.119610] swapper: page allocation failure: order:0, 
> mode:0x10(__GFP_RECLAIMABLE), nodemask=(null)
> [2.121604] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 5.10.0-rc1-00365-g1ea6c22c9b85 #1
> [2.123401] Call Trace:
> [2.123990]  dump_stack+0x1b/0x1d
> [2.124787]  warn_alloc+0x81/0xd9
> [2.125599]  __alloc_pages_slowpath+0x79c/0x7a9
> [2.126690]  ? get_page_from_freelist+0xb8/0x20d
> [2.127750]  __alloc_pages_nodemask+0x107/0x188
> [2.128795]  __alloc_pages_node+0x17/0x1c
> [2.129757]  alloc_slab_page+0x26/0x4e
> [2.130649]  allocate_slab+0x80/0x27e
> [2.131521]  ___slab_alloc+0x247/0x2ec
> [2.132605]  ? radix_tree_node_alloc+0x5e/0x8e
> [2.133915]  ? validate_chain+0x5a8/0x5c3
> [2.134838]  __slab_alloc+0x34/0x4d
> [2.135830]  ? __slab_alloc+0x34/0x4d
> [2.136909]  ? fs_reclaim_release+0x8/0x13
> [2.137910]  kmem_cache_alloc+0x46/0x157
> [2.138854]  ? radix_tree_node_alloc+0x5e/0x8e
> [2.140103]  radix_tree_node_alloc+0x5e/0x8e
> [2.141369]  idr_get_free+0xc1/0x21a
> [2.142230]  idr_alloc_u32+0x4d/0x80
> [2.143016]  idr_alloc+0x30/0x3e
> [2.143751]  worker_pool_assign_id+0x37/0x47
> [2.144682]  workqueue_init_early+0x9f/0x1f6
> [2.145707]  start_kernel+0x206/0x467
> [2.146592]  ? early_idt_handler_common+0x44/0x44
> [2.147671]  i386_start_kernel+0x42/0x44
> [2.148607]  startup_32_smp+0x164/0x168
> [2.149567] Mem-Info:
> [2.150101] active_anon:0 inactive_anon:0 isolated_anon:0
> [2.150101]  active_file:0 inactive_file:0 isolated_file:0
> [2.150101]  unevictable:0 dirty:0 writeback:0
> [2.150101]  slab_reclaimable:0 slab_unreclaimable:88
> [2.150101]  mapped:0 shmem:0 pagetables:0 bounce:0
> [2.150101]  free:2050715 free_pcp:0 free_cma:0
> [2.156588] Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB 
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 
> 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:0kB all_unreclaimable? no
> [2.162313] Normal free:721636kB min:0kB low:0kB high:0kB 
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB 
> inactive_file:0kB unevictable:0kB writepending:0kB present:906864kB 
> managed:722016kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB 
> local_pcp:0kB free_cma:0kB
> [2.

Re: [mm/page_alloc] 7fef431be9: vm-scalability.throughput 87.8% improvement

2020-10-23 Thread David Rientjes

On Wed, 21 Oct 2020, kernel test robot wrote:

> Greeting,
> 
> FYI, we noticed a 87.8% improvement of vm-scalability.throughput due to 
> commit:
> 
> 
> commit: 7fef431be9c9ac255838a9578331567b9dba4477 ("mm/page_alloc: place pages 
> to tail in __free_pages_core()")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 
> in testcase: vm-scalability
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz 
> with 192G memory
> with following parameters:
> 
>   runtime: 300s
>   size: 512G
>   test: anon-wx-rand-mt
>   cpufreq_governor: performance
>   ucode: 0x5002f01
> 
> test-description: The motivation behind this suite is to exercise functions 
> and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
> 

I'm curious why we are not able to reproduce this improvement on Skylake 
and actually see a slight performance degradation, at least for 
300s_128G_truncate_throughput.

Axel Rasmussen  can provide more details on our 
results.

Re: [PATCH] mm: memcontrol: add file_thp, shmem_thp to memory.stat

2020-10-22 Thread David Rientjes

On Thu, 22 Oct 2020, Johannes Weiner wrote:

> As huge page usage in the page cache and for shmem files proliferates
> in our production environment, the performance monitoring team has
> asked for per-cgroup stats on those pages.
> 
> We already track and export anon_thp per cgroup. We already track file
> THP and shmem THP per node, so making them per-cgroup is only a matter
> of switching from node to lruvec counters. All callsites are in places
> where the pages are charged and locked, so page->memcg is stable.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: David Rientjes 

Nice!

Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

2020-10-20 Thread David Rientjes

On Tue, 20 Oct 2020, Huang, Ying wrote:

> >> =
> >> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
> >>   
> >> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
> >> 
> >> commit: 
> >>   dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
> >>   85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges 
> >> separately")
> >> 
> >> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c 
> >>  --- 
> >>fail:runs  %reproductionfail:runs
> >>| | |
> >>   1:4   24%   2:4 
> >> perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
> >>   3:4   53%   5:4 
> >> perf-profile.calltrace.cycles-pp.error_entry.do_access
> >>   9:4  -27%   8:4 
> >> perf-profile.children.cycles-pp.error_entry
> >>   4:4  -10%   4:4 
> >> perf-profile.self.cycles-pp.error_entry
> >>  %stddev %change %stddev
> >>  \  |\  
> >> 477291-9.1% 434041vm-scalability.median
> >>   49791027-8.7%   45476799vm-scalability.throughput
> >> 223.67+1.6% 227.36
> >> vm-scalability.time.elapsed_time
> >> 223.67+1.6% 227.36
> >> vm-scalability.time.elapsed_time.max
> >>  50364 ±  6% +24.1%  62482 ± 10%  
> >> vm-scalability.time.involuntary_context_switches
> >>   2237+7.8%   2412
> >> vm-scalability.time.percent_of_cpu_this_job_got
> >>   3084   +18.2%   3646
> >> vm-scalability.time.system_time
> >>   1921-4.2%   1839vm-scalability.time.user_time
> >>  13.68+2.2   15.86mpstat.cpu.all.sys%
> >>  28535 ± 30% -47.0%  15114 ± 79%  
> >> numa-numastat.node0.other_node
> >> 142734 ± 11% -19.4% 115000 ± 17%  numa-meminfo.node0.AnonPages
> >>  11168 ±  3%  +8.8%  12150 ±  5%  numa-meminfo.node1.PageTables
> >>  76.00-1.6%  74.75vmstat.cpu.id
> >>   3626-1.9%   3555vmstat.system.cs
> >>2214928 ±166% -96.6%  75321 ±  7%  cpuidle.C1.usage
> >> 200981 ±  7% -18.0% 164861 ±  7%  cpuidle.POLL.time
> >>  52675 ±  3% -16.7%  43866 ± 10%  cpuidle.POLL.usage
> >>  35659 ± 11% -19.4%  28754 ± 17%  
> >> numa-vmstat.node0.nr_anon_pages
> >>1248014 ±  3% +10.9%1384236numa-vmstat.node1.nr_mapped
> >>   2722 ±  4% +10.6%   3011 ±  5%  
> >> numa-vmstat.node1.nr_page_table_pages
> >
> > I'm not sure that I'm reading this correctly, but I suspect that this just 
> > happens because of NUMA: memory affinity will obviously impact 
> > vm-scalability.throughput quite substantially, but I don't think the 
> > bisected commit can be to be blame.  Commit 85b9f46e8ea4 ("mm, thp: track 
> > fallbacks due to failed memcg charges separately") simply adds new 
> > count_vm_event() calls in a couple areas to track thp fallback due to 
> > memcg limits separate from fragmentation.
> >
> > It's likely a question about the testing methodology in general: for 
> > memory intensive benchmarks, I suggest it is configured in a manner that 
> > we can expect consistent memory access latency at the hardware level when 
> > running on a NUMA system.
> 
> So you think it's better to bind processes to NUMA node or CPU?  But we
> want to use this test case to capture NUMA/CPU placement/balance issue
> too.
> 

No, because binding to a specific socket may cause other performance 
"improvements" or "degradations" depending on how fragmented local memory 
is, or whether or not it's under memory pressure.  Is the system rebooted 
before testing so that we have a consistent state of memory availability 
and fragmentation across sockets?

> 0day solve the problem in another way.  We run the test case
> multiple-times and calculate the average and standard deviation, then
> compare.
> 

Depending on fragmentation or memory availability, any benchmark that 
assesses performance may be adversely affected if its results can be 
impacted by hugepage backing.

Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events

2020-10-13 Thread David Rientjes

On Fri, 9 Oct 2020, Axel Rasmussen wrote:

> It's common [1] to define tracepoint fields as "bool" when they contain
> a true / false value. Currently, defining a synthetic event with a
> "bool" field yields EINVAL. It's possible to work around this by using
> e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
> these properties don't match, you get EINVAL [2]).
> 
> Supporting "bool" explicitly makes hooking this up easier and more
> portable for userspace.
> 
> [1]: grep -r "bool" include/trace/events/
> [2]: check_synth_field() in kernel/trace/trace_events_hist.c
> 
> Acked-by: Michel Lespinasse 
> Signed-off-by: Axel Rasmussen 

Acked-by: David Rientjes

Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition

2020-10-13 Thread David Rientjes

On Fri, 9 Oct 2020, Axel Rasmussen wrote:

> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
> 
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
> 
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
> 
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
> 
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
> 
> Signed-off-by: Axel Rasmussen 

Acked-by: David Rientjes

Re: [PATCH] mm: memcontrol: Remove unused mod_memcg_obj_state()

2020-10-13 Thread David Rientjes

On Tue, 13 Oct 2020, Muchun Song wrote:

> Since commit:
> 
>   991e7673859e ("mm: memcontrol: account kernel stack per node")
> 
> There is no user of the mod_memcg_obj_state(). This patch just remove
> it. Also rework type of the idx parameter of the mod_objcg_state()
> from int to enum node_stat_item.
> 
> Signed-off-by: Muchun Song 

Acked-by: David Rientjes

Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

2020-10-04 Thread David Rientjes

On Sun, 4 Oct 2020, kernel test robot wrote:

> Greeting,
> 
> FYI, we noticed a -8.7% regression of vm-scalability.throughput due to commit:
> 
> 
> commit: 85b9f46e8ea451633ccd60a7d8cacbfff9f34047 ("mm, thp: track fallbacks 
> due to failed memcg charges separately")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 
> in testcase: vm-scalability
> on test machine: 104 threads Skylake with 192G memory
> with following parameters:
> 
>   runtime: 300s
>   size: 1T
>   test: lru-shm
>   cpufreq_governor: performance
>   ucode: 0x2006906
> 
> test-description: The motivation behind this suite is to exercise functions 
> and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
> 
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> Details are as below:
> -->
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
> 
> =
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
>   
> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
> 
> commit: 
>   dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
>   85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges 
> separately")
> 
> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c 
>  --- 
>fail:runs  %reproductionfail:runs
>| | |
>   1:4   24%   2:4 
> perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
>   3:4   53%   5:4 
> perf-profile.calltrace.cycles-pp.error_entry.do_access
>   9:4  -27%   8:4 
> perf-profile.children.cycles-pp.error_entry
>   4:4  -10%   4:4 
> perf-profile.self.cycles-pp.error_entry
>  %stddev %change %stddev
>  \  |\  
> 477291-9.1% 434041vm-scalability.median
>   49791027-8.7%   45476799vm-scalability.throughput
> 223.67+1.6% 227.36vm-scalability.time.elapsed_time
> 223.67+1.6% 227.36
> vm-scalability.time.elapsed_time.max
>  50364 ±  6% +24.1%  62482 ± 10%  
> vm-scalability.time.involuntary_context_switches
>   2237+7.8%   2412
> vm-scalability.time.percent_of_cpu_this_job_got
>   3084   +18.2%   3646vm-scalability.time.system_time
>   1921-4.2%   1839vm-scalability.time.user_time
>  13.68+2.2   15.86mpstat.cpu.all.sys%
>  28535 ± 30% -47.0%  15114 ± 79%  numa-numastat.node0.other_node
> 142734 ± 11% -19.4% 115000 ± 17%  numa-meminfo.node0.AnonPages
>  11168 ±  3%  +8.8%  12150 ±  5%  numa-meminfo.node1.PageTables
>  76.00-1.6%  74.75vmstat.cpu.id
>   3626-1.9%   3555vmstat.system.cs
>2214928 ±166% -96.6%  75321 ±  7%  cpuidle.C1.usage
> 200981 ±  7% -18.0% 164861 ±  7%  cpuidle.POLL.time
>  52675 ±  3% -16.7%  43866 ± 10%  cpuidle.POLL.usage
>  35659 ± 11% -19.4%  28754 ± 17%  numa-vmstat.node0.nr_anon_pages
>1248014 ±  3% +10.9%1384236numa-vmstat.node1.nr_mapped
>   2722 ±  4% +10.6%   3011 ±  5%  
> numa-vmstat.node1.nr_page_table_pages

I'm not sure that I'm reading this correctly, but I suspect that this just 
happens because of NUMA: memory affinity will obviously impact 
vm-scalability.throughput quite substantially, but I don't think the 
bisected commit can be to be blame.  Commit 85b9f46e8ea4 ("mm, thp: track 
fallbacks due to failed memcg charges separately") simply adds new 
count_vm_event() calls in a couple areas to track thp fallback due to 
memcg limits separate from fragmentation.

It's likely a question about the testing methodology in general: for 
memory intensive benchmarks, I suggest it is configured in a manner that 
we can expect consistent memory access latency at the hardware level when 
running on a NUMA system.

Re: [patch] KVM: SVM: Periodically schedule when unregistering regions on destroy

2020-09-11 Thread David Rientjes

Paolo, ping?

On Tue, 25 Aug 2020, David Rientjes wrote:

> There may be many encrypted regions that need to be unregistered when a
> SEV VM is destroyed.  This can lead to soft lockups.  For example, on a
> host running 4.15:
> 
> watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
> CPU: 206 PID: 194348 Comm: t_virtual_machi
> RIP: 0010:free_unref_page_list+0x105/0x170
> ...
> Call Trace:
>  [<0>] release_pages+0x159/0x3d0
>  [<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
>  [<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
>  [<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
>  [<0>] kvm_arch_destroy_vm+0x47/0x200
>  [<0>] kvm_put_kvm+0x1a8/0x2f0
>  [<0>] kvm_vm_release+0x25/0x30
>  [<0>] do_exit+0x335/0xc10
>  [<0>] do_group_exit+0x3f/0xa0
>  [<0>] get_signal+0x1bc/0x670
>  [<0>] do_signal+0x31/0x130
> 
> Although the CLFLUSH is no longer issued on every encrypted region to be
> unregistered, there are no other changes that can prevent soft lockups for
> very large SEV VMs in the latest kernel.
> 
> Periodically schedule if necessary.  This still holds kvm->lock across the
> resched, but since this only happens when the VM is destroyed this is
> assumed to be acceptable.
> 
> Signed-off-by: David Rientjes 
> ---
>  arch/x86/kvm/svm/sev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -1106,6 +1106,7 @@ void sev_vm_destroy(struct kvm *kvm)
>   list_for_each_safe(pos, q, head) {
>   __unregister_enc_region_locked(kvm,
>   list_entry(pos, struct enc_region, list));
> + cond_resched();
>   }
>   }
>  
>

Re: [PATCH] mm/memory_hotplug: drain per-cpu pages again during memory offline

2020-09-01 Thread David Rientjes

On Tue, 1 Sep 2020, Pavel Tatashin wrote:

> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
> 
> Thread#1 - a new process:
> 
> load_elf_binary
>  begin_new_exec
>   exec_mmap
>mmput
> exit_mmap
>  tlb_finish_mmu
>   tlb_flush_mmu
>release_pages
> free_unref_page_list
>  free_unref_page_prepare
>   set_pcppage_migratetype(page, migratetype);
>  // Set page->index migration type below  MIGRATE_PCPTYPES
> 
> Thread#2 - hot-removes memory
> __offline_pages
>   start_isolate_page_range
> set_migratetype_isolate
>   set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> Set migration type to MIGRATE_ISOLATE-> set
> drain_all_pages(zone);
>  // drain per-cpu page lists to buddy allocator.
> 
> Thread#1 - continue
>  free_unref_page_commit
>migratetype = get_pcppage_migratetype(page);
>   // get old migration type
>list_add(&page->lru, &pcp->lists[migratetype]);
>   // add new page to already drained pcp list
> 
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
> 
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
> 
> Signed-off-by: Pavel Tatashin 
> Cc: sta...@vger.kernel.org

Acked-by: David Rientjes

Re: [PATCH] ia64: fix min_low_pfn/max_low_pfn build errors

2020-08-29 Thread David Rientjes

On Fri, 28 Aug 2020, Randy Dunlap wrote:

> Fix min_low_pfn/max_low_pfn build errors for arch/ia64/: (e.g.)
> 
>  ERROR: "max_low_pfn" [drivers/rpmsg/virtio_rpmsg_bus.ko] undefined!
>  ERROR: "min_low_pfn" [drivers/rpmsg/virtio_rpmsg_bus.ko] undefined!
>  ERROR: "min_low_pfn" [drivers/hwtracing/intel_th/intel_th_msu.ko] undefined!
>  ERROR: "max_low_pfn" [drivers/hwtracing/intel_th/intel_th_msu.ko] undefined!
>  ERROR: "min_low_pfn" [drivers/crypto/cavium/nitrox/n5pf.ko] undefined!
>  ERROR: "max_low_pfn" [drivers/crypto/cavium/nitrox/n5pf.ko] undefined!
>  ERROR: "max_low_pfn" [drivers/md/dm-integrity.ko] undefined!
>  ERROR: "min_low_pfn" [drivers/md/dm-integrity.ko] undefined!
>  ERROR: "max_low_pfn" [crypto/tcrypt.ko] undefined!
>  ERROR: "min_low_pfn" [crypto/tcrypt.ko] undefined!
>  ERROR: "min_low_pfn" [security/keys/encrypted-keys/encrypted-keys.ko] 
> undefined!
>  ERROR: "max_low_pfn" [security/keys/encrypted-keys/encrypted-keys.ko] 
> undefined!
>  ERROR: "min_low_pfn" [arch/ia64/kernel/mca_recovery.ko] undefined!
>  ERROR: "max_low_pfn" [arch/ia64/kernel/mca_recovery.ko] undefined!
> 
> David suggested just exporting min_low_pfn & max_low_pfn in
> mm/memblock.c:
> https://lore.kernel.org/lkml/alpine.deb.2.22.394.2006291911220.1118...@chino.kir.corp.google.com/
> 
> Reported-by: kernel test robot 
> Signed-off-by: Randy Dunlap 
> Cc: linux...@kvack.org
> Cc: Andrew Morton 
> Cc: David Rientjes 
> Cc: Mike Rapoport 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: linux-i...@vger.kernel.org

Acked-by: David Rientjes 

Thanks Randy!

Re: [PATCH] microblaze: fix min_low_pfn/max_low_pfn build errors

2020-08-29 Thread David Rientjes

On Fri, 28 Aug 2020, Randy Dunlap wrote:

> Fix min_low_pfn/max_low_pfn build errors for arch/microblaze/: (e.g.)
> 
>   ERROR: "min_low_pfn" [drivers/rpmsg/virtio_rpmsg_bus.ko] undefined!
>   ERROR: "min_low_pfn" [drivers/hwtracing/intel_th/intel_th_msu_sink.ko] 
> undefined!
>   ERROR: "min_low_pfn" [drivers/hwtracing/intel_th/intel_th_msu.ko] undefined!
>   ERROR: "min_low_pfn" [drivers/mmc/core/mmc_core.ko] undefined!
>   ERROR: "min_low_pfn" [drivers/md/dm-crypt.ko] undefined!
>   ERROR: "min_low_pfn" [drivers/net/wireless/ath/ath6kl/ath6kl_sdio.ko] 
> undefined!
>   ERROR: "min_low_pfn" [crypto/tcrypt.ko] undefined!
>   ERROR: "min_low_pfn" [crypto/asymmetric_keys/asym_tpm.ko] undefined!
> 
> Mike had/has an alternate patch for Microblaze:
> https://lore.kernel.org/lkml/20200630111519.ga1951...@linux.ibm.com/
> 
> David suggested just exporting min_low_pfn & max_low_pfn in
> mm/memblock.c:
> https://lore.kernel.org/lkml/alpine.deb.2.22.394.2006291911220.1118...@chino.kir.corp.google.com/
> 
> Reported-by: kernel test robot 
> Signed-off-by: Randy Dunlap 
> Cc: linux...@kvack.org
> Cc: Andrew Morton 
> Cc: David Rientjes 
> Cc: Mike Rapoport 
> Cc: Michal Simek 
> Cc: Michal Simek 

Acked-by: David Rientjes 

Thanks Randy!

Re: Kernel 5.9-rc Regression: Boot failure with nvme

2020-08-29 Thread David Rientjes

On Sat, 29 Aug 2020, Christoph Hellwig wrote:

> > Just adding Christoph to the participants list, since at a guess it's
> > due to his changes whether they came from the nvme side or the dma
> > side..
> > 
> > Christoph?
> 
> This kinda looks like the sqsize regression we had in earlier 5.9-rc,
> but that should have been fixed in -rc2 with
> 
> 7442ddcedc344b6fa073692f165dffdd1889e780
> Author: John Garry 
> Date:   Fri Aug 14 23:34:25 2020 +0800
> 
> nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
> 
> Daniel, can you double check that you don't have that commit?
> 

Looks like Daniel has confirmed that this indeed does fix his issue -- 
great!

Christoph, re the plan to backport the atomic DMA pool support to 5.4 LTS 
for the purposes of fixing the AMD SEV allocation issues, I've composed 
the following list:

e860c299ac0d dma-remap: separate DMA atomic pools from direct remap code
c84dc6e68a1d dma-pool: add additional coherent pools to map to gfp mask
54adadf9b085 dma-pool: dynamically expanding atomic pools
76a19940bd62 dma-direct: atomic allocations must come from atomic coherent pools
2edc5bb3c5cc dma-pool: add pool sizes to debugfs
1d659236fb43 dma-pool: scale the default DMA coherent pool size with memory 
capacity
3ee06a6d532f dma-pool: fix too large DMA pools on medium memory size systems
dbed452a078d dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
** 633d5fce78a6 dma-direct: always align allocation size in 
dma_direct_alloc_pages()
** 96a539fa3bb7 dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
** 56fccf21d196 dma-direct: check return value when encrypting or decrypting 
memory
** 1a2b3357e860 dma-direct: add missing set_memory_decrypted() for coherent 
mapping
d07ae4c48690 dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
71cdec4fab76 dma-mapping: warn when coherent pool is depleted
567f6a6eba0c dma-direct: provide function to check physical memory area validity
23e469be6239 dma-pool: get rid of dma_in_atomic_pool()
48b6703858dd dma-pool: introduce dma_guess_pool()
81e9d894e03f dma-pool: make sure atomic pool suits device
d9765e41d8e9 dma-pool: do not allocate pool memory from CMA
9420139f516d dma-pool: fix coherent pool allocations for IOMMU mappings
d7e673ec2c8e dma-pool: Only allocate from CMA when in same memory zone

 [ The commits prefixed with ** are not absolutely required for atomic DMA
   but rather fix other issues with SEV in the DMA layer that I found 
   along the way.  They are likely deserving of their own stable 
   backports, but added them here because it's probably best to backport 
   in order to minimize conflicts.  We'll simply make a note of that in 
   the cover letter for the stable backport series. ]

Do you know of any others to add?  NVMe specific fixes, perhaps John 
Garry's fix above, Intel IOMMU fixes maybe?

[patch] KVM: SVM: Periodically schedule when unregistering regions on destroy

2020-08-25 Thread David Rientjes

There may be many encrypted regions that need to be unregistered when a
SEV VM is destroyed.  This can lead to soft lockups.  For example, on a
host running 4.15:

watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
CPU: 206 PID: 194348 Comm: t_virtual_machi
RIP: 0010:free_unref_page_list+0x105/0x170
...
Call Trace:
 [<0>] release_pages+0x159/0x3d0
 [<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
 [<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
 [<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
 [<0>] kvm_arch_destroy_vm+0x47/0x200
 [<0>] kvm_put_kvm+0x1a8/0x2f0
 [<0>] kvm_vm_release+0x25/0x30
 [<0>] do_exit+0x335/0xc10
 [<0>] do_group_exit+0x3f/0xa0
 [<0>] get_signal+0x1bc/0x670
 [<0>] do_signal+0x31/0x130

Although the CLFLUSH is no longer issued on every encrypted region to be
unregistered, there are no other changes that can prevent soft lockups for
very large SEV VMs in the latest kernel.

Periodically schedule if necessary.  This still holds kvm->lock across the
resched, but since this only happens when the VM is destroyed this is
assumed to be acceptable.

Signed-off-by: David Rientjes 
---
 arch/x86/kvm/svm/sev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1106,6 +1106,7 @@ void sev_vm_destroy(struct kvm *kvm)
list_for_each_safe(pos, q, head) {
__unregister_enc_region_locked(kvm,
list_entry(pos, struct enc_region, list));
+   cond_resched();
}
}

Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-12 Thread David Rientjes

On Wed, 12 Aug 2020, Charan Teja Kalla wrote:

> >> Signed-off-by: Charan Teja Reddy 
> >> ---
> >>
> >> v1: https://patchwork.kernel.org/patch/11707637/
> >>
> >>  mm/page_alloc.c | 5 +
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index e4896e6..839039f 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, 
> >> int count,
> >>struct page *page, *tmp;
> >>LIST_HEAD(head);
> >>  
> >> +  /*
> >> +   * Ensure proper count is passed which otherwise would stuck in the
> >> +   * below while (list_empty(list)) loop.
> >> +   */
> >> +  count = min(pcp->count, count);
> >>while (count) {
> >>struct list_head *list;
> >>  
> >>
> > 
> > Fixes: and Cc: stable... tags?
> 
> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
> one-list-per-migrate-type")
> Cc:  [2.6+]
> 

Acked-by: David Rientjes

Re: [PATCH] Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone"

2020-08-11 Thread David Rientjes

On Tue, 11 Aug 2020, Baoquan He wrote:

> This reverts commit 26e7deadaae1755faf1f6d1a68988c4b8348df59.
> 
> Sonny reported that one of their tests started failing on the latest
> kernel on their Chrome OS platform. The root cause is that the above
> commit removed the protection line of empty zone, while the parser used
> in the test relies on the protection line to mark the end of each zone.
> 
> Let's revert it to avoid breaking userspace testing or applications.
> 

Acked-by: David Rientjes 

No objection since I noted userspace parsing as a potential risk in 
https://marc.info/?l=linux-mm&m=158507790824995 but I didn't catch this 
earlier patch which could have led to the same thing.

Re: [PATCH] mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc

2020-08-11 Thread David Rientjes

On Tue, 11 Aug 2020, wuyun...@huawei.com wrote:

> From: Abel Wu 
> 
> The ALLOC_SLOWPATH statistics is missing in bulk allocation now.
> Fix it by doing statistics in alloc slow path.
> 
> Signed-off-by: Abel Wu 
> ---
>  mm/slub.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index df93a5a0e9a4..5d89e4064f83 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2600,6 +2600,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
> gfpflags, int node,
>   void *freelist;
>   struct page *page;
>  
> + stat(s, ALLOC_SLOWPATH);
> +
>   page = c->page;
>   if (!page) {
>   /*
> @@ -2788,7 +2790,6 @@ static __always_inline void *slab_alloc_node(struct 
> kmem_cache *s,
>   page = c->page;
>   if (unlikely(!object || !node_match(page, node))) {
>   object = __slab_alloc(s, gfpflags, node, addr, c);
> - stat(s, ALLOC_SLOWPATH);
>   } else {
>   void *next_object = get_freepointer_safe(s, object);
>  

Acked-by: David Rientjes 

> -- 
> 2.28.0.windows.1

Lol :)

Re: [PATCH] mm/slub: remove useless kmem_cache_debug

2020-08-10 Thread David Rientjes

On Mon, 10 Aug 2020, wuyun...@huawei.com wrote:

> From: Abel Wu 
> 
> The commit below is incomplete, as it didn't handle the add_full() part.
> commit a4d3f8916c65 ("slub: remove useless kmem_cache_debug() before 
> remove_full()")
> 
> Signed-off-by: Abel Wu 
> ---
>  mm/slub.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fe81773..0b021b7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2182,7 +2182,8 @@ static void deactivate_slab(struct kmem_cache *s, 
> struct page *page,
>   }
>   } else {
>   m = M_FULL;
> - if (kmem_cache_debug(s) && !lock) {
> +#ifdef CONFIG_SLUB_DEBUG
> + if (!lock) {
>   lock = 1;
>   /*
>* This also ensures that the scanning of full
> @@ -2191,6 +2192,7 @@ static void deactivate_slab(struct kmem_cache *s, 
> struct page *page,
>*/
>   spin_lock(&n->list_lock);
>   }
> +#endif
>   }
> 
>   if (l != m) {

This should be functionally safe, I'm wonder if it would make sense to 
only check for SLAB_STORE_USER here instead of kmem_cache_debug(), 
however, since that should be the only context in which we need the 
list_lock for add_full()?  It seems more explicit.

Re: [PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-10 Thread David Rientjes

On Mon, 10 Aug 2020, Charan Teja Reddy wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e4896e6..25e7e12 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,7 @@ static void free_unref_page_commit(struct page *page, 
> unsigned long pfn)
>   struct zone *zone = page_zone(page);
>   struct per_cpu_pages *pcp;
>   int migratetype;
> + int high;
>  
>   migratetype = get_pcppage_migratetype(page);
>   __count_vm_event(PGFREE);
> @@ -3128,8 +3129,19 @@ static void free_unref_page_commit(struct page *page, 
> unsigned long pfn)
>   pcp = &this_cpu_ptr(zone->pageset)->pcp;
>   list_add(&page->lru, &pcp->lists[migratetype]);
>   pcp->count++;
> - if (pcp->count >= pcp->high) {
> - unsigned long batch = READ_ONCE(pcp->batch);
> + high = READ_ONCE(pcp->high);
> + if (pcp->count >= high) {
> + int batch;
> +
> + batch = READ_ONCE(pcp->batch);
> + /*
> +  * For non-default pcp struct values, high is always
> +  * greater than the batch. If high < batch then pass
> +  * proper count to free the pcp's list pages.
> +  */
> + if (unlikely(high < batch))
> + batch = min(pcp->count, batch);
> +
>   free_pcppages_bulk(zone, batch, pcp);
>   }
>  }

I'm wondering if a fix to free_pcppages_bulk() is more appropriate here 
because the count passed into it seems otherwise fragile if this results 
in a hung core?

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2306 matches

Mail list logo