[PATCH] vfio/type1: Empty batch for pfnmap pages

2021-03-24 Thread Daniel Jordan
When vfio_pin_pages_remote() returns with a partial batch consisting of
a single VM_PFNMAP pfn, a subsequent call will unfortunately try
restoring it from batch->pages, resulting in vfio mapping the wrong page
and unbalancing the page refcount.

Prevent the function from returning with this kind of partial batch to
avoid the issue.  There's no explicit check for a VM_PFNMAP pfn because
it's awkward to do so, so infer it from characteristics of the batch
instead.  This may result in occasional false positives but keeps the
code simpler.

Fixes: 4d83de6da265 ("vfio/type1: Batch page pinning")
Link: https://lkml.kernel.org/r/20210323133254.33ed9...@omen.home.shazbot.org/
Reported-by: Alex Williamson 
Suggested-by: Alex Williamson 
Signed-off-by: Daniel Jordan 
---

Alex, I couldn't immediately find a way to trigger this bug, but I can
run your test case if you like.

This is the minimal fix, but it should still protect all calls of
vfio_batch_unpin() from this problem.

 drivers/vfio/vfio_iommu_type1.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index be07664a..45cbfd4879a5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -739,6 +739,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
ret = vfio_lock_acct(dma, lock_acct, false);
 
 unpin_out:
+   if (batch->size == 1 && !batch->offset) {
+   /* May be a VM_PFNMAP pfn, which the batch can't remember. */
+   put_pfn(pfn, dma->prot);
+   batch->size = 0;
+   }
+
if (ret < 0) {
if (pinned && !rsvd) {
for (pfn = *pfn_base ; pinned ; pfn++, pinned--)

base-commit: 84196390620ac0e5070ae36af84c137c6216a7dc
-- 
2.31.0



Re: [PATCH v2 3/3] vfio/type1: Batch page pinning

2021-03-23 Thread Daniel Jordan
Hi Alex,

Alex Williamson  writes:
> I've found a bug in this patch that we need to fix.  The diff is a
> little difficult to follow,

It was an awful diff, I remember...

> so I'll discuss it in the resulting function below...
>
> (1) Imagine the user has passed a vaddr range that alternates pfnmaps
> and pinned memory per page.
>
>
> static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>   long npage, unsigned long *pfn_base,
>   unsigned long limit, struct vfio_batch 
> *batch)
> {
> unsigned long pfn;
> struct mm_struct *mm = current->mm;
> long ret, pinned = 0, lock_acct = 0;
> bool rsvd;
> dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
>
> /* This code path is only user initiated */
> if (!mm)
> return -ENODEV;
>
> if (batch->size) {
> /* Leftover pages in batch from an earlier call. */
> *pfn_base = page_to_pfn(batch->pages[batch->offset]);
> pfn = *pfn_base;
> rsvd = is_invalid_reserved_pfn(*pfn_base);
>
> (4) We're called again and fill our local variables from the batch.  The
> batch only has one page, so we'll complete the inner loop below and 
> refill.
>
> (6) We're called again, batch->size is 1, but it was for a pfnmap, the pages
> array still contains the last pinned page, so we end up incorrectly using
> this pfn again for the next entry.
>
> } else {
> *pfn_base = 0;
> }
>
> while (npage) {
> if (!batch->size) {
> /* Empty batch, so refill it. */
> long req_pages = min_t(long, npage, batch->capacity);
>
> ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
>  , batch->pages);
> if (ret < 0)
> goto unpin_out;
>
> (2) Assume the 1st page is pfnmap, the 2nd is pinned memory

Just to check we're on the same wavelength, I think you can hit this bug
with one less call to vfio_pin_pages_remote() if the 1st page in the
vaddr range is pinned memory and the 2nd is pfnmap.  Then you'd have the
following sequence:

vfio_pin_pages_remote() call #1:

 - In the first batch refill, you'd get a size=1 batch with pinned
   memory and complete the inner loop, breaking at "if (!batch->size)".
   
 - In the second batch refill, you'd get another size=1 batch with a
   pfnmap page, take the "goto unpin_out" in the inner loop, and return
   from the function with the batch still containing a single pfnmap
   page.

vfio_pin_pages_remote() call #2:

 - *pfn_base is set from the first element of the pages array, which
unfortunately has the non-pfnmap pfn.

Did I follow you?

>
> batch->size = ret;
> batch->offset = 0;
>
> if (!*pfn_base) {
> *pfn_base = pfn;
> rsvd = is_invalid_reserved_pfn(*pfn_base);
> }
> }
>
> /*
>  * pfn is preset for the first iteration of this inner loop 
> and
>  * updated at the end to handle a VM_PFNMAP pfn.  In that 
> case,
>  * batch->pages isn't valid (there's no struct page), so allow
>  * batch->pages to be touched only when there's more than one
>  * pfn to check, which guarantees the pfns are from a
>  * !VM_PFNMAP vma.
>  */
> while (true) {
> if (pfn != *pfn_base + pinned ||
> rsvd != is_invalid_reserved_pfn(pfn))
> goto out;
>
> (3) On the 2nd page, both tests are probably true here, so we take this goto,
> leaving the batch with the next page.
>
> (5) Now we've refilled batch, but the next page is pfnmap, so likely both of 
> the
> above tests are true... but this is a pfnmap'ing!
>
> (7) Do we add something like if (batch->size == 1 && !batch->offset) {
> put_pfn(pfn, dma->prot); batch->size = 0; }?

Yes, that could work, maybe with a check for a pfnmap mapping (rsvd)
instead of those two conditions.

I'd rejected two approaches where the batch stores pfns instead of
pages.  Allocating two pages (one for pages, one for pfns) seems
overkill, though the allocation is transient.  Using a union for "struct
page **pages" and "unsigned long *pfns" seems fragile due to the sizes
of each type needing to match, and possibly slow from having to loop
over the array twice (once to convert them all after pin_user_pages and
again for the inner loop).  Neither seem much better, at least to me,
even with this bug as additional motivation.

It'd be better if pup returned pfns in some form, but that's another

Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-03-18 Thread Daniel Jordan
Alexey Klimov  writes:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it could fail with such flow:
>
>   sched_setaffinity()
> cpuset_cpus_allowed()
>   guarantee_online_cpus()   <-- cs->effective_cpus mask does not
> contain recently onlined cpu
> cpumask_and()   <-- final new_mask is empty
> __set_cpus_allowed_ptr()
>   cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
>   returns -EINVAL
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
> callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.
>
> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> Cc: Daniel Jordan 
> Reviewed-by: Qais Yousef 
> Co-analyzed-by: Joshua Baker 
> Signed-off-by: Alexey Klimov 

Looks good to me.

Reviewed-by: Daniel Jordan 


Re: [PATCH v2] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-03-18 Thread Daniel Jordan
Alexey Klimov  writes:
> The first section of comment seems problematic to me with regards to such 
> move:
>
>  * As this needs to hold the cpu maps lock it's impossible
>  * to call device_offline() because that ends up calling
>  * cpu_down() which takes cpu maps lock. cpu maps lock
>  * needs to be held as this might race against in kernel
>  * abusers of the hotplug machinery (thermal management).
>
> Cpu maps lock is released in cpu_maps_update_done() hence we will move
> dev->offline out of cpu maps lock. Maybe I misunderstood the comment
> and it relates to calling cpu_down_maps_locked() under lock to avoid
> race?

Yes, that's what I take from the comment, the cpu maps lock protects
against racing hotplug operations.


Re: [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise

2021-03-17 Thread Daniel Jordan
Andrey Ryabinin  writes:
>  static int cpuacct_stats_show(struct seq_file *sf, void *v)
>  {
...
>   for_each_possible_cpu(cpu) {
>   u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>  
> - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
> - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];
> - val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
> - val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
> - val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
> + cputime.utime += cpustat[CPUTIME_USER];
> + cputime.utime += cpustat[CPUTIME_NICE];
> + cputime.stime += cpustat[CPUTIME_SYSTEM];
> + cputime.stime += cpustat[CPUTIME_IRQ];
> + cputime.stime += cpustat[CPUTIME_SOFTIRQ];
> +
> + cputime.sum_exec_runtime += this_cpu_read(*ca->cpuusage);
>   }

cputime.sum_exec_runtime += *per_cpu_ptr(ca->cpuusage, cpu);

Or the stats can all be 0...


Re: [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage*

2021-03-17 Thread Daniel Jordan
Andrey Ryabinin  writes:

> cpuacct has 2 different ways of accounting and showing user
> and system times.
>
> The first one uses cpuacct_account_field() to account times
> and cpuacct.stat file to expose them. And this one seems to work ok.
>
> The second one is uses cpuacct_charge() function for accounting and
> set of cpuacct.usage* files to show times. Despite some attempts to
> fix it in the past it still doesn't work. E.g. while running KVM
> guest the cpuacct_charge() accounts most of the guest time as
> system time. This doesn't match with user times shown in
> cpuacct.stat or proc//stat.

I couldn't reproduce this running a cpu bound load in a kvm guest on a
nohz_full cpu on 5.11.  The time is almost entirely in cpuacct.usage and
_user, while _sys stays low.

Could you say more about how you're seeing this?  Don't really doubt
there's a problem, just wondering what you're doing.

> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
> index 941c28cf9738..7eff79faab0d 100644
> --- a/kernel/sched/cpuacct.c
> +++ b/kernel/sched/cpuacct.c
> @@ -29,7 +29,7 @@ struct cpuacct_usage {
>  struct cpuacct {
>   struct cgroup_subsys_state  css;
>   /* cpuusage holds pointer to a u64-type object on every CPU */
> - struct cpuacct_usage __percpu   *cpuusage;

Definition of struct cpuacct_usage can go away now.

> @@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
> *css)
>  static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
>enum cpuacct_stat_index index)
>  {
> - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>   u64 data;

There's a BUG_ON below this that could probably be WARN_ON_ONCE while
you're here

> @@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void 
> *v)
>   for_each_possible_cpu(cpu) {
>   u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>  
> - val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
> - val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
> + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
> + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];

unnecessary whitespace change?


Re: [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat

2021-03-17 Thread Daniel Jordan
Andrey Ryabinin  writes:

> Global CPUTIME_USER counter already includes CPUTIME_GUEST
> Also CPUTIME_NICE already includes CPUTIME_GUEST_NICE.
>
> Remove additions of CPUTIME_GUEST[_NICE] to total ->sum_exec_runtime
> to not account them twice.

Yes, that's just wrong.  usage_usec looks ok now.

Reviewed-by: Daniel Jordan 
Tested-by: Daniel Jordan 


Re: [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat

2021-03-17 Thread Daniel Jordan
Andrey Ryabinin  writes:

> cpuacct.stat in no-root cgroups shows user time without guest time
> included int it. This doesn't match with user time shown in root
> cpuacct.stat and /proc//stat.

Yeah, that's inconsistent.

> Make account_guest_time() to add user time to cgroup's cpustat to
> fix this.

Yep.

cgroup2's cpu.stat is broken the same way for child cgroups, and this
happily fixes it.  Probably deserves a mention in the changelog.

The problem with cgroup2 was, if the workload was mostly guest time,
cpu.stat's user and system together reflected it, but it was split
unevenly across the two.  I think guest time wasn't actually included in
either bucket, it was just that the little user and system time there
was got scaled up in cgroup_base_stat_cputime_show -> cputime_adjust to
match sum_exec_runtime, which did have it.

The stats look ok now for both cgroup1 and 2.  Just slightly unsure
whether we want to change the way both interfaces expose the accounting
in case something out there depends on it.  Seems like we should, but
it'd be good to hear more opinions.

> @@ -148,11 +146,11 @@ void account_guest_time(struct task_struct *p, u64 
> cputime)
>  
>   /* Add guest time to cpustat. */
>   if (task_nice(p) > 0) {
> - cpustat[CPUTIME_NICE] += cputime;
> - cpustat[CPUTIME_GUEST_NICE] += cputime;
> + task_group_account_field(p, CPUTIME_NICE, cputime);
> + task_group_account_field(p, CPUTIME_GUEST_NICE, cputime);
>   } else {
> - cpustat[CPUTIME_USER] += cputime;
> - cpustat[CPUTIME_GUEST] += cputime;
> + task_group_account_field(p, CPUTIME_USER, cputime);
> + task_group_account_field(p, CPUTIME_GUEST, cputime);
>   }

Makes sense for _USER and _NICE, but it doesn't seem cgroup1 or 2
actually use _GUEST and _GUEST_NICE.

Could go either way.  Consistency is nice, but I probably wouldn't
change the GUEST ones so people aren't confused about why they're
accounted.  It's also extra cycles for nothing, even though most of the
data is probably in the cache.


[PATCH] vfio/type1: fix vaddr_get_pfns() return in vfio_pin_page_external()

2021-03-08 Thread Daniel Jordan
vaddr_get_pfns() now returns the positive number of pfns successfully
gotten instead of zero.  vfio_pin_page_external() might return 1 to
vfio_iommu_type1_pin_pages(), which will treat it as an error, if
vaddr_get_pfns() is successful but vfio_pin_page_external() doesn't
reach vfio_lock_acct().

Fix it up in vfio_pin_page_external().  Found by inspection.

Fixes: be16c1fd99f4 ("vfio/type1: Change success value of vaddr_get_pfn()")
Signed-off-by: Daniel Jordan 
---

I couldn't test this due to lack of hardware.

 drivers/vfio/vfio_iommu_type1.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4bb162c1d649..2a0e3b3ce206 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -785,7 +785,12 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
unsigned long vaddr,
return -ENODEV;
 
ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
-   if (ret == 1 && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
+   if (ret != 1)
+   goto out;
+
+   ret = 0;
+
+   if (do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
ret = vfio_lock_acct(dma, 1, true);
if (ret) {
put_pfn(*pfn_base, dma->prot);
@@ -797,6 +802,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
unsigned long vaddr,
}
}
 
+out:
mmput(mm);
return ret;
 }

base-commit: 144c79ef33536b4ecb4951e07dbc1f2b7fa99d32
-- 
2.30.1



[PATCH v2 0/3] vfio/type1: Batch page pinning

2021-02-19 Thread Daniel Jordan
v2:
 - Fixed missing error unwind in patch 3 (Alex).  After more thought,
   the ENODEV case is fine, so it stayed the same.

 - Rebased on linux-vfio.git/next (no conflicts).

---

The VFIO type1 driver is calling pin_user_pages_remote() once per 4k page, so
let's do it once per 512 4k pages to bring VFIO in line with other drivers such
as IB and vDPA.

qemu guests with at least 2G memory start about 8% faster on a Xeon server,
with more detailed results in the last changelog.

Thanks to Matthew, who first suggested the idea to me.

Daniel


Test Cases
--

 1) qemu passthrough with IOMMU-capable PCI device

 2) standalone program to hit
vfio_pin_map_dma() -> vfio_pin_pages_remote()

 3) standalone program to hit
vfio_iommu_replay() -> vfio_pin_pages_remote()

Each was run...

 - with varying sizes
 - with/without disable_hugepages=1
 - with/without LOCKED_VM exceeded

I didn't test vfio_pin_page_external() because there was no readily available
hardware, but the changes there are pretty minimal.

Daniel Jordan (3):
  vfio/type1: Change success value of vaddr_get_pfn()
  vfio/type1: Prepare for batched pinning with struct vfio_batch
  vfio/type1: Batch page pinning

 drivers/vfio/vfio_iommu_type1.c | 215 +++-
 1 file changed, 155 insertions(+), 60 deletions(-)

base-commit: 76adb20f924f8d27ed50d02cd29cadedb59fd88f
-- 
2.30.1



[PATCH v2 2/3] vfio/type1: Prepare for batched pinning with struct vfio_batch

2021-02-19 Thread Daniel Jordan
Get ready to pin more pages at once with struct vfio_batch, which
represents a batch of pinned pages.

The struct has a fallback page pointer to avoid two unlikely scenarios:
pointlessly allocating a page if disable_hugepages is enabled or failing
the whole pinning operation if the kernel can't allocate memory.

vaddr_get_pfn() becomes vaddr_get_pfns() to prepare for handling
multiple pages, though for now only one page is stored in the pages
array.

Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 71 +++--
 1 file changed, 58 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7abaaad518a6..b7247a2fc87e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -103,6 +103,12 @@ struct vfio_dma {
unsigned long   *bitmap;
 };
 
+struct vfio_batch {
+   struct page **pages;/* for pin_user_pages_remote */
+   struct page *fallback_page; /* if pages alloc fails */
+   int capacity;   /* length of pages array */
+};
+
 struct vfio_group {
struct iommu_group  *iommu_group;
struct list_headnext;
@@ -459,6 +465,31 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
+#define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
+
+static void vfio_batch_init(struct vfio_batch *batch)
+{
+   if (unlikely(disable_hugepages))
+   goto fallback;
+
+   batch->pages = (struct page **) __get_free_page(GFP_KERNEL);
+   if (!batch->pages)
+   goto fallback;
+
+   batch->capacity = VFIO_BATCH_MAX_CAPACITY;
+   return;
+
+fallback:
+   batch->pages = >fallback_page;
+   batch->capacity = 1;
+}
+
+static void vfio_batch_fini(struct vfio_batch *batch)
+{
+   if (batch->capacity == VFIO_BATCH_MAX_CAPACITY)
+   free_page((unsigned long)batch->pages);
+}
+
 static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
unsigned long vaddr, unsigned long *pfn,
bool write_fault)
@@ -489,10 +520,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
struct mm_struct *mm,
  * Returns the positive number of pfns successfully obtained or a negative
  * error code.
  */
-static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
-int prot, unsigned long *pfn)
+static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
+ long npages, int prot, unsigned long *pfn,
+ struct page **pages)
 {
-   struct page *page[1];
struct vm_area_struct *vma;
unsigned int flags = 0;
int ret;
@@ -501,10 +532,10 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
mmap_read_lock(mm);
-   ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
-   page, NULL, NULL);
-   if (ret == 1) {
-   *pfn = page_to_pfn(page[0]);
+   ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
+   pages, NULL, NULL);
+   if (ret > 0) {
+   *pfn = page_to_pfn(pages[0]);
goto done;
}
 
@@ -592,7 +623,7 @@ static int vfio_wait_all_valid(struct vfio_iommu *iommu)
  */
 static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
  long npage, unsigned long *pfn_base,
- unsigned long limit)
+ unsigned long limit, struct vfio_batch *batch)
 {
unsigned long pfn = 0;
long ret, pinned = 0, lock_acct = 0;
@@ -603,7 +634,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
if (!current->mm)
return -ENODEV;
 
-   ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+   ret = vaddr_get_pfns(current->mm, vaddr, 1, dma->prot, pfn_base,
+batch->pages);
if (ret < 0)
return ret;
 
@@ -630,7 +662,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
/* Lock all the consecutive pages from pfn_base */
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
-   ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, );
+   ret = vaddr_get_pfns(current->mm, vaddr, 1, dma->prot, ,
+batch->pages);
if (ret < 0)
break;
 
@@ -693,6 +726,7 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, 
dma_addr_t iova,
 static

[PATCH v2 1/3] vfio/type1: Change success value of vaddr_get_pfn()

2021-02-19 Thread Daniel Jordan
vaddr_get_pfn() simply returns 0 on success.  Have it report the number
of pfns successfully gotten instead, whether from page pinning or
follow_fault_pfn(), which will be used later when batching pinning.

Change the last check in vfio_pin_pages_remote() for consistency with
the other two.

Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ec9fd95a138b..7abaaad518a6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -485,6 +485,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
struct mm_struct *mm,
return ret;
 }
 
+/*
+ * Returns the positive number of pfns successfully obtained or a negative
+ * error code.
+ */
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 int prot, unsigned long *pfn)
 {
@@ -501,7 +505,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   ret = 0;
goto done;
}
 
@@ -515,8 +518,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (ret == -EAGAIN)
goto retry;
 
-   if (!ret && !is_invalid_reserved_pfn(*pfn))
-   ret = -EFAULT;
+   if (!ret) {
+   if (is_invalid_reserved_pfn(*pfn))
+   ret = 1;
+   else
+   ret = -EFAULT;
+   }
}
 done:
mmap_read_unlock(mm);
@@ -597,7 +604,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
return -ENODEV;
 
ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
-   if (ret)
+   if (ret < 0)
return ret;
 
pinned++;
@@ -624,7 +631,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, );
-   if (ret)
+   if (ret < 0)
break;
 
if (pfn != *pfn_base + pinned ||
@@ -650,7 +657,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
ret = vfio_lock_acct(dma, lock_acct, false);
 
 unpin_out:
-   if (ret) {
+   if (ret < 0) {
if (!rsvd) {
for (pfn = *pfn_base ; pinned ; pfn++, pinned--)
put_pfn(pfn, dma->prot);
@@ -694,7 +701,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
unsigned long vaddr,
return -ENODEV;
 
ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
-   if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
+   if (ret == 1 && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
ret = vfio_lock_acct(dma, 1, true);
if (ret) {
put_pfn(*pfn_base, dma->prot);
-- 
2.30.1



[PATCH v2 3/3] vfio/type1: Batch page pinning

2021-02-19 Thread Daniel Jordan
Pinning one 4K page at a time is inefficient, so do it in batches of 512
instead.  This is just an optimization with no functional change
intended, and in particular the driver still calls iommu_map() with the
largest physically contiguous range possible.

Add two fields in vfio_batch to remember where to start between calls to
vfio_pin_pages_remote(), and use vfio_batch_unpin() to handle remaining
pages in the batch in case of error.

qemu pins pages for guests around 8% faster on my test system, a
two-node Broadwell server with 128G memory per node.  The qemu process
was bound to one node with its allocations constrained there as well.

 base   test
  guest     
   mem (GB)   speedupavg sec(std)   avg sec(std)
  1  7.4%   0.61   (0.00)  0.56   (0.00)
  2  8.3%   0.93   (0.00)  0.85   (0.00)
  4  8.4%   1.46   (0.00)  1.34   (0.00)
  8  8.6%   2.54   (0.01)  2.32   (0.00)
 16  8.3%   4.66   (0.00)  4.27   (0.01)
 32  8.3%   8.94   (0.01)  8.20   (0.01)
 64  8.2%  17.47   (0.01) 16.04   (0.03)
120  8.5%  32.45   (0.13) 29.69   (0.01)

perf diff confirms less time spent in pup.  Here are the top ten
functions:

 Baseline  Delta Abs  Symbol

   78.63% +6.64%  clear_page_erms
1.50% -1.50%  __gup_longterm_locked
1.27% -0.78%  __get_user_pages
  +0.76%  kvm_zap_rmapp.constprop.0
0.54% -0.53%  vmacache_find
0.55% -0.51%  get_pfnblock_flags_mask
0.48% -0.48%  __get_user_pages_remote
  +0.39%  slot_rmap_walk_next
  +0.32%  vfio_pin_map_dma
  +0.26%  kvm_handle_hva_range
...

Suggested-by: Matthew Wilcox (Oracle) 
Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 135 +---
 1 file changed, 89 insertions(+), 46 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b7247a2fc87e..cec2083dd556 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -107,6 +107,8 @@ struct vfio_batch {
struct page **pages;/* for pin_user_pages_remote */
struct page *fallback_page; /* if pages alloc fails */
int capacity;   /* length of pages array */
+   int size;   /* of batch currently */
+   int offset; /* of next entry in pages */
 };
 
 struct vfio_group {
@@ -469,6 +471,9 @@ static int put_pfn(unsigned long pfn, int prot)
 
 static void vfio_batch_init(struct vfio_batch *batch)
 {
+   batch->size = 0;
+   batch->offset = 0;
+
if (unlikely(disable_hugepages))
goto fallback;
 
@@ -484,6 +489,17 @@ static void vfio_batch_init(struct vfio_batch *batch)
batch->capacity = 1;
 }
 
+static void vfio_batch_unpin(struct vfio_batch *batch, struct vfio_dma *dma)
+{
+   while (batch->size) {
+   unsigned long pfn = page_to_pfn(batch->pages[batch->offset]);
+
+   put_pfn(pfn, dma->prot);
+   batch->offset++;
+   batch->size--;
+   }
+}
+
 static void vfio_batch_fini(struct vfio_batch *batch)
 {
if (batch->capacity == VFIO_BATCH_MAX_CAPACITY)
@@ -625,65 +641,88 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
  long npage, unsigned long *pfn_base,
  unsigned long limit, struct vfio_batch *batch)
 {
-   unsigned long pfn = 0;
+   unsigned long pfn;
+   struct mm_struct *mm = current->mm;
long ret, pinned = 0, lock_acct = 0;
bool rsvd;
dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
 
/* This code path is only user initiated */
-   if (!current->mm)
+   if (!mm)
return -ENODEV;
 
-   ret = vaddr_get_pfns(current->mm, vaddr, 1, dma->prot, pfn_base,
-batch->pages);
-   if (ret < 0)
-   return ret;
-
-   pinned++;
-   rsvd = is_invalid_reserved_pfn(*pfn_base);
-
-   /*
-* Reserved pages aren't counted against the user, externally pinned
-* pages are already counted against the user.
-*/
-   if (!rsvd && !vfio_find_vpfn(dma, iova)) {
-   if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
-   put_pfn(*pfn_base, dma->prot);
-   pr_warn("%s: RLIMIT_MEMLOC

Re: [PATCH 3/3] vfio/type1: batch page pinning

2021-02-18 Thread Daniel Jordan
Alex Williamson  writes:
> This might not be the first batch we fill, I think this needs to unwind
> rather than direct return.

So it does, well spotted.  And it's the same thing with the ENODEV case
farther up.

> Series looks good otherwise.

Thanks for going through it!


Re: [PATCH v2] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-02-12 Thread Daniel Jordan
Alexey Klimov  writes:
> int cpu_device_up(struct device *dev)

Yeah, definitely better to do the wait here.

>  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  {
> - int cpu, ret = 0;
> + struct device *dev;
> + cpumask_var_t mask;
> + int cpu, ret;
> +
> + if (!zalloc_cpumask_var(, GFP_KERNEL))
> + return -ENOMEM;
>  
> + ret = 0;
>   cpu_maps_update_begin();
>   for_each_online_cpu(cpu) {
>   if (topology_is_primary_thread(cpu))
> @@ -2099,18 +2098,35 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>* called under the sysfs hotplug lock, so it is properly
>* serialized against the regular offline usage.
>*/
> - cpuhp_offline_cpu_device(cpu);
> + dev = get_cpu_device(cpu);
> + dev->offline = true;
> +
> + cpumask_set_cpu(cpu, mask);
>   }
>   if (!ret)
>   cpu_smt_control = ctrlval;
>   cpu_maps_update_done();
> +
> + /* Tell user space about the state changes */
> + for_each_cpu(cpu, mask) {
> + dev = get_cpu_device(cpu);
> + kobject_uevent(>kobj, KOBJ_OFFLINE);
> + }
> +
> + free_cpumask_var(mask);
>   return ret;
>  }

Hrm, should the dev manipulation be kept in one place, something like
this?

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8817ccdc8e112..aa21219a7b7c4 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2085,11 +2085,20 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
if (ret)
break;
+
+   cpumask_set_cpu(cpu, mask);
+   }
+   if (!ret)
+   cpu_smt_control = ctrlval;
+   cpu_maps_update_done();
+
+   /* Tell user space about the state changes */
+   for_each_cpu(cpu, mask) {
/*
-* As this needs to hold the cpu maps lock it's impossible
+* When the cpu maps lock was taken above it was impossible
 * to call device_offline() because that ends up calling
 * cpu_down() which takes cpu maps lock. cpu maps lock
-* needs to be held as this might race against in kernel
+* needed to be held as this might race against in kernel
 * abusers of the hotplug machinery (thermal management).
 *
 * So nothing would update device:offline state. That would
@@ -2100,16 +2109,6 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 */
dev = get_cpu_device(cpu);
dev->offline = true;
-
-   cpumask_set_cpu(cpu, mask);
-   }
-   if (!ret)
-   cpu_smt_control = ctrlval;
-   cpu_maps_update_done();
-
-   /* Tell user space about the state changes */
-   for_each_cpu(cpu, mask) {
-   dev = get_cpu_device(cpu);
kobject_uevent(>kobj, KOBJ_OFFLINE);
}
 
@@ -2136,9 +2135,6 @@ int cpuhp_smt_enable(void)
ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
if (ret)
break;
-   /* See comment in cpuhp_smt_disable() */
-   dev = get_cpu_device(cpu);
-   dev->offline = false;
 
cpumask_set_cpu(cpu, mask);
}
@@ -2152,7 +2148,9 @@ int cpuhp_smt_enable(void)
 
/* Tell user space about the state changes */
for_each_cpu(cpu, mask) {
+   /* See comment in cpuhp_smt_disable() */
dev = get_cpu_device(cpu);
+   dev->offline = false;
kobject_uevent(>kobj, KOBJ_ONLINE);
}
 


Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-02-04 Thread Daniel Jordan
Alexey Klimov  writes:

> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it often fails with -EINVAL. Userspace needs
> to wait around 5..30 ms before sched_setaffinity() will succeed for the 
> recently
> onlined CPU after receiving uevent.
>
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it often fails with such flow:
>
>   sched_setaffinity()
> cpuset_cpus_allowed()
>   guarantee_online_cpus()   <-- cs->effective_cpus mask does not
> contain recently onlined cpu
> cpumask_and()   <-- final new_mask is empty
> __set_cpus_allowed_ptr()
>   cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
>   returns -EINVAL
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
> callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.
> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_up() and in cpuhp_smt_enable().

Nice writeup.  I just have some nits, patch looks ok otherwise.

> @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state 
> target)
>   err = _cpu_up(cpu, 0, target);
>  out:
>   cpu_maps_update_done();
> +
> + /* To avoid out of line uevent */

Not sure this will make sense out of context.  Maybe,

/*
 * Wait for cpuset updates to cpumasks to finish.  Later on this path
 * may generate uevents whose consumers rely on the updates.
 */

> @@ -2062,8 +2068,6 @@ static void cpuhp_offline_cpu_device(unsigned int cpu)
>   struct device *dev = get_cpu_device(cpu);
>  
>   dev->offline = true;
>  }
>  
> @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
>   struct device *dev = get_cpu_device(cpu);
>  
>   dev->offline = false;
>  }

You could get rid of these functions and just put the few remaining bits
in the callers.  They each have only one.


Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-02-04 Thread Daniel Jordan
Peter Zijlstra  writes:

> On Thu, Feb 04, 2021 at 12:50:34PM +, Alexey Klimov wrote:
>> On Thu, Feb 4, 2021 at 9:46 AM Peter Zijlstra  wrote:
>> >
>> > On Thu, Feb 04, 2021 at 01:01:57AM +, Alexey Klimov wrote:
>> > > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum 
>> > > cpuhp_state target)
>> > >   err = _cpu_up(cpu, 0, target);
>> > >  out:
>> > >   cpu_maps_update_done();
>> > > +
>> > > + /* To avoid out of line uevent */
>> > > + if (!err)
>> > > + cpuset_wait_for_hotplug();
>> > > +
>> > >   return err;
>> > >  }
>> > >
>> >
>> > > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int 
>> > > cpu)
>> > >   struct device *dev = get_cpu_device(cpu);
>> > >
>> > >   dev->offline = false;
>> > > - /* Tell user space about the state change */
>> > > - kobject_uevent(>kobj, KOBJ_ONLINE);
>> > >  }
>> > >
>> >
>> > One concequence of this is that you'll now get a bunch of notifications
>> > across things like suspend/hybernate.
>> 
>> The patch doesn't change the number of kobject_uevent()s. The
>> userspace will get the same number of uevents as before the patch (at
>> least if I can rely on my eyes).
>
> bringup_hibernate_cpu() didn't used to generate an event, it does now.
> Same for bringup_nonboot_cpus().

Both of those call cpu_up(), which only gets a cpuset_wait_for_hotplug()
in this patch.  No new events generated from that, right, it's just a
wrapper for a flush_work()?

> Also, looking again, you don't seem to be reinstating the OFFLINE event
> you took out.

It seems to be reinstated in cpuhp_smt_disable()?


[PATCH 3/3] vfio/type1: batch page pinning

2021-02-03 Thread Daniel Jordan
Pinning one 4K page at a time is inefficient, so do it in batches of 512
instead.  This is just an optimization with no functional change
intended, and in particular the driver still calls iommu_map() with the
largest physically contiguous range possible.

Add two fields in vfio_batch to remember where to start between calls to
vfio_pin_pages_remote(), and use vfio_batch_unpin() to handle remaining
pages in the batch in case of error.

qemu pins pages for guests around 8% faster on my test system, a
two-node Broadwell server with 128G memory per node.  The qemu process
was bound to one node with its allocations constrained there as well.

 base   test
  guest     
   mem (GB)   speedupavg sec(std)   avg sec(std)
  1  7.4%   0.61   (0.00)  0.56   (0.00)
  2  8.3%   0.93   (0.00)  0.85   (0.00)
  4  8.4%   1.46   (0.00)  1.34   (0.00)
  8  8.6%   2.54   (0.01)  2.32   (0.00)
 16  8.3%   4.66   (0.00)  4.27   (0.01)
 32  8.3%   8.94   (0.01)  8.20   (0.01)
 64  8.2%  17.47   (0.01) 16.04   (0.03)
120  8.5%  32.45   (0.13) 29.69   (0.01)

perf diff confirms less time spent in pup.  Here are the top ten
functions:

 Baseline  Delta Abs  Symbol

   78.63% +6.64%  clear_page_erms
1.50% -1.50%  __gup_longterm_locked
1.27% -0.78%  __get_user_pages
  +0.76%  kvm_zap_rmapp.constprop.0
0.54% -0.53%  vmacache_find
0.55% -0.51%  get_pfnblock_flags_mask
0.48% -0.48%  __get_user_pages_remote
  +0.39%  slot_rmap_walk_next
  +0.32%  vfio_pin_map_dma
  +0.26%  kvm_handle_hva_range
...

Suggested-by: Matthew Wilcox (Oracle) 
Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 133 +---
 1 file changed, 88 insertions(+), 45 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c26c1a4697e5..ac59bfc4e332 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -101,6 +101,8 @@ struct vfio_batch {
struct page **pages;/* for pin_user_pages_remote */
struct page *fallback_page; /* if pages alloc fails */
int capacity;   /* length of pages array */
+   int size;   /* of batch currently */
+   int offset; /* of next entry in pages */
 };
 
 struct vfio_group {
@@ -425,6 +427,9 @@ static int put_pfn(unsigned long pfn, int prot)
 
 static void vfio_batch_init(struct vfio_batch *batch)
 {
+   batch->size = 0;
+   batch->offset = 0;
+
if (unlikely(disable_hugepages))
goto fallback;
 
@@ -440,6 +445,17 @@ static void vfio_batch_init(struct vfio_batch *batch)
batch->capacity = 1;
 }
 
+static void vfio_batch_unpin(struct vfio_batch *batch, struct vfio_dma *dma)
+{
+   while (batch->size) {
+   unsigned long pfn = page_to_pfn(batch->pages[batch->offset]);
+
+   put_pfn(pfn, dma->prot);
+   batch->offset++;
+   batch->size--;
+   }
+}
+
 static void vfio_batch_fini(struct vfio_batch *batch)
 {
if (batch->capacity == VFIO_BATCH_MAX_CAPACITY)
@@ -526,65 +542,88 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
  long npage, unsigned long *pfn_base,
  unsigned long limit, struct vfio_batch *batch)
 {
-   unsigned long pfn = 0;
+   unsigned long pfn;
+   struct mm_struct *mm = current->mm;
long ret, pinned = 0, lock_acct = 0;
bool rsvd;
dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
 
/* This code path is only user initiated */
-   if (!current->mm)
+   if (!mm)
return -ENODEV;
 
-   ret = vaddr_get_pfns(current->mm, vaddr, 1, dma->prot, pfn_base,
-batch->pages);
-   if (ret < 0)
-   return ret;
-
-   pinned++;
-   rsvd = is_invalid_reserved_pfn(*pfn_base);
-
-   /*
-* Reserved pages aren't counted against the user, externally pinned
-* pages are already counted against the user.
-*/
-   if (!rsvd && !vfio_find_vpfn(dma, iova)) {
-   if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
-   put_pfn(*pfn_base, dma->prot);
-   pr_warn("%s: RLIMIT_MEMLOC

[PATCH 2/3] vfio/type1: prepare for batched pinning with struct vfio_batch

2021-02-03 Thread Daniel Jordan
Get ready to pin more pages at once with struct vfio_batch, which
represents a batch of pinned pages.

The struct has a fallback page pointer to avoid two unlikely scenarios:
pointlessly allocating a page if disable_hugepages is enabled or failing
the whole pinning operation if the kernel can't allocate memory.

vaddr_get_pfn() becomes vaddr_get_pfns() to prepare for handling
multiple pages, though for now only one page is stored in the pages
array.

Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 71 +++--
 1 file changed, 58 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4d608bc552a4..c26c1a4697e5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -97,6 +97,12 @@ struct vfio_dma {
unsigned long   *bitmap;
 };
 
+struct vfio_batch {
+   struct page **pages;/* for pin_user_pages_remote */
+   struct page *fallback_page; /* if pages alloc fails */
+   int capacity;   /* length of pages array */
+};
+
 struct vfio_group {
struct iommu_group  *iommu_group;
struct list_headnext;
@@ -415,6 +421,31 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
+#define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
+
+static void vfio_batch_init(struct vfio_batch *batch)
+{
+   if (unlikely(disable_hugepages))
+   goto fallback;
+
+   batch->pages = (struct page **) __get_free_page(GFP_KERNEL);
+   if (!batch->pages)
+   goto fallback;
+
+   batch->capacity = VFIO_BATCH_MAX_CAPACITY;
+   return;
+
+fallback:
+   batch->pages = >fallback_page;
+   batch->capacity = 1;
+}
+
+static void vfio_batch_fini(struct vfio_batch *batch)
+{
+   if (batch->capacity == VFIO_BATCH_MAX_CAPACITY)
+   free_page((unsigned long)batch->pages);
+}
+
 static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
unsigned long vaddr, unsigned long *pfn,
bool write_fault)
@@ -445,10 +476,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
struct mm_struct *mm,
  * Returns the positive number of pfns successfully obtained or a negative
  * error code.
  */
-static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
-int prot, unsigned long *pfn)
+static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
+ long npages, int prot, unsigned long *pfn,
+ struct page **pages)
 {
-   struct page *page[1];
struct vm_area_struct *vma;
unsigned int flags = 0;
int ret;
@@ -457,10 +488,10 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
mmap_read_lock(mm);
-   ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
-   page, NULL, NULL);
-   if (ret == 1) {
-   *pfn = page_to_pfn(page[0]);
+   ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
+   pages, NULL, NULL);
+   if (ret > 0) {
+   *pfn = page_to_pfn(pages[0]);
goto done;
}
 
@@ -493,7 +524,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
  */
 static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
  long npage, unsigned long *pfn_base,
- unsigned long limit)
+ unsigned long limit, struct vfio_batch *batch)
 {
unsigned long pfn = 0;
long ret, pinned = 0, lock_acct = 0;
@@ -504,7 +535,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
if (!current->mm)
return -ENODEV;
 
-   ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+   ret = vaddr_get_pfns(current->mm, vaddr, 1, dma->prot, pfn_base,
+batch->pages);
if (ret < 0)
return ret;
 
@@ -531,7 +563,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
/* Lock all the consecutive pages from pfn_base */
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
-   ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, );
+   ret = vaddr_get_pfns(current->mm, vaddr, 1, dma->prot, ,
+batch->pages);
if (ret < 0)
break;
 
@@ -594,6 +627,7 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, 
dma_addr_t iova,
 static

[PATCH 1/3] vfio/type1: change success value of vaddr_get_pfn()

2021-02-03 Thread Daniel Jordan
vaddr_get_pfn() simply returns 0 on success.  Have it report the number
of pfns successfully gotten instead, whether from page pinning or
follow_fault_pfn(), which will be used later when batching pinning.

Change the last check in vfio_pin_pages_remote() for consistency with
the other two.

Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4dedaa9128..4d608bc552a4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -441,6 +441,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, 
struct mm_struct *mm,
return ret;
 }
 
+/*
+ * Returns the positive number of pfns successfully obtained or a negative
+ * error code.
+ */
 static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 int prot, unsigned long *pfn)
 {
@@ -457,7 +461,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   ret = 0;
goto done;
}
 
@@ -471,8 +474,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (ret == -EAGAIN)
goto retry;
 
-   if (!ret && !is_invalid_reserved_pfn(*pfn))
-   ret = -EFAULT;
+   if (!ret) {
+   if (is_invalid_reserved_pfn(*pfn))
+   ret = 1;
+   else
+   ret = -EFAULT;
+   }
}
 done:
mmap_read_unlock(mm);
@@ -498,7 +505,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
return -ENODEV;
 
ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
-   if (ret)
+   if (ret < 0)
return ret;
 
pinned++;
@@ -525,7 +532,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, );
-   if (ret)
+   if (ret < 0)
break;
 
if (pfn != *pfn_base + pinned ||
@@ -551,7 +558,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
ret = vfio_lock_acct(dma, lock_acct, false);
 
 unpin_out:
-   if (ret) {
+   if (ret < 0) {
if (!rsvd) {
for (pfn = *pfn_base ; pinned ; pfn++, pinned--)
put_pfn(pfn, dma->prot);
@@ -595,7 +602,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, 
unsigned long vaddr,
return -ENODEV;
 
ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
-   if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
+   if (ret == 1 && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
ret = vfio_lock_acct(dma, 1, true);
if (ret) {
put_pfn(*pfn_base, dma->prot);
-- 
2.30.0



[PATCH 0/3] vfio/type1: batch page pinning

2021-02-03 Thread Daniel Jordan
Hi,

The VFIO type1 driver is calling pin_user_pages_remote() once per 4k page, so
let's do it once per 512 4k pages to bring VFIO in line with other drivers such
as IB and vDPA.

qemu guests with at least 2G memory start about 8% faster on a Xeon server,
with more detailed results in the last changelog.

Thanks to Matthew, who first suggested the idea to me.

Daniel


Test Cases
--

 1) qemu passthrough with IOMMU-capable PCI device

 2) standalone program to hit
vfio_pin_map_dma() -> vfio_pin_pages_remote()

 3) standalone program to hit
vfio_iommu_replay() -> vfio_pin_pages_remote()

Each was run...

 - with varying sizes
 - with/without disable_hugepages=1
 - with/without LOCKED_VM exceeded

I didn't test vfio_pin_page_external() because there was no readily available
hardware, but the changes there are pretty minimal.

Series based on v5.11-rc6.


Daniel Jordan (3):
  vfio/type1: change success value of vaddr_get_pfn()
  vfio/type1: prepare for batched pinning with struct vfio_batch
  vfio/type1: batch page pinning

 drivers/vfio/vfio_iommu_type1.c | 213 +++-
 1 file changed, 154 insertions(+), 59 deletions(-)


base-commit: 1048ba83fb1c00cd24172e23e8263972f6b5d9ac
-- 
2.30.0



Re: [RFC][PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu online

2021-01-20 Thread Daniel Jordan
Alexey Klimov  writes:
> Daniel, thank you for taking a look. I don't mind reviewing+testing
> another approach that you described.

Eh, I like yours better :)

>> Absent further discussion, Alexey, do you plan to post another version?
>
> I plan to update this patch and re-send in the next couple of days. It
> looks like it might be a series of two patches. Sorry for delays.

Not at all, this is just something I'm messing with when I have the
time.


Re: [RFC][PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu online

2021-01-14 Thread Daniel Jordan
Daniel Jordan  writes:
> Peter Zijlstra  writes:
>>> The nature of this bug is also described here (with different consequences):
>>> https://lore.kernel.org/lkml/20200211141554.24181-1-qais.you...@arm.com/
>>
>> Yeah, pesky deadlocks.. someone was going to try again.
>
> I dug up the synchronous patch
>
> 
> https://lore.kernel.org/lkml/1579878449-10164-1-git-send-email-prs...@codeaurora.org/
>
> but surprisingly wasn't able to reproduce the lockdep splat from
>
> https://lore.kernel.org/lkml/f0388d99-84d7-453b-9b6b-eeff0e7be...@lca.pw/
>
> even though I could hit it a few weeks ago.

oh okay, you need to mount a legacy cpuset hierarchy.

So as the above splat shows, making cpuset_hotplug_workfn() synchronous
means cpu_hotplug_lock (and "cpuhp_state-down") can be acquired before
cgroup_mutex.

But there are at least four cgroup paths that take the locks in the
opposite order.  They're all the same, they take cgroup_mutex and then
cpu_hotplug_lock later on to modify one or more static keys.

cpu_hotplug_lock should probably be ahead of cgroup_mutex because the
latter is taken in a hotplug callback, and we should keep the static
branches in cgroup, so the only way out I can think of is moving
cpu_hotplug_lock to just before cgroup_mutex is taken and switching to
_cpuslocked flavors of the static key calls.

lockdep quiets down with that change everywhere, but it puts another big
lock around a lot of cgroup paths.  Seems less heavyhanded to go with
this RFC.  What do you all think?

Absent further discussion, Alexey, do you plan to post another version?


Re: [PATCH] cgroup: Remove unnecessary call to strstrip()

2021-01-14 Thread Daniel Jordan
Hello Michal,

Michal Koutný  writes:
> On Sun, Jan 03, 2021 at 02:50:01AM +, Hao Lee  
> wrote:
>> The string buf will be stripped in cgroup_procs_write_start() before it
>> is converted to int, so remove this unnecessary call to strstrip().
> Good catch, Hao.
>
> Perhaps the code be then simplified a bit
>
> -- >8 --
> From: =?UTF-8?q?Michal=20Koutn=C3=BD?= 
> Date: Thu, 14 Jan 2021 13:23:39 +0100
> Subject: [PATCH] cgroup: cgroup.{procs,threads} factor out common parts
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> The functions cgroup_threads_start and cgroup_procs_start are almost

You meant cgroup_threads_write and cgroup_procs_write.

>  kernel/cgroup/cgroup.c | 55 +++---
>  1 file changed, 14 insertions(+), 41 deletions(-)

Ok, sure, that's a good thing.

Reviewed-by: Daniel Jordan 


Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread Daniel Jordan
Liang Li  writes:
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
>
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.

kzeropaged appears to escape some of the kernel's resource controls, at
least if I'm understanding this right.

The heavy part of a page fault is moved out of the faulting task's
context so the CPU controller can't throttle it.  A task that uses
these pages can benefit from clearing done by CPUs that it's not allowed
to run on.  How can it handle these cases?


Re: [RFC][PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu online

2020-12-08 Thread Daniel Jordan
Peter Zijlstra  writes:
>> The nature of this bug is also described here (with different consequences):
>> https://lore.kernel.org/lkml/20200211141554.24181-1-qais.you...@arm.com/
>
> Yeah, pesky deadlocks.. someone was going to try again.

I dug up the synchronous patch


https://lore.kernel.org/lkml/1579878449-10164-1-git-send-email-prs...@codeaurora.org/

but surprisingly wasn't able to reproduce the lockdep splat from

https://lore.kernel.org/lkml/f0388d99-84d7-453b-9b6b-eeff0e7be...@lca.pw/

even though I could hit it a few weeks ago.  I'm going to try to mess
with it later, but don't let me hold this up.


Re: [RFC][PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu online

2020-12-08 Thread Daniel Jordan
Alexey Klimov  writes:
> I also in doubts if we need cpuset_wait_for_hotplug() in 
> cpuhp_online_cpu_device()
> since an online uevent is sent there too.

We do need it there if we go with this fix.  Your reproducer hits the
same issue when it's changed to exercise smt/control instead of
cpuN/online.


Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

2020-12-07 Thread Daniel Jordan
Jason Gunthorpe  writes:
> On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
>> Well Alex can correct me, but I went digging and a comment from the
>> first type1 vfio commit says the iommu API didn't promise to unmap
>> subpages of previous mappings, so doing page at a time gave flexibility
>> at the cost of inefficiency.
>
> iommu restrictions are not related to with gup. vfio needs to get the
> page list from the page tables as efficiently as possible, then you
> break it up into what you want to feed into the IOMMU how the iommu
> wants.
>
> vfio must maintain a page list to call unpin_user_pages() anyhow, so

It does in some cases but not others, namely the expensive
VFIO_IOMMU_MAP_DMA/UNMAP_DMA path where the iommu page tables are used
to find the pfns when unpinning.

I don't see why vfio couldn't do as you say, though, and the worst case
memory overhead of using scatterlist to remember the pfns of a 300g VM
backed by huge but physically discontiguous pages is only a few meg, not
bad at all.

> it makes alot of sense to assemble the page list up front, then do the
> iommu, instead of trying to do both things page at a time.
>
> It would be smart to rebuild vfio to use scatter lists to store the
> page list and then break the sgl into pages for iommu
> configuration. SGLs will consume alot less memory for the usual case
> of THPs backing the VFIO registrations.
>
> ib_umem_get() has some example of how to code this, I've been thinking
> we could make this some common API, and it could be further optimized.

Agreed, great suggestions, both above and in the rest of your response.

>> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
>> Matthew pointed out this same issue to me by coincidence last week.
>
> Please don't just hack up vfio like this.

Yeah, you've cured me of that idea.  I'll see where I get experimenting
with some of this stuff.


Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

2020-12-07 Thread Daniel Jordan
Pavel Tatashin  writes:

> On Fri, Dec 4, 2020 at 3:06 PM Daniel Jordan  
> wrote:
>>
>> Jason Gunthorpe  writes:
>>
>> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
>> >> What I meant is the users of the interface do it incrementally not in
>> >> large chunks. For example:
>> >>
>> >> vfio_pin_pages_remote
>> >>vaddr_get_pfn
>> >> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
>> >> FOLL_LONGTERM, page, NULL, NULL);
>> >> 1 -> pin only one pages at a time
>> >
>> > I don't know why vfio does this, it is why it so ridiculously slow at
>> > least.
>>
>> Well Alex can correct me, but I went digging and a comment from the
>> first type1 vfio commit says the iommu API didn't promise to unmap
>> subpages of previous mappings, so doing page at a time gave flexibility
>> at the cost of inefficiency.
>>
>> Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
>> vfio kept pinning pages at a time.  I couldn't find an explanation for
>> why that stayed the same.
>>
>> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
>> Matthew pointed out this same issue to me by coincidence last week.
>> Currently debugging, but if there's a fundamental reason this won't work
>> on the vfio side, it'd be nice to know.
>
> Hi Daniel,
>
> I do not think there are any fundamental reasons why it won't work. I
> have also thinking increasing VFIO chunking for a different reason:
>
> If a client touches pages before doing a VFIO DMA map, those pages
> might be huge, and pinning a small page at a time and migrating a
> small page at a time can break-up the huge pages. So, it is not only
> inefficient to pin, but it can also inadvertently slow down the
> runtime.

Hi Pasha,

I see, and I'm curious, do you experience this case where a user has
touched the pages before doing a VFIO DMA map, and if so where?

The usual situation on my side is that the pages are faulted in during
pinning.

Daniel


Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

2020-12-04 Thread Daniel Jordan
Jason Gunthorpe  writes:

> On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
>> What I meant is the users of the interface do it incrementally not in
>> large chunks. For example:
>> 
>> vfio_pin_pages_remote
>>vaddr_get_pfn
>> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
>> FOLL_LONGTERM, page, NULL, NULL);
>> 1 -> pin only one pages at a time
>
> I don't know why vfio does this, it is why it so ridiculously slow at
> least.

Well Alex can correct me, but I went digging and a comment from the
first type1 vfio commit says the iommu API didn't promise to unmap
subpages of previous mappings, so doing page at a time gave flexibility
at the cost of inefficiency.

Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
vfio kept pinning pages at a time.  I couldn't find an explanation for
why that stayed the same.

Yesterday I tried optimizing vfio to skip gup calls for tail pages after
Matthew pointed out this same issue to me by coincidence last week.
Currently debugging, but if there's a fundamental reason this won't work
on the vfio side, it'd be nice to know.


[tip: sched/core] cpuset: fix race between hotplug work and later CPU offline

2020-11-20 Thread tip-bot2 for Daniel Jordan
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 406100f3da08066c00105165db8520bbc7694a36
Gitweb:
https://git.kernel.org/tip/406100f3da08066c00105165db8520bbc7694a36
Author:Daniel Jordan 
AuthorDate:Thu, 12 Nov 2020 12:17:11 -05:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:45 +01:00

cpuset: fix race between hotplug work and later CPU offline

One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

  BUG: unable to handle page fault for address: 607f820054db
  CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
  Workqueue: events cpuset_hotplug_workfn
  RIP: build_sched_domains
  Call Trace:
   partition_sched_domains_locked
   rebuild_sched_domains_locked
   cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only.  This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

  cd $UNIFIED_ROOT
  mkdir cg1
  echo 4-7 > cg1/cpuset.cpus
  echo root > cg1/cpuset.cpus.partition

  # with smt/control reading 'on',
  echo off > /sys/devices/system/cpu/smt/control

RIP maps to

  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init().  sd_id is calculated earlier in the same function:

  cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check.  Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with 
partition")
Signed-off-by: Daniel Jordan 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Tejun Heo 
Cc: sta...@vger.kernel.org
Link: 
https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jor...@oracle.com
---
 kernel/cgroup/cpuset.c | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d..53c70c4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -983,25 +983,48 @@ partition_and_rebuild_sched_domains(int ndoms_new, 
cpumask_var_t doms_new[],
  */
 static void rebuild_sched_domains_locked(void)
 {
+   struct cgroup_subsys_state *pos_css;
struct sched_domain_attr *attr;
cpumask_var_t *doms;
+   struct cpuset *cs;
int ndoms;
 
lockdep_assert_cpus_held();
percpu_rwsem_assert_held(_rwsem);
 
/*
-* We have raced with CPU hotplug. Don't do anything to avoid
+* If we have raced with CPU hotplug, return early to avoid
 * passing doms with offlined cpu to partition_sched_domains().
-* Anyways, hotplug work item will rebuild sched domains.
+* Anyways, cpuset_hotplug_workfn() will rebuild sched domains.
+*
+* With no CPUs in any subpartitions, top_cpuset's effective CPUs
+* should be the same as the active CPUs, so checking only top_cpuset
+* is enough to detect racing CPU offlines.
 */
if (!top_cpuset.nr_subparts_cpus &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;
 
-   if (top_cpuset.nr_subparts_cpus &&
-  !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
-   return;
+   /*
+* With subpartition CPUs, however, the effective CPUs of a partition
+* root should be only a subset of the active CPUs.  Since a CPU in any
+* partition root could be offlined, all must be checked.
+*/
+   if (top_cpuset.nr_subparts_cpus) {
+   rcu_read_lock();
+   cpuset_for_each_descendant_pre(cs, pos_css, _cpuset) {
+   if (!is_partition_root(cs)) {
+   pos_css = css_rightmost_descendant(pos_css);
+   continue;
+   }
+   if (!cpumask_subset(cs->effective_cpus,
+   cpu_active_mask)) {
+   rcu_read_unlock();
+   return;
+   }
+   }
+   rcu_read_u

[PATCH v2] cpuset: fix race between hotplug work and later CPU offline

2020-11-12 Thread Daniel Jordan
One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

  BUG: unable to handle page fault for address: 607f820054db
  CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
  Workqueue: events cpuset_hotplug_workfn
  RIP: build_sched_domains
  Call Trace:
   partition_sched_domains_locked
   rebuild_sched_domains_locked
   cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only.  This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

  cd $UNIFIED_ROOT
  mkdir cg1
  echo 4-7 > cg1/cpuset.cpus
  echo root > cg1/cpuset.cpus.partition

  # with smt/control reading 'on',
  echo off > /sys/devices/system/cpu/smt/control

RIP maps to

  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init().  sd_id is calculated earlier in the same function:

  cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check.  Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with 
partition")
Signed-off-by: Daniel Jordan 
Cc: Johannes Weiner 
Cc: Li Zefan 
Cc: Peter Zijlstra 
Cc: Prateek Sood 
Cc: Tejun Heo 
Cc: Waiman Long 
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
---

v2:
 - Made the comment more descriptive (Peter)

 kernel/cgroup/cpuset.c | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d0a5fd..53c70c470a38 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -983,25 +983,48 @@ partition_and_rebuild_sched_domains(int ndoms_new, 
cpumask_var_t doms_new[],
  */
 static void rebuild_sched_domains_locked(void)
 {
+   struct cgroup_subsys_state *pos_css;
struct sched_domain_attr *attr;
cpumask_var_t *doms;
+   struct cpuset *cs;
int ndoms;
 
lockdep_assert_cpus_held();
percpu_rwsem_assert_held(_rwsem);
 
/*
-* We have raced with CPU hotplug. Don't do anything to avoid
+* If we have raced with CPU hotplug, return early to avoid
 * passing doms with offlined cpu to partition_sched_domains().
-* Anyways, hotplug work item will rebuild sched domains.
+* Anyways, cpuset_hotplug_workfn() will rebuild sched domains.
+*
+* With no CPUs in any subpartitions, top_cpuset's effective CPUs
+* should be the same as the active CPUs, so checking only top_cpuset
+* is enough to detect racing CPU offlines.
 */
if (!top_cpuset.nr_subparts_cpus &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;
 
-   if (top_cpuset.nr_subparts_cpus &&
-  !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
-   return;
+   /*
+* With subpartition CPUs, however, the effective CPUs of a partition
+* root should be only a subset of the active CPUs.  Since a CPU in any
+* partition root could be offlined, all must be checked.
+*/
+   if (top_cpuset.nr_subparts_cpus) {
+   rcu_read_lock();
+   cpuset_for_each_descendant_pre(cs, pos_css, _cpuset) {
+   if (!is_partition_root(cs)) {
+   pos_css = css_rightmost_descendant(pos_css);
+   continue;
+   }
+   if (!cpumask_subset(cs->effective_cpus,
+   cpu_active_mask)) {
+   rcu_read_unlock();
+   return;
+   }
+   }
+   rcu_read_unlock();
+   }
 
/* Generate domain masks and attrs */
ndoms = generate_sched_domains(, );

base-commit: 3d5e28bff7ad55aea081c1af516cc1c94a5eca7d
-- 
2.29.2



Re: [PATCH] cpuset: fix race between hotplug work and later CPU offline

2020-11-10 Thread Daniel Jordan
Peter Zijlstra  writes:
> On Thu, Oct 29, 2020 at 02:18:45PM -0400, Daniel Jordan wrote:
>> rebuild_sched_domains_locked() prevented the race during the cgroup2
>> cpuset series up until the Fixes commit changed its check.  Make the
>> check more robust so that it can detect an offline CPU in any exclusive
>> cpuset's effective mask, not just the top one.
>
> *groan*, what a mess...

Ah, the joys of cpu hotplug!

>> I think the right thing to do long-term is make the hotplug work
>> synchronous, fixing the lockdep splats of past attempts, and then take
>> these checks out of rebuild_sched_domains_locked, but this fixes the
>> immediate issue and is small enough for stable.  Open to suggestions.
>> 
>> Prateek, are you planning on picking up your patches again?
>
> Yeah, that might help, but those deadlocks were nasty iirc :/

It might end up being too invasive to be worth it, but I'm being
optimistic for now.

>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 57b5b5d0a5fd..ac3124010b2a 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -983,8 +983,10 @@ partition_and_rebuild_sched_domains(int ndoms_new, 
>> cpumask_var_t doms_new[],
>>   */
>>  static void rebuild_sched_domains_locked(void)
>>  {
>> +struct cgroup_subsys_state *pos_css;
>>  struct sched_domain_attr *attr;
>>  cpumask_var_t *doms;
>> +struct cpuset *cs;
>>  int ndoms;
>>  
>>  lockdep_assert_cpus_held();
>> @@ -999,9 +1001,21 @@ static void rebuild_sched_domains_locked(void)
>>  !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>>  return;
>
> So you argued above that effective_cpus was stale, I suppose the above
> one works because its an equality test instead of a subset?

Yep, fortunately enough.

> Does that wants a comment?

Ok, I'll change the comments to this absent other ideas.

/*
 * If we have raced with CPU hotplug, return early to avoid
 * passing doms with offlined cpu to partition_sched_domains().
 * Anyways, cpuset_hotplug_workfn() will rebuild sched domains.
 *
 * With no CPUs in any subpartitions, top_cpuset's effective CPUs
 * should be the same as the active CPUs, so checking only top_cpuset
 * is enough to detect racing CPU offlines.
 */
if (!top_cpuset.nr_subparts_cpus &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;

/*
 * With subpartition CPUs, however, the effective CPUs of a partition
 * root should be only a subset of the active CPUs.  Since a CPU in any
 * partition root could be offlined, all must be checked.
 */
if (top_cpuset.nr_subparts_cpus) {
rcu_read_lock();
...


Thanks for looking.


[PATCH] cpuset: fix race between hotplug work and later CPU offline

2020-10-29 Thread Daniel Jordan
One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

  BUG: unable to handle page fault for address: 607f820054db
  CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
  Workqueue: events cpuset_hotplug_workfn
  RIP: build_sched_domains
  Call Trace:
   partition_sched_domains_locked
   rebuild_sched_domains_locked
   cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only.  This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

  cd $UNIFIED_ROOT
  mkdir cg1
  echo 4-7 > cg1/cpuset.cpus
  echo root > cg1/cpuset.cpus.partition

  # with smt/control reading 'on',
  echo off > /sys/devices/system/cpu/smt/control

RIP maps to

  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init().  sd_id is calculated earlier in the same function:

  cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check.  Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with 
partition")
Signed-off-by: Daniel Jordan 
Cc: Johannes Weiner 
Cc: Li Zefan 
Cc: Peter Zijlstra 
Cc: Prateek Sood 
Cc: Tejun Heo 
Cc: Waiman Long 
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: sta...@vger.kernel.org
---

I think the right thing to do long-term is make the hotplug work
synchronous, fixing the lockdep splats of past attempts, and then take
these checks out of rebuild_sched_domains_locked, but this fixes the
immediate issue and is small enough for stable.  Open to suggestions.

Prateek, are you planning on picking up your patches again?

 kernel/cgroup/cpuset.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d0a5fd..ac3124010b2a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -983,8 +983,10 @@ partition_and_rebuild_sched_domains(int ndoms_new, 
cpumask_var_t doms_new[],
  */
 static void rebuild_sched_domains_locked(void)
 {
+   struct cgroup_subsys_state *pos_css;
struct sched_domain_attr *attr;
cpumask_var_t *doms;
+   struct cpuset *cs;
int ndoms;
 
lockdep_assert_cpus_held();
@@ -999,9 +1001,21 @@ static void rebuild_sched_domains_locked(void)
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;
 
-   if (top_cpuset.nr_subparts_cpus &&
-  !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
-   return;
+   if (top_cpuset.nr_subparts_cpus) {
+   rcu_read_lock();
+   cpuset_for_each_descendant_pre(cs, pos_css, _cpuset) {
+   if (!is_partition_root(cs)) {
+   pos_css = css_rightmost_descendant(pos_css);
+   continue;
+   }
+   if (!cpumask_subset(cs->effective_cpus,
+   cpu_active_mask)) {
+   rcu_read_unlock();
+   return;
+   }
+   }
+   rcu_read_unlock();
+   }
 
/* Generate domain masks and attrs */
ndoms = generate_sched_domains(, );

base-commit: 23859ae44402f4d935b9ee548135dd1e65e2cbf4
-- 
2.29.0



Re: [PATCH v2] Remove __init from padata_do_multithreaded and padata_mt_helper.

2020-10-27 Thread Daniel Jordan
On 10/27/20 12:46 AM, Nico Pache wrote:
> On Wed, Jul 08, 2020 at 03:51:40PM -0400, Daniel Jordan wrote:
> > (I was away for a while)
> > 
> > On Thu, Jul 02, 2020 at 11:55:48AM -0400, Nico Pache wrote:
> > > Allow padata_do_multithreaded function to be called after bootstrap.
> > 
> > The functions are __init because they're currently only needed during boot, 
> > and
> > using __init allows the text to be freed once it's over, saving some memory.
> > 
> > So this change, in isolation, doesn't make sense.  If there were an 
> > enhancement
> > you were thinking of making, this patch could then be bundled with it so the
> > change is made only when it's used.
> > 
> > However, there's still work that needs to be merged before
> > padata_do_multithreaded can be called after boot.  See the parts about 
> > priority
> > adjustments (MAX_NICE/renicing) and concurrency limits in this branch
> > 
> >   
> > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5
> > 
> > and the ktask discussions from linux-mm/lkml where concerns about these 
> > issues
> > were raised.  I plan to post these parts fairly soon and can include you if 
> > you
> > want.
>
> I really like the speed benefits I've been able to achieve by using your
> padata multithreaded interface in the branch you linked me to. Do you
> still have plans on moving forward with this upstream?

Yes, I'm still planning to push these patches upstream, but it's going to take
some time with all the prerequisites.  I'm working on remote charging in the
CPU controller now, which is the biggest unfinished task.  A little background
on that here:

https://lore.kernel.org/linux-mm/20200219220859.gf54...@cmpxchg.org/


[PATCH] module: statically initialize init section freeing data

2020-10-08 Thread Daniel Jordan
Corentin hit the following workqueue warning when running with
CRYPTO_MANAGER_EXTRA_TESTS:

  WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
  Modules linked in: ghash_generic
  CPU: 2 PID: 147 Comm: modprobe Not tainted
  5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
  Hardware name: Pine H64 model A (DT)
  pc : __queue_work+0x3b8/0x3d0
  Call trace:
   __queue_work+0x3b8/0x3d0
   queue_work_on+0x6c/0x90
   do_init_module+0x188/0x1f0
   load_module+0x1d00/0x22b0

I wasn't able to reproduce on x86 or rpi 3b+.

This is

  WARN_ON(!list_empty(>entry))

from __queue_work(), and it happens because the init_free_wq work item
isn't initialized in time for a crypto test that requests the gcm
module.  Some crypto tests were recently moved earlier in boot as
explained in commit c4741b230597 ("crypto: run initcalls for generic
implementations earlier"), which went into mainline less than two weeks
before the Fixes commit.

Avoid the warning by statically initializing init_free_wq and the
corresponding llist.

Link: https://lore.kernel.org/lkml/20200217204803.GA13479@Red/
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Reported-by: Corentin Labbe 
Tested-by: Corentin Labbe 
Tested-on: sun50i-h6-pine-h64
Tested-on: imx8mn-ddr4-evk
Tested-on: sun50i-a64-bananapi-m64
Signed-off-by: Daniel Jordan 
---
 kernel/module.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 1c5cff34d9f2..8486123ffd7a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -91,8 +91,9 @@ EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
 /* Work queue for freeing init sections in success case */
-static struct work_struct init_free_wq;
-static struct llist_head init_free_list;
+static void do_free_init(struct work_struct *w);
+static DECLARE_WORK(init_free_wq, do_free_init);
+static LLIST_HEAD(init_free_list);
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 
@@ -3579,14 +3580,6 @@ static void do_free_init(struct work_struct *w)
}
 }
 
-static int __init modules_wq_init(void)
-{
-   INIT_WORK(_free_wq, do_free_init);
-   init_llist_head(_free_list);
-   return 0;
-}
-module_init(modules_wq_init);
-
 /*
  * This is where the real work happens.
  *

base-commit: c85fb28b6f999db9928b841f63f1beeb3074eeca
-- 
2.28.0



Re: WARNING: at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0

2020-10-08 Thread Daniel Jordan
On Wed, Oct 07, 2020 at 09:41:17PM +0200, Corentin Labbe wrote:
> I have added CONFIG_FTRACE=y and your second patch.
> The boot log can be seen at http://kernel.montjoie.ovh/108789.log
> 
> But it seems the latest dump_stack addition flood a bit.

Heh, sorry for making it spew, there wasn't such a flood when I tried.  Your
output is sufficiently incriminating, so I'll go post the fix now.

> I have started to read ftrace documentation, but if you have a quick what to 
> do in /sys/kernel/debug/tracing, it will be helpfull.

Sure, you can view the trace in /sys/kernel/debug/tracing/trace and
kernel-parameters.txt has the boot options documented.


Re: WARNING: at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0

2020-10-05 Thread Daniel Jordan
On Thu, Oct 01, 2020 at 07:50:22PM +0200, Corentin Labbe wrote:
> On Tue, Mar 03, 2020 at 04:30:17PM -0500, Daniel Jordan wrote:
> > Barring other ideas, Corentin, would you be willing to boot with
> > 
> > trace_event=initcall:*,module:* trace_options=stacktrace
> > 
> > and
> > 
> > diff --git a/kernel/module.c b/kernel/module.c
> > index 33569a01d6e1..393be6979a27 100644
> > --- a/kernel/module.c
> > +++ b/kernel/module.c
> > @@ -3604,8 +3604,11 @@ static noinline int do_init_module(struct module 
> > *mod)
> >  * be cleaned up needs to sync with the queued work - ie
> >  * rcu_barrier()
> >  */
> > -   if (llist_add(>node, _free_list))
> > +   if (llist_add(>node, _free_list)) {
> > +   pr_warn("%s: schedule_work for mod=%s\n", __func__, mod->name);
> > +   dump_stack();
> > schedule_work(_free_wq);
> > +   }
> >  
> > mutex_unlock(_mutex);
> > wake_up_all(_wq);
> > 
> > but not my earlier fix and share the dmesg and ftrace output to see if the
> > theory holds?
> > 
> > Also, could you attach your config?  Curious now what your crypto options 
> > look
> > like after fiddling with some of them today while trying and failing to see
> > this on x86.
> > 
> > thanks,
> > Daniel
> 
> Hello
> 
> Sorry for the very delayed answer.
> 
> I fail to reproduce it on x86 (qemu and  real hw) and arm.
> It seems to only happen on arm64.

Thanks for the config and dmesg, but there's no ftrace.  I see it's not
configured in your kernel, so could you boot with my earlier debug patch plus
this one and the kernel argument initcall_debug instead?

I'm trying to see whether it really is a request module call from the crypto
tests that's triggering this warning.  Preeetty likely that's what's happening,
but want to be sure since I can't reproduce this.  Then I can post the fix.

diff --git a/crypto/algapi.c b/crypto/algapi.c
index fdabf2675b63..0667c6b4588e 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -393,6 +393,10 @@ static void crypto_wait_for_test(struct crypto_larval 
*larval)
 {
int err;
 
+   pr_warn("%s: cra_name %s cra_driver_name %s\n", __func__,
+   larval->adult->cra_name, larval->adult->cra_driver_name);
+   dump_stack();
+
err = crypto_probing_notify(CRYPTO_MSG_ALG_REGISTER, larval->adult);
if (err != NOTIFY_STOP) {
if (WARN_ON(err != NOTIFY_DONE))
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 3cd075ce2a1e..46c4645be763 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -166,6 +166,8 @@ int __request_module(bool wait, const char *fmt, ...)
}
 
trace_module_request(module_name, wait, _RET_IP_);
+   pr_warn("%s: %s\n", __func__, module_name);
+   dump_stack();
 
ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
 


Re: WARNING: at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0

2020-09-30 Thread Daniel Jordan
On Fri, Sep 25, 2020 at 08:12:03PM +0200, Corentin Labbe wrote:
> On Tue, Mar 03, 2020 at 04:31:11PM -0500, Daniel Jordan wrote:
> > On Tue, Mar 03, 2020 at 08:48:19AM +0100, Corentin Labbe wrote:
> > > The patch fix the issue. Thanks!
> > 
> > Thanks for trying it!
> > 
> > > So you could add:
> > > Reported-by: Corentin Labbe 
> > > Tested-by: Corentin Labbe 
> > > Tested-on: sun50i-h6-pine-h64
> > > Tested-on: imx8mn-ddr4-evk
> > > Tested-on: sun50i-a64-bananapi-m64
> > 
> > I definitely will if the patch turns out to be the right fix.
> > 
> > thanks,
> > Daniel
> 
> Hello
> 
> I forgot about this problem since the patch is in my branch since.
> But a co-worker hit this problem recently and without this patch my CI still 
> have it.

Hi,

Sure, I'm happy to help get a fix merged, but let's nail down what the problem
is first.  It'd be useful to have the things requested here:

https://lore.kernel.org/linux-crypto/20200303213017.tanczhqd3nhpe...@ca-dmjordan1.us.oracle.com/

thanks,
Daniel


Re: [PATCH v19 00/20] per memcg lru_lock

2020-09-24 Thread Daniel Jordan
On Thu, Sep 24, 2020 at 11:28:15AM +0800, Alex Shi wrote:
> The new version rebased on v5.9-rc6 with line by line review by Hugh Dickins.

These thpscale numbers are from a proto-v19, which now seems to be gone from
Alex Shi's github.  Once again, this is a regression test where memcg is
enabled but not used.  There's a reasonable amount of isolations from
compaction as expected, as well as reclaim to a lesser extent.

I confined the runs to one node of a two-socket broadwell server, the same
machine as my other results.  thpscale's total_size argument is about 80% of
the node's memory, its madvise option was enabled, and the system's thp
settings are enabled=always and defrag=madvise.

Both base and lru kernels show huge variance run to run, which is surprising
because I was expecting a microbenchmark like this to be more stable.  There
seems to be an overall decrease in mean fault latency with the lru changes, but
it's in the noise, so it's hard to say how much of an improvement it really is.

I only ran up to four threads so these would be ready before I left.  I'll be
away for a few days.


thpscale Fault Latencies
  thpthp
thpthp
base1  base2
   lru1   lru2
Min   fault-base-10.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Min   fault-base-30.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Min   fault-base-40.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Min   fault-huge-1  214.00 (   0.00%)  227.00 (  -6.07%)  
235.00 (  -9.81%)  240.00 ( -12.15%)
Min   fault-huge-3  321.00 (   0.00%)  366.00 ( -14.02%)  
323.00 (  -0.62%)  407.00 ( -26.79%)
Min   fault-huge-4  441.00 (   0.00%)  401.00 (   9.07%)  
525.00 ( -19.05%)  434.00 (   1.59%)
Min   fault-both-1  214.00 (   0.00%)  227.00 (  -6.07%)  
235.00 (  -9.81%)  240.00 ( -12.15%)
Min   fault-both-3  321.00 (   0.00%)  366.00 ( -14.02%)  
323.00 (  -0.62%)  407.00 ( -26.79%)
Min   fault-both-4  441.00 (   0.00%)  401.00 (   9.07%)  
525.00 ( -19.05%)  434.00 (   1.59%)
Amean fault-base-10.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Amean fault-base-30.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Amean fault-base-40.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Amean fault-huge-1 1549.55 (   0.00%) 1667.13 *  -7.59%* 
1361.50 *  12.14%* 1458.10 *   5.90%*
Amean fault-huge-3 2582.29 (   0.00%) 2556.45 (   1.00%) 
2756.65 *  -6.75%* 2157.29 *  16.46%*
Amean fault-huge-4 2352.21 (   0.00%) 2709.37 * -15.18%* 
2323.89 (   1.20%) 1888.94 *  19.69%*
Amean fault-both-1 1549.55 (   0.00%) 1667.13 *  -7.59%* 
1361.50 *  12.14%* 1458.10 *   5.90%*
Amean fault-both-3 2582.29 (   0.00%) 2556.45 (   1.00%) 
2756.65 *  -6.75%* 2157.29 *  16.46%*
Amean fault-both-4 2352.21 (   0.00%) 2709.37 * -15.18%* 
2323.89 (   1.20%) 1888.94 *  19.69%*
Stddevfault-base-10.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Stddevfault-base-30.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Stddevfault-base-40.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
Stddevfault-huge-1 2011.48 (   0.00%) 3765.64 ( -87.21%) 
3061.45 ( -52.20%) 1758.56 (  12.57%)
Stddevfault-huge-311153.49 (   0.00%) 8339.83 (  25.23%) 
7017.28 (  37.08%) 3976.86 (  64.34%)
Stddevfault-huge-4 8817.67 (   0.00%)16241.48 ( -84.19%)
11595.28 ( -31.50%) 6631.47 (  24.79%)
Stddevfault-both-1 2011.48 (   0.00%) 3765.64 ( -87.21%) 
3061.45 ( -52.20%) 1758.56 (  12.57%)
Stddevfault-both-311153.49 (   0.00%) 8339.83 (  25.23%) 
7017.28 (  37.08%) 3976.86 (  64.34%)
Stddevfault-both-4 8817.67 (   0.00%)16241.48 ( -84.19%)
11595.28 ( -31.50%) 6631.47 (  24.79%)
CoeffVar  fault-base-10.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
CoeffVar  fault-base-30.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
CoeffVar  fault-base-40.00 (   0.00%)0.00 (   0.00%)
0.00 (   0.00%)0.00 (   0.00%)
CoeffVar  fault-huge-1  129.81 (   0.00%)  225.88 ( -74.00%)  
224.86 ( -73.22%)  120.61 (   7.09%)
CoeffVar  fault-huge-3  

Re: [PATCH] mm/migrate: correct thp migration stats.

2020-09-17 Thread Daniel Jordan
On Thu, Sep 17, 2020 at 04:27:29PM -0400, Zi Yan wrote:
> From: Zi Yan 
> 
> PageTransHuge returns true for both thp and hugetlb, so thp stats was
> counting both thp and hugetlb migrations. Exclude hugetlb migration by
> setting is_thp variable right.

Yeah, shoot.

> Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
> Signed-off-by: Zi Yan 

Reviewed-by: Daniel Jordan 

If you wanted, you could also do this.

diff --git a/mm/migrate.c b/mm/migrate.c
index d1ad964165e5..6bc9559afc70 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1471,7 +1471,7 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
 * we encounter them after the rest of the list
 * is processed.
 */
-   if (PageTransHuge(page) && !PageHuge(page)) {
+   if (is_thp) {
lock_page(page);
rc = split_huge_page_to_list(page, 
from);
unlock_page(page);
@@ -1480,8 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
nr_thp_split++;
goto retry;
}
-   }
-   if (is_thp) {
+
nr_thp_failed++;
nr_failed += nr_subpages;
goto out;


Re: [PATCH v18 00/32] per memcg lru_lock: reviews

2020-09-17 Thread Daniel Jordan
On Thu, Sep 17, 2020 at 08:39:34AM -0700, Alexander Duyck wrote:
> On Thu, Sep 17, 2020 at 7:26 AM Daniel Jordan
>  wrote:
> >
> > On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote:
> > > 在 2020/9/16 上午12:58, Daniel Jordan 写道:
> > > > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
> > > >> On Sun, 13 Sep 2020, Alex Shi wrote:
> > > >>> Uh, I updated the testing with some new results here:
> > > >>> https://lkml.org/lkml/2020/8/26/212
> > > >> Right, I missed that, that's better, thanks.  Any other test results?
> > > > Alex, you were doing some will-it-scale runs earlier.  Are you planning 
> > > > to do
> > > > more of those?  Otherwise I can add them in.
> > >
> > > Hi Daniel,
> > >
> > > Does compaction perf scalable, like thpscale, I except they could get 
> > > some benefit.
> >
> > Yep, I plan to stress compaction.  Reclaim as well.
> >
> > I should have said which Alex I meant.  I was asking Alex Duyck since he'd 
> > done
> > some will-it-scale runs.
> 
> I probably won't be able to do any will-it-scale runs any time soon.
> If I recall I ran them for this latest v18 patch set and didn't see
> any regressions like I did with the previous set. However the system I
> was using is tied up for other purposes and it may be awhile before I
> can free it up to look into this again.

Ok, sure.  I hadn't seen the regressions were taken case of, that's good to
hear.  Might still add them to my testing for v19 and beyond, we'll see.


Re: [PATCH v18 00/32] per memcg lru_lock: reviews

2020-09-17 Thread Daniel Jordan
On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote:
> 在 2020/9/16 上午12:58, Daniel Jordan 写道:
> > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
> >> On Sun, 13 Sep 2020, Alex Shi wrote:
> >>> Uh, I updated the testing with some new results here:
> >>> https://lkml.org/lkml/2020/8/26/212
> >> Right, I missed that, that's better, thanks.  Any other test results?
> > Alex, you were doing some will-it-scale runs earlier.  Are you planning to 
> > do
> > more of those?  Otherwise I can add them in.
> 
> Hi Daniel,
> 
> Does compaction perf scalable, like thpscale, I except they could get some 
> benefit.

Yep, I plan to stress compaction.  Reclaim as well.

I should have said which Alex I meant.  I was asking Alex Duyck since he'd done
some will-it-scale runs.


Re: [PATCH v18 00/32] per memcg lru_lock: reviews

2020-09-15 Thread Daniel Jordan
-rc25.9-rc2-lru
Min   1 8.96 (   0.00%)9.04 (  -0.89%)
Min   4 4.63 (   0.00%)4.74 (  -2.38%)
Min   7 3.34 (   0.00%)3.38 (  -1.20%)
Min   122.65 (   0.00%)2.95 ( -11.32%)
Min   183.54 (   0.00%)3.80 (  -7.34%)
Min   203.74 (   0.00%)4.02 (  -7.49%)
Amean 111.00 (   0.00%)   11.11 (  -0.98%)
Amean 4 4.92 (   0.00%)4.95 (  -0.59%)
Amean 7 3.65 (   0.00%)3.65 (  -0.16%)
Amean 123.29 (   0.00%)3.32 (  -0.89%)
Amean 184.20 (   0.00%)5.22 * -24.39%*
Amean 206.02 (   0.00%)9.14 * -51.98%*
Stddev1 3.33 (   0.00%)3.45 (  -3.40%)
Stddev4 0.23 (   0.00%)0.21 (   7.89%)
Stddev7 0.25 (   0.00%)0.22 (   9.87%)
Stddev120.35 (   0.00%)0.19 (  45.09%)
Stddev180.38 (   0.00%)1.75 (-354.74%)
Stddev202.93 (   0.00%)4.73 ( -61.72%)
CoeffVar  130.30 (   0.00%)   31.02 (  -2.40%)
CoeffVar  4 4.63 (   0.00%)4.24 (   8.43%)
CoeffVar  7 6.77 (   0.00%)6.10 (  10.02%)
CoeffVar  12   10.74 (   0.00%)5.85 (  45.57%)
CoeffVar  189.15 (   0.00%)   33.45 (-265.58%)
CoeffVar  20   48.64 (   0.00%)   51.75 (  -6.41%)
Max   117.01 (   0.00%)   17.36 (  -2.06%)
Max   4 5.33 (   0.00%)5.40 (  -1.31%)
Max   7 4.14 (   0.00%)4.18 (  -0.97%)
Max   123.89 (   0.00%)3.67 (   5.66%)
Max   184.82 (   0.00%)8.64 ( -79.25%)
Max   20   11.09 (   0.00%)   19.26 ( -73.67%)
BAmean-50 1 9.12 (   0.00%)9.16 (  -0.49%)
BAmean-50 4 4.73 (   0.00%)4.80 (  -1.55%)
BAmean-50 7 3.46 (   0.00%)3.48 (  -0.58%)
BAmean-50 123.02 (   0.00%)3.18 (  -5.24%)
BAmean-50 183.90 (   0.00%)4.08 (  -4.52%)
BAmean-50 204.02 (   0.00%)5.90 ( -46.56%)
BAmean-95 110.45 (   0.00%)   10.54 (  -0.82%)
BAmean-95 4 4.88 (   0.00%)4.91 (  -0.52%)
BAmean-95 7 3.60 (   0.00%)3.60 (  -0.08%)
BAmean-95 123.23 (   0.00%)3.28 (  -1.60%)
BAmean-95 184.14 (   0.00%)4.91 ( -18.58%)
BAmean-95 205.56 (   0.00%)8.22 ( -48.04%)
BAmean-99 110.45 (   0.00%)   10.54 (  -0.82%)
BAmean-99 4 4.88 (   0.00%)4.91 (  -0.52%)
BAmean-99 7 3.60 (   0.00%)3.60 (  -0.08%)
BAmean-99 123.23 (   0.00%)3.28 (  -1.60%)
BAmean-99 184.14 (   0.00%)4.91 ( -18.58%)
BAmean-99 205.56 (   0.00%)8.22 ( -48.04%)


docker-ized readtwice microbenchmark


This is Alex's modified readtwice case.  Needed a few fixes, and I made it into
a script.  Updated version attached.

Same machine, three runs per kernel, 40 containers per test.  This is average
MB/s over all containers.

5.9-rc2  5.9-rc2-lru
---  ---
220.5 (3.3)  356.9 (0.5)

That's a 62% improvement.
FROM centos:8
MAINTAINER Alexs 
#WORKDIR /vm-scalability 
#RUN yum update -y && yum groupinstall "Development Tools" -y && yum clean all 
&& \
#examples 
https://www.linuxtechi.com/build-docker-container-images-with-dockerfile/
RUN yum install git xfsprogs patch make gcc -y && yum clean all && \
git clone  
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ && \
cd vm-scalability && make usemem

COPY readtwice.patch /vm-scalability/

RUN cd vm-scalability && patch -p1 < readtwice.patch
#!/usr/bin/env bash
#
# Originally by Alex Shi 
# Changes from Daniel Jordan 

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
TAG='lrulock'
runtime=300

nr_cont=$(nproc)

cd "$SCRIPT_DIR"

echo -e "starting $nr_cont containers\n"

pids=()

sudo docker build -t "$TAG" .

nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l)
if (( nr_running_cont != 0 )); then
echo "error: $nr_running_cont containers already running"
exit 1
fi

# start some testing containers
for ((i=0; i < nr_cont; i++)); do
sudo docker run --privileged=true --rm "$TAG" bash -c "sleep infinity" &
done

nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l)
until (( nr_running_cont == nr_cont )); do
sleep .5
nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l)
done

# do testing evn setup 
for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do
sudo docker exec --privileged=true -t $i \
bash -c "cd /vm-scalability/; bash ./case-lru-file-readtwice m" 
&

[PATCH] padata: fix possible padata_works_lock deadlock

2020-09-02 Thread Daniel Jordan
syzbot reports,

  WARNING: inconsistent lock state
  5.9.0-rc2-syzkaller #0 Not tainted
  
  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
  syz-executor.0/26715 takes:
  (padata_works_lock){+.?.}-{2:2}, at: padata_do_parallel kernel/padata.c:220
  {IN-SOFTIRQ-W} state was registered at:
spin_lock include/linux/spinlock.h:354 [inline]
padata_do_parallel kernel/padata.c:220
...
__do_softirq kernel/softirq.c:298
...
sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1091
asm_sysvec_apic_timer_interrupt arch/x86/include/asm/idtentry.h:581

   Possible unsafe locking scenario:

 CPU0
 
lock(padata_works_lock);

  lock(padata_works_lock);

padata_do_parallel() takes padata_works_lock with softirqs enabled, so a
deadlock is possible if, on the same CPU, the lock is acquired in
process context and then softirq handling done in an interrupt leads to
the same path.

Fix by leaving softirqs disabled while do_parallel holds
padata_works_lock.

Reported-by: syzbot+f4b9f49e38e25eb4e...@syzkaller.appspotmail.com
Fixes: 4611ce2246889 ("padata: allocate work structures for parallel jobs from 
a pool")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/padata.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 16cb894dc272..d4d3ba6e1728 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -215,12 +215,13 @@ int padata_do_parallel(struct padata_shell *ps,
padata->pd = pd;
padata->cb_cpu = *cb_cpu;
 
-   rcu_read_unlock_bh();
-
spin_lock(_works_lock);
padata->seq_nr = ++pd->seq_nr;
pw = padata_work_alloc();
spin_unlock(_works_lock);
+
+   rcu_read_unlock_bh();
+
if (pw) {
padata_work_init(pw, padata_parallel_worker, padata, 0);
queue_work(pinst->parallel_wq, >pw_work);

base-commit: 9c7d619be5a002ea29c172df5e3c1227c22cbb41
-- 
2.28.0



[PATCH v2] padata: add another maintainer and another list

2020-08-27 Thread Daniel Jordan
At Steffen's request, I'll help maintain padata for the foreseeable
future.

While at it, let's have patches go to lkml too since the code is now
used outside of crypto.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3b186ade3597..06a1b8a6d953 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13024,7 +13024,9 @@ F:  lib/packing.c
 
 PADATA PARALLEL EXECUTION MECHANISM
 M: Steffen Klassert 
+M: Daniel Jordan 
 L: linux-cry...@vger.kernel.org
+L: linux-kernel@vger.kernel.org
 S: Maintained
 F: Documentation/core-api/padata.rst
 F: include/linux/padata.h
-- 
2.28.0



Re: [PATCH] padata: add a reviewer

2020-08-27 Thread Daniel Jordan
On Thu, Aug 27, 2020 at 08:44:09AM +0200, Steffen Klassert wrote:
> Please also consider to add yourself as one of the maintainers.

Ok, sure!  I'll take you up on that.


Re: [PATCH v18 00/32] per memcg lru_lock

2020-08-27 Thread Daniel Jordan
On Wed, Aug 26, 2020 at 04:59:28PM +0800, Alex Shi wrote:
> I clean up my testing and make it reproducable by a Dockerfile and a case 
> patch which
> attached. 

Ok, I'll give that a shot once I've taken care of sysbench.

> >>> Even better would be a description of the problem you're having in 
> >>> production
> >>> with lru_lock.  We might be able to create at least a simulation of it to 
> >>> show
> >>> what the expected improvement of your real workload is.
> >>
> >> we are using thousands memcgs in a machine, but as a simulation, I guess 
> >> above case
> >> could be helpful to show the problem.
> > 
> > Using thousands of memcgs to do what?  Any particulars about the type of
> > workload?  Surely it's more complicated than page cache reads :)
> 
> Yes, the workload are quit different on different business, some use cpu a
> lot, some use memory a lot, and some are may mixed.

That's pretty vague, but I don't suppose I could do much better describing what
all runs on our systems  :-/

I went back to your v1 post to see what motivated you originally, and you had
some results from aim9 but nothing about where this reared its head in the
first place.  How did you discover the bottleneck?  I'm just curious about how
lru_lock hurts in practice.

> > Neither kernel compile nor git checkout in the root cgroup changed much, 
> > just
> > 0.31% slower on elapsed time for the compile, so no significant regressions
> > there.  Now for sysbench again.

Still working on getting repeatable sysbench runs, no luck so far.  The numbers
have stayed fairly consistent with your series but vary a lot on the base
kernel, not sure why yet.


[PATCH] padata: add a reviewer

2020-08-26 Thread Daniel Jordan
I volunteer to review padata changes for the foreseeable future.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3b186ade3597..1481d47cfd75 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13024,6 +13024,7 @@ F:  lib/packing.c
 
 PADATA PARALLEL EXECUTION MECHANISM
 M: Steffen Klassert 
+R: Daniel Jordan 
 L: linux-cry...@vger.kernel.org
 S: Maintained
 F: Documentation/core-api/padata.rst
-- 
2.27.0



Re: [PATCH v18 00/32] per memcg lru_lock

2020-08-25 Thread Daniel Jordan
On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote:
> 在 2020/8/25 上午9:56, Daniel Jordan 写道:
> > Alex, do you have a pointer to the modified readtwice case?
> 
> Sorry, no. my developer machine crashed, so I lost case my container and 
> modified
> case. I am struggling to get my container back from a account problematic 
> repository. 
> 
> But some testing scripts is here, generally, the original readtwice case will
> run each of threads on each of cpus. The new case will run one container on 
> each cpus,
> and just run one readtwice thead in each of containers.

Ok, what you've sent so far gives me an idea of what you did.  My readtwice
changes were similar, except I used the cgroup interface directly instead of
docker and shared a filesystem between all the cgroups whereas it looks like
you had one per memcg.  30 second runs on 5.9-rc2 and v18 gave 11% more data
read with v18.  This was using 16 cgroups (32 dd tasks) on a 40 CPU, 2 socket
machine.

> > Even better would be a description of the problem you're having in 
> > production
> > with lru_lock.  We might be able to create at least a simulation of it to 
> > show
> > what the expected improvement of your real workload is.
> 
> we are using thousands memcgs in a machine, but as a simulation, I guess 
> above case
> could be helpful to show the problem.

Using thousands of memcgs to do what?  Any particulars about the type of
workload?  Surely it's more complicated than page cache reads :)

> > I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel 
> > from
> > mmtests, a memcg-ized version of the readtwice case I cooked up) and then 
> > today
> > discovered there's a chance I wasn't running the right kernels, so I'm 
> > redoing
> > them on v18.

Neither kernel compile nor git checkout in the root cgroup changed much, just
0.31% slower on elapsed time for the compile, so no significant regressions
there.  Now for sysbench again.


Re: [PATCH v18 00/32] per memcg lru_lock

2020-08-24 Thread Daniel Jordan
On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote:
> On Mon, 24 Aug 2020, Andrew Morton wrote:
> > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi  
> > wrote:
> Andrew demurred on version 17 for lack of review.  Alexander Duyck has
> been doing a lot on that front since then.  I have intended to do so,
> but it's a mirage that moves away from me as I move towards it: I have

Same, I haven't been able to keep up with the versions or the recent review
feedback.  I got through about half of v17 last week and hope to have more time
for the rest this week and beyond.

> > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> > > containers on a 2s * 26cores * HT box with a modefied case:

Alex, do you have a pointer to the modified readtwice case?

Even better would be a description of the problem you're having in production
with lru_lock.  We might be able to create at least a simulation of it to show
what the expected improvement of your real workload is.

> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> > > With this patchset, the readtwice performance increased about 80%
> > > in concurrent containers.
> > 
> > That's rather a slight amount of performance testing for a huge
> > performance patchset!
> 
> Indeed.  And I see that clause about readtwice performance increased 80%
> going back eight months to v6: a lot of fundamental bugs have been fixed
> in it since then, so I do think it needs refreshing.  It could be faster
> now: v16 or v17 fixed the last bug I knew of, which had been slowing
> down reclaim considerably.
> 
> When I last timed my repetitive swapping loads (not loads anyone sensible
> would be running with), across only two memcgs, Alex's patchset was
> slightly faster than without: it really did make a difference.  But
> I tend to think that for all patchsets, there exists at least one
> test that shows it faster, and another that shows it slower.
> 
> > Is more detailed testing planned?
> 
> Not by me, performance testing is not something I trust myself with,
> just get lost in the numbers: Alex, this is what we hoped for months
> ago, please make a more convincing case, I hope Daniel and others
> can make more suggestions.  But my own evidence suggests it's good.

I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from
mmtests, a memcg-ized version of the readtwice case I cooked up) and then today
discovered there's a chance I wasn't running the right kernels, so I'm redoing
them on v18.  Plan to look into what other, more "macro" tests would be
sensitive to these changes.


Re: [PATCH 2/2] mm, util: account_locked_vm() does not hold mmap_lock

2020-07-30 Thread Daniel Jordan
On Wed, Jul 29, 2020 at 12:21:11PM -0700, Hugh Dickins wrote:
> On Sun, 26 Jul 2020, Pengfei Li wrote:
> 
> > Since mm->locked_vm is already an atomic counter, account_locked_vm()
> > does not need to hold mmap_lock.
> 
> I am worried that this patch, already added to mmotm, along with its
> 1/2 making locked_vm an atomic64, might be rushed into v5.9 with just
> that two-line commit description, and no discussion at all.
> 
> locked_vm belongs fundamentally to mm/mlock.c, and the lock to guard
> it is mmap_lock; and mlock() has some complicated stuff to do under
> that lock while it decides how to adjust locked_vm.
>
> It is very easy to convert an unsigned long to an atomic64_t, but
> "atomic read, check limit and do stuff, atomic add" does not give
> the same guarantee as holding the right lock around it all.

Yes, this is why I withdrew my attempt to do something similar last year, I
didn't want to make the accounting racy.  Stack and heap growing and mremap
would be affected in addition to mlock.

It'd help to hear more about the motivation for this.

Daniel


Re: [PATCH V4] mm/vmstat: Add events for THP migration without split

2020-07-24 Thread Daniel Jordan
I'm assuming the newly-enlarged positive error return of migrate_pages(2) won't
have adverse effects in userspace.  Didn't see issues with any user in debian
codesearch, and can't imagine how it could be relied on.

This look ok.  Just some nits, take them or leave them as you prefer.

Reviewed-by: Daniel Jordan 

> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 705b33d1e395..4d434398d64d 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -46,13 +46,18 @@ MIGRATE_REASON
>  TRACE_EVENT(mm_migrate_pages,
>  
>   TP_PROTO(unsigned long succeeded, unsigned long failed,
> -  enum migrate_mode mode, int reason),
> +  unsigned long thp_succeeded, unsigned long thp_failed,
> +  unsigned long thp_split, enum migrate_mode mode, int reason),
>  
> - TP_ARGS(succeeded, failed, mode, reason),
> + TP_ARGS(succeeded, failed, thp_succeeded, thp_failed,
> + thp_split, mode, reason),
>  
>   TP_STRUCT__entry(
>   __field(unsigned long,  succeeded)
>   __field(unsigned long,  failed)
> + __field(unsigned long,  thp_succeeded)
> + __field(unsigned long,  thp_failed)
> + __field(unsigned long,  thp_split)

These three are ints in the code, not unsigned long.  It can save space in the
trace event struct, 8 bytes on my machine.

> >>>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>>> @@ -1429,22 +1429,35 @@ int migrate_pages(struct list_head *from, 
> >>>> new_page_t get_new_page,
...
> >>>> +bool is_thp = false;

Don't need to initialize, could declare with rc/nr_subpages.

> >>>> +if (is_thp) {
> >>>> +nr_thp_failed++;
> >>>> +nr_failed += nr_subpages;
> >>>> +goto out;
> >>>> +}
> >>>>  nr_failed++;
> >>>>  goto out;

This instead, in each of the three places with this pattern?:

if (is_thp)
nr_thp_failed++;
nr_failed += nr_subpages;
goto out;

> diff --git a/Documentation/vm/page_migration.rst 
> b/Documentation/vm/page_migration.rst
...
> +5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP 
> had
> +   to be split. After splitting, a migration retry was used for it's 
> sub-pages.

The first part of this might be misinterpreted for the same reason Anshuman
changed this earlier (migration didn't necessarily happen).  We could just
delete "A THP was migrated, but not as such: first, "


Re: [PATCH v3] x86/mm: use max memory block size on bare metal

2020-07-15 Thread Daniel Jordan
On Tue, Jul 14, 2020 at 04:54:50PM -0400, Daniel Jordan wrote:
> Some of our servers spend significant time at kernel boot initializing
> memory block sysfs directories and then creating symlinks between them
> and the corresponding nodes.  The slowness happens because the machines
> get stuck with the smallest supported memory block size on x86 (128M),
> which results in 16,288 directories to cover the 2T of installed RAM.
> The search for each memory block is noticeable even with
> commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
> xarray to accelerate lookup").
> 
> Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
> on the end of boot memory") chooses the block size based on alignment
> with memory end.  That addresses hotplug failures in qemu guests, but
> for bare metal systems whose memory end isn't aligned to even the
> smallest size, it leaves them at 128M.
> 
> Make kernels that aren't running on a hypervisor use the largest
> supported size (2G) to minimize overhead on big machines.  Kernel boot
> goes 7% faster on the aforementioned servers, shaving off half a second.
> 
> Signed-off-by: Daniel Jordan 
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 

Darn.  David, I forgot to add your ack from v2.  My assumption is that it still
stands after the minor change in this version, but please do correct me if I'm
wrong.


[PATCH v3] x86/mm: use max memory block size on bare metal

2020-07-14 Thread Daniel Jordan
Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them
and the corresponding nodes.  The slowness happens because the machines
get stuck with the smallest supported memory block size on x86 (128M),
which results in 16,288 directories to cover the 2T of installed RAM.
The search for each memory block is noticeable even with
commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
xarray to accelerate lookup").

Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
on the end of boot memory") chooses the block size based on alignment
with memory end.  That addresses hotplug failures in qemu guests, but
for bare metal systems whose memory end isn't aligned to even the
smallest size, it leaves them at 128M.

Make kernels that aren't running on a hypervisor use the largest
supported size (2G) to minimize overhead on big machines.  Kernel boot
goes 7% faster on the aforementioned servers, shaving off half a second.

Signed-off-by: Daniel Jordan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Peter Zijlstra 
Cc: Steven Sistare 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---

v3:
 - Add more accurate hypervisor check.  Someone kindly pointed me to
   517c3ba00916 ("x86/speculation/mds: Apply more accurate check on
   hypervisor platform"), and v2 had the same issue.
 - Rebase on v5.8-rc5

v2:
 - Thanks to David for the idea to make this conditional based on
   virtualization.
 - Update performance numbers to account for 4fb6eabf1037 (David)

 arch/x86/mm/init_64.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dbae185511cdf..51ea8b8e2959d 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1406,6 +1406,15 @@ static unsigned long probe_memory_block_size(void)
goto done;
}
 
+   /*
+* Use max block size to minimize overhead on bare metal, where
+* alignment for memory hotplug isn't a concern.
+*/
+   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
+   bz = MAX_BLOCK_SIZE;
+   goto done;
+   }
+
/* Find the largest allowed block size that aligns to memory end */
for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
if (IS_ALIGNED(boot_mem_end, bz))

base-commit: 11ba468877bb23f28956a35e896356252d63c983
-- 
2.27.0



[PATCH 5/6] padata: fold padata_alloc_possible() into padata_alloc()

2020-07-14 Thread Daniel Jordan
There's no reason to have two interfaces when there's only one caller.
Removing _possible saves text and simplifies future changes.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 Documentation/core-api/padata.rst |  2 +-
 crypto/pcrypt.c   |  2 +-
 include/linux/padata.h|  2 +-
 kernel/padata.c   | 33 +--
 4 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 771d50330e5b5..35175710b43cc 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -27,7 +27,7 @@ padata_instance structure for overall control of how jobs are 
to be run::
 
 #include 
 
-struct padata_instance *padata_alloc_possible(const char *name);
+struct padata_instance *padata_alloc(const char *name);
 
 'name' simply identifies the instance.
 
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 7374dfecaf70f..812892732a5e5 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -320,7 +320,7 @@ static int pcrypt_init_padata(struct padata_instance 
**pinst, const char *name)
 {
int ret = -ENOMEM;
 
-   *pinst = padata_alloc_possible(name);
+   *pinst = padata_alloc(name);
if (!*pinst)
return ret;
 
diff --git a/include/linux/padata.h b/include/linux/padata.h
index a941b96b7119e..070a7d43e8af8 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -192,7 +192,7 @@ extern void __init padata_init(void);
 static inline void __init padata_init(void) {}
 #endif
 
-extern struct padata_instance *padata_alloc_possible(const char *name);
+extern struct padata_instance *padata_alloc(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
 extern void padata_free_shell(struct padata_shell *ps);
diff --git a/kernel/padata.c b/kernel/padata.c
index 4f0a57e5738c9..1c0b97891edb8 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -979,18 +979,12 @@ static struct kobj_type padata_attr_type = {
 };
 
 /**
- * padata_alloc - allocate and initialize a padata instance and specify
- *cpumasks for serial and parallel workers.
- *
+ * padata_alloc - allocate and initialize a padata instance
  * @name: used to identify the instance
- * @pcpumask: cpumask that will be used for padata parallelization
- * @cbcpumask: cpumask that will be used for padata serialization
  *
  * Return: new instance on success, NULL on error
  */
-static struct padata_instance *padata_alloc(const char *name,
-   const struct cpumask *pcpumask,
-   const struct cpumask *cbcpumask)
+struct padata_instance *padata_alloc(const char *name)
 {
struct padata_instance *pinst;
 
@@ -1016,14 +1010,11 @@ static struct padata_instance *padata_alloc(const char 
*name,
free_cpumask_var(pinst->cpumask.pcpu);
goto err_free_serial_wq;
}
-   if (!padata_validate_cpumask(pinst, pcpumask) ||
-   !padata_validate_cpumask(pinst, cbcpumask))
-   goto err_free_masks;
 
INIT_LIST_HEAD(>pslist);
 
-   cpumask_copy(pinst->cpumask.pcpu, pcpumask);
-   cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);
+   cpumask_copy(pinst->cpumask.pcpu, cpu_possible_mask);
+   cpumask_copy(pinst->cpumask.cbcpu, cpu_possible_mask);
 
if (padata_setup_cpumasks(pinst))
goto err_free_masks;
@@ -1057,21 +1048,7 @@ static struct padata_instance *padata_alloc(const char 
*name,
 err:
return NULL;
 }
-
-/**
- * padata_alloc_possible - Allocate and initialize padata instance.
- * Use the cpu_possible_mask for serial and
- * parallel workers.
- *
- * @name: used to identify the instance
- *
- * Return: new instance on success, NULL on error
- */
-struct padata_instance *padata_alloc_possible(const char *name)
-{
-   return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
-}
-EXPORT_SYMBOL(padata_alloc_possible);
+EXPORT_SYMBOL(padata_alloc);
 
 /**
  * padata_free - free a padata instance
-- 
2.27.0



[PATCH 4/6] padata: remove effective cpumasks from the instance

2020-07-14 Thread Daniel Jordan
A padata instance has effective cpumasks that store the user-supplied
masks ANDed with the online mask, but this middleman is unnecessary.
parallel_data keeps the same information around.  Removing this saves
text and code churn in future changes.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/padata.h |  2 --
 kernel/padata.c| 30 +++---
 2 files changed, 3 insertions(+), 29 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 7d53208b43daa..a941b96b7119e 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -167,7 +167,6 @@ struct padata_mt_job {
  * @serial_wq: The workqueue used for serial work.
  * @pslist: List of padata_shell objects attached to this instance.
  * @cpumask: User supplied cpumasks for parallel and serial works.
- * @rcpumask: Actual cpumasks based on user cpumask and cpu_online_mask.
  * @kobj: padata instance kernel object.
  * @lock: padata instance lock.
  * @flags: padata flags.
@@ -179,7 +178,6 @@ struct padata_instance {
struct workqueue_struct *serial_wq;
struct list_headpslist;
struct padata_cpumask   cpumask;
-   struct padata_cpumask   rcpumask;
struct kobject   kobj;
struct mutex lock;
u8   flags;
diff --git a/kernel/padata.c b/kernel/padata.c
index 27f90a3c4dc6b..4f0a57e5738c9 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -571,13 +571,8 @@ static void padata_init_pqueues(struct parallel_data *pd)
 static struct parallel_data *padata_alloc_pd(struct padata_shell *ps)
 {
struct padata_instance *pinst = ps->pinst;
-   const struct cpumask *cbcpumask;
-   const struct cpumask *pcpumask;
struct parallel_data *pd;
 
-   cbcpumask = pinst->rcpumask.cbcpu;
-   pcpumask = pinst->rcpumask.pcpu;
-
pd = kzalloc(sizeof(struct parallel_data), GFP_KERNEL);
if (!pd)
goto err;
@@ -597,8 +592,8 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_shell *ps)
if (!alloc_cpumask_var(>cpumask.cbcpu, GFP_KERNEL))
goto err_free_pcpu;
 
-   cpumask_copy(pd->cpumask.pcpu, pcpumask);
-   cpumask_copy(pd->cpumask.cbcpu, cbcpumask);
+   cpumask_and(pd->cpumask.pcpu, pinst->cpumask.pcpu, cpu_online_mask);
+   cpumask_and(pd->cpumask.cbcpu, pinst->cpumask.cbcpu, cpu_online_mask);
 
padata_init_pqueues(pd);
padata_init_squeues(pd);
@@ -668,12 +663,6 @@ static int padata_replace(struct padata_instance *pinst)
 
pinst->flags |= PADATA_RESET;
 
-   cpumask_and(pinst->rcpumask.pcpu, pinst->cpumask.pcpu,
-   cpu_online_mask);
-
-   cpumask_and(pinst->rcpumask.cbcpu, pinst->cpumask.cbcpu,
-   cpu_online_mask);
-
list_for_each_entry(ps, >pslist, list) {
err = padata_replace_one(ps);
if (err)
@@ -856,8 +845,6 @@ static void __padata_free(struct padata_instance *pinst)
 
WARN_ON(!list_empty(>pslist));
 
-   free_cpumask_var(pinst->rcpumask.cbcpu);
-   free_cpumask_var(pinst->rcpumask.pcpu);
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
destroy_workqueue(pinst->serial_wq);
@@ -1033,20 +1020,13 @@ static struct padata_instance *padata_alloc(const char 
*name,
!padata_validate_cpumask(pinst, cbcpumask))
goto err_free_masks;
 
-   if (!alloc_cpumask_var(>rcpumask.pcpu, GFP_KERNEL))
-   goto err_free_masks;
-   if (!alloc_cpumask_var(>rcpumask.cbcpu, GFP_KERNEL))
-   goto err_free_rcpumask_pcpu;
-
INIT_LIST_HEAD(>pslist);
 
cpumask_copy(pinst->cpumask.pcpu, pcpumask);
cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);
-   cpumask_and(pinst->rcpumask.pcpu, pcpumask, cpu_online_mask);
-   cpumask_and(pinst->rcpumask.cbcpu, cbcpumask, cpu_online_mask);
 
if (padata_setup_cpumasks(pinst))
-   goto err_free_rcpumask_cbcpu;
+   goto err_free_masks;
 
__padata_start(pinst);
 
@@ -1064,10 +1044,6 @@ static struct padata_instance *padata_alloc(const char 
*name,
 
return pinst;
 
-err_free_rcpumask_cbcpu:
-   free_cpumask_var(pinst->rcpumask.cbcpu);
-err_free_rcpumask_pcpu:
-   free_cpumask_var(pinst->rcpumask.pcpu);
 err_free_masks:
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
-- 
2.27.0



[PATCH 3/6] padata: inline single call of pd_setup_cpumasks()

2020-07-14 Thread Daniel Jordan
pd_setup_cpumasks() has only one caller.  Move its contents inline to
prepare for the next cleanup.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/padata.c | 32 +---
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 8f55e717ba50b..27f90a3c4dc6b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -441,28 +441,6 @@ static int padata_setup_cpumasks(struct padata_instance 
*pinst)
return err;
 }
 
-static int pd_setup_cpumasks(struct parallel_data *pd,
-const struct cpumask *pcpumask,
-const struct cpumask *cbcpumask)
-{
-   int err = -ENOMEM;
-
-   if (!alloc_cpumask_var(>cpumask.pcpu, GFP_KERNEL))
-   goto out;
-   if (!alloc_cpumask_var(>cpumask.cbcpu, GFP_KERNEL))
-   goto free_pcpu_mask;
-
-   cpumask_copy(pd->cpumask.pcpu, pcpumask);
-   cpumask_copy(pd->cpumask.cbcpu, cbcpumask);
-
-   return 0;
-
-free_pcpu_mask:
-   free_cpumask_var(pd->cpumask.pcpu);
-out:
-   return err;
-}
-
 static void __init padata_mt_helper(struct work_struct *w)
 {
struct padata_work *pw = container_of(w, struct padata_work, pw_work);
@@ -613,8 +591,14 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_shell *ps)
goto err_free_pqueue;
 
pd->ps = ps;
-   if (pd_setup_cpumasks(pd, pcpumask, cbcpumask))
+
+   if (!alloc_cpumask_var(>cpumask.pcpu, GFP_KERNEL))
goto err_free_squeue;
+   if (!alloc_cpumask_var(>cpumask.cbcpu, GFP_KERNEL))
+   goto err_free_pcpu;
+
+   cpumask_copy(pd->cpumask.pcpu, pcpumask);
+   cpumask_copy(pd->cpumask.cbcpu, cbcpumask);
 
padata_init_pqueues(pd);
padata_init_squeues(pd);
@@ -626,6 +610,8 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_shell *ps)
 
return pd;
 
+err_free_pcpu:
+   free_cpumask_var(pd->cpumask.pcpu);
 err_free_squeue:
free_percpu(pd->squeue);
 err_free_pqueue:
-- 
2.27.0



[PATCH 1/6] padata: remove start function

2020-07-14 Thread Daniel Jordan
padata_start() is only used right after pcrypt allocates an instance
with all possible CPUs, when PADATA_INVALID can't happen, so there's no
need for a separate "start" step.  It can be done during allocation to
save text, make using padata easier, and avoid unneeded calls in the
future.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 crypto/pcrypt.c|  3 ---
 include/linux/padata.h |  1 -
 kernel/padata.c| 26 +-
 3 files changed, 1 insertion(+), 29 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 8bddc65cd5092..4f5707a3dd1e9 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -359,9 +359,6 @@ static int __init pcrypt_init(void)
if (err)
goto err_deinit_pencrypt;
 
-   padata_start(pencrypt);
-   padata_start(pdecrypt);
-
return crypto_register_template(_tmpl);
 
 err_deinit_pencrypt:
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 7302efff5e656..20294cddc7396 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -204,6 +204,5 @@ extern void padata_do_serial(struct padata_priv *padata);
 extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
-extern int padata_start(struct padata_instance *pinst);
 extern void padata_stop(struct padata_instance *pinst);
 #endif
diff --git a/kernel/padata.c b/kernel/padata.c
index 4373f7adaa40a..9317623166124 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -789,30 +789,6 @@ int padata_set_cpumask(struct padata_instance *pinst, int 
cpumask_type,
 }
 EXPORT_SYMBOL(padata_set_cpumask);
 
-/**
- * padata_start - start the parallel processing
- *
- * @pinst: padata instance to start
- *
- * Return: 0 on success or negative error code
- */
-int padata_start(struct padata_instance *pinst)
-{
-   int err = 0;
-
-   mutex_lock(>lock);
-
-   if (pinst->flags & PADATA_INVALID)
-   err = -EINVAL;
-
-   __padata_start(pinst);
-
-   mutex_unlock(>lock);
-
-   return err;
-}
-EXPORT_SYMBOL(padata_start);
-
 /**
  * padata_stop - stop the parallel processing
  *
@@ -1100,7 +1076,7 @@ static struct padata_instance *padata_alloc(const char 
*name,
if (padata_setup_cpumasks(pinst))
goto err_free_rcpumask_cbcpu;
 
-   pinst->flags = 0;
+   __padata_start(pinst);
 
kobject_init(>kobj, _attr_type);
mutex_init(>lock);
-- 
2.27.0



[PATCH 6/6] padata: remove padata_parallel_queue

2020-07-14 Thread Daniel Jordan
Only its reorder field is actually used now, so remove the struct and
embed @reorder directly in parallel_data.

No functional change, just a cleanup.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/padata.h | 15 ++
 kernel/padata.c| 46 ++
 2 files changed, 22 insertions(+), 39 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 070a7d43e8af8..a433f13fc4bf7 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -66,17 +66,6 @@ struct padata_serial_queue {
struct parallel_data *pd;
 };
 
-/**
- * struct padata_parallel_queue - The percpu padata parallel queue
- *
- * @reorder: List to wait for reordering after parallel processing.
- * @num_obj: Number of objects that are processed by this cpu.
- */
-struct padata_parallel_queue {
-   struct padata_listreorder;
-   atomic_t  num_obj;
-};
-
 /**
  * struct padata_cpumask - The cpumasks for the parallel/serial workers
  *
@@ -93,7 +82,7 @@ struct padata_cpumask {
  * that depends on the cpumask in use.
  *
  * @ps: padata_shell object.
- * @pqueue: percpu padata queues used for parallelization.
+ * @reorder_list: percpu reorder lists
  * @squeue: percpu padata queues used for serialuzation.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @seq_nr: Sequence number of the parallelized data object.
@@ -105,7 +94,7 @@ struct padata_cpumask {
  */
 struct parallel_data {
struct padata_shell *ps;
-   struct padata_parallel_queue__percpu *pqueue;
+   struct padata_list  __percpu *reorder_list;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
unsigned intseq_nr;
diff --git a/kernel/padata.c b/kernel/padata.c
index 1c0b97891edb8..16cb894dc272b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -250,13 +250,11 @@ EXPORT_SYMBOL(padata_do_parallel);
 static struct padata_priv *padata_find_next(struct parallel_data *pd,
bool remove_object)
 {
-   struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
struct padata_list *reorder;
int cpu = pd->cpu;
 
-   next_queue = per_cpu_ptr(pd->pqueue, cpu);
-   reorder = _queue->reorder;
+   reorder = per_cpu_ptr(pd->reorder_list, cpu);
 
spin_lock(>lock);
if (list_empty(>list)) {
@@ -291,7 +289,7 @@ static void padata_reorder(struct parallel_data *pd)
int cb_cpu;
struct padata_priv *padata;
struct padata_serial_queue *squeue;
-   struct padata_parallel_queue *next_queue;
+   struct padata_list *reorder;
 
/*
 * We need to ensure that only one cpu can work on dequeueing of
@@ -339,9 +337,8 @@ static void padata_reorder(struct parallel_data *pd)
 */
smp_mb();
 
-   next_queue = per_cpu_ptr(pd->pqueue, pd->cpu);
-   if (!list_empty(_queue->reorder.list) &&
-   padata_find_next(pd, false))
+   reorder = per_cpu_ptr(pd->reorder_list, pd->cpu);
+   if (!list_empty(>list) && padata_find_next(pd, false))
queue_work(pinst->serial_wq, >reorder_work);
 }
 
@@ -401,17 +398,16 @@ void padata_do_serial(struct padata_priv *padata)
 {
struct parallel_data *pd = padata->pd;
int hashed_cpu = padata_cpu_hash(pd, padata->seq_nr);
-   struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
-  hashed_cpu);
+   struct padata_list *reorder = per_cpu_ptr(pd->reorder_list, hashed_cpu);
struct padata_priv *cur;
 
-   spin_lock(>reorder.lock);
+   spin_lock(>lock);
/* Sort in ascending order of sequence number. */
-   list_for_each_entry_reverse(cur, >reorder.list, list)
+   list_for_each_entry_reverse(cur, >list, list)
if (cur->seq_nr < padata->seq_nr)
break;
list_add(>list, >list);
-   spin_unlock(>reorder.lock);
+   spin_unlock(>lock);
 
/*
 * Ensure the addition to the reorder list is ordered correctly
@@ -553,17 +549,15 @@ static void padata_init_squeues(struct parallel_data *pd)
}
 }
 
-/* Initialize all percpu queues used by parallel workers */
-static void padata_init_pqueues(struct parallel_data *pd)
+/* Initialize per-CPU reorder lists */
+static void padata_init_reorder_list(struct parallel_data *pd)
 {
int cpu;
-   struct padata_parallel_queue *pqueue;
+   struct padata_list *list;
 
for_each_cpu(cpu, pd->cpumask.pcpu) {
-   pqueue = per_cpu_ptr(pd->pqueue, cpu);
-
-   __padata_list_init

[PATCH 2/6] padata: remove stop function

2020-07-14 Thread Daniel Jordan
padata_stop() has two callers and is unnecessary in both cases.  When
pcrypt calls it before padata_free(), it's being unloaded so there are
no outstanding padata jobs[0].  When __padata_free() calls it, it's
either along the same path or else pcrypt initialization failed, which
of course means there are also no outstanding jobs.

Removing it simplifies padata and saves text.

[0] 
https://lore.kernel.org/linux-crypto/20191119225017.mjrak2fwa5vcc...@gondor.apana.org.au/

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 Documentation/core-api/padata.rst | 16 ++--
 crypto/pcrypt.c   | 12 +++-
 include/linux/padata.h|  1 -
 kernel/padata.c   | 14 --
 4 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 0830e5b0e8211..771d50330e5b5 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -31,18 +31,7 @@ padata_instance structure for overall control of how jobs 
are to be run::
 
 'name' simply identifies the instance.
 
-There are functions for enabling and disabling the instance::
-
-int padata_start(struct padata_instance *pinst);
-void padata_stop(struct padata_instance *pinst);
-
-These functions are setting or clearing the "PADATA_INIT" flag; if that flag is
-not set, other functions will refuse to work.  padata_start() returns zero on
-success (flag set) or -EINVAL if the padata cpumask contains no active CPU
-(flag not set).  padata_stop() clears the flag and blocks until the padata
-instance is unused.
-
-Finally, complete padata initialization by allocating a padata_shell::
+Then, complete padata initialization by allocating a padata_shell::
 
struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
 
@@ -155,11 +144,10 @@ submitted.
 Destroying
 --
 
-Cleaning up a padata instance predictably involves calling the three free
+Cleaning up a padata instance predictably involves calling the two free
 functions that correspond to the allocation in reverse::
 
 void padata_free_shell(struct padata_shell *ps);
-void padata_stop(struct padata_instance *pinst);
 void padata_free(struct padata_instance *pinst);
 
 It is the user's responsibility to ensure all outstanding jobs are complete
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 4f5707a3dd1e9..7374dfecaf70f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -331,12 +331,6 @@ static int pcrypt_init_padata(struct padata_instance 
**pinst, const char *name)
return ret;
 }
 
-static void pcrypt_fini_padata(struct padata_instance *pinst)
-{
-   padata_stop(pinst);
-   padata_free(pinst);
-}
-
 static struct crypto_template pcrypt_tmpl = {
.name = "pcrypt",
.create = pcrypt_create,
@@ -362,7 +356,7 @@ static int __init pcrypt_init(void)
return crypto_register_template(_tmpl);
 
 err_deinit_pencrypt:
-   pcrypt_fini_padata(pencrypt);
+   padata_free(pencrypt);
 err_unreg_kset:
kset_unregister(pcrypt_kset);
 err:
@@ -373,8 +367,8 @@ static void __exit pcrypt_exit(void)
 {
crypto_unregister_template(_tmpl);
 
-   pcrypt_fini_padata(pencrypt);
-   pcrypt_fini_padata(pdecrypt);
+   padata_free(pencrypt);
+   padata_free(pdecrypt);
 
kset_unregister(pcrypt_kset);
 }
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 20294cddc7396..7d53208b43daa 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -204,5 +204,4 @@ extern void padata_do_serial(struct padata_priv *padata);
 extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
-extern void padata_stop(struct padata_instance *pinst);
 #endif
diff --git a/kernel/padata.c b/kernel/padata.c
index 9317623166124..8f55e717ba50b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -789,19 +789,6 @@ int padata_set_cpumask(struct padata_instance *pinst, int 
cpumask_type,
 }
 EXPORT_SYMBOL(padata_set_cpumask);
 
-/**
- * padata_stop - stop the parallel processing
- *
- * @pinst: padata instance to stop
- */
-void padata_stop(struct padata_instance *pinst)
-{
-   mutex_lock(>lock);
-   __padata_stop(pinst);
-   mutex_unlock(>lock);
-}
-EXPORT_SYMBOL(padata_stop);
-
 #ifdef CONFIG_HOTPLUG_CPU
 
 static int __padata_add_cpu(struct padata_instance *pinst, int cpu)
@@ -883,7 +870,6 @@ static void __padata_free(struct padata_instance *pinst)
 
WARN_ON(!list_empty(>pslist));
 
-   padata_stop(pinst);
free_cpumask_var(pinst->rcpumask.cbcpu);
free_cpumask_var(pinst->rcpumask.pcpu);
free_cpumask_var(pinst->cpumask.pcpu);
-- 
2.27.0



[PATCH 0/6] padata cleanups

2020-07-14 Thread Daniel Jordan
These cleanups save ~5% of the padata text/data and make it a little
easier to use and develop going forward.

In particular, they pave the way to extend padata's multithreading support to
VFIO, a work-in-progress version of which can be found here:


https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

Based on v5.8-rc5.  As always, feedback is welcome.

Daniel

Daniel Jordan (6):
  padata: remove start function
  padata: remove stop function
  padata: inline single call of pd_setup_cpumasks()
  padata: remove effective cpumasks from the instance
  padata: fold padata_alloc_possible() into padata_alloc()
  padata: remove padata_parallel_queue

 Documentation/core-api/padata.rst |  18 +--
 crypto/pcrypt.c   |  17 +--
 include/linux/padata.h|  21 +---
 kernel/padata.c   | 177 ++
 4 files changed, 46 insertions(+), 187 deletions(-)


base-commit: 11ba468877bb23f28956a35e896356252d63c983
-- 
2.27.0



Re: [PATCH v2] Remove __init from padata_do_multithreaded and padata_mt_helper.

2020-07-08 Thread Daniel Jordan
(I was away for a while)

On Thu, Jul 02, 2020 at 11:55:48AM -0400, Nico Pache wrote:
> Allow padata_do_multithreaded function to be called after bootstrap.

The functions are __init because they're currently only needed during boot, and
using __init allows the text to be freed once it's over, saving some memory.

So this change, in isolation, doesn't make sense.  If there were an enhancement
you were thinking of making, this patch could then be bundled with it so the
change is made only when it's used.

However, there's still work that needs to be merged before
padata_do_multithreaded can be called after boot.  See the parts about priority
adjustments (MAX_NICE/renicing) and concurrency limits in this branch

  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

and the ktask discussions from linux-mm/lkml where concerns about these issues
were raised.  I plan to post these parts fairly soon and can include you if you
want.


Re: [PATCH v2] x86/mm: use max memory block size on bare metal

2020-07-08 Thread Daniel Jordan
(I'm back now)

On Fri, Jun 26, 2020 at 02:47:06PM +0200, Michal Hocko wrote:
> On Mon 22-06-20 15:17:39, Daniel Jordan wrote:
> > Hello Michal,
> > 
> > (I've been away and may be slow to respond for a little while)
> > 
> > On Fri, Jun 19, 2020 at 02:07:04PM +0200, Michal Hocko wrote:
> > > I believe that we should think about a future interface rather than
> > > trying to ducktape the blocksize anytime it causes problems. I would be
> > > even tempted to simply add a kernel command line option 
> > > memory_hotplug=disable,legacy,new_shiny
> > > 
> > > for disable it would simply drop all the sysfs crud and speed up boot
> > > for most users who simply do not care about memory hotplug. new_shiny
> > > would ideally provide an interface that would either export logically
> > > hotplugable memory ranges (e.g. DIMMs) or a query/action interface which
> > > accepts physical ranges as input. Having gazillions of sysfs files is
> > > simply unsustainable.
> > 
> > So in this idea, presumably the default would start off being legacy and 
> > then
> > later be changed to new_shiny?
> 
> Well it really depends. Going with disable as a default would suit most
> users much better because the vast majority simply doesn't use the
> functionality. On the other hand real users would regress unless they
> enable the option. Which is definitely not nice.

Agreed.

> Another and much less
> intrusive change would be creating sysfs interface on-demand. So until
> somebody actually tries to use the interface it won't exist. I haven't
> tried to explore how complex that would be. I am not really familiar
> with sysfs to be honest. But fundamentally nothing should prevent such a
> solution.

Hm, don't know sysfs internals either.  It seems possible that someone (or
something...) navigating around could trigger creation unintentionally--for
instance, by reading the symlinks to the memory block dirs in
/sys/devices/system/node/nodeN when trying to find another file to read in the
same place.

> Another option would be to create sysfs interface only if there is a
> hotplugable memory reported by the platform. But I am not sure we have a
> proper interface for that for all arches.

Systems that happen to have hotpluggable ranges but don't want the
overhead would still be stuck, though, it seems.


FWIW, the ideas for new_shiny sound promising.


Re: [PATCH v2] x86/mm: use max memory block size on bare metal

2020-06-22 Thread Daniel Jordan
Hello Michal,

(I've been away and may be slow to respond for a little while)

On Fri, Jun 19, 2020 at 02:07:04PM +0200, Michal Hocko wrote:
> On Tue 09-06-20 18:54:51, Daniel Jordan wrote:
> [...]
> > @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
> > goto done;
> > }
> >  
> > +   /*
> > +* Use max block size to minimize overhead on bare metal, where
> > +* alignment for memory hotplug isn't a concern.
> 
> This really begs a clarification why this is not a concern. Bare metal
> can see physical memory hotadd as well. I just suspect that you do not
> consider that to be very common so it is not a big deal?

It's not only uncommon, it's also that boot_mem_end on bare metal may not align
with any available memory block size.  For instance, this server's boot_mem_end
is only 4M aligned and FWIW my desktop's is 2M aligned.  As far as I can tell,
the logic that picks the size wasn't intended for bare metal.

> And I would
> tend to agree but still we are just going to wait until first user
> stumbles over this.

This isn't something new with this patch, 2G has been the default on big
machines for years.  This is addressing an unintended side effect of
078eb6aa50dc50, which was for qemu, by restoring the original behavior on bare
metal to avoid oodles of sysfs files.

> Btw. memblock interface just doesn't scale and it is a terrible
> interface for large machines and for the memory hotplug in general (just
> look at ppc and their insanely small memblocks).

I agree that the status quo isn't ideal and is something to address going
forward.

> Most usecases I have seen simply want to either offline some portion of
> memory without a strong requirement of the physical memory range as long
> as it is from a particular node or simply offline and remove the full
> node.

Interesting, would've thought that removing a single bad DIMM for RAS purposes
would also be common relative to how often hotplug is done on real systems.

> I believe that we should think about a future interface rather than
> trying to ducktape the blocksize anytime it causes problems. I would be
> even tempted to simply add a kernel command line option 
> memory_hotplug=disable,legacy,new_shiny
> 
> for disable it would simply drop all the sysfs crud and speed up boot
> for most users who simply do not care about memory hotplug. new_shiny
> would ideally provide an interface that would either export logically
> hotplugable memory ranges (e.g. DIMMs) or a query/action interface which
> accepts physical ranges as input. Having gazillions of sysfs files is
> simply unsustainable.

So in this idea, presumably the default would start off being legacy and then
later be changed to new_shiny?

If new_shiny scales well, maybe 'disable' wouldn't be needed and so using the
option could be avoided most of the time.  If some users really don't want it,
they can build without it.


Re: [PATCH v2] x86/mm: use max memory block size on bare metal

2020-06-11 Thread Daniel Jordan
On Thu, Jun 11, 2020 at 10:05:38AM -0700, Dave Hansen wrote:
> One other nit for this.  We *do* have actual hardware hotplug, and I'm
> pretty sure the alignment guarantees for hardware hotplug are pretty
> weak.  For instance, the alignment guarantees for persistent memory are
> still only 64MB even on modern platforms.
>
> Let's say we're on bare metal and we see an SRAT table that has some
> areas that show that hotplug might happen there.  Is this patch still
> ideal there?

Well, not if there's concern about hardware hotplug.

My assumption going in was that this wasn't a problem in practice.
078eb6aa50dc50 ("x86/mm/memory_hotplug: determine block size based on the end
of boot memory") was merged in 2018 to address qemu hotplug failures and >64G
systems have used a 2G block since 2014 with no complaints about alignment
issues, to my knowledge anyway.


Re: [PATCH v2] x86/mm: use max memory block size on bare metal

2020-06-11 Thread Daniel Jordan
On Thu, Jun 11, 2020 at 07:16:02AM -0700, Dave Hansen wrote:
> On 6/9/20 3:54 PM, Daniel Jordan wrote:
> > +   /*
> > +* Use max block size to minimize overhead on bare metal, where
> > +* alignment for memory hotplug isn't a concern.
> > +*/
> > +   if (hypervisor_is_type(X86_HYPER_NATIVE)) {
> > +   bz = MAX_BLOCK_SIZE;
> > +   goto done;
> > +   }
> 
> What ends up being the worst case scenario?  Booting a really small
> bare-metal x86 system, say with 64MB or 128MB of RAM?  What's the
> overhead there?

Might not be following you, so bear with me, but we only get to this check on a
system with a physical address end of at least MEM_SIZE_FOR_LARGE_BLOCK (64G),
and this would still (ever so slightly...) reduce overhead of memory block init
at boot in that case.


Re: [PATCH v2] x86/mm: use max memory block size on bare metal

2020-06-10 Thread Daniel Jordan
On Wed, Jun 10, 2020 at 09:30:00AM +0200, David Hildenbrand wrote:
> On 10.06.20 09:20, David Hildenbrand wrote:
> > On 10.06.20 00:54, Daniel Jordan wrote:
> >> @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
> >>goto done;
> >>}
> >>  
> >> +  /*
> >> +   * Use max block size to minimize overhead on bare metal, where
> >> +   * alignment for memory hotplug isn't a concern.
> >> +   */
> >> +  if (hypervisor_is_type(X86_HYPER_NATIVE)) {
> >> +  bz = MAX_BLOCK_SIZE;
> >> +  goto done;
> >> +  }
> > 
> > I'd assume that bioses on physical machines >= 64GB will not align
> > bigger (>= 2GB) DIMMs to something < 2GB.
> > 
> > Acked-by: David Hildenbrand 

Thanks!

> FTWT, setup_arch() does the init_hypervisor_platform() call. I assume
> that should be early enough.

I checked all the places where x86 uses memory_block_size_bytes(), it looks ok.


Re: [PATCH v2] x86/mm: use max memory block size on bare metal

2020-06-09 Thread Daniel Jordan
On Tue, Jun 09, 2020 at 06:54:51PM -0400, Daniel Jordan wrote:
> Some of our servers spend significant time at kernel boot initializing
> memory block sysfs directories and then creating symlinks between them
> and the corresponding nodes.  The slowness happens because the machines
> get stuck with the smallest supported memory block size on x86 (128M),
> which results in 16,288 directories to cover the 2T of installed RAM.
> The search for each memory block is noticeable even with
> commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
> xarray to accelerate lookup").
> 
> Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
> on the end of boot memory") chooses the block size based on alignment
> with memory end.  That addresses hotplug failures in qemu guests, but
> for bare metal systems whose memory end isn't aligned to even the
> smallest size, it leaves them at 128M.
> 
> Make kernels that aren't running on a hypervisor use the largest
> supported size (2G) to minimize overhead on big machines.  Kernel boot
> goes 7% faster on the aforementioned servers, shaving off half a second.
> 
> Signed-off-by: Daniel Jordan 
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Peter Zijlstra 
> Cc: Steven Sistare 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---

Forgot the v1 changes:

 - Thanks to David for the idea to make this conditional based on
   virtualization.
 - Update performance numbers to account for 4fb6eabf1037 (David)


[PATCH v2] x86/mm: use max memory block size on bare metal

2020-06-09 Thread Daniel Jordan
Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them
and the corresponding nodes.  The slowness happens because the machines
get stuck with the smallest supported memory block size on x86 (128M),
which results in 16,288 directories to cover the 2T of installed RAM.
The search for each memory block is noticeable even with
commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
xarray to accelerate lookup").

Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
on the end of boot memory") chooses the block size based on alignment
with memory end.  That addresses hotplug failures in qemu guests, but
for bare metal systems whose memory end isn't aligned to even the
smallest size, it leaves them at 128M.

Make kernels that aren't running on a hypervisor use the largest
supported size (2G) to minimize overhead on big machines.  Kernel boot
goes 7% faster on the aforementioned servers, shaving off half a second.

Signed-off-by: Daniel Jordan 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Peter Zijlstra 
Cc: Steven Sistare 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---

Applies to 5.7 and today's mainline

 arch/x86/mm/init_64.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..906fbdb060748 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mm_internal.h"
 
@@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
goto done;
}
 
+   /*
+* Use max block size to minimize overhead on bare metal, where
+* alignment for memory hotplug isn't a concern.
+*/
+   if (hypervisor_is_type(X86_HYPER_NATIVE)) {
+   bz = MAX_BLOCK_SIZE;
+   goto done;
+   }
+
/* Find the largest allowed block size that aligns to memory end */
for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
if (IS_ALIGNED(boot_mem_end, bz))
-- 
2.26.2



Re: [PATCH V2] mm/vmstat: Add events for THP migration without split

2020-06-09 Thread Daniel Jordan
On Tue, Jun 09, 2020 at 05:05:45PM +0530, Anshuman Khandual wrote:
> On 06/05/2020 07:54 PM, Zi Yan wrote:
> > On 4 Jun 2020, at 23:35, Anshuman Khandual wrote:
> >> Thanks Zi, for such a detailed explanation. Ideally, we should separate THP
> >> migration from base page migration in terms of statistics. 
> >> PGMIGRATE_SUCCESS
> >> and PGMIGRATE_FAIL should continue to track statistics when migration 
> >> starts
> >> with base pages. But for THP, we should track the following events.
> > 
> > You mean PGMIGRATE_SUCCESS and PGMIGRATE_FAIL will not track the number of 
> > migrated subpages
> > from split THPs? Will it cause userspace issues since their semantics are 
> > changed?
> 
> Yeah, basic idea is to carve out all THP migration related statistics from
> the normal page migration. Not sure if that might cause any issue for user
> space as you have mentioned. Does /proc/vmstat indicate some sort of an ABI
> whose semantics can not really change ? Some more opinions here from others
> would be helpful.

vmstats have had their implementations changed for THP before, so there's at
least some precedent.

https://lore.kernel.org/linux-mm/1559025859-72759-2-git-send-email-yang@linux.alibaba.com/

Others know better than I do.

> The current situation is definitely bit problematic where PGMIGRATE_SUCCESS
> increments (+1) both for normal and successful THP migration. Same situation
> for PGMIGRATE_FAILURE as well.

Yeah, that's suboptimal.  Maybe a separate patch to get thps accounted properly
in those stats?

> Hence, there are two clear choices available.
> 
> 1. THP and normal page migration stats are separate and mutually exclusive
> 
> OR
> 
> 2. THP migration has specific counters but normal page migration counters
>still account for everything in THP migration in terms of normal pages

FWIW, 2 makes more sense to me because it suits the existing names (it's
PGMIGRATE_SUCCESS not PGMIGRATE_SMALL_PAGES_SUCCESS) and it's less surprising
than leaving thp out of them.

> But IIUC, either way the current PGMIGRATE_SUCCESS or PGMIGRATE_FAIL stats
> will change for the user as visible from /proc/vmstat.
> 
> > 
> >>
> >> 1. THP_MIGRATION_SUCCESS   - THP migration is successful, without split
> >> 2. THP_MIGRATION_FAILURE   - THP could neither be migrated, nor be split
> > 
> > They make sense to me.> 
> >> 3. THP_MIGRATION_SPLIT_SUCCESS  - THP got split and all sub pages migrated
> >> 4. THP_MIGRATION_SPLIT_FAILURE  - THP got split but all sub pages could 
> >> not be migrated
> >>
> >> THP_MIGRATION_SPLIT_FAILURE could either increment once for a single THP or
> >> number of subpages that did not get migrated after split. As you mentioned,
> >> this will need some extra code in the core migration. Nonetheless, if these
> >> new events look good, will be happy to make required changes.
> > 
> > Maybe THP_MIGRATION_SPLIT would be simpler? My concern is that whether we 
> > need such
> 
> Also, it will not require a new migration queue tracking the split THP sub 
> pages.
> 
> > detailed information or not. Maybe trace points would be good enough for 3 
> > and 4.

I share the concern about whether that many new stats are necessary.  The
changelog says this is for performance debugging, so it seems THP success and
THP split would cover that.

The split stat becomes redundant if the other information is needed in the
future, but it can presumably be removed in that case.


[PATCH] padata: upgrade smp_mb__after_atomic to smp_mb in padata_do_serial

2020-06-08 Thread Daniel Jordan
A 5.7 kernel hangs during a tcrypt test of padata that waits for an AEAD
request to finish.  This is only seen on large machines running many
concurrent requests.

The issue is that padata never serializes the request.  The removal of
the reorder_objects atomic missed that the memory barrier in
padata_do_serial() depends on it.

Upgrade the barrier from smp_mb__after_atomic to smp_mb to get correct
ordering again.

Fixes: 3facced7aeed1 ("padata: remove reorder_objects")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-cry...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/padata.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..7b701bc3e7922 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -260,7 +260,7 @@ static void padata_reorder(struct parallel_data *pd)
 *
 * Ensure reorder queue is read after pd->lock is dropped so we see
 * new objects from another task in padata_do_serial.  Pairs with
-* smp_mb__after_atomic in padata_do_serial.
+* smp_mb in padata_do_serial.
 */
smp_mb();
 
@@ -342,7 +342,7 @@ void padata_do_serial(struct padata_priv *padata)
 * with the trylock of pd->lock in padata_reorder.  Pairs with smp_mb
 * in padata_reorder.
 */
-   smp_mb__after_atomic();
+   smp_mb();
 
padata_reorder(pd);
 }

base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
-- 
2.26.2



Re: [PATCH] x86/mm: use max memory block size with unaligned memory end

2020-06-04 Thread Daniel Jordan
On Thu, Jun 04, 2020 at 01:00:55PM -0700, Dave Hansen wrote:
> On 6/4/20 11:12 AM, Daniel Jordan wrote:
> >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
> >> That's why that's not papering over the problem. Increasing the memory
> >> block size isn't always the answer.
> > Ok.  If you don't mind, what's the purpose of hotplugging at that 
> > granularity?
> > I'm simply curious.
> 
> FWIW, the 128MB on x86 came from the original sparsemem/hotplug
> implementation.  It was the size of the smallest DIMM that my server
> system at the time would take.  ppc64's huge page size was and is 16MB
> and that's also the granularity with which hypervisors did hot-add way
> back then.  I'm not actually sure what they do now.

Interesting, that tells me a lot more than the "matt - 128 is convenient right
now" comment that has always weirdly stuck out at me.

> I actually can't think of anything that's *keeping* it at 128MB on x86
> though.  We don't, for instance, require a whole section to be
> pfn_valid().

Hm, something to look into.


Re: [PATCH] x86/mm: use max memory block size with unaligned memory end

2020-06-04 Thread Daniel Jordan
On Thu, Jun 04, 2020 at 08:55:19PM +0200, David Hildenbrand wrote:
> >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
> >> That's why that's not papering over the problem. Increasing the memory
> >> block size isn't always the answer.
> > 
> > Ok.  If you don't mind, what's the purpose of hotplugging at that 
> > granularity?
> > I'm simply curious.
> 
> On bare metal: none with that big machines AFAIKS. :)

Sounds about right :)

> For VMs/partitions it gives you much more flexibility ("cloud", kata
> containers, memory overcommit, ...).
> 
> Assume you have a VM with some initial memory size (e.g., 32GB). By
> hotplugging up to 256 DIMMs you cab grow in small steps (e.g., 128MB, up
> to 64GB, 256MB, up to 96GB, ...). And if you online all the memory
> blocks MOVABLE, you can shrink in these small steps.

Yeah, sorry for not being clear, I meant why does powerpc hotplug at "only" 16M.

> Regarding PowerPC, AFAIK it also gives the OS more flexibility to find
> memory blocks that can be offlined and unplugged, especially without the
> MOVABLE zone. Finding some scattered 16MB blocks that can be offlined is
> easier than finding one bigger (e.g., 2GB) memory block that can be
> offlined. And the history of powerpc dlpar dates back to pre-MOVABLE
> days (there is a paper from 2003).

Makes sense, thanks!

> I do think your change mostly affects bare metal where you do not care
> about hotplugging small memory blocks. Maybe an even better check would be
> 
> if (!in_vm() {
>   bz = MAX_BLOCK_SIZE;
>   goto none;
> }
> 
> because I doubt we have bare metal machines > 64 where we want to
> hot(un)plug DIMMs < 2G.

Yeah, agreed, not these days.

> But maybe there is a use case I am not aware of
> ... and I don't know an easy way to check whether we are running inside
> a VM or not (like kvm_para_available() ... ).

What about this?  Works on bare metal and kvm, so presumably all the other HVs
too.

 if (x86_hyper_type == X86_HYPER_NATIVE) {
bz = MAX_BLOCK_SIZE;
goto done;
 }


Re: [PATCH] x86/mm: use max memory block size with unaligned memory end

2020-06-04 Thread Daniel Jordan
On Thu, Jun 04, 2020 at 07:45:40PM +0200, David Hildenbrand wrote:
> On 04.06.20 19:22, Daniel Jordan wrote:
> > IMHO the root cause of this is really the small block size.  Building a 
> > cache
> > on top to avoid iterating over tons of small blocks seems like papering over
> > the problem, especially when one of the two affected paths in boot is a
> 
> The memory block size dictates your memory hot(un)plug granularity.

Indeed.

> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
> That's why that's not papering over the problem. Increasing the memory
> block size isn't always the answer.

Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
I'm simply curious.

> > cautious check that might be ready to be removed by now[0]:
> 
> Yeah, we discussed that somewhere already. My change only highlighted
> the problem. And now that it's cheap, it can just stay unless there is a
> very good reason not to do it.

Agreed.

> > Yeah, but of course it's not as bad as it was now that it's fully 
> > parallelized.
> 
> Right. I also observed that computing if a zone is contiguous can be
> expensive.

That's right, I remember that.  It's on my list :)


Re: [PATCH] x86/mm: use max memory block size with unaligned memory end

2020-06-04 Thread Daniel Jordan
On Thu, Jun 04, 2020 at 09:22:03AM +0200, David Hildenbrand wrote:
> On 04.06.20 05:54, Daniel Jordan wrote:
> > Some of our servers spend 14 out of the 21 seconds of kernel boot
> > initializing memory block sysfs directories and then creating symlinks
> > between them and the corresponding nodes.  The slowness happens because
> > the machines get stuck with the smallest supported memory block size on
> > x86 (128M), which results in 16,288 directories to cover the 2T of
> > installed RAM, and each of these paths does a linear search of the
> > memory blocks for every block id, with atomic ops at each step.
> 
> With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray
> to accelerate lookup") merged by Linus' today (strange, I thought this
> would be long upstream)

Ah, thanks for pointing this out!  It was only posted to LKML so I missed it.

> all linear searches should be gone and at least
> the performance observation in this patch no longer applies.

The performance numbers as stated, that's certainly true, but this patch on top
still improves kernel boot by 7%.  It's a savings of half a second -- I'll take
it.

IMHO the root cause of this is really the small block size.  Building a cache
on top to avoid iterating over tons of small blocks seems like papering over
the problem, especially when one of the two affected paths in boot is a
cautious check that might be ready to be removed by now[0]:

static int init_memory_block(struct memory_block **memory,
 unsigned long block_id, unsigned long state)
{
...
mem = find_memory_block_by_id(block_id);
if (mem) {
put_device(>dev);
return -EEXIST;
}

Anyway, I guess I'll redo the changelog and post again.

> The memmap init should nowadays consume most time.

Yeah, but of course it's not as bad as it was now that it's fully parallelized.

[0] 
https://lore.kernel.org/linux-mm/a8e96df6-dc6d-037f-491c-92182d4ad...@redhat.com/


[PATCH] x86/mm: use max memory block size with unaligned memory end

2020-06-03 Thread Daniel Jordan
Some of our servers spend 14 out of the 21 seconds of kernel boot
initializing memory block sysfs directories and then creating symlinks
between them and the corresponding nodes.  The slowness happens because
the machines get stuck with the smallest supported memory block size on
x86 (128M), which results in 16,288 directories to cover the 2T of
installed RAM, and each of these paths does a linear search of the
memory blocks for every block id, with atomic ops at each step.

Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
on the end of boot memory") chooses the block size based on alignment
with memory end.  That addresses hotplug failures in qemu guests, but
for bare metal systems whose memory end isn't aligned to the smallest
size, it leaves them at 128M.

For such systems, use the largest supported size (2G) to minimize
overhead on big machines.  That saves nearly all of the 14 seconds so
the kernel boots 3x faster.

There are some simple ways to avoid the linear searches, but for now it
makes no difference with a 2G block.

Signed-off-by: Daniel Jordan 
---
 arch/x86/mm/init_64.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..d388127d1b519 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1390,6 +1390,15 @@ static unsigned long probe_memory_block_size(void)
goto done;
}
 
+   /*
+* Memory end isn't aligned to any allowed block size, so default to
+* the largest to minimize overhead on large memory systems.
+*/
+   if (!IS_ALIGNED(boot_mem_end, MIN_MEMORY_BLOCK_SIZE)) {
+   bz = MAX_BLOCK_SIZE;
+   goto done;
+   }
+
/* Find the largest allowed block size that aligns to memory end */
for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
if (IS_ALIGNED(boot_mem_end, bz))

base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
-- 
2.26.2



Re: [PATCH] mm/vmstat: Add events for PMD based THP migration without split

2020-06-03 Thread Daniel Jordan
On Wed, Jun 03, 2020 at 10:06:31AM +0530, Anshuman Khandual wrote:
> Does this look okay and sufficient ?
> 
> --- a/Documentation/vm/page_migration.rst
> +++ b/Documentation/vm/page_migration.rst
> @@ -253,5 +253,20 @@ which are function pointers of struct 
> address_space_operations.
>   PG_isolated is alias with PG_reclaim flag so driver shouldn't use the 
> flag
>   for own purpose.
>  
> +Quantifying Migration
> +=
> +Following events can be used to quantify page migration.
> +
> +- PGMIGRATE_SUCCESS
> +- PGMIGRATE_FAIL
> +- THP_MIGRATION_SUCCESS
> +- THP_MIGRATION_FAILURE
> +
> +THP_MIGRATION_FAILURE in particular represents an event when a THP could not 
> be
> +migrated as a single entity following an allocation failure and ended up 
> getting
> +split into constituent normal pages before being retried. This event, along 
> with
> +PGMIGRATE_SUCCESS and PGMIGRATE_FAIL will help in quantifying and analyzing 
> THP
> +migration events including both success and failure cases.

Looks great!

> Sure, will update the commit message accordingly.

Thanks.  Hopefully these will help someone in the future.


Re: [PATCH] mm/vmstat: Add events for PMD based THP migration without split

2020-06-02 Thread Daniel Jordan
On Mon, Jun 01, 2020 at 09:48:09PM -0700, John Hubbard wrote:
> However, the fact that this is under discussion hints at the need for a
> bit of documentation help. What do you think about adding some notes about
> all of this to, say, Documentation/vm/page_migration.rst ?

Yes, that would be good.  I understand the intent better now but still think
the 'failure' event could be misinterpreted as outright failure instead of
splitting followed by successfully moving the constituent pages.

It might help to clarify in the changelog as well.


Re: [PATCH] mm/vmstat: Add events for PMD based THP migration without split

2020-06-01 Thread Daniel Jordan
Hi Anshuman,

On Fri, May 22, 2020 at 09:04:04AM +0530, Anshuman Khandual wrote:
> This adds the following two new VM events which will help in validating PMD
> based THP migration without split. Statistics reported through these events
> will help in performance debugging.
> 
> 1. THP_PMD_MIGRATION_SUCCESS
> 2. THP_PMD_MIGRATION_FAILURE

The names suggest a binary event similar to the existing
pgmigrate_success/fail, but FAILURE only tracks one kind of migration error,
and then only when the thp is successfully split, so shouldn't it be called
SPLIT instead?


Re: [PATCH v6 12/12] mmap locking API: convert mmap_sem comments

2020-05-29 Thread Daniel Jordan
On Tue, May 19, 2020 at 10:29:08PM -0700, Michel Lespinasse wrote:
> Convert comments that reference mmap_sem to reference mmap_lock instead.
> 
> Signed-off-by: Michel Lespinasse 

Not opposed to leaving lockaphores in :)

Reviewed-by: Daniel Jordan 


Re: [PATCH v6 11/12] mmap locking API: convert mmap_sem API comments

2020-05-29 Thread Daniel Jordan
On Tue, May 19, 2020 at 10:29:07PM -0700, Michel Lespinasse wrote:
> Convert comments that reference old mmap_sem APIs to reference
> corresponding new mmap locking APIs instead.
> 
> Signed-off-by: Michel Lespinasse 

Reviewed-by: Daniel Jordan 


Re: [PATCH v6 10/12] mmap locking API: rename mmap_sem to mmap_lock

2020-05-29 Thread Daniel Jordan
On Tue, May 19, 2020 at 10:29:06PM -0700, Michel Lespinasse wrote:
> Rename the mmap_sem field to mmap_lock. Any new uses of this lock
> should now go through the new mmap locking api. The mmap_lock is
> still implemented as a rwsem, though this could change in the future.
> 
> Signed-off-by: Michel Lespinasse 
> Reviewed-by: Vlastimil Babka 

Reviewed-by: Daniel Jordan 


Re: [PATCH v6 09/12] mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()

2020-05-29 Thread Daniel Jordan
On Tue, May 19, 2020 at 10:29:05PM -0700, Michel Lespinasse wrote:
> Add new APIs to assert that mmap_sem is held.
> 
> Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
> makes the assertions more tolerant of future changes to the lock type.
> 
> Signed-off-by: Michel Lespinasse 

Reviewed-by: Daniel Jordan 


Re: [PATCH v6 08/12] mmap locking API: add MMAP_LOCK_INITIALIZER

2020-05-29 Thread Daniel Jordan
On Tue, May 19, 2020 at 10:29:04PM -0700, Michel Lespinasse wrote:
> Define a new initializer for the mmap locking api.
> Initially this just evaluates to __RWSEM_INITIALIZER as the API
> is defined as wrappers around rwsem.
> 
> Signed-off-by: Michel Lespinasse 
> Reviewed-by: Laurent Dufour 
> Reviewed-by: Vlastimil Babka 

Reviewed-by: Daniel Jordan 


Re: [PATCH -V3] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-28 Thread Daniel Jordan
On Thu, May 28, 2020 at 01:32:40PM +0800, Huang, Ying wrote:
> Daniel Jordan  writes:
> 
> > On Mon, May 25, 2020 at 08:26:48AM +0800, Huang Ying wrote:
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 423c234aca15..0abd93d2a4fc 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct 
> >> swap_info_struct *si,
> >> * discarding, do discard now and reclaim them
> >> */
> >>swap_do_scheduled_discard(si);
> >> -  *scan_base = *offset = si->cluster_next;
> >> +  *scan_base = this_cpu_read(*si->cluster_next_cpu);
> >> +  *offset = *scan_base;
> >>goto new_cluster;
> >
> > Why is this done?  As far as I can tell, the values always get overwritten 
> > at
> > the end of the function with tmp and tmp isn't derived from them.  Seems
> > ebc2a1a69111 moved some logic that used to make sense but doesn't have any
> > effect now.
> 
> If we fail to allocate from cluster, "scan_base" and "offset" will not
> be overridden.

Ok, if another task races to allocate the clusters the first just discarded.

> And "cluster_next" or "cluster_next_cpu" may be changed
> in swap_do_scheduled_discard(), because the lock is released and
> re-acquired there.

I see, by another task on the same cpu for cluster_next_cpu.

Both probably unlikely, but at least it tries to pick up where the racing task
left off.  You might tack this onto the comment:

 * discarding, do discard now and reclaim them, then reread
 * cluster_next_cpu since we dropped si->lock
/*

> The code may not have much value.

No, it makes sense.

> > These aside, patch looks good to me.
> 
> Thanks for your review!  It really help me to improve the quality of the
> patch.  Can I add your "Reviewed-by" in the next version?

Sure,
Reviewed-by: Daniel Jordan 


Re: [PATCH -V3] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-27 Thread Daniel Jordan
On Mon, May 25, 2020 at 08:26:48AM +0800, Huang Ying wrote:
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 423c234aca15..0abd93d2a4fc 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct 
> swap_info_struct *si,
>* discarding, do discard now and reclaim them
>*/
>   swap_do_scheduled_discard(si);
> - *scan_base = *offset = si->cluster_next;
> + *scan_base = this_cpu_read(*si->cluster_next_cpu);
> + *offset = *scan_base;
>   goto new_cluster;

Why is this done?  As far as I can tell, the values always get overwritten at
the end of the function with tmp and tmp isn't derived from them.  Seems
ebc2a1a69111 moved some logic that used to make sense but doesn't have any
effect now.

>   } else
>   return false;
> @@ -721,6 +722,34 @@ static void swap_range_free(struct swap_info_struct *si, 
> unsigned long offset,
>   }
>  }
>  
> +static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
> +{
> + unsigned long prev;
> +
> + if (!(si->flags & SWP_SOLIDSTATE)) {
> + si->cluster_next = next;
> + return;
> + }
> +
> + prev = this_cpu_read(*si->cluster_next_cpu);
> + /*
> +  * Cross the swap address space size aligned trunk, choose
> +  * another trunk randomly to avoid lock contention on swap
> +  * address space if possible.
> +  */
> + if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=
> + (next >> SWAP_ADDRESS_SPACE_SHIFT)) {
> + /* No free swap slots available */
> + if (si->highest_bit <= si->lowest_bit)
> + return;
> + next = si->lowest_bit +
> + prandom_u32_max(si->highest_bit - si->lowest_bit + 1);
> + next = ALIGN(next, SWAP_ADDRESS_SPACE_PAGES);
> + next = max_t(unsigned int, next, si->lowest_bit);

next is always greater than lowest_bit because it's aligned up.  I think the
intent of the max_t line is to handle when next is aligned outside the valid
range, so it'd have to be ALIGN_DOWN instead?


These aside, patch looks good to me.


[PATCH v3 7/8] mm: make deferred init's max threads arch-specific

2020-05-27 Thread Daniel Jordan
Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 arch/x86/mm/init_64.c| 12 
 include/linux/memblock.h |  3 +++
 mm/page_alloc.c  | 13 -
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
mem_init_print_info(NULL);
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   /*
+* More CPUs always led to greater speedups on tested systems, up to
+* all the nodes' CPUs.  Use all since the system is otherwise idle
+* now.
+*/
+   return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
 int kernel_set_to_readonly;
 
 void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
*zone,
 #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
for (; i != U64_MAX;  \
 __next_mem_pfn_range_in_zone(, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d47016849531..329fd1a809c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1835,6 +1835,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
unsigned long end_pfn,
}
 }
 
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   return 1;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1883,11 +1890,7 @@ static int __init deferred_init_memmap(void *data)
 first_init_pfn))
goto zone_empty;
 
-   /*
-* More CPUs always led to greater speedups on tested systems, up to
-* all the nodes' CPUs.  Use all since the system is otherwise idle now.
-*/
-   max_threads = max(cpumask_weight(cpumask), 1u);
+   max_threads = deferred_page_init_max_threads(cpumask);
 
while (spfn < epfn) {
unsigned long epfn_align = ALIGN(epfn, PAGES_PER_SECTION);
-- 
2.26.2



[PATCH v3 1/8] padata: remove exit routine

2020-05-27 Thread Daniel Jordan
padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 kernel/padata.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-   cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-   cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2



[PATCH v3 0/8] padata: parallelize deferred page init

2020-05-27 Thread Daniel Jordan
Thanks to Alex for his continued review and Josh for running v2!  Please
continue to review and test, and acks for the padata parts would be
appreciated.

Daniel

--

Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 8.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap.  Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot
by 4% to 49%, saving up to 1.6 out of 4 seconds.  Patch 6 has more
numbers.

The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.

Series based on v5.7-rc7 plus these three from mmotm

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

and it's available here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v3
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v3

and the future users and related features are available as
work-in-progress:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.5
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

v3:
 - Remove nr_pages accounting as suggested by Alex, adding a new patch
 - Align deferred init ranges up not down, simplify surrounding code (Alex)
 - Add Josh's T-b's from v2 (Josh's T-b's for v1 lost in rebase, apologies!)
 - Move padata.h include up in init/main.c to reduce patch collisions (Andrew)
 - Slightly reword Documentation patch
 - Rebase on v5.7-rc7 and retest

v2:
 - Improve the problem statement (Andrew, Josh, Pavel)
 - Add T-b's to unchanged patches (Josh)
 - Fully initialize max-order blocks to avoid buddy issues (Alex)
 - Parallelize on section-aligned boundaries to avoid potential
   false sharing (Alex)
 - Return the maximum thread count from a function that architectures
   can override, with the generic version returning 1 (current
   behavior).  Override for x86 since that's the only arch this series
   has been tested on so far.  Other archs can test with more threads
   by dropping patch 6.
 - Rebase to v5.7-rc6, rerun tests

RFC v4 [2] -> v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
https://www.youtube.com/watch?v=pBsHnf93tcQ

https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/

[2] 
https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jor...@oracle.com/

Daniel Jordan (8):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: don't track number of pages during deferred initialization
  mm: parallelize deferred_init_memmap()
  mm: make deferred init's max threads arch-specific
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 arch/x86/mm/init_64.c |  12 ++
 include/linux/memblock.h  |   3 +
 include/linux/padata.h|  43 -
 init/main.c   |   2 +
 kernel/padata.c   | 277 --
 mm/Kconfig|   6 +-
 mm/page_alloc.c   |  59 +--
 8 files changed, 361 insertions(+), 82 deletions(-)


base-com

[PATCH v3 8/8] padata: document multithreaded jobs

2020-05-27 Thread Daniel Jordan
Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 Documentation/core-api/padata.rst | 41 +++
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 9a24c111781d9..0830e5b0e8211 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 ===
 
-:Date: December 2019
+:Date: May 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+===
 
 Initializing
 
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run serialized jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
 #include 
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically allocated on the main thread's stack.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =
 
-- 
2.26.2



[PATCH v3 4/8] padata: add basic support for multithreaded jobs

2020-05-27 Thread Daniel Jordan
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |  29 
 kernel/padata.c| 152 -
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
struct list_headlist;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ * possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ * the client to communicate the minimum amount of work that's
+ * appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *   depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+   void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+   void*fn_arg;
+   unsigned long   start;
+   unsigned long   size;
+   unsigned long   align;
+   unsigned long   min_chunk;
+   int max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
  struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index 78ff9aa529204..e78f57d9aef90 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +36,8 @@
 #include 
 #include 
 
+#definePADATA_WORK_ONSTACK 1   /* Work's memory is on stack */
+
 struct padata_work {
struct work_struct  pw_work;
struct list_headpw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+   spinlock_t  lock;
+   struct completion   completion;
+   struct padata_mt_job*job;
+   int nworks;
+   int nworks_fini;
+   unsigned long   chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-void *data)
+void *data, int flags)
 {
-   INIT_WORK(>pw_work, work_fn);
+   if (flags & PADATA_WORK_

[PATCH v3 3/8] padata: allocate work structures for parallel jobs from a pool

2020-05-27 Thread Daniel Jordan
padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |   8 +--
 kernel/padata.c| 118 +++--
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
struct list_headlist;
struct parallel_data*pd;
int cb_cpu;
-   int cpu;
unsigned intseq_nr;
int info;
void(*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-   struct padata_listparallel;
struct padata_listreorder;
-   struct work_structwork;
atomic_t  num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
struct padata_parallel_queue__percpu *pqueue;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
-   atomic_tseq_nr;
+   unsigned intseq_nr;
unsigned intprocessed;
int cpu;
struct padata_cpumask   cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index 6f709bc0fc413..78ff9aa529204 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include 
 #include 
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+   struct work_struct  pw_work;
+   struct list_headpw_list;  /* padata_free_works linkage */
+   void*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, 
unsigned int seq_nr)
return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-   struct padata_parallel_queue *pqueue;
-   LIST_HEAD(local_list);
+   struct padata_work *pw;
 
-   local_bh_disable();
-   pqueue = container_of(parallel_work,
- struct padata_parallel_queue, work);
+   lockdep_assert_held(_works_lock);
 
-   spin_lock(>parallel.lock);
-   list_replace_init(>parallel.list, _list);
-   spin_unlock(>parallel.lock);
+   if (list_empty(_free_works))
+   return NULL;/* No more work items allowed to be queued. */
 
-   while (!list_empty(_list)) {
-   struct padata_priv *padata;
+   pw = list_first_entry(_free_works, struct padata_work, pw_list);
+   list_del(>pw_list);
+   return pw;
+}
 
-   padata = list_entry(local_list.next,
-   struct padata_priv, list);
+static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
+void *data

[PATCH v3 6/8] mm: parallelize deferred_init_memmap()

2020-05-27 Thread Daniel Jordan
)  79.8% 48.7 (  7.4)
 100% ( 16)  21.0%813.7 ( 21.0)  80.5% 47.0 (  5.2)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  16G/node = 16G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --716.0 ( 14.0) -- 49.7 (  0.6)
  25% (  1)   1.8%703.0 (  5.3)  -4.0% 51.7 (  0.6)
  50% (  2)   1.6%704.7 (  1.2)  43.0% 28.3 (  0.6)
  75% (  3)   2.7%696.7 ( 13.1)  49.7% 25.0 (  0.0)
 100% (  4)   4.1%687.0 ( 10.4)  55.7% 22.0 (  0.0)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  14G/node = 14G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --787.7 (  6.4) --122.3 (  0.6)
  25% (  1)   0.2%786.3 ( 10.8)  -2.5%125.3 (  2.1)
  50% (  2)   5.9%741.0 ( 13.9)  37.6% 76.3 ( 19.7)
  75% (  3)   8.3%722.0 ( 19.0)  49.9% 61.3 (  3.2)
 100% (  4)   9.3%714.7 (  9.5)  56.4% 53.3 (  1.5)

On Josh's 96-CPU and 192G memory system:

Without this patch series:
[0.487132] node 0 initialised, 23398907 pages in 292ms
[0.499132] node 1 initialised, 24189223 pages in 304ms
...
[0.629376] Run /sbin/init as init process

With this patch series:
[0.231435] node 1 initialised, 24189223 pages in 32ms
[0.236718] node 0 initialised, 23398907 pages in 36ms

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 mm/Kconfig  |  6 +++---
 mm/page_alloc.c | 46 --
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358..04c1da3f9f44c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
  amount of time. If this option is set, large machines will bring up
- a subset of memmap at boot and then initialise the rest in parallel
- by starting one-off "pgdatinitX" kernel thread for each node X. This
- has a potential performance impact on processes running early in the
+ a subset of memmap at boot and then initialise the rest in parallel.
+ This has a potential performance impact on tasks running early in the
  lifetime of the system until these kthreads finish the
  initialisation.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d64f3027fdfa6..1d47016849531 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1814,6 +1815,26 @@ deferred_init_maxorder(u64 *i, struct zone *zone, 
unsigned long *start_pfn,
return nr_pages;
 }
 
+static void __init
+deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
+  void *arg)
+{
+   unsigned long spfn, epfn;
+   struct zone *zone = arg;
+   u64 i;
+
+   deferred_init_mem_pfn_range_in_zone(, zone, , , start_pfn);
+
+   /*
+* Initialize and free pages in MAX_ORDER sized increments so that we
+* can avoid introducing any issues with the buddy allocator.
+*/
+   while (spfn < end_pfn) {
+   deferred_init_maxorder(, zone, , );
+   cond_resched();
+   }
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1823,7 +1844,7 @@ static int __init deferred_init_memmap(void *data)
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
-   int zid;
+   int zid, max_threads;
u64 i;
 
/* Bind memory initialisation thread to a local node if possible */
@@ -1863,13 +1884,26 @@ static int __init deferred_init_memmap(void *data)
goto zone_empty;
 
/*
-* Initialize and free pages in MAX_ORDER sized increments so
-* that we can avoid introduci

[PATCH v3 2/8] padata: initialize earlier

2020-05-27 Thread Daniel Jordan
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |  6 ++
 init/main.c|  2 ++
 kernel/padata.c| 17 -
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #definePADATA_INVALID  4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index 03371976d3872..df32f67214d23 100644
--- a/init/main.c
+++ b/init/main.c
@@ -63,6 +63,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();
 
+   padata_init();
page_alloc_init_late();
/* Initialize page ext after all struct pages are initialized. */
page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 835919c745266..6f709bc0fc413 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MAX_OBJ_NUM 1000
 
@@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
int ret;
 
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
  padata_cpu_online, NULL);
if (ret < 0)
-   return ret;
+   goto err;
hp_online = ret;
 
ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
  NULL, padata_cpu_dead);
if (ret < 0) {
cpuhp_remove_multi_state(hp_online);
-   return ret;
+   goto err;
}
-   return 0;
-}
-module_init(padata_driver_init);
 
+   return;
+err:
+   pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2



[PATCH v3 5/8] mm: don't track number of pages during deferred initialization

2020-05-27 Thread Daniel Jordan
Deferred page init used to report the number of pages initialized:

  node 0 initialised, 32439114 pages in 97ms

Tracking this makes the code more complicated when using multiple
threads.  Given that the statistic probably has limited value,
especially since a zone grows on demand so that the page count can vary,
just remove it.

The boot message now looks like

  node 0 deferred pages initialised in 97ms

Signed-off-by: Daniel Jordan 
Suggested-by: Alexander Duyck 
---
 mm/page_alloc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..d64f3027fdfa6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data)
 {
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-   unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+   unsigned long spfn = 0, epfn = 0;
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
@@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data)
 * allocator.
 */
while (spfn < epfn) {
-   nr_pages += deferred_init_maxorder(, zone, , );
+   deferred_init_maxorder(, zone, , );
cond_resched();
}
 zone_empty:
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
-   pr_info("node %d initialised, %lu pages in %ums\n",
-   pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
+   pr_info("node %d deferred pages initialised in %ums\n",
+   pgdat->node_id, jiffies_to_msecs(jiffies - start));
 
pgdat_init_report_one_done();
return 0;
-- 
2.26.2



Re: [PATCH 4.4 26/65] sched/fair, cpumask: Export for_each_cpu_wrap()

2020-05-27 Thread Daniel Jordan
On 5/27/20 4:09 AM, Greg KH wrote:
> On Wed, May 27, 2020 at 07:50:56AM +, nobuhiro1.iwama...@toshiba.co.jp 
> wrote:
>>> Subject: [PATCH 4.4 26/65] sched/fair, cpumask: Export for_each_cpu_wrap()  
...
>>
>> This commit also needs the following commits:
>>
>> commit d207af2eab3f8668b95ad02b21930481c42806fd
>> Author: Michael Kelley 
>> Date:   Wed Feb 14 02:54:03 2018 +
>>
>> cpumask: Make for_each_cpu_wrap() available on UP as well
>> 
>> for_each_cpu_wrap() was originally added in the #else half of a
>> large "#if NR_CPUS == 1" statement, but was omitted in the #if
>> half.  This patch adds the missing #if half to prevent compile
>> errors when NR_CPUS is 1.
>> 
>> Reported-by: kbuild test robot 
>> Signed-off-by: Michael Kelley 
>> Cc: Linus Torvalds 
>> Cc: Peter Zijlstra 
>> Cc: Thomas Gleixner 
>> Cc: k...@microsoft.com
>> Cc: martin.peter...@oracle.com
>> Cc: mikel...@microsoft.com
>> Fixes: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()")
>> Link: 
>> http://lkml.kernel.org/r/sn6pr1901mb2045f087f59450507d4fcc17cb...@sn6pr1901mb2045.namprd19.prod.outlook.com
>> Signed-off-by: Ingo Molnar 
>>
>> Please apply this commit.
> 
> Good catch, now queued up, thanks.

I left this commit out because the 4.4 kernel only uses
cpumask_next_wrap in padata, which is only enabled for SMP kernels, but
it's probably best to be safe.

Daniel


Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-21 Thread Daniel Jordan
On Thu, May 21, 2020 at 09:46:35AM -0700, Alexander Duyck wrote:
> It is more about not bothering with the extra tracking. We don't
> really need it and having it doesn't really add much in the way of
> value.

Yeah, it can probably go.

> > > > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void 
> > > > *data)
> > > > goto zone_empty;
> > > >
> > > > /*
> > > > -* Initialize and free pages in MAX_ORDER sized increments so
> > > > -* that we can avoid introducing any issues with the buddy
> > > > -* allocator.
> > > > +* More CPUs always led to greater speedups on tested systems, 
> > > > up to
> > > > +* all the nodes' CPUs.  Use all since the system is otherwise 
> > > > idle now.
> > > >  */
> > > > +   max_threads = max(cpumask_weight(cpumask), 1u);
> > > > +
> > > > while (spfn < epfn) {
> > > > +   epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > > > +
> > > > +   if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > > > +   epfn_align - spfn >= PAGES_PER_SECTION) {
> > > > +   struct definit_args arg = { zone, 
> > > > ATOMIC_LONG_INIT(0) };
> > > > +   struct padata_mt_job job = {
> > > > +   .thread_fn   = 
> > > > deferred_init_memmap_chunk,
> > > > +   .fn_arg  = ,
> > > > +   .start   = spfn,
> > > > +   .size= epfn_align - spfn,
> > > > +   .align   = PAGES_PER_SECTION,
> > > > +   .min_chunk   = PAGES_PER_SECTION,
> > > > +   .max_threads = max_threads,
> > > > +   };
> > > > +
> > > > +   padata_do_multithreaded();
> > > > +   nr_pages += atomic_long_read(_pages);
> > > > +   spfn = epfn_align;
> > > > +   }
> > > > +
> > > > nr_pages += deferred_init_maxorder(, zone, , 
> > > > );
> > > > cond_resched();
> > > > }
> > >
> > > This doesn't look right. You are basically adding threads in addition
> > > to calls to deferred_init_maxorder.
> >
> > The deferred_init_maxorder call is there to do the remaining, non-section
> > aligned part of a range.  It doesn't have to be done this way.
> 
> It is also doing the advancing though isn't it?

Yes.  Not sure what you're getting at.  There's the 'spfn = epfn_align' before
so nothing is skipped.  It's true that the nonaligned part is done outside of
padata when it could be done by a thread that'd otherwise be waiting or idle,
which should be addressed in the next version.

> I think I resolved this with the fix for it I described in the other
> email. We just need to swap out spfn for epfn and make sure we align
> spfn with epfn_align. Then I think that takes care of possible skips.

Right, though your fix looks a lot like deferred_init_mem_pfn_range_in_zone().
Seems better to just use that and not repeat ourselves.  Lame that it's
starting at the beginning of the ranges every time, maybe it could be
generalized somehow, but I think it should be fast enough.

> > We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
> > loop.
> >
> > What I was trying to avoid by aligning down is creating a discontiguous pfn
> > range that get passed to padata.  We already discussed how those are handled
> > by the zone iterator in the thread function, but job->size can be 
> > exaggerated
> > to include parts of the range that are never touched.  Thinking more about 
> > it
> > though, it's a small fraction of the total work and shouldn't matter.
> 
> So the problem with aligning down is that you are going to be slowed
> up as you have to go single threaded to initialize whatever remains.
> So worst case scenario is that you have a section aligned block and
> you will process all but 1 section in parallel, and then have to
> process the remaining section one max order block at a time.

Yes, aligning up is better.

> > > This should accomplish the same thing, but much more efficiently.
> >
> > Well, more cleanly.  I'll give it a try.
> 
> I agree I am not sure if it will make a big difference on x86, however
> the more ranges you have to process the faster this approach should be
> as it stays parallel the entire time rather than having to drop out
> and process the last section one max order block at a time.

Right.


Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-21 Thread Daniel Jordan
On Thu, May 21, 2020 at 08:00:31AM -0700, Alexander Duyck wrote:
> So I was thinking about my suggestion further and the loop at the end
> isn't quite correct as I believe it could lead to gaps. The loop on
> the end should probably be:
> for_each_free_mem_pfn_range_in_zone_from(i, zone, spfn, epfn) 
> {
> if (epfn <= epfn_align)
> continue;
> if (spfn < epfn_align)
> spfn = epfn_align;
> break;
> }
> 
> That would generate a new range where epfn_align has actually ended
> and there is a range of new PFNs to process.

Whoops, my email crossed with yours.  Agreed, but see the other message.


Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-21 Thread Daniel Jordan
On Wed, May 20, 2020 at 06:29:32PM -0700, Alexander Duyck wrote:
> On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, 
> > unsigned long *start_pfn,
> > return nr_pages;
> >  }
> >
> > +struct definit_args {
> > +   struct zone *zone;
> > +   atomic_long_t nr_pages;
> > +};
> > +
> > +static void __init
> > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > +  void *arg)
> > +{
> > +   unsigned long spfn, epfn, nr_pages = 0;
> > +   struct definit_args *args = arg;
> > +   struct zone *zone = args->zone;
> > +   u64 i;
> > +
> > +   deferred_init_mem_pfn_range_in_zone(, zone, , , 
> > start_pfn);
> > +
> > +   /*
> > +* Initialize and free pages in MAX_ORDER sized increments so that 
> > we
> > +* can avoid introducing any issues with the buddy allocator.
> > +*/
> > +   while (spfn < end_pfn) {
> > +   nr_pages += deferred_init_maxorder(, zone, , );
> > +   cond_resched();
> > +   }
> > +
> > +   atomic_long_add(nr_pages, >nr_pages);
> > +}
> > +
> 
> Personally I would get rid of nr_pages entirely. It isn't worth the
> cache thrash to have this atomic variable bouncing around.

One of the things I tried to optimize was the managed_pages atomic adds in
__free_pages_core, but performance stayed the same on the biggest machine I
tested when it was done once at the end of page init instead of in every thread
for every pageblock.

I'm not sure this atomic would matter either, given it's less frequent.

> You could
> probably just have this function return void since all nr_pages is
> used for is a pr_info  statement at the end of the initialization
> which will be completely useless now anyway since we really have the
> threads running in parallel anyway.

The timestamp is still useful for observability, page init is a significant
part of kernel boot on big machines, over 10% sometimes with these patches.

It's mostly the time that matters though, I agree the number of pages is less
important and is probably worth removing just to simplify the code.  I'll do it
if no one sees a reason to keep it.

> We only really need the nr_pages logic in deferred_grow_zone in order
> to track if we have freed enough pages to allow us to go back to what
> we were doing.
>
> > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > goto zone_empty;
> >
> > /*
> > -* Initialize and free pages in MAX_ORDER sized increments so
> > -* that we can avoid introducing any issues with the buddy
> > -* allocator.
> > +* More CPUs always led to greater speedups on tested systems, up to
> > +* all the nodes' CPUs.  Use all since the system is otherwise idle 
> > now.
> >  */
> > +   max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> > while (spfn < epfn) {
> > +   epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > +
> > +   if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > +   epfn_align - spfn >= PAGES_PER_SECTION) {
> > +   struct definit_args arg = { zone, 
> > ATOMIC_LONG_INIT(0) };
> > +   struct padata_mt_job job = {
> > +   .thread_fn   = deferred_init_memmap_chunk,
> > +   .fn_arg  = ,
> > +   .start   = spfn,
> > +   .size= epfn_align - spfn,
> > +   .align   = PAGES_PER_SECTION,
> > +   .min_chunk   = PAGES_PER_SECTION,
> > +   .max_threads = max_threads,
> > +   };
> > +
> > +   padata_do_multithreaded();
> > +   nr_pages += atomic_long_read(_pages);
> > +   spfn = epfn_align;
> > +   }
> > +
> > nr_pages += deferred_init_maxorder(, zone, , );
> > cond_resched();
> > }
> 
> This doesn't look right. You are basically adding threads in addition
> to calls to deferred_init_maxorder.

The deferred_init_maxorder call is there to do the remaining, non-section
aligned part of a range.  It doesn't have to be done this way.

> In addition you are spawning on

  1   2   3   4   5   >