Re: [PATCH v2 0/5] kvfree_rcu() miscellaneous fixes

2021-04-16 Thread Uladzislau Rezki
On Thu, Apr 15, 2021 at 06:10:26PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 15, 2021 at 07:19:55PM +0200, Uladzislau Rezki (Sony) wrote:
> > This is a v2 of a small series. See the changelog below:
> > 
> > V1 -> V2:
> > - document the rcu_delay_page_cache_fill_msec parameter;
> > - drop the "kvfree_rcu: introduce "flags" variable" patch;
> > - reword commit messages;
> > - in the patch [1], do not use READ_ONCE() instances in
> >   get_cached_bnode()/put_cached_bnode() it is protected
> >   by the lock.
> > - Capitalize the word following by ":" in commit messages.
> > 
> > Uladzislau Rezki (Sony) (4):
> > [1]  kvfree_rcu: Use [READ/WRITE]_ONCE() macros to access to nr_bkv_objs
> > [2]  kvfree_rcu: Add a bulk-list check when a scheduler is run
> > [3]  kvfree_rcu: Update "monitor_todo" once a batch is started
> > [4]  kvfree_rcu: Use kfree_rcu_monitor() instead of open-coded variant
> > 
> > Zhang Qiang (1):
> > [5]  kvfree_rcu: Release a page cache under memory pressure
> 
> I have queued these, thank you both!  And they pass touch tests, but
> could you please check that "git am -3" correctly resolved a couple of
> conflicts, one in Documentation/admin-guide/kernel-parameters.txt and
> the other in kernel/rcu/tree.c?
> 
Thanks!

I have double checked it. I see that everything is in place and
has been correctly applied on your latest "dev".

--
Vlad Rezki


[PATCH v2 4/5] kvfree_rcu: Update "monitor_todo" once a batch is started

2021-04-15 Thread Uladzislau Rezki (Sony)
Before attempting of starting a new batch a "monitor_todo" var.
is set to "false" and set back to "true" when a previous RCU
batch is still in progress.

Drop it to "false" only when a new batch has been successfully
queued, if not, it stays active anyway. There is no reason in
setting it force and back.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3ddc9dc97487..17c128d93825 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3415,15 +3415,14 @@ static inline void kfree_rcu_drain_unlock(struct 
kfree_rcu_cpu *krcp,
  unsigned long flags)
 {
// Attempt to start a new batch.
-   krcp->monitor_todo = false;
if (queue_kfree_rcu_work(krcp)) {
// Success! Our job is done here.
+   krcp->monitor_todo = false;
raw_spin_unlock_irqrestore(>lock, flags);
return;
}
 
// Previous RCU batch still in progress, try again later.
-   krcp->monitor_todo = true;
schedule_delayed_work(>monitor_work, KFREE_DRAIN_JIFFIES);
raw_spin_unlock_irqrestore(>lock, flags);
 }
-- 
2.20.1



[PATCH v2 5/5] kvfree_rcu: Use kfree_rcu_monitor() instead of open-coded variant

2021-04-15 Thread Uladzislau Rezki (Sony)
To queue a new batch we have a kfree_rcu_monitor() function that
checks "monitor_todo" var. and invokes kfree_rcu_drain_unlock()
to start a new batch after a GP. Get rid of open-coded case by
switching it to the separate function.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 17c128d93825..b3e04c4fefcf 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3670,7 +3670,6 @@ static unsigned long
 kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
int cpu, freed = 0;
-   unsigned long flags;
 
for_each_possible_cpu(cpu) {
int count;
@@ -3678,12 +3677,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
 
count = krcp->count;
count += drain_page_cache(krcp);
-
-   raw_spin_lock_irqsave(>lock, flags);
-   if (krcp->monitor_todo)
-   kfree_rcu_drain_unlock(krcp, flags);
-   else
-   raw_spin_unlock_irqrestore(>lock, flags);
+   kfree_rcu_monitor(>monitor_work.work);
 
sc->nr_to_scan -= count;
freed += count;
-- 
2.20.1



[PATCH v2 3/5] kvfree_rcu: Add a bulk-list check when a scheduler is run

2021-04-15 Thread Uladzislau Rezki (Sony)
RCU_SCHEDULER_RUNNING is set when a scheduling is available.
That signal is used in order to check and queue a "monitor work"
to reclaim freed objects(if they are) during a boot-up phase.

We have it because, the main path of the kvfree_rcu() call can
not queue the work untill the scheduler is up and running.

Currently in such helper only "krcp->head" is checked to figure
out if there are outstanding objects to be released. And this is
only one channel. After adding a bulk interface there are two
extra which have to be checked also: "krcp->bkvhead[0]" as well
as "krcp->bkvhead[1]". So, we have to queue the "monitor work"
if _any_ corresponding channel is not empty.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 07e718fdea12..3ddc9dc97487 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3712,7 +3712,8 @@ void __init kfree_rcu_scheduler_running(void)
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
raw_spin_lock_irqsave(>lock, flags);
-   if (!krcp->head || krcp->monitor_todo) {
+   if ((!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head) ||
+   krcp->monitor_todo) {
raw_spin_unlock_irqrestore(>lock, flags);
continue;
}
-- 
2.20.1



[PATCH v2 2/5] kvfree_rcu: Use [READ/WRITE]_ONCE() macros to access to nr_bkv_objs

2021-04-15 Thread Uladzislau Rezki (Sony)
nr_bkv_objs represents the counter of objects in the page-cache.
Accessing to it requires taking the lock. Switch to READ_ONCE()
WRITE_ONCE() macros to provide an atomic access to that counter.
A shrinker is one of the user of it.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 742152d6b952..07e718fdea12 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3223,7 +3223,7 @@ get_cached_bnode(struct kfree_rcu_cpu *krcp)
if (!krcp->nr_bkv_objs)
return NULL;
 
-   krcp->nr_bkv_objs--;
+   WRITE_ONCE(krcp->nr_bkv_objs, krcp->nr_bkv_objs - 1);
return (struct kvfree_rcu_bulk_data *)
llist_del_first(>bkvcache);
 }
@@ -3237,9 +3237,8 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
return false;
 
llist_add((struct llist_node *) bnode, >bkvcache);
-   krcp->nr_bkv_objs++;
+   WRITE_ONCE(krcp->nr_bkv_objs, krcp->nr_bkv_objs + 1);
return true;
-
 }
 
 static int
@@ -3251,7 +3250,7 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
 
raw_spin_lock_irqsave(>lock, flags);
page_list = llist_del_all(>bkvcache);
-   krcp->nr_bkv_objs = 0;
+   WRITE_ONCE(krcp->nr_bkv_objs, 0);
raw_spin_unlock_irqrestore(>lock, flags);
 
llist_for_each_safe(pos, n, page_list) {
@@ -3655,18 +3654,13 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct 
shrink_control *sc)
 {
int cpu;
unsigned long count = 0;
-   unsigned long flags;
 
/* Snapshot count of all CPUs */
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
count += READ_ONCE(krcp->count);
-
-   raw_spin_lock_irqsave(>lock, flags);
-   count += krcp->nr_bkv_objs;
-   raw_spin_unlock_irqrestore(>lock, flags);
-
+   count += READ_ONCE(krcp->nr_bkv_objs);
atomic_set(>backoff_page_cache_fill, 1);
}
 
-- 
2.20.1



[PATCH v2 1/5] kvfree_rcu: Release a page cache under memory pressure

2021-04-15 Thread Uladzislau Rezki (Sony)
From: Zhang Qiang 

Add a drain_page_cache() function to drain a per-cpu page cache.
The reason behind of it is a system can run into a low memory
condition, in that case a page shrinker can ask for its users
to free their caches in order to get extra memory available for
other needs in a system.

When a system hits such condition, a page cache is drained for
all CPUs in a system. By default a page cache work is delayed
with 5 seconds interval until a memory pressure disappears, if
needed it can be changed. See a rcu_delay_page_cache_fill_msec
module parameter.

Co-developed-by: Uladzislau Rezki (Sony) 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Zqiang 
---
 .../admin-guide/kernel-parameters.txt |  5 ++
 kernel/rcu/tree.c | 82 +--
 2 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 78dc87435ca7..6b769f5cf14c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4154,6 +4154,11 @@
whole algorithm to behave better in low memory
condition.
 
+   rcutree.rcu_delay_page_cache_fill_msec= [KNL]
+   Set delay for a page-cache refill when a low memory
+   condition occurs. That is in milliseconds. Allowed
+   value is within a 0:10 range.
+
rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2c9cf4df942c..742152d6b952 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -186,6 +186,17 @@ module_param(rcu_unlock_delay, int, 0444);
 static int rcu_min_cached_objs = 5;
 module_param(rcu_min_cached_objs, int, 0444);
 
+// A page shrinker can ask for freeing extra pages to get them
+// available for other needs in a system. Usually it happens
+// under low memory condition, in that case we should hold on
+// a bit with page cache filling.
+//
+// Default value is 5 seconds. That is long enough to reduce
+// an interfering and racing with a shrinker where the cache
+// is drained.
+static int rcu_delay_page_cache_fill_msec = 5000;
+module_param(rcu_delay_page_cache_fill_msec, int, 0444);
+
 /* Retrieve RCU kthreads priority for rcutorture */
 int rcu_get_gp_kthreads_prio(void)
 {
@@ -3144,6 +3155,7 @@ struct kfree_rcu_cpu_work {
  * Even though it is lockless an access has to be protected by the
  * per-cpu lock.
  * @page_cache_work: A work to refill the cache when it is empty
+ * @backoff_page_cache_fill: Delay a cache filling
  * @work_in_progress: Indicates that page_cache_work is running
  * @hrtimer: A hrtimer for scheduling a page_cache_work
  * @nr_bkv_objs: number of allocated objects at @bkvcache.
@@ -3163,7 +3175,8 @@ struct kfree_rcu_cpu {
bool initialized;
int count;
 
-   struct work_struct page_cache_work;
+   struct delayed_work page_cache_work;
+   atomic_t backoff_page_cache_fill;
atomic_t work_in_progress;
struct hrtimer hrtimer;
 
@@ -3229,6 +3242,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
 
 }
 
+static int
+drain_page_cache(struct kfree_rcu_cpu *krcp)
+{
+   unsigned long flags;
+   struct llist_node *page_list, *pos, *n;
+   int freed = 0;
+
+   raw_spin_lock_irqsave(>lock, flags);
+   page_list = llist_del_all(>bkvcache);
+   krcp->nr_bkv_objs = 0;
+   raw_spin_unlock_irqrestore(>lock, flags);
+
+   llist_for_each_safe(pos, n, page_list) {
+   free_page((unsigned long)pos);
+   freed++;
+   }
+
+   return freed;
+}
+
 /*
  * This function is invoked in workqueue context after a grace period.
  * It frees all the objects queued on ->bhead_free or ->head_free.
@@ -3419,7 +3452,7 @@ schedule_page_work_fn(struct hrtimer *t)
struct kfree_rcu_cpu *krcp =
container_of(t, struct kfree_rcu_cpu, hrtimer);
 
-   queue_work(system_highpri_wq, >page_cache_work);
+   queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
return HRTIMER_NORESTART;
 }
 
@@ -3428,12 +3461,16 @@ static void fill_page_cache_func(struct work_struct 
*work)
struct kvfree_rcu_bulk_data *bnode;
struct kfree_rcu_cpu *krcp =
container_of(work, struct kfree_rcu_cpu,
-   page_cache_work);
+   page_cache_work.work);
unsigned long flags;
+   int nr_pages;
bool pushed;
int i;
 
-   for (i = 0; i < rcu_min_cached_objs; i++) {
+   nr_pages = atomic_read(>backoff_page_cache_fill) ?
+   1 : rcu_min_cached_objs;
+
+   for (i = 0; i < nr_pages; i++) {
bno

[PATCH v2 0/5] kvfree_rcu() miscellaneous fixes

2021-04-15 Thread Uladzislau Rezki (Sony)
This is a v2 of a small series. See the changelog below:

V1 -> V2:
- document the rcu_delay_page_cache_fill_msec parameter;
- drop the "kvfree_rcu: introduce "flags" variable" patch;
- reword commit messages;
- in the patch [1], do not use READ_ONCE() instances in
  get_cached_bnode()/put_cached_bnode() it is protected
  by the lock.
- Capitalize the word following by ":" in commit messages.

Uladzislau Rezki (Sony) (4):
[1]  kvfree_rcu: Use [READ/WRITE]_ONCE() macros to access to nr_bkv_objs
[2]  kvfree_rcu: Add a bulk-list check when a scheduler is run
[3]  kvfree_rcu: Update "monitor_todo" once a batch is started
[4]  kvfree_rcu: Use kfree_rcu_monitor() instead of open-coded variant

Zhang Qiang (1):
[5]  kvfree_rcu: Release a page cache under memory pressure

 .../admin-guide/kernel-parameters.txt |  5 +
 kernel/rcu/tree.c | 92 +++
 2 files changed, 77 insertions(+), 20 deletions(-)

-- 
2.20.1



Re: [tip: core/rcu] softirq: Don't try waking ksoftirqd before it has been spawned

2021-04-15 Thread Uladzislau Rezki
> 
> Another approach is to move the spawning of ksoftirqd earlier.  This
> still leaves a window of vulnerability, but the window is smaller, and
> thus the probablity of something needing to happen there is smaller.
> Uladzislau sent out a patch that did this some weeks back.
> 
See below the patch that is in question, just in case:


commit f4cd768e341486655c8c196e1f2b48a4463541f3
Author: Paul E. McKenney 
Date:   Fri Feb 12 16:41:05 2021 -0800

softirq: Don't try waking ksoftirqd before it has been spawned

If there is heavy softirq activity, the softirq system will attempt
to awaken ksoftirqd and will stop the traditional back-of-interrupt
softirq processing.  This is all well and good, but only if the
ksoftirqd kthreads already exist, which is not the case during early
boot, in which case the system hangs.

One reproducer is as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
--configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" 
--bootargs "threadirqs=1" --trust-make

This commit therefore moves the spawning of the ksoftirqd kthreads
earlier in boot.  With this change, the above test passes.

Reported-by: Sebastian Andrzej Siewior 
Reported-by: Uladzislau Rezki 
Inspired-by: Uladzislau Rezki 
Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index bb8ff90..283a02d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -592,6 +592,8 @@ static inline struct task_struct *this_cpu_ksoftirqd(void)
return this_cpu_read(ksoftirqd);
 }

+int spawn_ksoftirqd(void);
+
 /* Tasklets --- multithreaded analogue of BHs.

This API is deprecated. Please consider using threaded IRQs instead:
diff --git a/init/main.c b/init/main.c
index c68d784..99835bb 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1512,6 +1512,7 @@ static noinline void __init kernel_init_freeable(void)

init_mm_internals();

+   spawn_ksoftirqd();
rcu_init_tasks_generic();
do_pre_smp_initcalls();
lockup_detector_init();
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 9d71046..45d50d4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -724,7 +724,7 @@ static struct smp_hotplug_thread softirq_threads = {
.thread_comm= "ksoftirqd/%u",
 };

-static __init int spawn_ksoftirqd(void)
+__init int spawn_ksoftirqd(void)
 {
cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
  takeover_tasklets);
@@ -732,7 +732,6 @@ static __init int spawn_ksoftirqd(void)



return 0;
 }
-early_initcall(spawn_ksoftirqd);

 /*
  * [ These __weak aliases are kept in a separate compilation unit, so that


Thanks.

--
Vlad Rezki


[PATCH 6/6] kvfree_rcu: use kfree_rcu_monitor() instead of open-coded variant

2021-04-14 Thread Uladzislau Rezki (Sony)
To queue a new batch we have a kfree_rcu_monitor() function that
checks KRC_MONITOR_TODO bit and invokes kfree_rcu_drain_unlock()
to start a new batch after a GP. Get rid of open-coded case by
switching it to the separate function.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 012030cbe55e..14e9220198eb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3680,7 +3680,6 @@ static unsigned long
 kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
int cpu, freed = 0;
-   unsigned long flags;
 
for_each_possible_cpu(cpu) {
int count;
@@ -3688,12 +3687,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
 
count = krcp->count;
count += drain_page_cache(krcp);
-
-   raw_spin_lock_irqsave(>lock, flags);
-   if (test_bit(KRC_MONITOR_TODO, >flags))
-   kfree_rcu_drain_unlock(krcp, flags);
-   else
-   raw_spin_unlock_irqrestore(>lock, flags);
+   kfree_rcu_monitor(>monitor_work.work);
 
sc->nr_to_scan -= count;
freed += count;
-- 
2.20.1



[PATCH 5/6] kvfree_rcu: clear KRC_MONITOR_TODO bit once a batch is started

2021-04-14 Thread Uladzislau Rezki (Sony)
Before attempting of starting a new batch the KRC_MONITOR_TODO
bit is cleared and set back when a previous RCU batch is still
in progress.

Clear the KRC_MONITOR_TODO bit only when a new batch has been
successfully queued, if not, it stays active anyway, thus no
reason in setting it back. Please note that checking/setting
this bit is protected by the krcp->lock spinlock.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da3605067cc1..012030cbe55e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3427,16 +3427,14 @@ static inline void kfree_rcu_drain_unlock(struct 
kfree_rcu_cpu *krcp,
  unsigned long flags)
 {
// Attempt to start a new batch.
-   clear_bit(KRC_MONITOR_TODO, >flags);
-
if (queue_kfree_rcu_work(krcp)) {
// Success! Our job is done here.
+   clear_bit(KRC_MONITOR_TODO, >flags);
raw_spin_unlock_irqrestore(>lock, flags);
return;
}
 
// Previous RCU batch still in progress, try again later.
-   set_bit(KRC_MONITOR_TODO, >flags);
schedule_delayed_work(>monitor_work, KFREE_DRAIN_JIFFIES);
raw_spin_unlock_irqrestore(>lock, flags);
 }
-- 
2.20.1



[PATCH 4/6] kvfree_rcu: add a bulk-list check when a scheduler is run

2021-04-14 Thread Uladzislau Rezki (Sony)
RCU_SCHEDULER_RUNNING is set when a scheduling is available.
That signal is used in order to check and queue a "monitor work"
to reclaim freed objects(if they are) during a boot-up phase.

We have it because, the main path of the kvfree_rcu() call can
not queue the work untill the scheduler is up and running.

Currently in such helper only "krcp->head" is checked to figure
out if there are outstanding objects to be released. And this is
only one channel. After adding a bulk interface there are two
extra which have to be checked also: "krcp->bkvhead[0]" as well
as "krcp->bkvhead[1]". So, we have to queue the "monitor work"
if _any_ corresponding channel is not empty.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 31ee820c3d9e..da3605067cc1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3723,13 +3723,11 @@ void __init kfree_rcu_scheduler_running(void)
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
raw_spin_lock_irqsave(>lock, flags);
-   if (!krcp->head || test_and_set_bit(KRC_MONITOR_TODO, 
>flags)) {
-   raw_spin_unlock_irqrestore(>lock, flags);
-   continue;
+   if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head ||
+   !test_and_set_bit(KRC_MONITOR_TODO, 
>flags)) {
+   schedule_delayed_work_on(cpu, >monitor_work,
+   KFREE_DRAIN_JIFFIES);
}
-
-   schedule_delayed_work_on(cpu, >monitor_work,
-KFREE_DRAIN_JIFFIES);
raw_spin_unlock_irqrestore(>lock, flags);
}
 }
-- 
2.20.1



[PATCH 3/6] kvfree_rcu: introduce "flags" variable

2021-04-14 Thread Uladzislau Rezki (Sony)
We have a few extra variables within kfree_rcu_cpu structure
which are control ones and behave as regular booleans. Instead
we can pack them into only one, define bit descriptions which
will represent an individual boolean state.

This reduces the size of the per-cpu kfree_rcu_cpu structure.
To access to the flags variable atomic bit operations are used.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 61 ---
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1b0289fa1cdd..31ee820c3d9e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3139,6 +3139,24 @@ struct kfree_rcu_cpu_work {
struct kfree_rcu_cpu *krcp;
 };
 
+// The per-cpu kfree_rcu_cpu structure was initialized.
+// It is set only once when a system is up and running.
+#define KRC_INITIALIZED0x1
+
+// Indicates that a page_cache_work has been initialized
+// and is about to be queued for execution. The flag is
+// cleared on exit of the worker function.
+#define KRC_CACHE_WORK_RUN 0x2
+
+// A page shrinker can ask for freeing extra pages to get
+// them available for other needs in a system. Usually it
+// happens under low memory condition, in that case hold
+// on a bit with page cache filling.
+#define KRC_DELAY_CACHE_FILL   0x4
+
+// Tracks whether a "monitor_work" delayed work is pending
+#define KRC_MONITOR_TODO   0x8
+
 /**
  * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
  * @head: List of kfree_rcu() objects not yet waiting for a grace period
@@ -3146,17 +3164,14 @@ struct kfree_rcu_cpu_work {
  * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
  * @lock: Synchronize access to this structure
  * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
- * @monitor_todo: Tracks whether a @monitor_work delayed work is pending
- * @initialized: The @rcu_work fields have been initialized
  * @count: Number of objects for which GP not started
+ * @flags: Atomic flags which describe different states
  * @bkvcache:
  * A simple cache list that contains objects for reuse purpose.
  * In order to save some per-cpu space the list is singular.
  * Even though it is lockless an access has to be protected by the
  * per-cpu lock.
  * @page_cache_work: A work to refill the cache when it is empty
- * @backoff_page_cache_fill: Delay a cache filling
- * @work_in_progress: Indicates that page_cache_work is running
  * @hrtimer: A hrtimer for scheduling a page_cache_work
  * @nr_bkv_objs: number of allocated objects at @bkvcache.
  *
@@ -3171,13 +3186,10 @@ struct kfree_rcu_cpu {
struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
raw_spinlock_t lock;
struct delayed_work monitor_work;
-   bool monitor_todo;
-   bool initialized;
int count;
 
+   unsigned long flags;
struct delayed_work page_cache_work;
-   atomic_t backoff_page_cache_fill;
-   atomic_t work_in_progress;
struct hrtimer hrtimer;
 
struct llist_head bkvcache;
@@ -3415,7 +3427,8 @@ static inline void kfree_rcu_drain_unlock(struct 
kfree_rcu_cpu *krcp,
  unsigned long flags)
 {
// Attempt to start a new batch.
-   krcp->monitor_todo = false;
+   clear_bit(KRC_MONITOR_TODO, >flags);
+
if (queue_kfree_rcu_work(krcp)) {
// Success! Our job is done here.
raw_spin_unlock_irqrestore(>lock, flags);
@@ -3423,7 +3436,7 @@ static inline void kfree_rcu_drain_unlock(struct 
kfree_rcu_cpu *krcp,
}
 
// Previous RCU batch still in progress, try again later.
-   krcp->monitor_todo = true;
+   set_bit(KRC_MONITOR_TODO, >flags);
schedule_delayed_work(>monitor_work, KFREE_DRAIN_JIFFIES);
raw_spin_unlock_irqrestore(>lock, flags);
 }
@@ -3439,7 +3452,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
 monitor_work.work);
 
raw_spin_lock_irqsave(>lock, flags);
-   if (krcp->monitor_todo)
+   if (test_bit(KRC_MONITOR_TODO, >flags))
kfree_rcu_drain_unlock(krcp, flags);
else
raw_spin_unlock_irqrestore(>lock, flags);
@@ -3466,7 +3479,7 @@ static void fill_page_cache_func(struct work_struct *work)
bool pushed;
int i;
 
-   nr_pages = atomic_read(>backoff_page_cache_fill) ?
+   nr_pages = test_bit(KRC_DELAY_CACHE_FILL, >flags) ?
1 : rcu_min_cached_objs;
 
for (i = 0; i < nr_pages; i++) {
@@ -3485,16 +3498,16 @@ static void fill_page_cache_func(struct work_struct 
*work)
}
}
 
-   atomic_set(>work_in_progress, 0);
-   atomic_set(>backoff_page_cache_fill, 0);
+   clear_bit(KRC_CACHE_WORK_RUN, >flags);
+

[PATCH 2/6] kvfree_rcu: use [READ/WRITE]_ONCE() macros to access to nr_bkv_objs

2021-04-14 Thread Uladzislau Rezki (Sony)
nr_bkv_objs represents the counter of objects in the page-cache.
Accessing to it requires taking the lock. Switch to READ_ONCE()
WRITE_ONCE() macros to provide an atomic access to that counter.
A shrinker is one of the user of it.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8b74edcd11d4..1b0289fa1cdd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3220,10 +3220,10 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, 
unsigned long flags)
 static inline struct kvfree_rcu_bulk_data *
 get_cached_bnode(struct kfree_rcu_cpu *krcp)
 {
-   if (!krcp->nr_bkv_objs)
+   if (!READ_ONCE(krcp->nr_bkv_objs))
return NULL;
 
-   krcp->nr_bkv_objs--;
+   WRITE_ONCE(krcp->nr_bkv_objs, krcp->nr_bkv_objs - 1);
return (struct kvfree_rcu_bulk_data *)
llist_del_first(>bkvcache);
 }
@@ -3233,13 +3233,12 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
struct kvfree_rcu_bulk_data *bnode)
 {
// Check the limit.
-   if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
+   if (READ_ONCE(krcp->nr_bkv_objs) >= rcu_min_cached_objs)
return false;
 
llist_add((struct llist_node *) bnode, >bkvcache);
-   krcp->nr_bkv_objs++;
+   WRITE_ONCE(krcp->nr_bkv_objs, krcp->nr_bkv_objs + 1);
return true;
-
 }
 
 static int
@@ -3251,7 +3250,7 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
 
raw_spin_lock_irqsave(>lock, flags);
page_list = llist_del_all(>bkvcache);
-   krcp->nr_bkv_objs = 0;
+   WRITE_ONCE(krcp->nr_bkv_objs, 0);
raw_spin_unlock_irqrestore(>lock, flags);
 
llist_for_each_safe(pos, n, page_list) {
@@ -3655,18 +3654,13 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct 
shrink_control *sc)
 {
int cpu;
unsigned long count = 0;
-   unsigned long flags;
 
/* Snapshot count of all CPUs */
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
count += READ_ONCE(krcp->count);
-
-   raw_spin_lock_irqsave(>lock, flags);
-   count += krcp->nr_bkv_objs;
-   raw_spin_unlock_irqrestore(>lock, flags);
-
+   count += READ_ONCE(krcp->nr_bkv_objs);
atomic_set(>backoff_page_cache_fill, 1);
}
 
-- 
2.20.1



[PATCH 1/6] kvfree_rcu: Release a page cache under memory pressure

2021-04-14 Thread Uladzislau Rezki (Sony)
From: Zhang Qiang 

Add a drain_page_cache() function to drain a per-cpu page cache.
The reason behind of it is a system can run into a low memory
condition, in that case a page shrinker can ask for its users
to free their caches in order to get extra memory available for
other needs in a system.

When a system hits such condition, a page cache is drained for
all CPUs in a system. By default a page cache work is delayed
with 5 seconds interval until a memory pressure disappears, if
needed it can be changed. See a rcu_delay_page_cache_fill_msec
module parameter.

Co-developed-by: Uladzislau Rezki (Sony) 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Zqiang 
---
 kernel/rcu/tree.c | 70 +--
 1 file changed, 61 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2c9cf4df942c..8b74edcd11d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -186,6 +186,17 @@ module_param(rcu_unlock_delay, int, 0444);
 static int rcu_min_cached_objs = 5;
 module_param(rcu_min_cached_objs, int, 0444);
 
+// A page shrinker can ask for freeing extra pages to get them
+// available for other needs in a system. Usually it happens
+// under low memory condition, in that case we should hold on
+// a bit with page cache filling.
+//
+// Default value is 5 seconds. That is long enough to reduce
+// an interfering and racing with a shrinker where the cache
+// is drained.
+static int rcu_delay_page_cache_fill_msec = 5000;
+module_param(rcu_delay_page_cache_fill_msec, int, 0444);
+
 /* Retrieve RCU kthreads priority for rcutorture */
 int rcu_get_gp_kthreads_prio(void)
 {
@@ -3144,6 +3155,7 @@ struct kfree_rcu_cpu_work {
  * Even though it is lockless an access has to be protected by the
  * per-cpu lock.
  * @page_cache_work: A work to refill the cache when it is empty
+ * @backoff_page_cache_fill: Delay a cache filling
  * @work_in_progress: Indicates that page_cache_work is running
  * @hrtimer: A hrtimer for scheduling a page_cache_work
  * @nr_bkv_objs: number of allocated objects at @bkvcache.
@@ -3163,7 +3175,8 @@ struct kfree_rcu_cpu {
bool initialized;
int count;
 
-   struct work_struct page_cache_work;
+   struct delayed_work page_cache_work;
+   atomic_t backoff_page_cache_fill;
atomic_t work_in_progress;
struct hrtimer hrtimer;
 
@@ -3229,6 +3242,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
 
 }
 
+static int
+drain_page_cache(struct kfree_rcu_cpu *krcp)
+{
+   unsigned long flags;
+   struct llist_node *page_list, *pos, *n;
+   int freed = 0;
+
+   raw_spin_lock_irqsave(>lock, flags);
+   page_list = llist_del_all(>bkvcache);
+   krcp->nr_bkv_objs = 0;
+   raw_spin_unlock_irqrestore(>lock, flags);
+
+   llist_for_each_safe(pos, n, page_list) {
+   free_page((unsigned long)pos);
+   freed++;
+   }
+
+   return freed;
+}
+
 /*
  * This function is invoked in workqueue context after a grace period.
  * It frees all the objects queued on ->bhead_free or ->head_free.
@@ -3419,7 +3452,7 @@ schedule_page_work_fn(struct hrtimer *t)
struct kfree_rcu_cpu *krcp =
container_of(t, struct kfree_rcu_cpu, hrtimer);
 
-   queue_work(system_highpri_wq, >page_cache_work);
+   queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
return HRTIMER_NORESTART;
 }
 
@@ -3428,12 +3461,16 @@ static void fill_page_cache_func(struct work_struct 
*work)
struct kvfree_rcu_bulk_data *bnode;
struct kfree_rcu_cpu *krcp =
container_of(work, struct kfree_rcu_cpu,
-   page_cache_work);
+   page_cache_work.work);
unsigned long flags;
+   int nr_pages;
bool pushed;
int i;
 
-   for (i = 0; i < rcu_min_cached_objs; i++) {
+   nr_pages = atomic_read(>backoff_page_cache_fill) ?
+   1 : rcu_min_cached_objs;
+
+   for (i = 0; i < nr_pages; i++) {
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
 
@@ -3450,6 +3487,7 @@ static void fill_page_cache_func(struct work_struct *work)
}
 
atomic_set(>work_in_progress, 0);
+   atomic_set(>backoff_page_cache_fill, 0);
 }
 
 static void
@@ -3457,10 +3495,15 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 {
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(>work_in_progress, 1)) {
-   hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
-   HRTIMER_MODE_REL);
-   krcp->hrtimer.function = schedule_page_work_fn;
-   hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
+   if (atomic_read(>backoff_page_cache_fill)) {
+   queue_delayed_wo

Re: [tip: core/rcu] softirq: Don't try waking ksoftirqd before it has been spawned

2021-04-14 Thread Uladzislau Rezki
On Wed, Apr 14, 2021 at 09:13:22AM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-04-12 11:36:45 [-0700], Paul E. McKenney wrote:
> > > Color me confused. I did not follow the discussion around this
> > > completely, but wasn't it agreed on that this rcu torture muck can wait
> > > until the threads are brought up?
> > 
> > Yes, we can cause rcutorture to wait.  But in this case, rcutorture
> > is just the messenger, and making it wait would simply be ignoring
> > the message.  The message is that someone could invoke any number of
> > things that wait on a softirq handler's invocation during the interval
> > before ksoftirqd has been spawned.
> 
> My memory on this is that the only user, that required this early
> behaviour, was kprobe which was recently changed to not need it anymore.
> Which makes the test as the only user that remains. Therefore I thought
> that this test will be moved to later position (when ksoftirqd is up and
> running) and that there is no more requirement for RCU to be completely
> up that early in the boot process.
> 
> Did I miss anything?
> 
Seems not. Let me wrap it up a bit though i may miss something:

1) Initially we had an issue with booting RISV because of:

36dadef23fcc ("kprobes: Init kprobes in early_initcall")

i.e. a developer decided to move initialization of kprobe at
early_initcall() phase. Since kprobe uses synchronize_rcu_tasks()
a system did not boot due to the fact that RCU-tasks were setup
at core_initcall() step. It happens later in this chain.

To address that issue, we had decided to move RCU-tasks setup
to before early_initcall() and it worked well:

https://lore.kernel.org/lkml/20210218083636.ga2...@pc638.lan/T/

2) After that fix you reported another issue. If the kernel is run
with "threadirqs=1" - it did not boot also. Because ksoftirqd does
not exist by that time, thus our early-rcu-self test did not pass.

3) Due to (2), Masami Hiramatsu proposed to fix kprobes by delaying
kprobe optimization and it also addressed initial issue:

https://lore.kernel.org/lkml/20210219112357.ga34...@pc638.lan/T/

At the same time Paul made another patch:

softirq: Don't try waking ksoftirqd before it has been spawned

it allows us to keep RCU-tasks initialization before even
early_initcall() where it is now and let our rcu-self-test
to be completed without any hanging.

--
Vlad Rezki


[tip: core/rcu] kvfree_rcu: Directly allocate page for single-argument case

2021-04-11 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 148e3731d124079a036b3acf780f3d35c1b9c0aa
Gitweb:
https://git.kernel.org/tip/148e3731d124079a036b3acf780f3d35c1b9c0aa
Author:Uladzislau Rezki (Sony) 
AuthorDate:Wed, 20 Jan 2021 17:21:46 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:18:07 -08:00

kvfree_rcu: Directly allocate page for single-argument case

Single-argument kvfree_rcu() must be invoked from sleepable contexts,
so we can directly allocate pages.  Furthermmore, the fallback in case
of page-allocation failure is the high-latency synchronize_rcu(), so it
makes sense to do these page allocations from the fastpath, and even to
permit limited sleeping within the allocator.

This commit therefore allocates if needed on the fastpath using
GFP_KERNEL|__GFP_RETRY_MAYFAIL.  This also has the beneficial effect
of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant
of kvfree_rcu(), given that the double-argument variant cannot directly
invoke the allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 42 ++
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..1f8c980 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,37 +3493,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
}
 }
 
+// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
+// state specified by flags.  If can_alloc is true, the caller must
+// be schedulable and not be holding any locks or mutexes that might be
+// acquired by the memory allocator or anything that it might invoke.
+// Returns true if ptr was successfully recorded, else the caller must
+// use a fallback.
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+   unsigned long *flags, void *ptr, bool can_alloc)
 {
struct kvfree_rcu_bulk_data *bnode;
int idx;
 
-   if (unlikely(!krcp->initialized))
+   *krcp = krc_this_cpu_lock(flags);
+   if (unlikely(!(*krcp)->initialized))
return false;
 
-   lockdep_assert_held(>lock);
idx = !!is_vmalloc_addr(ptr);
 
/* Check if a new block is required. */
-   if (!krcp->bkvhead[idx] ||
-   krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
{
-   bnode = get_cached_bnode(krcp);
-   /* Switch to emergency path. */
+   if (!(*krcp)->bkvhead[idx] ||
+   (*krcp)->bkvhead[idx]->nr_records == 
KVFREE_BULK_MAX_ENTR) {
+   bnode = get_cached_bnode(*krcp);
+   if (!bnode && can_alloc) {
+   krc_this_cpu_unlock(*krcp, *flags);
+   bnode = (struct kvfree_rcu_bulk_data *)
+   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+   *krcp = krc_this_cpu_lock(flags);
+   }
+
if (!bnode)
return false;
 
/* Initialize the new block. */
bnode->nr_records = 0;
-   bnode->next = krcp->bkvhead[idx];
+   bnode->next = (*krcp)->bkvhead[idx];
 
/* Attach it to the head. */
-   krcp->bkvhead[idx] = bnode;
+   (*krcp)->bkvhead[idx] = bnode;
}
 
/* Finally insert. */
-   krcp->bkvhead[idx]->records
-   [krcp->bkvhead[idx]->nr_records++] = ptr;
+   (*krcp)->bkvhead[idx]->records
+   [(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
return true;
 }
@@ -3561,8 +3574,6 @@ void kvfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
ptr = (unsigned long *) func;
}
 
-   krcp = krc_this_cpu_lock();
-
// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(ptr)) {
// Probable double kfree_rcu(), just leak.
@@ -3570,12 +3581,11 @@ void kvfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
  __func__, head);
 
// Mark as success and leave.
-   success = true;
-   goto unlock_return;
+   return;
}
 
kasan_record_aux_stack(ptr);
-   success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+   success = add_ptr_to_bulk_krc_lock(, , ptr, !head);
if (!success) {
run_page_cache_worker(krcp);
 


[tip: core/rcu] kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

2021-04-11 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 3e7ce7a187fc6aaa9fda1310a2b8da8770342ff7
Gitweb:
https://git.kernel.org/tip/3e7ce7a187fc6aaa9fda1310a2b8da8770342ff7
Author:Uladzislau Rezki (Sony) 
AuthorDate:Fri, 29 Jan 2021 17:16:03 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:18:07 -08:00

kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

__GFP_RETRY_MAYFAIL can spend quite a bit of time reclaiming, and this
can be wasted effort given that there is a fallback code path in case
memory allocation fails.

__GFP_NORETRY does perform some light-weight reclaim, but it will fail
under OOM conditions, allowing the fallback to be taken as an alternative
to hard-OOMing the system.

There is a four-way tradeoff that must be balanced:
1) Minimize use of the fallback path;
2) Avoid full-up OOM;
3) Do a light-wait allocation request;
4) Avoid dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) 
Acked-by: Michal Hocko 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7ee83f3..0ecc1fb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3517,8 +3517,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
bnode = get_cached_bnode(*krcp);
if (!bnode && can_alloc) {
krc_this_cpu_unlock(*krcp, *flags);
+
+   // __GFP_NORETRY - allows a light-weight direct reclaim
+   // what is OK from minimizing of fallback hitting point 
of
+   // view. Apart of that it forbids any OOM invoking what 
is
+   // also beneficial since we are about to release memory 
soon.
+   //
+   // __GFP_NOMEMALLOC - prevents from consuming of all the
+   // memory reserves. Please note we have a fallback path.
+   //
+   // __GFP_NOWARN - it is supposed that an allocation can
+   // be failed under low memory or high memory pressure
+   // scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
*krcp = krc_this_cpu_lock(flags);
}
 


[tip: core/rcu] kvfree_rcu: Use same set of GFP flags as does single-argument

2021-04-11 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: ee6ddf58475cce8a3d3697614679cd8cb4a6f583
Gitweb:
https://git.kernel.org/tip/ee6ddf58475cce8a3d3697614679cd8cb4a6f583
Author:Uladzislau Rezki (Sony) 
AuthorDate:Fri, 29 Jan 2021 21:05:05 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:18:07 -08:00

kvfree_rcu: Use same set of GFP flags as does single-argument

Running an rcuscale stress-suite can lead to "Out of memory" of a
system. This can happen under high memory pressure with a small amount
of physical memory.

For example, a KVM test configuration with 64 CPUs and 512 megabytes
can result in OOM when running rcuscale with below parameters:

../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig 
CONFIG_NR_CPUS=64 \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
rcuscale.holdoff=20 \
  rcuscale.kfree_loops=1 torture.disable_onoff_at_boot" --trust-make


[   12.054448] kworker/1:1H invoked oom-killer: 
gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0
[   12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.12.0-1 04/01/2014
[   12.056485] Workqueue: events_highpri fill_page_cache_func
[   12.056485] Call Trace:
[   12.056485]  dump_stack+0x57/0x6a
[   12.056485]  dump_header+0x4c/0x30a
[   12.056485]  ? del_timer_sync+0x20/0x30
[   12.056485]  out_of_memory.cold.47+0xa/0x7e
[   12.056485]  __alloc_pages_slowpath.constprop.123+0x82f/0xc00
[   12.056485]  __alloc_pages_nodemask+0x289/0x2c0
[   12.056485]  __get_free_pages+0x8/0x30
[   12.056485]  fill_page_cache_func+0x39/0xb0
[   12.056485]  process_one_work+0x1ed/0x3b0
[   12.056485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  worker_thread+0x28/0x3c0
[   12.060485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  kthread+0x138/0x160
[   12.060485]  ? kthread_park+0x80/0x80
[   12.060485]  ret_from_fork+0x22/0x30
[   12.062156] Mem-Info:
[   12.062350] active_anon:0 inactive_anon:0 isolated_anon:0
[   12.062350]  active_file:0 inactive_file:0 isolated_file:0
[   12.062350]  unevictable:0 dirty:0 writeback:0
[   12.062350]  slab_reclaimable:2797 slab_unreclaimable:80920
[   12.062350]  mapped:1 shmem:2 pagetables:8 bounce:0
[   12.062350]  free:10488 free_pcp:1227 free_cma:0
...
[   12.101610] Out of memory and no killable processes...
[   12.102042] Kernel panic - not syncing: System is deadlocked on memory
[   12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.12.0-1 04/01/2014


Because kvfree_rcu() has a fallback path, memory allocation failure is
not the end of the world.  Furthermore, the added overhead of aggressive
GFP settings must be balanced against the overhead of the fallback path,
which is a cache miss for double-argument kvfree_rcu() and a call to
synchronize_rcu() for single-argument kvfree_rcu().  The current choice
of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call
to synchronize_rcu(), so less-tenacious GFP flags would be helpful.

Here is the tradeoff that must be balanced:
a) Minimize use of the fallback path,
b) Avoid pushing the system into OOM,
c) Bound allocation latency to that of synchronize_rcu(), and
d) Leave the emergency reserves to use cases lacking fallbacks.

This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to
GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN.  This combination
leaves the emergency reserves alone and can initiate reclaim, but will
not invoke the OOM killer.

Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0ecc1fb..4120d4b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3463,7 +3463,7 @@ static void fill_page_cache_func(struct work_struct *work)
 
for (i = 0; i < rcu_min_cached_objs; i++) {
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
 
if (bnode) {
raw_spin_lock_irqsave(>lock, flags);


[tip: core/rcu] rcuscale: Add kfree_rcu() single-argument scale test

2021-04-11 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 686fe1bf6bcce3ce9fc03c9d9035c643c320ca46
Gitweb:
https://git.kernel.org/tip/686fe1bf6bcce3ce9fc03c9d9035c643c320ca46
Author:Uladzislau Rezki (Sony) 
AuthorDate:Wed, 17 Feb 2021 19:51:10 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:18:07 -08:00

rcuscale: Add kfree_rcu() single-argument scale test

The single-argument variant of kfree_rcu() is currently not
tested by any member of the rcutoture test suite.  This
commit therefore adds rcuscale code to test it.  This
testing is controlled by two new boolean module parameters,
kfree_rcu_test_single and kfree_rcu_test_double.  If one
is set and the other not, only the corresponding variant
is tested, otherwise both are tested, with the variant to
be tested determined randomly on each invocation.

Both of these module parameters are initialized to false,
so setting either to true will test only that variant.

Suggested-by: Paul E. McKenney 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 12 
 kernel/rcu/rcuscale.c   | 15 ++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..84fce41 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4259,6 +4259,18 @@
rcuscale.kfree_rcu_test= [KNL]
Set to measure performance of kfree_rcu() flooding.
 
+   rcuscale.kfree_rcu_test_double= [KNL]
+   Test the double-argument variant of kfree_rcu().
+   If this parameter has the same value as
+   rcuscale.kfree_rcu_test_single, both the single-
+   and double-argument variants are tested.
+
+   rcuscale.kfree_rcu_test_single= [KNL]
+   Test the single-argument variant of kfree_rcu().
+   If this parameter has the same value as
+   rcuscale.kfree_rcu_test_double, both the single-
+   and double-argument variants are tested.
+
rcuscale.kfree_nthreads= [KNL]
The number of threads running loops of kfree_rcu().
 
diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 06491d5..dca51fe 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg)
 torture_param(int, kfree_nthreads, -1, "Number of threads running loops of 
kfree_rcu().");
 torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees 
done in an iteration.");
 torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num 
allocations and frees.");
+torture_param(bool, kfree_rcu_test_double, false, "Do we run a kfree_rcu() 
double-argument scale test?");
+torture_param(bool, kfree_rcu_test_single, false, "Do we run a kfree_rcu() 
single-argument scale test?");
 
 static struct task_struct **kfree_reader_tasks;
 static int kfree_nrealthreads;
@@ -644,10 +646,13 @@ kfree_scale_thread(void *arg)
struct kfree_obj *alloc_ptr;
u64 start_time, end_time;
long long mem_begin, mem_during = 0;
+   bool kfree_rcu_test_both;
+   DEFINE_TORTURE_RANDOM(tr);
 
VERBOSE_SCALEOUT_STRING("kfree_scale_thread task started");
set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
set_user_nice(current, MAX_NICE);
+   kfree_rcu_test_both = (kfree_rcu_test_single == kfree_rcu_test_double);
 
start_time = ktime_get_mono_fast_ns();
 
@@ -670,7 +675,15 @@ kfree_scale_thread(void *arg)
if (!alloc_ptr)
return -ENOMEM;
 
-   kfree_rcu(alloc_ptr, rh);
+   // By default kfree_rcu_test_single and 
kfree_rcu_test_double are
+   // initialized to false. If both have the same value 
(false or true)
+   // both are randomly tested, otherwise only the one 
with value true
+   // is tested.
+   if ((kfree_rcu_test_single && !kfree_rcu_test_double) ||
+   (kfree_rcu_test_both && 
torture_random() & 0x800))
+   kfree_rcu(alloc_ptr);
+   else
+   kfree_rcu(alloc_ptr, rh);
}
 
cond_resched();


[PATCH-next 1/1] lib/test_vmalloc.c: extend max value of nr_threads parameter

2021-04-06 Thread Uladzislau Rezki (Sony)
Currently a maximum value is set to 1024 workers the user can
create during the test. It might be that for some big systems
it is not enough. Since it is a test thing we can give testers
more flexibility.

Increase that number till USHRT_MAX that corresponds to 65535.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 lib/test_vmalloc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index d337985e4c5e..01e9543de566 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -24,7 +24,7 @@
MODULE_PARM_DESC(name, msg) \
 
 __param(int, nr_threads, 0,
-   "Number of workers to perform tests(min: 1 max: 1024)");
+   "Number of workers to perform tests(min: 1 max: USHRT_MAX)");
 
 __param(bool, sequential_test_order, false,
"Use sequential stress tests order");
@@ -469,13 +469,13 @@ init_test_configurtion(void)
 {
/*
 * A maximum number of workers is defined as hard-coded
-* value and set to 1024. We add such gap just in case
-* and for potential heavy stressing.
+* value and set to USHRT_MAX. We add such gap just in
+* case and for potential heavy stressing.
 */
-   nr_threads = clamp(nr_threads, 1, 1024);
+   nr_threads = clamp(nr_threads, 1, (int) USHRT_MAX);
 
/* Allocate the space for test instances. */
-   tdriver = kcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
+   tdriver = kvcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
if (tdriver == NULL)
return -1;
 
@@ -555,7 +555,7 @@ static void do_concurrent_test(void)
i, t->stop - t->start);
}
 
-   kfree(tdriver);
+   kvfree(tdriver);
 }
 
 static int vmalloc_test_init(void)
-- 
2.20.1



Re: [PATCH-next 2/5] lib/test_vmalloc.c: add a new 'nr_threads' parameter

2021-04-06 Thread Uladzislau Rezki
On Mon, Apr 05, 2021 at 07:39:20PM -0700, Andrew Morton wrote:
> On Sat, 3 Apr 2021 14:31:43 +0200 Uladzislau Rezki  wrote:
> 
> > > 
> > > We may need to replaced that kcalloc() with kmvalloc() though...
> > >
> > Yep. If we limit to USHRT_MAX, the maximum amount of memory for
> > internal data would be ~12MB. Something like below:
> > 
> > diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
> > index d337985e4c5e..a5103e3461bf 100644
> > --- a/lib/test_vmalloc.c
> > +++ b/lib/test_vmalloc.c
> > @@ -24,7 +24,7 @@
> > MODULE_PARM_DESC(name, msg) \
> > 
> >  __param(int, nr_threads, 0,
> > -   "Number of workers to perform tests(min: 1 max: 1024)");
> > +   "Number of workers to perform tests(min: 1 max: 65536)");
> > 
> >  __param(bool, sequential_test_order, false,
> > "Use sequential stress tests order");
> > @@ -469,13 +469,13 @@ init_test_configurtion(void)
> >  {
> > /*
> >  * A maximum number of workers is defined as hard-coded
> > -* value and set to 1024. We add such gap just in case
> > +* value and set to 65536. We add such gap just in case
> >  * and for potential heavy stressing.
> >  */
> > -   nr_threads = clamp(nr_threads, 1, 1024);
> > +   nr_threads = clamp(nr_threads, 1, 65536);
> > 
> > /* Allocate the space for test instances. */
> > -   tdriver = kcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
> > +   tdriver = kvcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
> > if (tdriver == NULL)
> > return -1;
> > 
> > @@ -555,7 +555,7 @@ static void do_concurrent_test(void)
> > i, t->stop - t->start);
> > }
> > 
> > -   kfree(tdriver);
> > +   kvfree(tdriver);
> >  }
> > 
> >  static int vmalloc_test_init(void)
> > 
> > Does it sound reasonable for you?
> 
> I think so.  It's a test thing so let's give testers more flexibility,
> remembering that they don't need as much protection from their own
> mistakes.
> 
OK. I will send one more extra patch then.

--
Vlad Rezki


Re: [PATCH-next 2/5] lib/test_vmalloc.c: add a new 'nr_threads' parameter

2021-04-03 Thread Uladzislau Rezki
> On Fri,  2 Apr 2021 22:22:34 +0200 "Uladzislau Rezki (Sony)" 
>  wrote:
> 
> > By using this parameter we can specify how many workers are
> > created to perform vmalloc tests. By default it is one CPU.
> > The maximum value is set to 1024.
> > 
> > As a result of this change a 'single_cpu_test' one becomes
> > obsolete, therefore it is no longer needed.
> > 
> 
> Why limit to 1024?  Maybe testers want more - what's the downside to
> permitting that?
>
I was thinking mainly about if a tester issues enormous number of kthreads,
so a system is not able to handle it. Therefore i clamped that value to 1024.

>From the other hand we can give more wide permissions, in that case a
user should think more carefully about what is passed. For example we
can limit max value by USHRT_MAX what is 65536.

> 
> We may need to replaced that kcalloc() with kmvalloc() though...
>
Yep. If we limit to USHRT_MAX, the maximum amount of memory for
internal data would be ~12MB. Something like below:

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index d337985e4c5e..a5103e3461bf 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -24,7 +24,7 @@
MODULE_PARM_DESC(name, msg) \

 __param(int, nr_threads, 0,
-   "Number of workers to perform tests(min: 1 max: 1024)");
+   "Number of workers to perform tests(min: 1 max: 65536)");

 __param(bool, sequential_test_order, false,
"Use sequential stress tests order");
@@ -469,13 +469,13 @@ init_test_configurtion(void)
 {
/*
 * A maximum number of workers is defined as hard-coded
-* value and set to 1024. We add such gap just in case
+* value and set to 65536. We add such gap just in case
 * and for potential heavy stressing.
 */
-   nr_threads = clamp(nr_threads, 1, 1024);
+   nr_threads = clamp(nr_threads, 1, 65536);

/* Allocate the space for test instances. */
-   tdriver = kcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
+   tdriver = kvcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
if (tdriver == NULL)
return -1;

@@ -555,7 +555,7 @@ static void do_concurrent_test(void)
i, t->stop - t->start);
}

-   kfree(tdriver);
+   kvfree(tdriver);
 }

 static int vmalloc_test_init(void)

Does it sound reasonable for you?

--
Vlad Rezki


Re: [PATCH-next 5/5] mm/vmalloc: remove an empty line

2021-04-02 Thread Uladzislau Rezki
> On Sat, Apr 3, 2021 at 1:53 AM Uladzislau Rezki (Sony)  
> wrote:
> >
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> How about merging it with patch [4/5] ?
> 
I had in mind such concern. Yes we can do a squashing with [4/5].
If there are other comments i will rework whole series if not we
can ask Andrew to merge it.

Thank you.

--
Vlad Rezki


[PATCH-next 5/5] mm/vmalloc: remove an empty line

2021-04-02 Thread Uladzislau Rezki (Sony)
Signed-off-by: Uladzislau Rezki (Sony) 
---
 mm/vmalloc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 093c7e034ca2..1e643280cbcf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1503,7 +1503,6 @@ static struct vmap_area *alloc_vmap_area(unsigned long 
size,
va->va_end = addr + size;
va->vm = NULL;
 
-
spin_lock(_area_lock);
insert_vmap_area(va, _area_root, _area_list);
spin_unlock(_area_lock);
-- 
2.20.1



[PATCH-next 4/5] mm/vmalloc: refactor the preloading loagic

2021-04-02 Thread Uladzislau Rezki (Sony)
Instead of keeping open-coded style, move the code related to
preloading into a separate function. Therefore introduce the
preload_this_cpu_lock() routine that prelaods a current CPU
with one extra vmap_area object.

There is no functional change as a result of this patch.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 mm/vmalloc.c | 60 +++-
 1 file changed, 27 insertions(+), 33 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8b564f91a610..093c7e034ca2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1430,6 +1430,29 @@ static void free_vmap_area(struct vmap_area *va)
spin_unlock(_vmap_area_lock);
 }
 
+static inline void
+preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
+{
+   struct vmap_area *va = NULL;
+
+   /*
+* Preload this CPU with one extra vmap_area object. It is used
+* when fit type of free area is NE_FIT_TYPE. It guarantees that
+* a CPU that does an allocation is preloaded.
+*
+* We do it in non-atomic context, thus it allows us to use more
+* permissive allocation masks to be more stable under low memory
+* condition and high memory pressure.
+*/
+   if (!this_cpu_read(ne_fit_preload_node))
+   va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
+
+   spin_lock(lock);
+
+   if (va && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, va))
+   kmem_cache_free(vmap_area_cachep, va);
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend.
@@ -1439,7 +1462,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long 
size,
unsigned long vstart, unsigned long vend,
int node, gfp_t gfp_mask)
 {
-   struct vmap_area *va, *pva;
+   struct vmap_area *va;
unsigned long addr;
int purged = 0;
int ret;
@@ -1465,43 +1488,14 @@ static struct vmap_area *alloc_vmap_area(unsigned long 
size,
kmemleak_scan_area(>rb_node, SIZE_MAX, gfp_mask);
 
 retry:
-   /*
-* Preload this CPU with one extra vmap_area object. It is used
-* when fit type of free area is NE_FIT_TYPE. Please note, it
-* does not guarantee that an allocation occurs on a CPU that
-* is preloaded, instead we minimize the case when it is not.
-* It can happen because of cpu migration, because there is a
-* race until the below spinlock is taken.
-*
-* The preload is done in non-atomic context, thus it allows us
-* to use more permissive allocation masks to be more stable under
-* low memory condition and high memory pressure. In rare case,
-* if not preloaded, GFP_NOWAIT is used.
-*
-* Set "pva" to NULL here, because of "retry" path.
-*/
-   pva = NULL;
-
-   if (!this_cpu_read(ne_fit_preload_node))
-   /*
-* Even if it fails we do not really care about that.
-* Just proceed as it is. If needed "overflow" path
-* will refill the cache we allocate from.
-*/
-   pva = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
-
-   spin_lock(_vmap_area_lock);
-
-   if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
-   kmem_cache_free(vmap_area_cachep, pva);
+   preload_this_cpu_lock(_vmap_area_lock, gfp_mask, node);
+   addr = __alloc_vmap_area(size, align, vstart, vend);
+   spin_unlock(_vmap_area_lock);
 
/*
 * If an allocation fails, the "vend" address is
 * returned. Therefore trigger the overflow path.
 */
-   addr = __alloc_vmap_area(size, align, vstart, vend);
-   spin_unlock(_vmap_area_lock);
-
if (unlikely(addr == vend))
goto overflow;
 
-- 
2.20.1



[PATCH-next 3/5] vm/test_vmalloc.sh: adapt for updated driver interface

2021-04-02 Thread Uladzislau Rezki (Sony)
A 'single_cpu_test' parameter is odd and it does not exist
anymore. Instead there was introduced a 'nr_threads' one.
If it is not set it behaves as the former parameter.

That is why update a "stress mode" according to this change
specifying number of workers which are equal to number of CPUs.
Also update an output of help message based on a new interface.

CC: linux-kselft...@vger.kernel.org
CC: Shuah Khan 
Signed-off-by: Uladzislau Rezki (Sony) 
---
 tools/testing/selftests/vm/test_vmalloc.sh | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/vm/test_vmalloc.sh 
b/tools/testing/selftests/vm/test_vmalloc.sh
index 06d2bb109f06..d73b846736f1 100755
--- a/tools/testing/selftests/vm/test_vmalloc.sh
+++ b/tools/testing/selftests/vm/test_vmalloc.sh
@@ -11,6 +11,7 @@
 
 TEST_NAME="vmalloc"
 DRIVER="test_${TEST_NAME}"
+NUM_CPUS=`grep -c ^processor /proc/cpuinfo`
 
 # 1 if fails
 exitcode=1
@@ -22,9 +23,9 @@ ksft_skip=4
 # Static templates for performance, stressing and smoke tests.
 # Also it is possible to pass any supported parameters manualy.
 #
-PERF_PARAM="single_cpu_test=1 sequential_test_order=1 test_repeat_count=3"
-SMOKE_PARAM="single_cpu_test=1 test_loop_count=1 test_repeat_count=10"
-STRESS_PARAM="test_repeat_count=20"
+PERF_PARAM="sequential_test_order=1 test_repeat_count=3"
+SMOKE_PARAM="test_loop_count=1 test_repeat_count=10"
+STRESS_PARAM="nr_threads=$NUM_CPUS test_repeat_count=20"
 
 check_test_requirements()
 {
@@ -58,8 +59,8 @@ run_perfformance_check()
 
 run_stability_check()
 {
-   echo "Run stability tests. In order to stress vmalloc subsystem we run"
-   echo "all available test cases on all available CPUs simultaneously."
+   echo "Run stability tests. In order to stress vmalloc subsystem all"
+   echo "available test cases are run by NUM_CPUS workers simultaneously."
echo "It will take time, so be patient."
 
modprobe $DRIVER $STRESS_PARAM > /dev/null 2>&1
@@ -92,17 +93,17 @@ usage()
echo "# Shows help message"
echo "./${DRIVER}.sh"
echo
-   echo "# Runs 1 test(id_1), repeats it 5 times on all online CPUs"
-   echo "./${DRIVER}.sh run_test_mask=1 test_repeat_count=5"
+   echo "# Runs 1 test(id_1), repeats it 5 times by NUM_CPUS workers"
+   echo "./${DRIVER}.sh nr_threads=$NUM_CPUS run_test_mask=1 
test_repeat_count=5"
echo
echo -n "# Runs 4 tests(id_1|id_2|id_4|id_16) on one CPU with "
echo "sequential order"
-   echo -n "./${DRIVER}.sh single_cpu_test=1 sequential_test_order=1 "
+   echo -n "./${DRIVER}.sh sequential_test_order=1 "
echo "run_test_mask=23"
echo
-   echo -n "# Runs all tests on all online CPUs, shuffled order, repeats "
+   echo -n "# Runs all tests by NUM_CPUS workers, shuffled order, repeats "
echo "20 times"
-   echo "./${DRIVER}.sh test_repeat_count=20"
+   echo "./${DRIVER}.sh nr_threads=$NUM_CPUS test_repeat_count=20"
echo
echo "# Performance analysis"
echo "./${DRIVER}.sh performance"
-- 
2.20.1



[PATCH-next 2/5] lib/test_vmalloc.c: add a new 'nr_threads' parameter

2021-04-02 Thread Uladzislau Rezki (Sony)
By using this parameter we can specify how many workers are
created to perform vmalloc tests. By default it is one CPU.
The maximum value is set to 1024.

As a result of this change a 'single_cpu_test' one becomes
obsolete, therefore it is no longer needed.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 lib/test_vmalloc.c | 88 +-
 1 file changed, 40 insertions(+), 48 deletions(-)

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index 4eb6abdaa74e..d337985e4c5e 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -23,8 +23,8 @@
module_param(name, type, 0444); \
MODULE_PARM_DESC(name, msg) \
 
-__param(bool, single_cpu_test, false,
-   "Use single first online CPU to run tests");
+__param(int, nr_threads, 0,
+   "Number of workers to perform tests(min: 1 max: 1024)");
 
 __param(bool, sequential_test_order, false,
"Use sequential stress tests order");
@@ -50,13 +50,6 @@ __param(int, run_test_mask, INT_MAX,
/* Add a new test case description here. */
 );
 
-/*
- * Depends on single_cpu_test parameter. If it is true, then
- * use first online CPU to trigger a test on, otherwise go with
- * all online CPUs.
- */
-static cpumask_t cpus_run_test_mask = CPU_MASK_NONE;
-
 /*
  * Read write semaphore for synchronization of setup
  * phase that is done in main thread and workers.
@@ -386,16 +379,13 @@ struct test_case_data {
u64 time;
 };
 
-/* Split it to get rid of: WARNING: line over 80 characters */
-static struct test_case_data
-   per_cpu_test_data[NR_CPUS][ARRAY_SIZE(test_case_array)];
-
 static struct test_driver {
struct task_struct *task;
+   struct test_case_data data[ARRAY_SIZE(test_case_array)];
+
unsigned long start;
unsigned long stop;
-   int cpu;
-} per_cpu_test_driver[NR_CPUS];
+} *tdriver;
 
 static void shuffle_array(int *arr, int n)
 {
@@ -423,9 +413,6 @@ static int test_func(void *private)
ktime_t kt;
u64 delta;
 
-   if (set_cpus_allowed_ptr(current, cpumask_of(t->cpu)) < 0)
-   pr_err("Failed to set affinity to %d CPU\n", t->cpu);
-
for (i = 0; i < ARRAY_SIZE(test_case_array); i++)
random_array[i] = i;
 
@@ -450,9 +437,9 @@ static int test_func(void *private)
kt = ktime_get();
for (j = 0; j < test_repeat_count; j++) {
if (!test_case_array[index].test_func())
-   per_cpu_test_data[t->cpu][index].test_passed++;
+   t->data[index].test_passed++;
else
-   per_cpu_test_data[t->cpu][index].test_failed++;
+   t->data[index].test_failed++;
}
 
/*
@@ -461,7 +448,7 @@ static int test_func(void *private)
delta = (u64) ktime_us_delta(ktime_get(), kt);
do_div(delta, (u32) test_repeat_count);
 
-   per_cpu_test_data[t->cpu][index].time = delta;
+   t->data[index].time = delta;
}
t->stop = get_cycles();
 
@@ -477,53 +464,56 @@ static int test_func(void *private)
return 0;
 }
 
-static void
+static int
 init_test_configurtion(void)
 {
/*
-* Reset all data of all CPUs.
+* A maximum number of workers is defined as hard-coded
+* value and set to 1024. We add such gap just in case
+* and for potential heavy stressing.
 */
-   memset(per_cpu_test_data, 0, sizeof(per_cpu_test_data));
+   nr_threads = clamp(nr_threads, 1, 1024);
 
-   if (single_cpu_test)
-   cpumask_set_cpu(cpumask_first(cpu_online_mask),
-   _run_test_mask);
-   else
-   cpumask_and(_run_test_mask, cpu_online_mask,
-   cpu_online_mask);
+   /* Allocate the space for test instances. */
+   tdriver = kcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
+   if (tdriver == NULL)
+   return -1;
 
if (test_repeat_count <= 0)
test_repeat_count = 1;
 
if (test_loop_count <= 0)
test_loop_count = 1;
+
+   return 0;
 }
 
 static void do_concurrent_test(void)
 {
-   int cpu, ret;
+   int i, ret;
 
/*
 * Set some basic configurations plus sanity check.
 */
-   init_test_configurtion();
+   ret = init_test_configurtion();
+   if (ret < 0)
+   return;
 
/*
 * Put on hold all workers.
 */
down_write(_for_test_rwsem);
 
-   for_each_cpu(cpu, _run_test_mask) {
-   struct test_driver *t = _cpu_test_driver[cpu];
+   for (i = 0; i < nr_threads; i++) {
+   struct test_driver *t = [i];
 
-   t->cpu = cpu;
-  

[PATCH-next 1/5] lib/test_vmalloc.c: remove two kvfree_rcu() tests

2021-04-02 Thread Uladzislau Rezki (Sony)
Remove two test cases related to kvfree_rcu() and SLAB. Those are
considered as redundant now, because similar test functionality
has recently been introduced in the "rcuscale" RCU test-suite.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 lib/test_vmalloc.c | 40 
 1 file changed, 40 deletions(-)

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index 5cf2fe9aab9e..4eb6abdaa74e 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -47,8 +47,6 @@ __param(int, run_test_mask, INT_MAX,
"\t\tid: 128,  name: pcpu_alloc_test\n"
"\t\tid: 256,  name: kvfree_rcu_1_arg_vmalloc_test\n"
"\t\tid: 512,  name: kvfree_rcu_2_arg_vmalloc_test\n"
-   "\t\tid: 1024, name: kvfree_rcu_1_arg_slab_test\n"
-   "\t\tid: 2048, name: kvfree_rcu_2_arg_slab_test\n"
/* Add a new test case description here. */
 );
 
@@ -363,42 +361,6 @@ kvfree_rcu_2_arg_vmalloc_test(void)
return 0;
 }
 
-static int
-kvfree_rcu_1_arg_slab_test(void)
-{
-   struct test_kvfree_rcu *p;
-   int i;
-
-   for (i = 0; i < test_loop_count; i++) {
-   p = kmalloc(sizeof(*p), GFP_KERNEL);
-   if (!p)
-   return -1;
-
-   p->array[0] = 'a';
-   kvfree_rcu(p);
-   }
-
-   return 0;
-}
-
-static int
-kvfree_rcu_2_arg_slab_test(void)
-{
-   struct test_kvfree_rcu *p;
-   int i;
-
-   for (i = 0; i < test_loop_count; i++) {
-   p = kmalloc(sizeof(*p), GFP_KERNEL);
-   if (!p)
-   return -1;
-
-   p->array[0] = 'a';
-   kvfree_rcu(p, rcu);
-   }
-
-   return 0;
-}
-
 struct test_case_desc {
const char *test_name;
int (*test_func)(void);
@@ -415,8 +377,6 @@ static struct test_case_desc test_case_array[] = {
{ "pcpu_alloc_test", pcpu_alloc_test },
{ "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test },
{ "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test },
-   { "kvfree_rcu_1_arg_slab_test", kvfree_rcu_1_arg_slab_test },
-   { "kvfree_rcu_2_arg_slab_test", kvfree_rcu_2_arg_slab_test },
/* Add a new test case here. */
 };
 
-- 
2.20.1



Re: [PATCH][next] mm/vmalloc: Fix read of pointer area after it has been free'd

2021-03-29 Thread Uladzislau Rezki
> On Mon, Mar 29, 2021 at 08:14:53PM +0200, Uladzislau Rezki wrote:
> > On Mon, Mar 29, 2021 at 07:40:29PM +0200, Uladzislau Rezki wrote:
> > > On Mon, Mar 29, 2021 at 06:14:34PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Mar 29, 2021 at 06:07:30PM +0100, Colin King wrote:
> > > > > From: Colin Ian King 
> > > > > 
> > > > > Currently the memory pointed to by area is being freed by the
> > > > > free_vm_area call and then area->nr_pages is referencing the
> > > > > free'd object. Fix this swapping the order of the warn_alloc
> > > > > message and the free.
> > > > > 
> > > > > Addresses-Coverity: ("Read from pointer after free")
> > > > > Fixes: 014ccf9b888d ("mm/vmalloc: improve allocation failure error 
> > > > > messages")
> > > > 
> > > > i don't have this git sha.  if this is -next, the sha ids aren't stable
> > > > and shouldn't be referenced in commit logs, because these fixes should
> > > > just be squashed into the not-yet-upstream commits.
> > > > 
> > > > > Signed-off-by: Colin Ian King 
> > > > > ---
> > > > >  mm/vmalloc.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index b73e4e715e0d..7936405749e4 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2790,11 +2790,11 @@ static void *__vmalloc_area_node(struct 
> > > > > vm_struct *area, gfp_t gfp_mask,
> > > > >   }
> > > > >  
> > > > >   if (!pages) {
> > > > > - free_vm_area(area);
> > > > >   warn_alloc(gfp_mask, NULL,
> > > > >  "vmalloc size %lu allocation failure: "
> > > > >  "page array size %lu allocation failed",
> > > > >  area->nr_pages * PAGE_SIZE, array_size);
> > > > > + free_vm_area(area);
> > > > >   return NULL;
> > > > 
> > > > this fix looks right to me.
> > > > 
> > > That is from the linux-next. Same to me.
> > > 
> > > Reviewed-by: Uladzislau Rezki (Sony) 
> > > 
> > > --
> > > Vlad Rezki
> > Is the linux-next(next-20210329) broken?
> > 
> Please ignore my previous email. That was due to my local "stashed" change.
> 
Hello, Andrew.

Could you please squash below patch with the one that is in question?
Or should i send out it as separate patch?

>From 6d1c221fec4718094c6e825e3879a76ad70dba93 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" 
Date: Mon, 29 Mar 2021 21:12:47 +0200
Subject: [PATCH] mm/vmalloc: print correct vmalloc allocation size

On entry the area->nr_pages is not set yet and is zero, thus
when an allocation of the page-table array fails the vmalloc
size will not be reflected correctly in a error message.

Replace area->nr_pages by the nr_small_pages.

Fixes: 014ccf9b888d ("mm/vmalloc: improve allocation failure error messages")
Signed-off-by: Uladzislau Rezki (Sony) 
---
 mm/vmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b73e4e715e0d..8b564f91a610 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2794,7 +2794,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
warn_alloc(gfp_mask, NULL,
   "vmalloc size %lu allocation failure: "
   "page array size %lu allocation failed",
-  area->nr_pages * PAGE_SIZE, array_size);
+  nr_small_pages * PAGE_SIZE, array_size);
return NULL;
}
 
-- 
2.20.1

--
Vlad Rezki


Re: [PATCH][next] mm/vmalloc: Fix read of pointer area after it has been free'd

2021-03-29 Thread Uladzislau Rezki
On Mon, Mar 29, 2021 at 08:14:53PM +0200, Uladzislau Rezki wrote:
> On Mon, Mar 29, 2021 at 07:40:29PM +0200, Uladzislau Rezki wrote:
> > On Mon, Mar 29, 2021 at 06:14:34PM +0100, Matthew Wilcox wrote:
> > > On Mon, Mar 29, 2021 at 06:07:30PM +0100, Colin King wrote:
> > > > From: Colin Ian King 
> > > > 
> > > > Currently the memory pointed to by area is being freed by the
> > > > free_vm_area call and then area->nr_pages is referencing the
> > > > free'd object. Fix this swapping the order of the warn_alloc
> > > > message and the free.
> > > > 
> > > > Addresses-Coverity: ("Read from pointer after free")
> > > > Fixes: 014ccf9b888d ("mm/vmalloc: improve allocation failure error 
> > > > messages")
> > > 
> > > i don't have this git sha.  if this is -next, the sha ids aren't stable
> > > and shouldn't be referenced in commit logs, because these fixes should
> > > just be squashed into the not-yet-upstream commits.
> > > 
> > > > Signed-off-by: Colin Ian King 
> > > > ---
> > > >  mm/vmalloc.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index b73e4e715e0d..7936405749e4 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2790,11 +2790,11 @@ static void *__vmalloc_area_node(struct 
> > > > vm_struct *area, gfp_t gfp_mask,
> > > > }
> > > >  
> > > > if (!pages) {
> > > > -   free_vm_area(area);
> > > > warn_alloc(gfp_mask, NULL,
> > > >        "vmalloc size %lu allocation failure: "
> > > >"page array size %lu allocation failed",
> > > >area->nr_pages * PAGE_SIZE, array_size);
> > > > +   free_vm_area(area);
> > > > return NULL;
> > > 
> > > this fix looks right to me.
> > > 
> > That is from the linux-next. Same to me.
> > 
> > Reviewed-by: Uladzislau Rezki (Sony) 
> > 
> > --
> > Vlad Rezki
> Is the linux-next(next-20210329) broken?
> 
Please ignore my previous email. That was due to my local "stashed" change.

--
Vlad Rezki


Re: [PATCH][next] mm/vmalloc: Fix read of pointer area after it has been free'd

2021-03-29 Thread Uladzislau Rezki
On Mon, Mar 29, 2021 at 07:40:29PM +0200, Uladzislau Rezki wrote:
> On Mon, Mar 29, 2021 at 06:14:34PM +0100, Matthew Wilcox wrote:
> > On Mon, Mar 29, 2021 at 06:07:30PM +0100, Colin King wrote:
> > > From: Colin Ian King 
> > > 
> > > Currently the memory pointed to by area is being freed by the
> > > free_vm_area call and then area->nr_pages is referencing the
> > > free'd object. Fix this swapping the order of the warn_alloc
> > > message and the free.
> > > 
> > > Addresses-Coverity: ("Read from pointer after free")
> > > Fixes: 014ccf9b888d ("mm/vmalloc: improve allocation failure error 
> > > messages")
> > 
> > i don't have this git sha.  if this is -next, the sha ids aren't stable
> > and shouldn't be referenced in commit logs, because these fixes should
> > just be squashed into the not-yet-upstream commits.
> > 
> > > Signed-off-by: Colin Ian King 
> > > ---
> > >  mm/vmalloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index b73e4e715e0d..7936405749e4 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2790,11 +2790,11 @@ static void *__vmalloc_area_node(struct vm_struct 
> > > *area, gfp_t gfp_mask,
> > >   }
> > >  
> > >   if (!pages) {
> > > - free_vm_area(area);
> > >   warn_alloc(gfp_mask, NULL,
> > >  "vmalloc size %lu allocation failure: "
> > >  "page array size %lu allocation failed",
> > >  area->nr_pages * PAGE_SIZE, array_size);
> > > + free_vm_area(area);
> > >   return NULL;
> > 
> > this fix looks right to me.
> > 
> That is from the linux-next. Same to me.
> 
> Reviewed-by: Uladzislau Rezki (Sony) 
> 
> --
> Vlad Rezki
Is the linux-next(next-20210329) broken?

@pc638:~/data/raid0/coding/linux-next.git$ git branch 
  master
  next-20210225
  next-20210326
* next-20210329
urezki@pc638:~/data/raid0/coding/linux-next.git$ ../run_linux.sh 
./arch/x86_64/boot/bzImage
File ‘quantal-trinity-x86_64.cgz’ already there; not retrieving.

early console in setup code
Probing EDD (edd=off to disable)... ok
[0.00] Linux version 5.12.0-rc4-next-20210329+ (urezki@pc638) (gcc 
(Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #497 SMP 
PREEMPT Mon Mar 29 19:59:25 CEST 2021
[0.00] Command line: root=/dev/ram0 hung_task_panic=1 debug apic=debug 
sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 
printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic 
load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err 
ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 
vga=normal rw rcuperf.shutdown=0 watchdog_thresh=60 run_self_test=1 threadirqs
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xbffd] usable
[0.00] BIOS-e820: [mem 0xbffe-0xbfff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00023fff] usable
[0.00] printk: debug: ignoring loglevel setting.
[0.00] printk: bootconsole [earlyser0] enabled
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 
04/01/2014
[0.00] Hypervisor detected: KVM
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[0.00] kvm-clock: cpu 0, msr 1e96be001, primary cpu clock
[0.00] kvm-clock: using sched offset of 1302025291 cycles
[0.000492] clocksource: kvm-clock: mask: 0x max_cycles: 
0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[0.001870] tsc: Detected 3700.134 MHz processor
[0.002721] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.003285] e820: remove [mem 0x000a-0x000f] usable
[0.028860] AGP: No AGP bridge found
[0.029382] last_pfn = 0x24 max_arch_pfn = 0x4
[0.029868] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
Memory KASLR using RDTSC...
[0.030657] last_pfn = 0xbffe0 max_arch_pfn = 

Re: [PATCH][next] mm/vmalloc: Fix read of pointer area after it has been free'd

2021-03-29 Thread Uladzislau Rezki
On Mon, Mar 29, 2021 at 06:14:34PM +0100, Matthew Wilcox wrote:
> On Mon, Mar 29, 2021 at 06:07:30PM +0100, Colin King wrote:
> > From: Colin Ian King 
> > 
> > Currently the memory pointed to by area is being freed by the
> > free_vm_area call and then area->nr_pages is referencing the
> > free'd object. Fix this swapping the order of the warn_alloc
> > message and the free.
> > 
> > Addresses-Coverity: ("Read from pointer after free")
> > Fixes: 014ccf9b888d ("mm/vmalloc: improve allocation failure error 
> > messages")
> 
> i don't have this git sha.  if this is -next, the sha ids aren't stable
> and shouldn't be referenced in commit logs, because these fixes should
> just be squashed into the not-yet-upstream commits.
> 
> > Signed-off-by: Colin Ian King 
> > ---
> >  mm/vmalloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index b73e4e715e0d..7936405749e4 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2790,11 +2790,11 @@ static void *__vmalloc_area_node(struct vm_struct 
> > *area, gfp_t gfp_mask,
> > }
> >  
> > if (!pages) {
> > -   free_vm_area(area);
> > warn_alloc(gfp_mask, NULL,
> >"vmalloc size %lu allocation failure: "
> >"page array size %lu allocation failed",
> >    area->nr_pages * PAGE_SIZE, array_size);
> > +   free_vm_area(area);
> > return NULL;
> 
> this fix looks right to me.
> 
That is from the linux-next. Same to me.

Reviewed-by: Uladzislau Rezki (Sony) 

--
Vlad Rezki


Re: [PATCH 0/9 v6] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-25 Thread Uladzislau Rezki
On Thu, Mar 25, 2021 at 02:26:24PM +, Mel Gorman wrote:
> On Thu, Mar 25, 2021 at 03:06:57PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Mar 25, 2021 at 12:50:01PM +, Matthew Wilcox wrote:
> > > > On Thu, Mar 25, 2021 at 11:42:19AM +, Mel Gorman wrote:
> > > > > This series introduces a bulk order-0 page allocator with sunrpc and
> > > > > the network page pool being the first users. The implementation is not
> > > > > efficient as semantics needed to be ironed out first. If no other 
> > > > > semantic
> > > > > changes are needed, it can be made more efficient.  Despite that, this
> > > > > is a performance-related for users that require multiple pages for an
> > > > > operation without multiple round-trips to the page allocator. Quoting
> > > > > the last patch for the high-speed networking use-case
> > > > > 
> > > > > Kernel  XDP stats   CPU pps   
> > > > > Delta
> > > > > BaselineXDP-RX CPU  total   3,771,046   
> > > > > n/a
> > > > > ListXDP-RX CPU  total   3,940,242
> > > > > +4.49%
> > > > > Array   XDP-RX CPU  total   4,249,224   
> > > > > +12.68%
> > > > > 
> > > > > >From the SUNRPC traces of svc_alloc_arg()
> > > > > 
> > > > >   Single page: 25.007 us per call over 532,571 calls
> > > > >   Bulk list:6.258 us per call over 517,034 calls
> > > > >   Bulk array:   4.590 us per call over 517,442 calls
> > > > > 
> > > > > Both potential users in this series are corner cases (NFS and 
> > > > > high-speed
> > > > > networks) so it is unlikely that most users will see any benefit in 
> > > > > the
> > > > > short term. Other potential other users are batch allocations for page
> > > > > cache readahead, fault around and SLUB allocations when high-order 
> > > > > pages
> > > > > are unavailable. It's unknown how much benefit would be seen by 
> > > > > converting
> > > > > multiple page allocation calls to a single batch or what difference 
> > > > > it may
> > > > > make to headline performance.
> > > > 
> > > > We have a third user, vmalloc(), with a 16% perf improvement.  I know 
> > > > the
> > > > email says 21% but that includes the 5% improvement from switching to
> > > > kvmalloc() to allocate area->pages.
> > > > 
> > > > https://lore.kernel.org/linux-mm/20210323133948.ga10...@pc638.lan/
> > > > 
> > > 
> > > That's fairly promising. Assuming the bulk allocator gets merged, it would
> > > make sense to add vmalloc on top. That's for bringing it to my attention
> > > because it's far more relevant than my imaginary potential use cases.
> > > 
> > For the vmalloc we should be able to allocating on a specific NUMA node,
> > at least the current interface takes it into account. As far as i see
> > the current interface allocate on a current node:
> > 
> > static inline unsigned long
> > alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page 
> > **page_array)
> > {
> > return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, 
> > page_array);
> > }
> > 
> > Or am i missing something?
> > 
> 
> No, you're not missing anything. Options would be to add a helper similar
> alloc_pages_node or to directly call __alloc_pages_bulk specifying a node
> and using GFP_THISNODE. prepare_alloc_pages() should pick the correct
> zonelist containing only the required node.
> 
IMHO, a helper something like *_node() would be reasonable. I see that many
functions in "mm" have its own variants which explicitly add "_node()" prefix
to signal to users that it is a NUMA aware calls.

As for __alloc_pages_bulk(), i got it.

Thanks!

--
Vlad Rezki


Re: [PATCH 0/9 v6] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-25 Thread Uladzislau Rezki
On Thu, Mar 25, 2021 at 02:09:27PM +, Matthew Wilcox wrote:
> On Thu, Mar 25, 2021 at 03:06:57PM +0100, Uladzislau Rezki wrote:
> > For the vmalloc we should be able to allocating on a specific NUMA node,
> > at least the current interface takes it into account. As far as i see
> > the current interface allocate on a current node:
> > 
> > static inline unsigned long
> > alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page 
> > **page_array)
> > {
> > return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, 
> > page_array);
> > }
> > 
> > Or am i missing something?
> 
> You can call __alloc_pages_bulk() directly; there's no need to indirect
> through alloc_pages_bulk_array().
>
OK. It is accessible then.

--
Vlad Rezki


Re: [PATCH 0/9 v6] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-25 Thread Uladzislau Rezki
> On Thu, Mar 25, 2021 at 12:50:01PM +, Matthew Wilcox wrote:
> > On Thu, Mar 25, 2021 at 11:42:19AM +, Mel Gorman wrote:
> > > This series introduces a bulk order-0 page allocator with sunrpc and
> > > the network page pool being the first users. The implementation is not
> > > efficient as semantics needed to be ironed out first. If no other semantic
> > > changes are needed, it can be made more efficient.  Despite that, this
> > > is a performance-related for users that require multiple pages for an
> > > operation without multiple round-trips to the page allocator. Quoting
> > > the last patch for the high-speed networking use-case
> > > 
> > > Kernel  XDP stats   CPU pps   Delta
> > > BaselineXDP-RX CPU  total   3,771,046   n/a
> > > ListXDP-RX CPU  total   3,940,242+4.49%
> > > Array   XDP-RX CPU  total   4,249,224   +12.68%
> > > 
> > > >From the SUNRPC traces of svc_alloc_arg()
> > > 
> > >   Single page: 25.007 us per call over 532,571 calls
> > >   Bulk list:6.258 us per call over 517,034 calls
> > >   Bulk array:   4.590 us per call over 517,442 calls
> > > 
> > > Both potential users in this series are corner cases (NFS and high-speed
> > > networks) so it is unlikely that most users will see any benefit in the
> > > short term. Other potential other users are batch allocations for page
> > > cache readahead, fault around and SLUB allocations when high-order pages
> > > are unavailable. It's unknown how much benefit would be seen by converting
> > > multiple page allocation calls to a single batch or what difference it may
> > > make to headline performance.
> > 
> > We have a third user, vmalloc(), with a 16% perf improvement.  I know the
> > email says 21% but that includes the 5% improvement from switching to
> > kvmalloc() to allocate area->pages.
> > 
> > https://lore.kernel.org/linux-mm/20210323133948.ga10...@pc638.lan/
> > 
> 
> That's fairly promising. Assuming the bulk allocator gets merged, it would
> make sense to add vmalloc on top. That's for bringing it to my attention
> because it's far more relevant than my imaginary potential use cases.
> 
For the vmalloc we should be able to allocating on a specific NUMA node,
at least the current interface takes it into account. As far as i see
the current interface allocate on a current node:

static inline unsigned long
alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page 
**page_array)
{
return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, 
page_array);
}

Or am i missing something?

--
Vlad Rezki


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-24 Thread Uladzislau Rezki
On Tue, Mar 23, 2021 at 09:39:24PM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > > I suspect the vast majority of the time is spent calling 
> > > > alloc_pages_node()
> > > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > > vmalloc() wants?
> > > > 
> > > 
> > >  - __vmalloc_node_range
> > > - 45.25% __alloc_pages_nodemask
> > >- 37.59% get_page_from_freelist
> > [...]
> > >   - 44.61% 0xc047348d
> > >  - __vunmap
> > > - 35.56% free_unref_page
> > 
> > Hmm!  I hadn't been thinking about the free side of things.
> > Does this make a difference?
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 4f5f8c907897..61d5b769fea0 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> > deallocate_pages)
> > vm_remove_mappings(area, deallocate_pages);
> >  
> > if (deallocate_pages) {
> > -   int i;
> > -
> > -   for (i = 0; i < area->nr_pages; i++) {
> > -   struct page *page = area->pages[i];
> > -
> > -   BUG_ON(!page);
> > -   __free_pages(page, 0);
> > -   }
> > +   release_pages(area->pages, area->nr_pages);
> > atomic_long_sub(area->nr_pages, _vmalloc_pages);
> > -
> > kvfree(area->pages);
> > }
> >
> Same test. 4MB allocation on a single CPU:
> 
> default: loops: 100 avg: 93601889 usec
> patch:   loops: 100 avg: 98217904 usec
> 
> 
> - __vunmap
>- 41.17% free_unref_page
>   - 28.42% free_pcppages_bulk
>  - 6.38% __mod_zone_page_state
>   4.79% check_preemption_disabled
>2.63% __list_del_entry_valid
>2.63% __list_add_valid
>   - 7.50% free_unref_page_commit
>2.15% check_preemption_disabled
>2.01% __list_add_valid
> 2.31% free_unref_page_prepare.part.86
> 0.70% free_pcp_prepare
> 
> 
> 
> - __vunmap
>- 45.36% release_pages
>   - 37.70% free_unref_page_list
>  - 24.70% free_pcppages_bulk
> - 5.42% __mod_zone_page_state
>  4.23% check_preemption_disabled
>   2.31% __list_add_valid
>   2.07% __list_del_entry_valid
>  - 7.58% free_unref_page_commit
>   2.47% check_preemption_disabled
>   1.75% __list_add_valid
>3.43% free_unref_page_prepare.part.86
>   - 2.39% mem_cgroup_uncharge_list
>uncharge_page
> 
> 
> It is obvious that the default version is slightly better. It requires
> less things to be done comparing with release_pages() variant.
> 
> > 
> > release_pages does a bunch of checks that are unnecessary ... we could
> > probably just do:
> > 
> > LIST_HEAD(pages_to_free);
> > 
> > for (i = 0; i < area->nr_pages; i++) {
> > struct page *page = area->pages[i];
> > if (put_page_testzero(page))
> > list_add(>lru, _to_free);
> > }
> > free_unref_page_list(_to_free);
> > 
> > but let's see if the provided interface gets us the performance we want.
> >  
> I will test it tomorrow. From the first glance it looks like a more light 
> version :)
> 
Here we go:


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4f5f8c907897..349024768ba6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2254,6 +2254,7 @@ static void vm_remove_mappings(struct vm_struct *area, 
int deallocate_pages)
 static void __vunmap(const void *addr, int deallocate_pages)
 {
struct vm_struct *area;
+   LIST_HEAD(pages_to_free);
 
if (!addr)
return;
@@ -2282,11 +2283,12 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
for (i = 0; i < area->nr_pages; i++) {
struct page *page = area->pages[i];
 
-   BUG_ON(!page);
-  

Re: [PATCH] mm: vmalloc: Prevent use after free in _vm_unmap_aliases

2021-03-24 Thread Uladzislau Rezki
> 
> On 3/18/2021 10:29 PM, Uladzislau Rezki wrote:
> > On Thu, Mar 18, 2021 at 03:38:25PM +0530, vji...@codeaurora.org wrote:
> >> From: Vijayanand Jitta 
> >>
> >> A potential use after free can occur in _vm_unmap_aliases
> >> where an already freed vmap_area could be accessed, Consider
> >> the following scenario:
> >>
> >> Process 1  Process 2
> >>
> >> __vm_unmap_aliases __vm_unmap_aliases
> >>purge_fragmented_blocks_allcpus rcu_read_lock()
> >>rcu_read_lock()
> >>list_del_rcu(>free_list)
> >>
> >> list_for_each_entry_rcu(vb .. )
> >>__purge_vmap_area_lazy
> >>kmem_cache_free(va)
> >>
> >> va_start = vb->va->va_start
> > Or maybe we should switch to kfree_rcu() instead of kmem_cache_free()?
> > 
> > --
> > Vlad Rezki
> > 
> 
> Thanks for suggestion.
> 
> I see free_vmap_area_lock (spinlock) is taken in __purge_vmap_area_lazy
> while it loops through list and calls kmem_cache_free on va's. So, looks
> like we can't replace it with kfree_rcu as it might cause scheduling
> within atomic context.
> 
A double argument of the kfree_rcu() is a safe way to be used from atomic
contexts, it does not use any sleeping primitives, so it can be replaced.

>From the other hand i see that per-cpu KVA allocator is only one user of
the RCU and your change fixes it. Feel free to use:

Reviewed-by: Uladzislau Rezki (Sony) 

Thanks.

--
Vlad Rezki


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
On Tue, Mar 23, 2021 at 02:07:22PM +, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 02:39:48PM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote:
> > > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > > > I suspect the vast majority of the time is spent calling 
> > > > > alloc_pages_node()
> > > > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly 
> > > > > what
> > > > > vmalloc() wants?
> > > > > 
> > > > 
> > > >  - __vmalloc_node_range
> > > > - 45.25% __alloc_pages_nodemask
> > > >- 37.59% get_page_from_freelist
> > > [...]
> > > >   - 44.61% 0xc047348d
> > > >  - __vunmap
> > > > - 35.56% free_unref_page
> > > 
> > > Hmm!  I hadn't been thinking about the free side of things.
> > > Does this make a difference?
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 4f5f8c907897..61d5b769fea0 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> > > deallocate_pages)
> > >   vm_remove_mappings(area, deallocate_pages);
> > >  
> > >   if (deallocate_pages) {
> > > - int i;
> > > -
> > > - for (i = 0; i < area->nr_pages; i++) {
> > > - struct page *page = area->pages[i];
> > > -
> > > - BUG_ON(!page);
> > > - __free_pages(page, 0);
> > > - }
> > > + release_pages(area->pages, area->nr_pages);
> > >   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> > > -
> > >   kvfree(area->pages);
> > >   }
> > > 
> > Will check it today!
> > 
> > > release_pages does a bunch of checks that are unnecessary ... we could
> > > probably just do:
> > > 
> > >   LIST_HEAD(pages_to_free);
> > > 
> > >   for (i = 0; i < area->nr_pages; i++) {
> > >   struct page *page = area->pages[i];
> > >   if (put_page_testzero(page))
> > >   list_add(>lru, _to_free);
> > >   }
> > >   free_unref_page_list(_to_free);
> > > 
> > > but let's see if the provided interface gets us the performance we want.
> > >  
> > > > Reviewed-by: Uladzislau Rezki (Sony) 
> > > > 
> > > > Thanks!
> > > 
> > > Thank you!
> > You are welcome. A small nit:
> > 
> >   CC  mm/vmalloc.o
> > mm/vmalloc.c: In function ‘__vmalloc_area_node’:
> > mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ 
> > makes integer from pointer without a cast [-Wint-conversion]
> >   area->caller);
> >   ^~~~
> > In file included from mm/vmalloc.c:12:
> > ./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument 
> > is of type ‘const void *’
> >  void *kvmalloc_node_caller(size_t size, gfp_t flags, int node,
> 
> Oh, thank you!  I confused myself by changing the type halfway through.
> vmalloc() uses void * to match __builtin_return_address while most
> of the rest of the kernel uses unsigned long to match _RET_IP_.
> I'll submit another patch to convert vmalloc to use _RET_IP_.
> 
Thanks!

> > As for the bulk-array interface. I have checked the:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> > mm-bulk-rebase-v6r2
> > 
> > applied the patch that is in question + below one:
> > 
> > 
> > @@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct 
> > *area, gfp_t gfp_mask,
> > area->pages = pages;
> > area->nr_pages = nr_pages;
> >  
> > -   for (i = 0; i < area->nr_pages; i++) {
> > -   struct page *page;
> > -
> > -   if (node == NUMA_NO_NODE)
> > -   page = alloc_page(gfp_mask);
> > -   else
> > -   page = alloc_pages_node(node, gfp_mask, 0);
> > -
> > -   if (unlikely(!page)) {
> > -   /* Successfully allocated i pages, free t

Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
> On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > I suspect the vast majority of the time is spent calling 
> > > alloc_pages_node()
> > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > vmalloc() wants?
> > > 
> > 
> >  - __vmalloc_node_range
> > - 45.25% __alloc_pages_nodemask
> >- 37.59% get_page_from_freelist
> [...]
> >   - 44.61% 0xc047348d
> >  - __vunmap
> > - 35.56% free_unref_page
> 
> Hmm!  I hadn't been thinking about the free side of things.
> Does this make a difference?
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4f5f8c907897..61d5b769fea0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> - int i;
> -
> - for (i = 0; i < area->nr_pages; i++) {
> - struct page *page = area->pages[i];
> -
> - BUG_ON(!page);
> - __free_pages(page, 0);
> - }
> + release_pages(area->pages, area->nr_pages);
>   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> -
>   kvfree(area->pages);
>   }
>
Same test. 4MB allocation on a single CPU:

default: loops: 100 avg: 93601889 usec
patch:   loops: 100 avg: 98217904 usec


- __vunmap
   - 41.17% free_unref_page
  - 28.42% free_pcppages_bulk
 - 6.38% __mod_zone_page_state
  4.79% check_preemption_disabled
   2.63% __list_del_entry_valid
   2.63% __list_add_valid
  - 7.50% free_unref_page_commit
   2.15% check_preemption_disabled
   2.01% __list_add_valid
2.31% free_unref_page_prepare.part.86
0.70% free_pcp_prepare



- __vunmap
   - 45.36% release_pages
  - 37.70% free_unref_page_list
 - 24.70% free_pcppages_bulk
- 5.42% __mod_zone_page_state
 4.23% check_preemption_disabled
  2.31% __list_add_valid
  2.07% __list_del_entry_valid
 - 7.58% free_unref_page_commit
  2.47% check_preemption_disabled
  1.75% __list_add_valid
   3.43% free_unref_page_prepare.part.86
  - 2.39% mem_cgroup_uncharge_list
   uncharge_page


It is obvious that the default version is slightly better. It requires
less things to be done comparing with release_pages() variant.

> 
> release_pages does a bunch of checks that are unnecessary ... we could
> probably just do:
> 
>   LIST_HEAD(pages_to_free);
> 
>   for (i = 0; i < area->nr_pages; i++) {
>   struct page *page = area->pages[i];
>   if (put_page_testzero(page))
>   list_add(>lru, _to_free);
>   }
>   free_unref_page_list(_to_free);
> 
> but let's see if the provided interface gets us the performance we want.
>  
I will test it tomorrow. From the first glance it looks like a more light 
version :)

--
Vlad Rezki


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > I suspect the vast majority of the time is spent calling 
> > > alloc_pages_node()
> > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > vmalloc() wants?
> > > 
> > 
> >  - __vmalloc_node_range
> > - 45.25% __alloc_pages_nodemask
> >- 37.59% get_page_from_freelist
> [...]
> >   - 44.61% 0xc047348d
> >  - __vunmap
> > - 35.56% free_unref_page
> 
> Hmm!  I hadn't been thinking about the free side of things.
> Does this make a difference?
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4f5f8c907897..61d5b769fea0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> - int i;
> -
> - for (i = 0; i < area->nr_pages; i++) {
> - struct page *page = area->pages[i];
> -
> - BUG_ON(!page);
> - __free_pages(page, 0);
> - }
> + release_pages(area->pages, area->nr_pages);
>   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> -
>   kvfree(area->pages);
>   }
> 
Will check it today!

> release_pages does a bunch of checks that are unnecessary ... we could
> probably just do:
> 
>   LIST_HEAD(pages_to_free);
> 
>   for (i = 0; i < area->nr_pages; i++) {
>   struct page *page = area->pages[i];
>   if (put_page_testzero(page))
>   list_add(>lru, _to_free);
>   }
>   free_unref_page_list(_to_free);
> 
> but let's see if the provided interface gets us the performance we want.
>  
> > Reviewed-by: Uladzislau Rezki (Sony) 
> > 
> > Thanks!
> 
> Thank you!
You are welcome. A small nit:

  CC  mm/vmalloc.o
mm/vmalloc.c: In function ‘__vmalloc_area_node’:
mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ 
makes integer from pointer without a cast [-Wint-conversion]
  area->caller);
  ^~~~
In file included from mm/vmalloc.c:12:
./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument is 
of type ‘const void *’
 void *kvmalloc_node_caller(size_t size, gfp_t flags, int node,


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8a202ba263f6..ee6fa44983bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2489,7 +2489,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
 
/* Please note that the recursion is strictly bounded. */
pages = kvmalloc_node_caller(array_size, nested_gfp, node,
-area->caller);
+   (unsigned long) area->caller);
if (!pages) {
free_vm_area(area);
return NULL;


As for the bulk-array interface. I have checked the:

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2

applied the patch that is in question + below one:


@@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct 
*area, gfp_t gfp_mask,
area->pages = pages;
area->nr_pages = nr_pages;
 
-   for (i = 0; i < area->nr_pages; i++) {
-   struct page *page;
-
-   if (node == NUMA_NO_NODE)
-   page = alloc_page(gfp_mask);
-   else
-   page = alloc_pages_node(node, gfp_mask, 0);
-
-   if (unlikely(!page)) {
-   /* Successfully allocated i pages, free them in 
__vfree() */
-   area->nr_pages = i;
-   atomic_long_add(area->nr_pages, _vmalloc_pages);
-   goto fail;
-   }
-   area->pages[i] = page;
-   if (gfpflags_allow_blocking(gfp_mask))
-   cond_resched();
+   ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages);
+   if (ret == nr_pages)
+   atomic_long_add(area->nr_pages, _vmalloc_pages);
+   else {
+   area->nr_pages = ret;
+   goto fail;
}
-   atomic_long_add(area->nr_pages, _vmalloc_pages);


single CPU, 4MB allocation, 100 avg: 70639437 usec
single CPU, 4MB allocation, 100 avg: 89218654 usec

and now we get ~21% delta. That is very good :)

--
Vlad Rezki


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 11:36:19PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote:
> > > If we're trying to allocate 4MB of memory, the table will be 8KiB in size
> > > (1024 pointers * 8 bytes per pointer), which can usually be satisfied
> > > by a kmalloc (which is significantly faster).  Instead of changing this
> > > open-coded implementation, just use kvmalloc().
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > ---
> > >  mm/vmalloc.c | 7 +--
> > >  1 file changed, 1 insertion(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 96444d64129a..32b640a84250 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct 
> > > *area, gfp_t gfp_mask,
> > >   gfp_mask |= __GFP_HIGHMEM;
> > >  
> > >   /* Please note that the recursion is strictly bounded. */
> > > - if (array_size > PAGE_SIZE) {
> > > - pages = __vmalloc_node(array_size, 1, nested_gfp, node,
> > > + pages = kvmalloc_node_caller(array_size, nested_gfp, node,
> > >   area->caller);
> > > - } else {
> > > - pages = kmalloc_node(array_size, nested_gfp, node);
> > > - }
> > > -
> > >   if (!pages) {
> > >   free_vm_area(area);
> > >   return NULL;
> > > -- 
> > > 2.30.2
> > Makes sense to me. Though i expected a bigger difference:
> > 
> > # patch
> > single CPU, 4MB allocation, loops: 100 avg: 85293854 usec
> > 
> > # default
> > single CPU, 4MB allocation, loops: 100 avg: 89275857 usec
> 
> Well, 4.5% isn't something to leave on the table ... but yeah, I was
> expecting more in the 10-20% range.  It may be more significant if
> there's contention on the spinlocks (like if this crazy ksmbd is calling
> vmalloc(4MB) on multiple nodes simultaneously).
> 
Yep, it can be that simultaneous allocations will show even bigger
improvements because of lock contention. Anyway there is an advantage
in switching to SLAB - 5% is also a win :) 

>
> I suspect the vast majority of the time is spent calling alloc_pages_node()
> 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> vmalloc() wants?
> 

-   97.37% 0.00%  vmalloc_test/0   [kernel.vmlinux]  [k] ret_from_fork  

◆
 ret_from_fork  

▒
 kthread

▒
   - 0xc047373b 

▒
  - 52.67% 0xc047349f   

▒
   __vmalloc_node   

▒
 - __vmalloc_node_range 

▒
- 45.25% __alloc_pages_nodemask 

▒
   - 37.59% get_page_from_freelist  

▒
4.34% __list_del_entry_valid

▒
3.67% __list_add_valid  

▒
1.52% prep_new_page 

▒

Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-22 Thread Uladzislau Rezki
On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote:
> If we're trying to allocate 4MB of memory, the table will be 8KiB in size
> (1024 pointers * 8 bytes per pointer), which can usually be satisfied
> by a kmalloc (which is significantly faster).  Instead of changing this
> open-coded implementation, just use kvmalloc().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/vmalloc.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 96444d64129a..32b640a84250 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct 
> *area, gfp_t gfp_mask,
>   gfp_mask |= __GFP_HIGHMEM;
>  
>   /* Please note that the recursion is strictly bounded. */
> - if (array_size > PAGE_SIZE) {
> - pages = __vmalloc_node(array_size, 1, nested_gfp, node,
> + pages = kvmalloc_node_caller(array_size, nested_gfp, node,
>   area->caller);
> - } else {
> - pages = kmalloc_node(array_size, nested_gfp, node);
> - }
> -
>   if (!pages) {
>   free_vm_area(area);
>   return NULL;
> -- 
> 2.30.2
Makes sense to me. Though i expected a bigger difference:

# patch
single CPU, 4MB allocation, loops: 100 avg: 85293854 usec

# default
single CPU, 4MB allocation, loops: 100 avg: 89275857 usec

One question. Should we care much about fragmentation? I mean
with the patch, allocations > 2MB will do request to SLAB bigger
then PAGE_SIZE.

Thanks!

--
Vlad Rezki


Re: [PATCH v2 1/1] kvfree_rcu: Release a page cache under memory pressure

2021-03-18 Thread Uladzislau Rezki
On Tue, Mar 16, 2021 at 02:01:25PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 16, 2021 at 09:42:07PM +0100, Uladzislau Rezki wrote:
> > > On Wed, Mar 10, 2021 at 09:07:57PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > From: Zhang Qiang 
> > > > 
> > > > Add a drain_page_cache() function to drain a per-cpu page cache.
> > > > The reason behind of it is a system can run into a low memory
> > > > condition, in that case a page shrinker can ask for its users
> > > > to free their caches in order to get extra memory available for
> > > > other needs in a system.
> > > > 
> > > > When a system hits such condition, a page cache is drained for
> > > > all CPUs in a system. Apart of that a page cache work is delayed
> > > > with 5 seconds interval until a memory pressure disappears.
> > > 
> > > Does this capture it?
> > > 
> > It would be good to have kind of clear interface saying that:
> > 
> > - low memory condition starts;
> > - it is over, watermarks were fixed.
> > 
> > but i do not see it. Therefore 5 seconds back-off has been chosen
> > to make a cache refilling to be less aggressive. Suppose 5 seconds
> > is not enough, in that case the work will attempt to allocate some
> > pages using less permissive parameters. What means that if we are
> > still in a low memory condition a refilling will probably fail and
> > next job will be invoked in 5 seconds one more time.
> 
> I would like such an interface as well, but from what I hear it is
> easier to ask for than to provide.  :-/
> 
> > > 
> > > 
> > > Add a drain_page_cache() function that drains the specified per-cpu
> > > page cache.  This function is invoked on each CPU when the system
> > > enters a low-memory state, that is, when the shrinker invokes
> > > kfree_rcu_shrink_scan().  Thus, when the system is low on memory,
> > > kvfree_rcu() starts taking its slow paths.
> > > 
> > > In addition, the first subsequent attempt to refill the caches is
> > > delayed for five seconds.
> > > 
> > > 
> > > 
> > > A few questions below.
> > > 
> > >   Thanx, Paul
> > > 
> > > > Co-developed-by: Uladzislau Rezki (Sony) 
> > > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > > Signed-off-by: Zqiang 
> > > > ---
> > > >  kernel/rcu/tree.c | 59 ---
> > > >  1 file changed, 51 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 2c9cf4df942c..46b8a98ca077 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
> > > > bool initialized;
> > > > int count;
> > > >  
> > > > -   struct work_struct page_cache_work;
> > > > +   struct delayed_work page_cache_work;
> > > > atomic_t work_in_progress;
> > > > struct hrtimer hrtimer;
> > > >  
> > > > @@ -3175,6 +3175,17 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) 
> > > > = {
> > > > .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
> > > >  };
> > > >  
> > > > +// A page shrinker can ask for freeing extra pages
> > > > +// to get them available for other needs in a system.
> > > > +// Usually it happens under low memory condition, in
> > > > +// that case hold on a bit with page cache filling.
> > > > +static unsigned long backoff_page_cache_fill;
> > > > +
> > > > +// 5 seconds delay. That is long enough to reduce
> > > > +// an interfering and racing with a shrinker where
> > > > +// the cache is drained.
> > > > +#define PAGE_CACHE_FILL_DELAY (5 * HZ)
> > > > +
> > > >  static __always_inline void
> > > >  debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
> > > >  {
> > > > @@ -3229,6 +3240,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
> > > >  
> > > >  }
> > > >  
> > > > +static int
> > > > +drain_page_cache(struct kfree_rcu_cpu *krcp)
> > > > +{
> > 

Re: [PATCH] mm: vmalloc: Prevent use after free in _vm_unmap_aliases

2021-03-18 Thread Uladzislau Rezki
On Thu, Mar 18, 2021 at 03:38:25PM +0530, vji...@codeaurora.org wrote:
> From: Vijayanand Jitta 
> 
> A potential use after free can occur in _vm_unmap_aliases
> where an already freed vmap_area could be accessed, Consider
> the following scenario:
> 
> Process 1 Process 2
> 
> __vm_unmap_aliases__vm_unmap_aliases
>   purge_fragmented_blocks_allcpus rcu_read_lock()
>   rcu_read_lock()
>   list_del_rcu(>free_list)
>   
> list_for_each_entry_rcu(vb .. )
>   __purge_vmap_area_lazy
>   kmem_cache_free(va)
>   
> va_start = vb->va->va_start
Or maybe we should switch to kfree_rcu() instead of kmem_cache_free()?

--
Vlad Rezki


Re: [PATCH v2 1/1] kvfree_rcu: Release a page cache under memory pressure

2021-03-16 Thread Uladzislau Rezki
> On Wed, Mar 10, 2021 at 09:07:57PM +0100, Uladzislau Rezki (Sony) wrote:
> > From: Zhang Qiang 
> > 
> > Add a drain_page_cache() function to drain a per-cpu page cache.
> > The reason behind of it is a system can run into a low memory
> > condition, in that case a page shrinker can ask for its users
> > to free their caches in order to get extra memory available for
> > other needs in a system.
> > 
> > When a system hits such condition, a page cache is drained for
> > all CPUs in a system. Apart of that a page cache work is delayed
> > with 5 seconds interval until a memory pressure disappears.
> 
> Does this capture it?
> 
It would be good to have kind of clear interface saying that:

- low memory condition starts;
- it is over, watermarks were fixed.

but i do not see it. Therefore 5 seconds back-off has been chosen
to make a cache refilling to be less aggressive. Suppose 5 seconds
is not enough, in that case the work will attempt to allocate some
pages using less permissive parameters. What means that if we are
still in a low memory condition a refilling will probably fail and
next job will be invoked in 5 seconds one more time.

> 
> 
> Add a drain_page_cache() function that drains the specified per-cpu
> page cache.  This function is invoked on each CPU when the system
> enters a low-memory state, that is, when the shrinker invokes
> kfree_rcu_shrink_scan().  Thus, when the system is low on memory,
> kvfree_rcu() starts taking its slow paths.
> 
> In addition, the first subsequent attempt to refill the caches is
> delayed for five seconds.
> 
> 
> 
> A few questions below.
> 
>       Thanx, Paul
> 
> > Co-developed-by: Uladzislau Rezki (Sony) 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> > Signed-off-by: Zqiang 
> > ---
> >  kernel/rcu/tree.c | 59 ---
> >  1 file changed, 51 insertions(+), 8 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 2c9cf4df942c..46b8a98ca077 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
> > bool initialized;
> > int count;
> >  
> > -   struct work_struct page_cache_work;
> > +   struct delayed_work page_cache_work;
> > atomic_t work_in_progress;
> > struct hrtimer hrtimer;
> >  
> > @@ -3175,6 +3175,17 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
> > .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
> >  };
> >  
> > +// A page shrinker can ask for freeing extra pages
> > +// to get them available for other needs in a system.
> > +// Usually it happens under low memory condition, in
> > +// that case hold on a bit with page cache filling.
> > +static unsigned long backoff_page_cache_fill;
> > +
> > +// 5 seconds delay. That is long enough to reduce
> > +// an interfering and racing with a shrinker where
> > +// the cache is drained.
> > +#define PAGE_CACHE_FILL_DELAY (5 * HZ)
> > +
> >  static __always_inline void
> >  debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
> >  {
> > @@ -3229,6 +3240,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
> >  
> >  }
> >  
> > +static int
> > +drain_page_cache(struct kfree_rcu_cpu *krcp)
> > +{
> > +   unsigned long flags;
> > +   struct llist_node *page_list, *pos, *n;
> > +   int freed = 0;
> > +
> > +   raw_spin_lock_irqsave(>lock, flags);
> > +   page_list = llist_del_all(>bkvcache);
> > +   krcp->nr_bkv_objs = 0;
> > +   raw_spin_unlock_irqrestore(>lock, flags);
> > +
> > +   llist_for_each_safe(pos, n, page_list) {
> > +   free_page((unsigned long)pos);
> > +   freed++;
> > +   }
> > +
> > +   return freed;
> > +}
> > +
> >  /*
> >   * This function is invoked in workqueue context after a grace period.
> >   * It frees all the objects queued on ->bhead_free or ->head_free.
> > @@ -3419,7 +3450,7 @@ schedule_page_work_fn(struct hrtimer *t)
> > struct kfree_rcu_cpu *krcp =
> > container_of(t, struct kfree_rcu_cpu, hrtimer);
> >  
> > -   queue_work(system_highpri_wq, >page_cache_work);
> > +   queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
> > return HRTIMER_NORESTART;
> >  }
> >  
> > @@ -3428,7 +3459,7 @

Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

2021-03-16 Thread Uladzislau Rezki
On Tue, Mar 16, 2021 at 10:01:46AM +0200, Topi Miettinen wrote:
> On 15.3.2021 19.47, Uladzislau Rezki wrote:
> > On Mon, Mar 15, 2021 at 09:16:26AM -0700, Kees Cook wrote:
> > > On Mon, Mar 15, 2021 at 01:24:10PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Mar 15, 2021 at 11:04:42AM +0200, Topi Miettinen wrote:
> > > > > What's the problem with that? It seems to me that nothing relies on 
> > > > > specific
> > > > > addresses of the chunks, so it should be possible to randomize these 
> > > > > too.
> > > > > Also the alignment is honored.
> > > > > 
> > > > My concern are:
> > > > 
> > > > - it is not a vmalloc allocator;
> > > > - per-cpu allocator allocates chunks, thus it might be it happens only 
> > > > once. It does not allocate it often;
> > > 
> > > That's actually the reason to randomize it: if it always ends up in the
> > > same place at every boot, it becomes a stable target for attackers.
> > > 
> > Probably we can randomize a base address only once when pcpu-allocator
> > allocates a fist chunk during the boot.
> > 
> > > > - changing it will likely introduce issues you are not aware of;
> > > > - it is not supposed to be interacting with vmalloc allocator. Read the
> > > >comment under pcpu_get_vm_areas();
> > > > 
> > > > Therefore i propose just not touch it.
> > > 
> > > How about splitting it from this patch instead? Then it can get separate
> > > testing, etc.
> > > 
> > It should be split as well as tested.
> 
> Would you prefer another kernel option `randomize_percpu_allocator=1`, or
> would it be OK to make it a flag in `randomize_vmalloc`, like
> `randomize_vmalloc=3`? Maybe the latter would not be compatible with static
> branches.
> 
I think it is better to have a separate option, because there are two
different allocators.

--
Vlad Rezki


Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

2021-03-15 Thread Uladzislau Rezki
On Mon, Mar 15, 2021 at 06:23:37PM +0200, Topi Miettinen wrote:
> On 15.3.2021 17.35, Uladzislau Rezki wrote:
> > > On 14.3.2021 19.23, Uladzislau Rezki wrote:
> > > > Also, using vmaloc test driver i can trigger a kernel BUG:
> > > > 
> > > > 
> > > > [   24.627577] kernel BUG at mm/vmalloc.c:1272!
> > > 
> > > It seems that most tests indeed fail. Perhaps the vmalloc subsystem isn't
> > > very robust in face of fragmented virtual memory. What could be done to 
> > > fix
> > > that?
> > > 
> > Your patch is broken in context of checking "vend" when you try to
> > allocate next time after first attempt. Passed "vend" is different
> > there comparing what is checked later to figure out if an allocation
> > failed or not:
> > 
> > 
> >  if (unlikely(addr == vend))
> >  goto overflow;
> > 
> 
> 
> Thanks, I'll fix that.
> 
> > 
> > > 
> > > In this patch, I could retry __alloc_vmap_area() with the whole region 
> > > after
> > > failure of both [random, vend] and [vstart, random] but I'm not sure that
> > > would help much. Worth a try of course.
> > > 
> > There is no need in your second [vstart, random]. If a first bigger range
> > has not been successful, the smaller one will never be success anyway. The
> > best way to go here is to repeat with real [vsart:vend], if it still fails
> > on a real range, then it will not be possible to accomplish an allocation
> > request with given parameters.
> > 
> > > 
> > > By the way, some of the tests in test_vmalloc.c don't check for vmalloc()
> > > failure, for example in full_fit_alloc_test().
> > > 
> > Where?
> 
> Something like this:
> 
> diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
> index 5cf2fe9aab9e..27e5db9a96b4 100644
> --- a/lib/test_vmalloc.c
> +++ b/lib/test_vmalloc.c
> @@ -182,9 +182,14 @@ static int long_busy_list_alloc_test(void)
> if (!ptr)
> return rv;
> 
> -   for (i = 0; i < 15000; i++)
> +   for (i = 0; i < 15000; i++) {
> ptr[i] = vmalloc(1 * PAGE_SIZE);
> 
> +   if (!ptr[i])
> +   goto leave;
> +   }
> +
>
Hmm. That is for creating a long list of allocated areas before running
a test. For example if one allocation among 15 000 fails, some index will
be set to NULL. Later on after "leave" label vfree() will bypass NULL freeing.

Either we have 15 000 extra elements or 10 000 does not really matter
and is considered as a corner case that is probably never happens. Yes,
you can simulate such precondition, but then a regular vmalloc()s will
likely also fails, thus the final results will be screwed up.

> +
> for (i = 0; i < test_loop_count; i++) {
> ptr_1 = vmalloc(100 * PAGE_SIZE);
> if (!ptr_1)
> @@ -236,7 +241,11 @@ static int full_fit_alloc_test(void)
> 
> for (i = 0; i < junk_length; i++) {
> ptr[i] = vmalloc(1 * PAGE_SIZE);
> +   if (!ptr[i])
> +   goto error;
> junk_ptr[i] = vmalloc(1 * PAGE_SIZE);
> +   if (!junk_ptr[i])
> +   goto error;
> }
> 
> for (i = 0; i < junk_length; i++)
> @@ -256,8 +265,10 @@ static int full_fit_alloc_test(void)
> rv = 0;
> 
>  error:
> -   for (i = 0; i < junk_length; i++)
> +   for (i = 0; i < junk_length; i++) {
> vfree(ptr[i]);
> +   vfree(junk_ptr[i]);
> +   }
> 
> vfree(ptr);
> vfree(junk_ptr);
> 
Same here.

--
Vlad Rezki


Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

2021-03-15 Thread Uladzislau Rezki
On Mon, Mar 15, 2021 at 09:16:26AM -0700, Kees Cook wrote:
> On Mon, Mar 15, 2021 at 01:24:10PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 15, 2021 at 11:04:42AM +0200, Topi Miettinen wrote:
> > > What's the problem with that? It seems to me that nothing relies on 
> > > specific
> > > addresses of the chunks, so it should be possible to randomize these too.
> > > Also the alignment is honored.
> > > 
> > My concern are:
> > 
> > - it is not a vmalloc allocator;
> > - per-cpu allocator allocates chunks, thus it might be it happens only 
> > once. It does not allocate it often;
> 
> That's actually the reason to randomize it: if it always ends up in the
> same place at every boot, it becomes a stable target for attackers.
> 
Probably we can randomize a base address only once when pcpu-allocator
allocates a fist chunk during the boot.

> > - changing it will likely introduce issues you are not aware of;
> > - it is not supposed to be interacting with vmalloc allocator. Read the
> >   comment under pcpu_get_vm_areas();
> > 
> > Therefore i propose just not touch it.
> 
> How about splitting it from this patch instead? Then it can get separate
> testing, etc.
> 
It should be split as well as tested.

--
Vlad Rezki


Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

2021-03-15 Thread Uladzislau Rezki
> On 14.3.2021 19.23, Uladzislau Rezki wrote:
> > Also, using vmaloc test driver i can trigger a kernel BUG:
> > 
> > 
> > [   24.627577] kernel BUG at mm/vmalloc.c:1272!
> 
> It seems that most tests indeed fail. Perhaps the vmalloc subsystem isn't
> very robust in face of fragmented virtual memory. What could be done to fix
> that?
> 
Your patch is broken in context of checking "vend" when you try to
allocate next time after first attempt. Passed "vend" is different
there comparing what is checked later to figure out if an allocation
failed or not:


if (unlikely(addr == vend))
goto overflow;


>
> In this patch, I could retry __alloc_vmap_area() with the whole region after
> failure of both [random, vend] and [vstart, random] but I'm not sure that
> would help much. Worth a try of course.
> 
There is no need in your second [vstart, random]. If a first bigger range
has not been successful, the smaller one will never be success anyway. The
best way to go here is to repeat with real [vsart:vend], if it still fails
on a real range, then it will not be possible to accomplish an allocation
request with given parameters.

>
> By the way, some of the tests in test_vmalloc.c don't check for vmalloc()
> failure, for example in full_fit_alloc_test().
> 
Where?

--
Vlad Rezki


Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

2021-03-15 Thread Uladzislau Rezki
On Mon, Mar 15, 2021 at 11:04:42AM +0200, Topi Miettinen wrote:
> On 14.3.2021 19.23, Uladzislau Rezki wrote:
> > > Memory mappings inside kernel allocated with vmalloc() are in
> > > predictable order and packed tightly toward the low addresses, except
> > > for per-cpu areas which start from top of the vmalloc area. With
> > > new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> > > used randomly to make the allocations less predictable and harder to
> > > guess for attackers. Also module and BPF code locations get randomized
> > > (within their dedicated and rather small area though) and if
> > > CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
> > > 
> > > On 32 bit systems this may cause problems due to increased VM
> > > fragmentation if the address space gets crowded.
> > > 
> > > On all systems, it will reduce performance and increase memory and
> > > cache usage due to less efficient use of page tables and inability to
> > > merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
> > > page tables, in the worst case, additional page table entries of up to
> > > 4 pages are created for each mapping, so with small mappings there's
> > > considerable penalty.
> > > 
> > > Without randomize_vmalloc=1:
> > > $ grep -v kernel_clone /proc/vmallocinfo
> > > 0xc900-0xc9009000   36864 
> > > irq_init_percpu_irqstack+0x176/0x1c0 vmap
> > > 0xc9009000-0xc900b0008192 
> > > acpi_os_map_iomem+0x2ac/0x2d0 phys=0x1ffe1000 ioremap
> > > 0xc900c000-0xc900f000   12288 
> > > acpi_os_map_iomem+0x2ac/0x2d0 phys=0x1ffe ioremap
> > > 0xc900f000-0xc90110008192 hpet_enable+0x31/0x4a4 
> > > phys=0xfed0 ioremap
> > > 0xc9011000-0xc90130008192 
> > > gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> > > 0xc9013000-0xc90150008192 
> > > gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> > > 0xc9015000-0xc90170008192 
> > > gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> > > 0xc9021000-0xc90230008192 
> > > gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> > > 0xc9023000-0xc90250008192 
> > > acpi_os_map_iomem+0x2ac/0x2d0 phys=0xfed0 ioremap
> > > 0xc9025000-0xc90270008192 memremap+0x19c/0x280 
> > > phys=0x000f5000 ioremap
> > > 0xc9031000-0xc9036000   20480 
> > > pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
> > > 0xc9043000-0xc9047000   16384 n_tty_open+0x11/0xe0 
> > > pages=3 vmalloc
> > > 0xc9211000-0xc9232000  135168 
> > > crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
> > > 0xc9232000-0xc9253000  135168 
> > > crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
> > > 0xc95a9000-0xc95ba000   69632 
> > > pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
> > > 0xc95ba000-0xc95cc000   73728 
> > > pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
> > > 0xe8c0-0xe8e0 2097152 
> > > pcpu_get_vm_areas+0x0/0x2290 vmalloc
> > > 
> > > With randomize_vmalloc=1, the allocations are randomized:
> > > $ grep -v kernel_clone /proc/vmallocinfo
> > > 0xc9759d443000-0xc9759d4450008192 hpet_enable+0x31/0x4a4 
> > > phys=0xfed0 ioremap
> > > 0xccf1e9f66000-0xccf1e9f680008192 
> > > gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> > > 0xcd2fc02a4000-0xcd2fc02a60008192 
> > > gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> > > 0xcdaefb898000-0xcdaefb89b000   12288 
> > > acpi_os_map_iomem+0x2ac/0x2d0 phys=0x1ffe ioremap
> > > 0xcef8074c3000-0xcef8074cc000   36864 
> > > irq_init_percpu_irqstack+0x176/0x1c0 vmap
> > > 0xcf725ca2e000-0xcf725ca4f000  135168 
> > > crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
> > > 0xd0efb25e1000-0xd0efb25f2000   69632 
> > > pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
> > > 0xd27054678000-0xd2705467c000   16384 n_tty_open+0x11/0xe0 
> > > pages=3 vmalloc
> > > 0xd2adf716e000-0xd2adf718   73728 
> > > pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
> > > 0xd4ba5fb6b000-0xd4ba5fb6d0008192 
> > 

Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

2021-03-14 Thread Uladzislau Rezki
> Memory mappings inside kernel allocated with vmalloc() are in
> predictable order and packed tightly toward the low addresses, except
> for per-cpu areas which start from top of the vmalloc area. With
> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> used randomly to make the allocations less predictable and harder to
> guess for attackers. Also module and BPF code locations get randomized
> (within their dedicated and rather small area though) and if
> CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
> 
> On 32 bit systems this may cause problems due to increased VM
> fragmentation if the address space gets crowded.
> 
> On all systems, it will reduce performance and increase memory and
> cache usage due to less efficient use of page tables and inability to
> merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
> page tables, in the worst case, additional page table entries of up to
> 4 pages are created for each mapping, so with small mappings there's
> considerable penalty.
> 
> Without randomize_vmalloc=1:
> $ grep -v kernel_clone /proc/vmallocinfo
> 0xc900-0xc9009000   36864 
> irq_init_percpu_irqstack+0x176/0x1c0 vmap
> 0xc9009000-0xc900b0008192 acpi_os_map_iomem+0x2ac/0x2d0 
> phys=0x1ffe1000 ioremap
> 0xc900c000-0xc900f000   12288 acpi_os_map_iomem+0x2ac/0x2d0 
> phys=0x1ffe ioremap
> 0xc900f000-0xc90110008192 hpet_enable+0x31/0x4a4 
> phys=0xfed0 ioremap
> 0xc9011000-0xc90130008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xc9013000-0xc90150008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xc9015000-0xc90170008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xc9021000-0xc90230008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xc9023000-0xc90250008192 acpi_os_map_iomem+0x2ac/0x2d0 
> phys=0xfed0 ioremap
> 0xc9025000-0xc90270008192 memremap+0x19c/0x280 
> phys=0x000f5000 ioremap
> 0xc9031000-0xc9036000   20480 pcpu_create_chunk+0xe8/0x260 
> pages=4 vmalloc
> 0xc9043000-0xc9047000   16384 n_tty_open+0x11/0xe0 pages=3 
> vmalloc
> 0xc9211000-0xc9232000  135168 crypto_scomp_init_tfm+0xc6/0xf0 
> pages=32 vmalloc
> 0xc9232000-0xc9253000  135168 crypto_scomp_init_tfm+0x67/0xf0 
> pages=32 vmalloc
> 0xc95a9000-0xc95ba000   69632 pcpu_create_chunk+0x7b/0x260 
> pages=16 vmalloc
> 0xc95ba000-0xc95cc000   73728 pcpu_create_chunk+0xb2/0x260 
> pages=17 vmalloc
> 0xe8c0-0xe8e0 2097152 pcpu_get_vm_areas+0x0/0x2290 
> vmalloc
> 
> With randomize_vmalloc=1, the allocations are randomized:
> $ grep -v kernel_clone /proc/vmallocinfo
> 0xc9759d443000-0xc9759d4450008192 hpet_enable+0x31/0x4a4 
> phys=0xfed0 ioremap
> 0xccf1e9f66000-0xccf1e9f680008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xcd2fc02a4000-0xcd2fc02a60008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xcdaefb898000-0xcdaefb89b000   12288 acpi_os_map_iomem+0x2ac/0x2d0 
> phys=0x1ffe ioremap
> 0xcef8074c3000-0xcef8074cc000   36864 
> irq_init_percpu_irqstack+0x176/0x1c0 vmap
> 0xcf725ca2e000-0xcf725ca4f000  135168 crypto_scomp_init_tfm+0xc6/0xf0 
> pages=32 vmalloc
> 0xd0efb25e1000-0xd0efb25f2000   69632 pcpu_create_chunk+0x7b/0x260 
> pages=16 vmalloc
> 0xd27054678000-0xd2705467c000   16384 n_tty_open+0x11/0xe0 pages=3 
> vmalloc
> 0xd2adf716e000-0xd2adf718   73728 pcpu_create_chunk+0xb2/0x260 
> pages=17 vmalloc
> 0xd4ba5fb6b000-0xd4ba5fb6d0008192 acpi_os_map_iomem+0x2ac/0x2d0 
> phys=0x1ffe1000 ioremap
> 0xded126192000-0xded1261940008192 memremap+0x19c/0x280 
> phys=0x000f5000 ioremap
> 0xe01a4dbcd000-0xe01a4dbcf0008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xe4b649952000-0xe4b6499540008192 acpi_os_map_iomem+0x2ac/0x2d0 
> phys=0xfed0 ioremap
> 0xe71ed592a000-0xe71ed592c0008192 gen_pool_add_owner+0x49/0x130 
> pages=1 vmalloc
> 0xe7dc5824f000-0xe7dc5827  135168 crypto_scomp_init_tfm+0x67/0xf0 
> pages=32 vmalloc
> 0xe8f4f980-0xe8f4f9a0 2097152 pcpu_get_vm_areas+0x0/0x2290 
> vmalloc
> 0xe8f4f9a19000-0xe8f4f9a1e000   20480 pcpu_create_chunk+0xe8/0x260 
> pages=4 vmalloc
> 
> With CONFIG_VMAP_STACK, also kernel thread stacks are placed in
> vmalloc area and therefore they also get randomized (only one example
> line from /proc/vmallocinfo shown for brevity):
> 
> unrandomized:
> 0xc9018000-0xc9021000   36864 kernel_clone+0xf9/0x560 pages=8 
> vmalloc
> 
> randomized:
> 0xcb57611a8000-0xcb57611b1000   36864 

[PATCH v2 1/1] kvfree_rcu: Release a page cache under memory pressure

2021-03-10 Thread Uladzislau Rezki (Sony)
From: Zhang Qiang 

Add a drain_page_cache() function to drain a per-cpu page cache.
The reason behind of it is a system can run into a low memory
condition, in that case a page shrinker can ask for its users
to free their caches in order to get extra memory available for
other needs in a system.

When a system hits such condition, a page cache is drained for
all CPUs in a system. Apart of that a page cache work is delayed
with 5 seconds interval until a memory pressure disappears.

Co-developed-by: Uladzislau Rezki (Sony) 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Zqiang 
---
 kernel/rcu/tree.c | 59 ---
 1 file changed, 51 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2c9cf4df942c..46b8a98ca077 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
bool initialized;
int count;
 
-   struct work_struct page_cache_work;
+   struct delayed_work page_cache_work;
atomic_t work_in_progress;
struct hrtimer hrtimer;
 
@@ -3175,6 +3175,17 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
 };
 
+// A page shrinker can ask for freeing extra pages
+// to get them available for other needs in a system.
+// Usually it happens under low memory condition, in
+// that case hold on a bit with page cache filling.
+static unsigned long backoff_page_cache_fill;
+
+// 5 seconds delay. That is long enough to reduce
+// an interfering and racing with a shrinker where
+// the cache is drained.
+#define PAGE_CACHE_FILL_DELAY (5 * HZ)
+
 static __always_inline void
 debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
 {
@@ -3229,6 +3240,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
 
 }
 
+static int
+drain_page_cache(struct kfree_rcu_cpu *krcp)
+{
+   unsigned long flags;
+   struct llist_node *page_list, *pos, *n;
+   int freed = 0;
+
+   raw_spin_lock_irqsave(>lock, flags);
+   page_list = llist_del_all(>bkvcache);
+   krcp->nr_bkv_objs = 0;
+   raw_spin_unlock_irqrestore(>lock, flags);
+
+   llist_for_each_safe(pos, n, page_list) {
+   free_page((unsigned long)pos);
+   freed++;
+   }
+
+   return freed;
+}
+
 /*
  * This function is invoked in workqueue context after a grace period.
  * It frees all the objects queued on ->bhead_free or ->head_free.
@@ -3419,7 +3450,7 @@ schedule_page_work_fn(struct hrtimer *t)
struct kfree_rcu_cpu *krcp =
container_of(t, struct kfree_rcu_cpu, hrtimer);
 
-   queue_work(system_highpri_wq, >page_cache_work);
+   queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
return HRTIMER_NORESTART;
 }
 
@@ -3428,7 +3459,7 @@ static void fill_page_cache_func(struct work_struct *work)
struct kvfree_rcu_bulk_data *bnode;
struct kfree_rcu_cpu *krcp =
container_of(work, struct kfree_rcu_cpu,
-   page_cache_work);
+   page_cache_work.work);
unsigned long flags;
bool pushed;
int i;
@@ -3457,10 +3488,14 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 {
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(>work_in_progress, 1)) {
-   hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
-   HRTIMER_MODE_REL);
-   krcp->hrtimer.function = schedule_page_work_fn;
-   hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
+   if (xchg(_page_cache_fill, 0UL)) {
+   queue_delayed_work(system_wq,
+   >page_cache_work, PAGE_CACHE_FILL_DELAY);
+   } else {
+   hrtimer_init(>hrtimer, CLOCK_MONOTONIC, 
HRTIMER_MODE_REL);
+   krcp->hrtimer.function = schedule_page_work_fn;
+   hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
+   }
}
 }
 
@@ -3612,14 +3647,20 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct 
shrink_control *sc)
 {
int cpu;
unsigned long count = 0;
+   unsigned long flags;
 
/* Snapshot count of all CPUs */
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
count += READ_ONCE(krcp->count);
+
+   raw_spin_lock_irqsave(>lock, flags);
+   count += krcp->nr_bkv_objs;
+   raw_spin_unlock_irqrestore(>lock, flags);
}
 
+   WRITE_ONCE(backoff_page_cache_fill, 1);
return count;
 }
 
@@ -3634,6 +3675,8 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
count = krcp->count;
+   count +

[PATCH 2/2] kvfree_rcu: convert a page cache to lock-free variant

2021-03-08 Thread Uladzislau Rezki (Sony)
Implement an access to the page cache as lock-free variant. This
is done because there are extra places where an access is required,
therefore making it lock-less will remove any lock contention.

For example we have a shrinker path as well as a reclaim kthread.
In both cases a current CPU can access to a remote per-cpu page
cache that would require taking a lock to protect it.

An "rcuscale" performance test suite can detect it and shows some
slight improvements:

../kvm.sh --memory 16G --torture rcuscale --allcpus --duration 10 \
--kconfig CONFIG_NR_CPUS=64 --bootargs "rcuscale.kfree_rcu_test=1 \
rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=1 \
rcuscale.kfree_rcu_test_double=1 torture.disable_onoff_at_boot" --trust-make

100 iterations, checking total time taken by all kfree'ers in ns.:

default: AVG: 10968415107.5 MIN: 10668412500 MAX: 11312145160
patch:   AVG: 10787596486.1 MIN: 10397559880 MAX: 11214901050

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 91 +--
 1 file changed, 56 insertions(+), 35 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9c8cfb01e9a6..4f04664d5ac0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3167,8 +3167,9 @@ struct kfree_rcu_cpu {
atomic_t work_in_progress;
struct hrtimer hrtimer;
 
+   // lock-free cache.
struct llist_head bkvcache;
-   int nr_bkv_objs;
+   atomic_t nr_bkv_objs;
 };
 
 static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
@@ -3215,49 +3216,79 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, 
unsigned long flags)
raw_spin_unlock_irqrestore(>lock, flags);
 }
 
+/*
+ * Increment 'v', if 'v' is below 'thresh'. Returns true if we
+ * succeeded, false if 'v' + 1 would be bigger than 'thresh'.
+ *
+ * Decrement 'v' if 'v' is upper 'thresh'. Returns true if we
+ * succeeded, false if 'v' - 1 would be smaller than 'thresh'.
+ */
+static inline bool
+atomic_test_inc_dec(atomic_t *v, unsigned int thresh, bool inc)
+{
+   unsigned int cur = atomic_read(v);
+   unsigned int old;
+
+   for (;;) {
+   if (inc) {
+   if (cur >= thresh)
+   return false;
+   } else {
+   if (cur <= thresh)
+   return false;
+   }
+
+   old = atomic_cmpxchg(v, cur, inc ? (cur + 1):(cur - 1));
+   if (old == cur)
+   break;
+
+   cur = old;
+   }
+
+   return true;
+}
+
 static inline struct kvfree_rcu_bulk_data *
 get_cached_bnode(struct kfree_rcu_cpu *krcp)
 {
-   if (!krcp->nr_bkv_objs)
-   return NULL;
+   struct kvfree_rcu_bulk_data *bnode = NULL;
 
-   krcp->nr_bkv_objs--;
-   return (struct kvfree_rcu_bulk_data *)
-   llist_del_first(>bkvcache);
+   if (atomic_test_inc_dec(>nr_bkv_objs, 0, false))
+   bnode = (struct kvfree_rcu_bulk_data *)
+   llist_del_first(>bkvcache);
+
+   return bnode;
 }
 
 static inline bool
 put_cached_bnode(struct kfree_rcu_cpu *krcp,
struct kvfree_rcu_bulk_data *bnode)
 {
-   // Check the limit.
-   if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
-   return false;
-
-   llist_add((struct llist_node *) bnode, >bkvcache);
-   krcp->nr_bkv_objs++;
-   return true;
+   if (atomic_test_inc_dec(>nr_bkv_objs, rcu_min_cached_objs, true)) 
{
+   llist_add((struct llist_node *) bnode, >bkvcache);
+   return true;
+   }
 
+   return false;
 }
 
 static int
 drain_page_cache(struct kfree_rcu_cpu *krcp)
 {
-   unsigned long flags;
-   struct llist_node *page_list, *pos, *n;
-   int freed = 0;
+   struct kvfree_rcu_bulk_data *bnode;
+   int num_pages, i;
 
-   raw_spin_lock_irqsave(>lock, flags);
-   page_list = llist_del_all(>bkvcache);
-   krcp->nr_bkv_objs = 0;
-   raw_spin_unlock_irqrestore(>lock, flags);
+   num_pages = atomic_read(>nr_bkv_objs);
+
+   for (i = 0; i < num_pages; i++) {
+   bnode = get_cached_bnode(krcp);
+   if (!bnode)
+   break;
 
-   llist_for_each_safe(pos, n, page_list) {
-   free_page((unsigned long)pos);
-   freed++;
+   free_page((unsigned long) bnode);
}
 
-   return freed;
+   return i;
 }
 
 /*
@@ -3314,10 +3345,8 @@ static void kfree_rcu_work(struct work_struct *work)
}
rcu_lock_release(_callback_map);
 
-   raw_spin_lock_irqsave(>lock, flags);
if (put_cached_bnode(krcp, bkvhead[i]))
bkvhead[i] = NULL;
-   raw_spin_unlock_irqrestore(>lock, flags

[PATCH 1/2] kvfree_rcu: Release a page cache under memory pressure

2021-03-08 Thread Uladzislau Rezki (Sony)
From: Zhang Qiang 

Add a drain_page_cache() function to drain a per-cpu page cache.
The reason behind of it is a system can run into a low memory
condition, in that case a page shrinker can ask for its users
to free their caches in order to get extra memory available for
other needs in a system.

When a system hits such condition, a page cache is drained for
all CPUs in a system. Apart of that a page cache work is delayed
with 5 seconds interval until a memory pressure disappears.

Co-developed-by: Uladzislau Rezki (Sony) 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Zqiang 
---
 kernel/rcu/tree.c | 59 ---
 1 file changed, 51 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2c9cf4df942c..9c8cfb01e9a6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
bool initialized;
int count;
 
-   struct work_struct page_cache_work;
+   struct delayed_work page_cache_work;
atomic_t work_in_progress;
struct hrtimer hrtimer;
 
@@ -3175,6 +3175,17 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
 };
 
+// A page shrinker can ask for freeing extra pages
+// to get them available for other needs in a system.
+// Usually it happens under low memory condition, in
+// that case hold on a bit with page cache filling.
+static bool backoff_page_cache_fill;
+
+// 5 seconds delay. That is long enough to reduce
+// an interfering and racing with a shrinker where
+// the cache is drained.
+#define PAGE_CACHE_FILL_DELAY (5 * HZ)
+
 static __always_inline void
 debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
 {
@@ -3229,6 +3240,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
 
 }
 
+static int
+drain_page_cache(struct kfree_rcu_cpu *krcp)
+{
+   unsigned long flags;
+   struct llist_node *page_list, *pos, *n;
+   int freed = 0;
+
+   raw_spin_lock_irqsave(>lock, flags);
+   page_list = llist_del_all(>bkvcache);
+   krcp->nr_bkv_objs = 0;
+   raw_spin_unlock_irqrestore(>lock, flags);
+
+   llist_for_each_safe(pos, n, page_list) {
+   free_page((unsigned long)pos);
+   freed++;
+   }
+
+   return freed;
+}
+
 /*
  * This function is invoked in workqueue context after a grace period.
  * It frees all the objects queued on ->bhead_free or ->head_free.
@@ -3419,7 +3450,7 @@ schedule_page_work_fn(struct hrtimer *t)
struct kfree_rcu_cpu *krcp =
container_of(t, struct kfree_rcu_cpu, hrtimer);
 
-   queue_work(system_highpri_wq, >page_cache_work);
+   queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
return HRTIMER_NORESTART;
 }
 
@@ -3428,7 +3459,7 @@ static void fill_page_cache_func(struct work_struct *work)
struct kvfree_rcu_bulk_data *bnode;
struct kfree_rcu_cpu *krcp =
container_of(work, struct kfree_rcu_cpu,
-   page_cache_work);
+   page_cache_work.work);
unsigned long flags;
bool pushed;
int i;
@@ -3457,10 +3488,14 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 {
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(>work_in_progress, 1)) {
-   hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
-   HRTIMER_MODE_REL);
-   krcp->hrtimer.function = schedule_page_work_fn;
-   hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
+   if (xchg(_page_cache_fill, false)) {
+   queue_delayed_work(system_wq,
+   >page_cache_work, PAGE_CACHE_FILL_DELAY);
+   } else {
+   hrtimer_init(>hrtimer, CLOCK_MONOTONIC, 
HRTIMER_MODE_REL);
+   krcp->hrtimer.function = schedule_page_work_fn;
+   hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
+   }
}
 }
 
@@ -3612,14 +3647,20 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct 
shrink_control *sc)
 {
int cpu;
unsigned long count = 0;
+   unsigned long flags;
 
/* Snapshot count of all CPUs */
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
count += READ_ONCE(krcp->count);
+
+   raw_spin_lock_irqsave(>lock, flags);
+   count += krcp->nr_bkv_objs;
+   raw_spin_unlock_irqrestore(>lock, flags);
}
 
+   WRITE_ONCE(backoff_page_cache_fill, true);
return count;
 }
 
@@ -3634,6 +3675,8 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
 
count = krcp->count;
+   count += drain

Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-22 Thread Uladzislau Rezki
On Mon, Feb 22, 2021 at 10:16:08AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 22, 2021 at 06:16:05PM +0100, Uladzislau Rezki wrote:
> > On Mon, Feb 22, 2021 at 07:09:03AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 22, 2021 at 01:54:31PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Feb 22, 2021 at 11:21:04AM +0100, Sebastian Andrzej Siewior 
> > > > wrote:
> > > > > On 2021-02-19 10:33:36 [-0800], Paul E. McKenney wrote:
> > > > > > For definiteness, here is the first part of the change, posted 
> > > > > > earlier.
> > > > > > The commit log needs to be updated.  I will post the change that 
> > > > > > keeps
> > > > > > the tick going as a reply to this email.
> > > > > …
> > > > > > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > > > > > index 9d71046..ba78e63 100644
> > > > > > --- a/kernel/softirq.c
> > > > > > +++ b/kernel/softirq.c
> > > > > > @@ -209,7 +209,7 @@ static inline void invoke_softirq(void)
> > > > > > if (ksoftirqd_running(local_softirq_pending()))
> > > > > > return;
> > > > > >  
> > > > > > -   if (!force_irqthreads) {
> > > > > > +   if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) {
> > > > > >  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
> > > > > > /*
> > > > > >  * We can safely execute softirq on the current stack if
> > > > > > @@ -358,8 +358,8 @@ asmlinkage __visible void __softirq_entry 
> > > > > > __do_softirq(void)
> > > > > >  
> > > > > > pending = local_softirq_pending();
> > > > > > if (pending) {
> > > > > > -   if (time_before(jiffies, end) && !need_resched() &&
> > > > > > -   --max_restart)
> > > > > > +   if (!__this_cpu_read(ksoftirqd) ||
> > > > > > +   (time_before(jiffies, end) && !need_resched() && 
> > > > > > --max_restart))
> > > > > > goto restart;
> > > > > 
> > > > > This is hunk shouldn't be needed. The reason for it is probably that 
> > > > > the
> > > > > following wakeup_softirqd() would avoid further invoke_softirq()
> > > > > performing the actual softirq work. It would leave early due to
> > > > > ksoftirqd_running(). Unless I'm wrong, any raise_softirq() invocation
> > > > > outside of an interrupt would do the same. 
> > > 
> > > And it does pass the rcutorture test without that hunk:
> > > 
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
> > > --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y 
> > > CONFIG_PROVE_LOCKING=y" --bootargs "threadirqs=1" --trust-make
> > > 
> > Yep. I have tested that patch also. It works for me as well. So
> > technically i do not see any issues from the first glance but of
> > course it should be reviewed by the softirq people to hear their
> > opinion.
> > 
> > IRQs are enabled, so it can be handled from an IRQ tail until
> > ksoftirqd threads are spawned.
> 
> And if I add "CONFIG_NO_HZ_IDLE=y CONFIG_HZ_PERIODIC=n" it still works,
> even if I revert my changes to rcu_needs_cpu().  Should I rely on this
> working globally?  ;-)
> 
There might be corner cases which we are not aware of so far. From the
other hand what the patch does is simulating the !threadirqs behaviour
during early boot. In that case we know that handling of SW irqs from
real-irq tail works :)

--
Vlad Rezki


Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-22 Thread Uladzislau Rezki
On Mon, Feb 22, 2021 at 07:09:03AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 22, 2021 at 01:54:31PM +0100, Uladzislau Rezki wrote:
> > On Mon, Feb 22, 2021 at 11:21:04AM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2021-02-19 10:33:36 [-0800], Paul E. McKenney wrote:
> > > > For definiteness, here is the first part of the change, posted earlier.
> > > > The commit log needs to be updated.  I will post the change that keeps
> > > > the tick going as a reply to this email.
> > > …
> > > > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > > > index 9d71046..ba78e63 100644
> > > > --- a/kernel/softirq.c
> > > > +++ b/kernel/softirq.c
> > > > @@ -209,7 +209,7 @@ static inline void invoke_softirq(void)
> > > > if (ksoftirqd_running(local_softirq_pending()))
> > > > return;
> > > >  
> > > > -   if (!force_irqthreads) {
> > > > +   if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) {
> > > >  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
> > > > /*
> > > >  * We can safely execute softirq on the current stack if
> > > > @@ -358,8 +358,8 @@ asmlinkage __visible void __softirq_entry 
> > > > __do_softirq(void)
> > > >  
> > > > pending = local_softirq_pending();
> > > > if (pending) {
> > > > -   if (time_before(jiffies, end) && !need_resched() &&
> > > > -   --max_restart)
> > > > +   if (!__this_cpu_read(ksoftirqd) ||
> > > > +   (time_before(jiffies, end) && !need_resched() && 
> > > > --max_restart))
> > > > goto restart;
> > > 
> > > This is hunk shouldn't be needed. The reason for it is probably that the
> > > following wakeup_softirqd() would avoid further invoke_softirq()
> > > performing the actual softirq work. It would leave early due to
> > > ksoftirqd_running(). Unless I'm wrong, any raise_softirq() invocation
> > > outside of an interrupt would do the same. 
> 
> And it does pass the rcutorture test without that hunk:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
> --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y 
> CONFIG_PROVE_LOCKING=y" --bootargs "threadirqs=1" --trust-make
> 
Yep. I have tested that patch also. It works for me as well. So
technically i do not see any issues from the first glance but of
course it should be reviewed by the softirq people to hear their
opinion.

IRQs are enabled, so it can be handled from an IRQ tail until
ksoftirqd threads are spawned.

> > > I would like PeterZ / tglx to comment on this one. Basically I'm not
> > > sure if it is okay to expect softirqs beeing served and waited on that
> > > early in the boot.
> 
> It would be good to get other eyes on this.
> 
> I do agree that "don't wait on softirq handlers until after completion
> of all early_initcall() handlers" is a nice simple rule, but debugging
> violations of it is not so simple.  Adding warnings to ease debugging
> of violations of this rule is quite a bit more complex than is either of
> the methods of making the rule unnecessary, at least from what I can see
> at this point.  The complexity of the warnings is exactly what Sebastian
> pointed out earlier, that it is currently legal to raise_softirq() but
> not to wait on the resulting handlers.  But even waiting is OK if that
> waiting does not delay the boot sequence.  But if the boot kthread waits
> on the kthread that does the waiting, it is once again not OK.
> 
> So am I missing something subtle here?
>
I agree here. Seems like we are on the same page in understanding :)

> > The ksoftirqd threads get spawned during early_initcall() phase. Why not
> > just spawn them one step earlier what is totally safe? I mean before
> > do_pre_smp_initcalls() that calls early callbacks.
> > 
> > +   spawn_ksoftirqd();
> > rcu_init_tasks_generic();
> > do_pre_smp_initcalls();
> > 
> > With such change the spawning will not be depended on linker/compiler
> > i.e. when and in which order an early_initcall(spawn_ksoftirqd) callback
> > is executed.
> 
> We both posted patches similar to this, so I am not opposed.  One caveat,
> though, namely that this narrows the window quite a bit but does not
> entirely close it.  But it does allow the early_initcall()s to wait on
> softirq handlers.
> 
Yep, that was an intention. At least to provide such functionality for early
callbacks. What happens before it(init/main.c) is pretty controllable.

--
Vlad Rezki


Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-22 Thread Uladzislau Rezki
On Mon, Feb 22, 2021 at 11:21:04AM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-02-19 10:33:36 [-0800], Paul E. McKenney wrote:
> > For definiteness, here is the first part of the change, posted earlier.
> > The commit log needs to be updated.  I will post the change that keeps
> > the tick going as a reply to this email.
> …
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 9d71046..ba78e63 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -209,7 +209,7 @@ static inline void invoke_softirq(void)
> > if (ksoftirqd_running(local_softirq_pending()))
> > return;
> >  
> > -   if (!force_irqthreads) {
> > +   if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) {
> >  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
> > /*
> >  * We can safely execute softirq on the current stack if
> > @@ -358,8 +358,8 @@ asmlinkage __visible void __softirq_entry 
> > __do_softirq(void)
> >  
> > pending = local_softirq_pending();
> > if (pending) {
> > -   if (time_before(jiffies, end) && !need_resched() &&
> > -   --max_restart)
> > +   if (!__this_cpu_read(ksoftirqd) ||
> > +   (time_before(jiffies, end) && !need_resched() && 
> > --max_restart))
> > goto restart;
> 
> This is hunk shouldn't be needed. The reason for it is probably that the
> following wakeup_softirqd() would avoid further invoke_softirq()
> performing the actual softirq work. It would leave early due to
> ksoftirqd_running(). Unless I'm wrong, any raise_softirq() invocation
> outside of an interrupt would do the same. 
> 
> I would like PeterZ / tglx to comment on this one. Basically I'm not
> sure if it is okay to expect softirqs beeing served and waited on that
> early in the boot.
> 
The ksoftirqd threads get spawned during early_initcall() phase. Why not
just spawn them one step earlier what is totally safe? I mean before
do_pre_smp_initcalls() that calls early callbacks.

+   spawn_ksoftirqd();
rcu_init_tasks_generic();
do_pre_smp_initcalls();

With such change the spawning will not be depended on linker/compiler
i.e. when and in which order an early_initcall(spawn_ksoftirqd) callback
is executed.

--
Vlad Rezki


Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-19 Thread Uladzislau Rezki
On Fri, Feb 19, 2021 at 12:23:57PM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 19, 2021 at 12:17:38PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2021-02-19 12:13:01 [+0100], Uladzislau Rezki wrote:
> > > I or Paul will ask for a test once it is settled down :) Looks like
> > > it is, so we should fix for v5.12.
> > 
> > Okay. Since Paul asked for powerpc test on v5.11-rc I wanted check if
> > parts of it are also -stable material.
> > 
> OK, i see. It will be broken starting from v5.12-rc unless we fix it.
> 
Sorry it is broken since 5.11 kernel already, i messed it up.

--
Vlad Rezki


Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-19 Thread Uladzislau Rezki
On Fri, Feb 19, 2021 at 12:17:38PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-02-19 12:13:01 [+0100], Uladzislau Rezki wrote:
> > I or Paul will ask for a test once it is settled down :) Looks like
> > it is, so we should fix for v5.12.
> 
> Okay. Since Paul asked for powerpc test on v5.11-rc I wanted check if
> parts of it are also -stable material.
> 
OK, i see. It will be broken starting from v5.12-rc unless we fix it.

--
Vlad Rezki


Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-19 Thread Uladzislau Rezki
On Fri, Feb 19, 2021 at 11:57:10AM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-02-19 11:49:58 [+0100], Uladzislau Rezki wrote:
> > If above fix works, we can initialize rcu_init_tasks_generic() from the
> > core_initcall() including selftst. It means that such initialization can
> > be done later:
> 
> Good. Please let me know once there is something for me to test.
> Do I assume correctly that the self-test, I stumbled upon, is v5.12
> material?
> 
I or Paul will ask for a test once it is settled down :) Looks like
it is, so we should fix for v5.12.

--
Vlad Rezki


Re: [PATCH] kprobes: Fix to delay the kprobes jump optimization

2021-02-19 Thread Uladzislau Rezki
On Fri, Feb 19, 2021 at 09:17:55AM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-02-18 07:15:54 [-0800], Paul E. McKenney wrote:
> > Thank you, but the original report of a problem was from Sebastian
> > and the connection to softirq was Uladzislau.  So could you please
> > add these before (or even in place of) my Reported-by?
> > 
> > Reported-by: Sebastian Andrzej Siewior 
> > Reported-by: Uladzislau Rezki 
> > 
> > Other than that, looks good!
> 
> Perfect. I'm kind of lost here, nevertheless ;) Does this mean that the
> RCU selftest can now be delayed?
> 
If above fix works, we can initialize rcu_init_tasks_generic() from the
core_initcall() including selftst. It means that such initialization can
be done later:

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5cc6deaa5df2..ae7d0cdfa9bd 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -88,12 +88,6 @@ void rcu_sched_clock_irq(int user);
 void rcu_report_dead(unsigned int cpu);
 void rcutree_migrate_callbacks(int cpu);
 
-#ifdef CONFIG_TASKS_RCU_GENERIC
-void rcu_init_tasks_generic(void);
-#else
-static inline void rcu_init_tasks_generic(void) { }
-#endif
-
 #ifdef CONFIG_RCU_STALL_COMMON
 void rcu_sysrq_start(void);
 void rcu_sysrq_end(void);
diff --git a/init/main.c b/init/main.c
index c68d784376ca..3024c4db17a9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1512,7 +1512,6 @@ static noinline void __init kernel_init_freeable(void)
 
init_mm_internals();
 
-   rcu_init_tasks_generic();
do_pre_smp_initcalls();
lockup_detector_init();
 
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 17c8ebe131af..2797f9a042f4 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -966,11 +966,6 @@ static void rcu_tasks_trace_pregp_step(void)
 static void rcu_tasks_trace_pertask(struct task_struct *t,
struct list_head *hop)
 {
-   // During early boot when there is only the one boot CPU, there
-   // is no idle task for the other CPUs. Just return.
-   if (unlikely(t == NULL))
-   return;
-
WRITE_ONCE(t->trc_reader_special.b.need_qs, false);
WRITE_ONCE(t->trc_reader_checked, false);
t->trc_ipi_to_cpu = -1;
@@ -1300,7 +1295,7 @@ late_initcall(rcu_tasks_verify_self_tests);
 static void rcu_tasks_initiate_self_tests(void) { }
 #endif /* #else #ifdef CONFIG_PROVE_RCU */
 
-void __init rcu_init_tasks_generic(void)
+static void __init rcu_init_tasks_generic(void)
 {
 #ifdef CONFIG_TASKS_RCU
rcu_spawn_tasks_kthread();
@@ -1318,6 +1313,7 @@ void __init rcu_init_tasks_generic(void)
rcu_tasks_initiate_self_tests();
 }
 
+core_initcall(rcu_init_tasks_generic);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 static inline void rcu_tasks_bootup_oddness(void) {}
 void show_rcu_tasks_gp_kthreads(void) {}


--
Vlad Rezki


Re: [PATCH 2/2] rcu-tasks: add RCU-tasks self tests

2021-02-18 Thread Uladzislau Rezki
On Thu, Feb 18, 2021 at 02:03:07PM +0900, Masami Hiramatsu wrote:
> On Wed, 17 Feb 2021 10:17:38 -0800
> "Paul E. McKenney"  wrote:
> 
> > > > 1.  Spawn ksoftirqd earlier.
> > > > 
> > > > 2.  Suppress attempts to awaken ksoftirqd before it exists,
> > > > forcing all ksoftirq execution on the back of interrupts.
> > > > 
> > > > Uladzislau and I each produced patches for #1, and I produced a patch
> > > > for #2.
> > > > 
> > > > The only other option I know of is to push the call to init_kprobes()
> > > > later in the boot sequence, perhaps to its original subsys_initcall(),
> > > > or maybe only as late as core_initcall().  I added Masami and Steve on
> > > > CC for their thoughts on this.
> > > > 
> > > > Is there some other proper fix that I am missing?
> > > 
> > > Oh, I missed that the synchronize_rcu_tasks() will be involved the kprobes
> > > in early stage. Does the problem only exist in the synchronize_rcu_tasks()
> > > instead of synchronize_rcu()? If so I can just stop optimizer in early 
> > > stage
> > > because I just want to enable kprobes in early stage, but not optprobes.
> > > 
> > > Does the following patch help?
> > 
> > It does look to me like it would!  I clearly should have asked you about
> > this a couple of months ago.  ;-)
> > 
> > The proof of the pudding would be whether the powerpc guys can apply
> > this to v5.10-rc7 and have their kernel come up without hanging at boot.
> 
> Who could I ask for testing this patch, Uladzislau?
> I think the test machine which enough slow or the kernel has much initcall
> to run optimization thread while booting.
> In my environment, I could not reproduce that issue because the optimizer
> was sheduled after some tick passed. At that point, ksoftirqd has already
> been initialized.
> 
>From my end i did some simulation and had a look at your change. So the
patch works on my setup. I see that optimization of kprobes is deferred
and can be initiated only from the subsys_initcall() phase. So the sequence
should be correct for v5.10-rc7:

1. ksoftirq is setup early_initcall();
2. rcu_spawn_tasks_* are setup in the core_initcall();
3. an optimization of kprobes are invoked from subsys_initcall().

For real test on power-pc you can ask Daniel Axtens  for help. 

--
Vlad Rezki


[PATCH v3 1/1] rcuscale: add kfree_rcu() single-argument scale test

2021-02-17 Thread Uladzislau Rezki (Sony)
To stress and test a single argument of kfree_rcu() call, we
should to have a special coverage for it. We used to have it
in the test-suite related to vmalloc stressing. The reason is
the rcuscale is a correct place for RCU related things.

Therefore introduce two torture_param() variables, one is for
single-argument scale test and another one for double-argument
scale test.

By default kfree_rcu_test_single and kfree_rcu_test_double are
initialized to false. If both have the same value (false or true)
both are randomly tested, otherwise only the one with value true
is tested. The value of this is that it allows testing of both
options with one test.

Suggested-by: Paul E. McKenney 
Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/rcuscale.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 06491d5530db..0fb540e2b22b 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg)
 torture_param(int, kfree_nthreads, -1, "Number of threads running loops of 
kfree_rcu().");
 torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees 
done in an iteration.");
 torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num 
allocations and frees.");
+torture_param(bool, kfree_rcu_test_single, false, "Do we run a kfree_rcu() 
single-argument scale test?");
+torture_param(bool, kfree_rcu_test_double, false, "Do we run a kfree_rcu() 
double-argument scale test?");
 
 static struct task_struct **kfree_reader_tasks;
 static int kfree_nrealthreads;
@@ -644,10 +646,13 @@ kfree_scale_thread(void *arg)
struct kfree_obj *alloc_ptr;
u64 start_time, end_time;
long long mem_begin, mem_during = 0;
+   bool kfree_rcu_test_both;
+   DEFINE_TORTURE_RANDOM(tr);
 
VERBOSE_SCALEOUT_STRING("kfree_scale_thread task started");
set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
set_user_nice(current, MAX_NICE);
+   kfree_rcu_test_both = (kfree_rcu_test_single == kfree_rcu_test_double);
 
start_time = ktime_get_mono_fast_ns();
 
@@ -670,7 +675,15 @@ kfree_scale_thread(void *arg)
if (!alloc_ptr)
return -ENOMEM;
 
-   kfree_rcu(alloc_ptr, rh);
+   // By default kfree_rcu_test_single and 
kfree_rcu_test_double are
+   // initialized to false. If both have the same value 
(false or true)
+   // both are randomly tested, otherwise only the one 
with value true
+   // is tested.
+   if ((kfree_rcu_test_single && !kfree_rcu_test_double) ||
+   (kfree_rcu_test_both && 
torture_random() & 0x800))
+   kfree_rcu(alloc_ptr);
+   else
+   kfree_rcu(alloc_ptr, rh);
}
 
cond_resched();
-- 
2.20.1



Re: [PATCH 1/2] rcuscale: add kfree_rcu() single-argument scale test

2021-02-17 Thread Uladzislau Rezki
On Tue, Feb 16, 2021 at 09:35:02AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 15, 2021 at 05:27:05PM +0100, Uladzislau Rezki wrote:
> > On Tue, Feb 09, 2021 at 05:00:52PM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 09, 2021 at 09:13:43PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Feb 04, 2021 at 01:46:48PM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Jan 29, 2021 at 09:05:04PM +0100, Uladzislau Rezki (Sony) 
> > > > > wrote:
> > > > > > To stress and test a single argument of kfree_rcu() call, we
> > > > > > should to have a special coverage for it. We used to have it
> > > > > > in the test-suite related to vmalloc stressing. The reason is
> > > > > > the rcuscale is a correct place for RCU related things.
> > > > > > 
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > > > 
> > > > > This is a great addition, but it would be even better if there was
> > > > > a way to say "test both in one run".  One way to do this is to have
> > > > > torture_param() variables for both kfree_rcu_test_single and (say)
> > > > > kfree_rcu_test_double, both bool and both initialized to false.  If 
> > > > > both
> > > > > have the same value (false or true) both are tested, otherwise only
> > > > > the one with value true is tested.  The value of this is that it 
> > > > > allows
> > > > > testing of both options with one test.
> > > > > 
> > > > Make sense to me :)
> > > > 
> > > > >From ba083a543a123455455c81230b7b5a9aa2a9cb7f Mon Sep 17 00:00:00 2001
> > > > From: "Uladzislau Rezki (Sony)" 
> > > > Date: Fri, 29 Jan 2021 19:51:27 +0100
> > > > Subject: [PATCH v2 1/1] rcuscale: add kfree_rcu() single-argument scale 
> > > > test
> > > > 
> > > > To stress and test a single argument of kfree_rcu() call, we
> > > > should to have a special coverage for it. We used to have it
> > > > in the test-suite related to vmalloc stressing. The reason is
> > > > the rcuscale is a correct place for RCU related things.
> > > > 
> > > > Therefore introduce two torture_param() variables, one is for
> > > > single-argument scale test and another one for double-argument
> > > > scale test.
> > > > 
> > > > By default kfree_rcu_test_single and kfree_rcu_test_double are
> > > > initialized to false. If both have the same value (false or true)
> > > > both are tested in one run, otherwise only the one with value
> > > > true is tested. The value of this is that it allows testing of
> > > > both options with one test.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > > ---
> > > >  kernel/rcu/rcuscale.c | 33 -
> > > >  1 file changed, 28 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
> > > > index 06491d5530db..0cde5c17f06c 100644
> > > > --- a/kernel/rcu/rcuscale.c
> > > > +++ b/kernel/rcu/rcuscale.c
> > > > @@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg)
> > > >  torture_param(int, kfree_nthreads, -1, "Number of threads running 
> > > > loops of kfree_rcu().");
> > > >  torture_param(int, kfree_alloc_num, 8000, "Number of allocations and 
> > > > frees done in an iteration.");
> > > >  torture_param(int, kfree_loops, 10, "Number of loops doing 
> > > > kfree_alloc_num allocations and frees.");
> > > > +torture_param(int, kfree_rcu_test_single, 0, "Do we run a kfree_rcu() 
> > > > single-argument scale test?");
> > > > +torture_param(int, kfree_rcu_test_double, 0, "Do we run a kfree_rcu() 
> > > > double-argument scale test?");
> > > 
> > > Good!  But why int instead of bool?
> > > 
> > > >  static struct task_struct **kfree_reader_tasks;
> > > >  static int kfree_nrealthreads;
> > > > @@ -641,7 +643,7 @@ kfree_scale_thread(void *arg)
> > > >  {
> > > > int i, loop = 0;
> > > > long me = (long)arg;
> > > > -   struct kfree_obj *alloc_ptr;
> > > > +   struct kfree_obj *alloc_ptr[2];
> > > 
> > > You lost me on this one...
> > > 
>

Re: [PATCH 1/2] rcuscale: add kfree_rcu() single-argument scale test

2021-02-15 Thread Uladzislau Rezki
On Tue, Feb 09, 2021 at 05:00:52PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 09, 2021 at 09:13:43PM +0100, Uladzislau Rezki wrote:
> > On Thu, Feb 04, 2021 at 01:46:48PM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 29, 2021 at 09:05:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > To stress and test a single argument of kfree_rcu() call, we
> > > > should to have a special coverage for it. We used to have it
> > > > in the test-suite related to vmalloc stressing. The reason is
> > > > the rcuscale is a correct place for RCU related things.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > 
> > > This is a great addition, but it would be even better if there was
> > > a way to say "test both in one run".  One way to do this is to have
> > > torture_param() variables for both kfree_rcu_test_single and (say)
> > > kfree_rcu_test_double, both bool and both initialized to false.  If both
> > > have the same value (false or true) both are tested, otherwise only
> > > the one with value true is tested.  The value of this is that it allows
> > > testing of both options with one test.
> > > 
> > Make sense to me :)
> > 
> > >From ba083a543a123455455c81230b7b5a9aa2a9cb7f Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" 
> > Date: Fri, 29 Jan 2021 19:51:27 +0100
> > Subject: [PATCH v2 1/1] rcuscale: add kfree_rcu() single-argument scale test
> > 
> > To stress and test a single argument of kfree_rcu() call, we
> > should to have a special coverage for it. We used to have it
> > in the test-suite related to vmalloc stressing. The reason is
> > the rcuscale is a correct place for RCU related things.
> > 
> > Therefore introduce two torture_param() variables, one is for
> > single-argument scale test and another one for double-argument
> > scale test.
> > 
> > By default kfree_rcu_test_single and kfree_rcu_test_double are
> > initialized to false. If both have the same value (false or true)
> > both are tested in one run, otherwise only the one with value
> > true is tested. The value of this is that it allows testing of
> > both options with one test.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> > ---
> >  kernel/rcu/rcuscale.c | 33 -
> >  1 file changed, 28 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
> > index 06491d5530db..0cde5c17f06c 100644
> > --- a/kernel/rcu/rcuscale.c
> > +++ b/kernel/rcu/rcuscale.c
> > @@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg)
> >  torture_param(int, kfree_nthreads, -1, "Number of threads running loops of 
> > kfree_rcu().");
> >  torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees 
> > done in an iteration.");
> >  torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num 
> > allocations and frees.");
> > +torture_param(int, kfree_rcu_test_single, 0, "Do we run a kfree_rcu() 
> > single-argument scale test?");
> > +torture_param(int, kfree_rcu_test_double, 0, "Do we run a kfree_rcu() 
> > double-argument scale test?");
> 
> Good!  But why int instead of bool?
> 
> >  static struct task_struct **kfree_reader_tasks;
> >  static int kfree_nrealthreads;
> > @@ -641,7 +643,7 @@ kfree_scale_thread(void *arg)
> >  {
> > int i, loop = 0;
> > long me = (long)arg;
> > -   struct kfree_obj *alloc_ptr;
> > +   struct kfree_obj *alloc_ptr[2];
> 
> You lost me on this one...
> 
> > u64 start_time, end_time;
> > long long mem_begin, mem_during = 0;
> >  
> > @@ -665,12 +667,33 @@ kfree_scale_thread(void *arg)
> > mem_during = (mem_during + si_mem_available()) / 2;
> > }
> >  
> > +   // By default kfree_rcu_test_single and kfree_rcu_test_double 
> > are
> > +   // initialized to false. If both have the same value (false or 
> > true)
> > +   // both are tested in one run, otherwise only the one with value
> > +   // true is tested.
> > for (i = 0; i < kfree_alloc_num; i++) {
> > -   alloc_ptr = kmalloc(kfree_mult * sizeof(struct 
> > kfree_obj), GFP_KERNEL);
> > -   if (!alloc_ptr)
> > -   return -ENOMEM;
> > +   alloc_ptr[0] = kmalloc(kfree_mult * sizeof(struct 
> > kfree_obj), 

[tip: core/rcu] rcu: Introduce kfree_rcu() single-argument macro

2021-02-15 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 5130b8fd06901c1b3a4bd0d0f5c5ea99b2b0a6f0
Gitweb:
https://git.kernel.org/tip/5130b8fd06901c1b3a4bd0d0f5c5ea99b2b0a6f0
Author:Uladzislau Rezki (Sony) 
AuthorDate:Fri, 20 Nov 2020 12:49:16 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 13:42:04 -08:00

rcu: Introduce kfree_rcu() single-argument macro

There is a kvfree_rcu() single argument macro that handles pointers
returned by kvmalloc(). Even though it also handles pointer returned by
kmalloc(), readability suffers.

This commit therefore updates the kfree_rcu() macro to explicitly pair
with kmalloc(), thus improving readability.

Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de08264..b95373e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -851,8 +851,9 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 
 /**
  * kfree_rcu() - kfree an object after a grace period.
- * @ptr:   pointer to kfree
- * @rhf:   the name of the struct rcu_head within the type of @ptr.
+ * @ptr: pointer to kfree for both single- and double-argument invocations.
+ * @rhf: the name of the struct rcu_head within the type of @ptr,
+ *   but only for double-argument invocations.
  *
  * Many rcu callbacks functions just call kfree() on the base structure.
  * These functions are trivial, but their size adds up, and furthermore
@@ -875,13 +876,7 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
  * The BUILD_BUG_ON check must not involve any function calls, hence the
  * checks are done in macros here.
  */
-#define kfree_rcu(ptr, rhf)\
-do {   \
-   typeof (ptr) ___p = (ptr);  \
-   \
-   if (___p)   \
-   __kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \
-} while (0)
+#define kfree_rcu kvfree_rcu
 
 /**
  * kvfree_rcu() - kvfree an object after a grace period.
@@ -913,7 +908,14 @@ do {   
\
kvfree_rcu_arg_2, kvfree_rcu_arg_1)(__VA_ARGS__)
 
 #define KVFREE_GET_MACRO(_1, _2, NAME, ...) NAME
-#define kvfree_rcu_arg_2(ptr, rhf) kfree_rcu(ptr, rhf)
+#define kvfree_rcu_arg_2(ptr, rhf) \
+do {   \
+   typeof (ptr) ___p = (ptr);  \
+   \
+   if (___p)   \
+   __kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \
+} while (0)
+
 #define kvfree_rcu_arg_1(ptr)  \
 do {   \
typeof(ptr) ___p = (ptr);   \


[tip: core/rcu] rcu: Eliminate the __kvfree_rcu() macro

2021-02-15 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 5ea5d1ed572cb5ac173674fe770252253d2d9e27
Gitweb:
https://git.kernel.org/tip/5ea5d1ed572cb5ac173674fe770252253d2d9e27
Author:Uladzislau Rezki (Sony) 
AuthorDate:Fri, 20 Nov 2020 12:49:17 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 13:42:04 -08:00

rcu: Eliminate the __kvfree_rcu() macro

This commit open-codes the __kvfree_rcu() macro, thus saving a
few lines of code and improving readability.

Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index b95373e..f1576cd 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -840,15 +840,6 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
  */
 #define __is_kvfree_rcu_offset(offset) ((offset) < 4096)
 
-/*
- * Helper macro for kfree_rcu() to prevent argument-expansion eyestrain.
- */
-#define __kvfree_rcu(head, offset) \
-   do { \
-   BUILD_BUG_ON(!__is_kvfree_rcu_offset(offset)); \
-   kvfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset)); 
\
-   } while (0)
-
 /**
  * kfree_rcu() - kfree an object after a grace period.
  * @ptr: pointer to kfree for both single- and double-argument invocations.
@@ -866,7 +857,7 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
  * Because the functions are not allowed in the low-order 4096 bytes of
  * kernel virtual memory, offsets up to 4095 bytes can be accommodated.
  * If the offset is larger than 4095 bytes, a compile-time error will
- * be generated in __kvfree_rcu(). If this error is triggered, you can
+ * be generated in kvfree_rcu_arg_2(). If this error is triggered, you can
  * either fall back to use of call_rcu() or rearrange the structure to
  * position the rcu_head structure into the first 4096 bytes.
  *
@@ -912,8 +903,11 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 do {   \
typeof (ptr) ___p = (ptr);  \
\
-   if (___p)   \
-   __kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \
+   if (___p) { 
\
+   BUILD_BUG_ON(!__is_kvfree_rcu_offset(offsetof(typeof(*(ptr)), 
rhf)));   \
+   kvfree_call_rcu(&((___p)->rhf), (rcu_callback_t)(unsigned long) 
\
+   (offsetof(typeof(*(ptr)), rhf)));   
\
+   }   
\
 } while (0)
 
 #define kvfree_rcu_arg_1(ptr)  \


Re: [PATCH v2] mm/vmalloc: randomize vmalloc() allocations

2021-02-15 Thread Uladzislau Rezki
On Sat, Feb 13, 2021 at 03:43:39PM +0200, Topi Miettinen wrote:
> On 13.2.2021 13.55, Uladzislau Rezki wrote:
> > > Hello,
> > > 
> > > Is there a chance of getting this reviewed and maybe even merged, please?
> > > 
> > > -Topi
> > > 
> > I can review it and help with it. But before that i would like to
> > clarify if such "randomization" is something that you can not leave?
> 
> This happens to interest me and I don't mind the performance loss since I
> think there's also an improvement in security. I suppose (perhaps wrongly)
> that others may also be interested in such features. For example, also
> `nosmt` can take away a big part of CPU processing capability.
>
OK. I was thinking about if it is done for some production systems or
some specific projects where this is highly demanded.

>
> Does this
> answer your question, I'm not sure what you mean with leaving? I hope you
> would not want me to go away and leave?
>
No-no, that was a type :) Sorry for that. I just wanted to figure out
who really needs it.

> > For example on 32bit system vmalloc space is limited, such randomization
> > can slow down it, also it will lead to failing of allocations much more,
> > thus it will require repeating with different offset.
> 
> I would not use `randomize_vmalloc=1` on a 32 bit systems, because in
> addition to slow down, the address space could become so fragmented that
> large allocations may not fit anymore. Perhaps the documentation should warn
> about this more clearly. I haven't tried this on a 32 bit system though and
> there the VM layout is very different.
> 
For 32-bit systems that would introduce many issues not limited to 
fragmentations.

> __alloc_vm_area() scans the vmalloc space starting from a random address up
> to end of the area. If this fails, the scan is restarted from the bottom of
> the area up to this random address. Thus the entire area is scanned.
> 
> > Second. There is a space or region for modules. Using various offsets
> > can waste of that memory, thus can lead to failing of module loading.
> 
> The allocations for modules (or BPF code) are also randomized within their
> dedicated space. I don't think other allocations should affect module space.
> Within this module space, fragmentation may also be possible because there's
> only 1,5GB available. The largest allocation on my system seems to be 11M at
> the moment, others are 1M or below and most are 8k. The possibility of an
> allocation failing probably depends on the fill ratio. In practice haven't
> seen problems with this.
> 
I think it depends on how many modules your system loads. If it is a big
system it might be that such fragmentation and wasting of module space
may lead to modules loading.

> It would be possible to have finer control, for example
> `randomize_vmalloc=3` (1 = general vmalloc, 2 = modules, bitwise ORed) or
> `randomize_vmalloc=general,modules`.
> 
> I experimented by trying to change how the modules are compiled
> (-mcmodel=medium or -mcmodel=large) so that they could be located in the
> normal vmalloc space, but instead I found a bug in the compiler (-mfentry
> produces incorrect code for -mcmodel=large, now fixed).
> 
> > On the other side there is a per-cpu allocator. Interfering with it
> > also will increase a rate of failing.
> 
> I didn't notice the per-cpu allocator before. I'm probably missing
> something, but it seems to be used for a different purpose (for allocating
> the vmap_area structure objects instead of the address space range), so
> where do you see interference?
> 


   A   B
 >   <
<---><->
|   vmalloc address space|
|<--->


A - is a vmalloc allocations;
B - is a percpu-allocator.

--
Vlad Rezki


Re: [PATCH 2/2] rcu-tasks: add RCU-tasks self tests

2021-02-13 Thread Uladzislau Rezki
On Sat, Feb 13, 2021 at 08:45:54AM -0800, Paul E. McKenney wrote:
> On Sat, Feb 13, 2021 at 12:30:30PM +0100, Uladzislau Rezki wrote:
> > On Fri, Feb 12, 2021 at 04:43:28PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 12, 2021 at 04:37:09PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 12, 2021 at 03:48:51PM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 12, 2021 at 10:12:07PM +0100, Uladzislau Rezki wrote:
> > > > > > On Fri, Feb 12, 2021 at 08:20:59PM +0100, Sebastian Andrzej Siewior 
> > > > > > wrote:
> > > > > > > On 2020-12-09 21:27:32 [+0100], Uladzislau Rezki (Sony) wrote:
> > > > > > > > Add self tests for checking of RCU-tasks API functionality.
> > > > > > > > It covers:
> > > > > > > > - wait API functions;
> > > > > > > > - invoking/completion call_rcu_tasks*().
> > > > > > > > 
> > > > > > > > Self-tests are run when CONFIG_PROVE_RCU kernel parameter is 
> > > > > > > > set.
> > > > > > > 
> > > > > > > I just bisected to this commit. By booting with `threadirqs' I 
> > > > > > > end up
> > > > > > > with:
> > > > > > > [0.176533] Running RCU-tasks wait API self tests
> > > > > > > 
> > > > > > > No stall warning or so.
> > > > > > > It boots again with:
> > > > > > > 
> > > > > > > diff --git a/init/main.c b/init/main.c
> > > > > > > --- a/init/main.c
> > > > > > > +++ b/init/main.c
> > > > > > > @@ -1489,6 +1489,7 @@ void __init console_on_rootfs(void)
> > > > > > >   fput(file);
> > > > > > >  }
> > > > > > >  
> > > > > > > +void rcu_tasks_initiate_self_tests(void);
> > > > > > >  static noinline void __init kernel_init_freeable(void)
> > > > > > >  {
> > > > > > >   /*
> > > > > > > @@ -1514,6 +1515,7 @@ static noinline void __init 
> > > > > > > kernel_init_freeable(void)
> > > > > > >  
> > > > > > >   rcu_init_tasks_generic();
> > > > > > >   do_pre_smp_initcalls();
> > > > > > > + rcu_tasks_initiate_self_tests();
> > > > > > >   lockup_detector_init();
> > > > > > >  
> > > > > > >   smp_init();
> > > > > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > > > > > --- a/kernel/rcu/tasks.h
> > > > > > > +++ b/kernel/rcu/tasks.h
> > > > > > > @@ -1266,7 +1266,7 @@ static void test_rcu_tasks_callback(struct 
> > > > > > > rcu_head *rhp)
> > > > > > >   rttd->notrun = true;
> > > > > > >  }
> > > > > > >  
> > > > > > > -static void rcu_tasks_initiate_self_tests(void)
> > > > > > > +void rcu_tasks_initiate_self_tests(void)
> > > > > > >  {
> > > > > > >   pr_info("Running RCU-tasks wait API self tests\n");
> > > > > > >  #ifdef CONFIG_TASKS_RCU
> > > > > > > @@ -1322,7 +1322,6 @@ void __init rcu_init_tasks_generic(void)
> > > > > > >  #endif
> > > > > > >  
> > > > > > >   // Run the self-tests.
> > > > > > > - rcu_tasks_initiate_self_tests();
> > > > > > >  }
> > > > > > >  
> > > > > > >  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> > > > > > > 
> > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > > > 
> > > > > Apologies for the hassle!  My testing clearly missed this combination
> > > > > of CONFIG_PROVE_RCU=y and threadirqs=1.  :-(
> > > > > 
> > > > > But at least I can easily reproduce this hang as follows:
> > > > > 
> > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
> > > > > --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y 
> > > > > CONFIG_PROVE_LOCKING=y" --bootargs "threadirqs=1" --trust-make
> > > > > 
> > > > > Sadly, I cannot take your pat

Re: [PATCH v2] mm/vmalloc: randomize vmalloc() allocations

2021-02-13 Thread Uladzislau Rezki
> Hello,
> 
> Is there a chance of getting this reviewed and maybe even merged, please?
> 
> -Topi
> 
I can review it and help with it. But before that i would like to
clarify if such "randomization" is something that you can not leave?

For example on 32bit system vmalloc space is limited, such randomization
can slow down it, also it will lead to failing of allocations much more,
thus it will require repeating with different offset.

Second. There is a space or region for modules. Using various offsets
can waste of that memory, thus can lead to failing of module loading.

On the other side there is a per-cpu allocator. Interfering with it
also will increase a rate of failing.

--
Vlad Rezki

> > Memory mappings inside kernel allocated with vmalloc() are in
> > predictable order and packed tightly toward the low addresses. With
> > new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> > used randomly to make the allocations less predictable and harder to
> > guess for attackers. Also module and BPF code locations get randomized
> > (within their dedicated and rather small area though) and if
> > CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
> > 
> > On 32 bit systems this may cause problems due to increased VM
> > fragmentation if the address space gets crowded.
> > 
> > On all systems, it will reduce performance and increase memory and
> > cache usage due to less efficient use of page tables and inability to
> > merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
> > page tables, in the worst case, additional page table entries of up to
> > 4 pages are created for each mapping, so with small mappings there's
> > considerable penalty.
> > 
> > Without randomize_vmalloc=1:
> > $ cat /proc/vmallocinfo
> > 0xc900-0xc90020008192 acpi_os_map_iomem+0x29e/0x2c0 
> > phys=0x3ffe1000 ioremap
> > 0xc9002000-0xc9005000   12288 acpi_os_map_iomem+0x29e/0x2c0 
> > phys=0x3ffe ioremap
> > 0xc9005000-0xc90070008192 hpet_enable+0x36/0x4a9 
> > phys=0xfed0 ioremap
> > 0xc9007000-0xc90090008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xc9009000-0xc900b0008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xc900b000-0xc900d0008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xc900d000-0xc900f0008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xc9011000-0xc9015000   16384 n_tty_open+0x16/0xe0 pages=3 
> > vmalloc
> > 0xc93de000-0xc93e8192 acpi_os_map_iomem+0x29e/0x2c0 
> > phys=0xfed0 ioremap
> > 0xc93e-0xc93e20008192 memremap+0x1a1/0x280 
> > phys=0x000f5000 ioremap
> > 0xc93e2000-0xc93f3000   69632 pcpu_create_chunk+0x80/0x2c0 
> > pages=16 vmalloc
> > 0xc93f3000-0xc9405000   73728 pcpu_create_chunk+0xb7/0x2c0 
> > pages=17 vmalloc
> > 0xc9405000-0xc940a000   20480 pcpu_create_chunk+0xed/0x2c0 
> > pages=4 vmalloc
> > 0xe8c0-0xe8e0 2097152 pcpu_get_vm_areas+0x0/0x1a40 
> > vmalloc
> > 
> > With randomize_vmalloc=1, the allocations are randomized:
> > $ cat /proc/vmallocinfo
> > 0xca3a36442000-0xca3a36447000   20480 pcpu_create_chunk+0xed/0x2c0 
> > pages=4 vmalloc
> > 0xca63034d6000-0xca63034d9000   12288 acpi_os_map_iomem+0x29e/0x2c0 
> > phys=0x3ffe ioremap
> > 0xcce23d32e000-0xcce23d338192 memremap+0x1a1/0x280 
> > phys=0x000f5000 ioremap
> > 0xcfb9f0e22000-0xcfb9f0e240008192 hpet_enable+0x36/0x4a9 
> > phys=0xfed0 ioremap
> > 0xd1df23e9e000-0xd1df23eb   73728 pcpu_create_chunk+0xb7/0x2c0 
> > pages=17 vmalloc
> > 0xd690c299-0xd690c29920008192 acpi_os_map_iomem+0x29e/0x2c0 
> > phys=0x3ffe1000 ioremap
> > 0xd8460c718000-0xd8460c71c000   16384 n_tty_open+0x16/0xe0 pages=3 
> > vmalloc
> > 0xd89aba709000-0xd89aba70b0008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xe0ca3f2ed000-0xe0ca3f2ef0008192 acpi_os_map_iomem+0x29e/0x2c0 
> > phys=0xfed0 ioremap
> > 0xe3ba44802000-0xe3ba448040008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xe4524b2a2000-0xe4524b2a40008192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xe61372b2e000-0xe61372b38192 gen_pool_add_owner+0x49/0x130 
> > pages=1 vmalloc
> > 0xe704d2f7c000-0xe704d2f8d000   69632 pcpu_create_chunk+0x80/0x2c0 
> > pages=16 vmalloc
> > 0xe8c0-0xe8e0 2097152 pcpu_get_vm_areas+0x0/0x1a40 
> > vmalloc
> > 
> > With CONFIG_VMAP_STACK, also kernel thread stacks are placed in
> > vmalloc area and therefore they also get randomized (only one example
> > line from /proc/vmallocinfo shown for brevity):
> 

Re: [PATCH 2/2] rcu-tasks: add RCU-tasks self tests

2021-02-13 Thread Uladzislau Rezki
On Fri, Feb 12, 2021 at 04:43:28PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 12, 2021 at 04:37:09PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 12, 2021 at 03:48:51PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 12, 2021 at 10:12:07PM +0100, Uladzislau Rezki wrote:
> > > > On Fri, Feb 12, 2021 at 08:20:59PM +0100, Sebastian Andrzej Siewior 
> > > > wrote:
> > > > > On 2020-12-09 21:27:32 [+0100], Uladzislau Rezki (Sony) wrote:
> > > > > > Add self tests for checking of RCU-tasks API functionality.
> > > > > > It covers:
> > > > > > - wait API functions;
> > > > > > - invoking/completion call_rcu_tasks*().
> > > > > > 
> > > > > > Self-tests are run when CONFIG_PROVE_RCU kernel parameter is set.
> > > > > 
> > > > > I just bisected to this commit. By booting with `threadirqs' I end up
> > > > > with:
> > > > > [0.176533] Running RCU-tasks wait API self tests
> > > > > 
> > > > > No stall warning or so.
> > > > > It boots again with:
> > > > > 
> > > > > diff --git a/init/main.c b/init/main.c
> > > > > --- a/init/main.c
> > > > > +++ b/init/main.c
> > > > > @@ -1489,6 +1489,7 @@ void __init console_on_rootfs(void)
> > > > >   fput(file);
> > > > >  }
> > > > >  
> > > > > +void rcu_tasks_initiate_self_tests(void);
> > > > >  static noinline void __init kernel_init_freeable(void)
> > > > >  {
> > > > >   /*
> > > > > @@ -1514,6 +1515,7 @@ static noinline void __init 
> > > > > kernel_init_freeable(void)
> > > > >  
> > > > >   rcu_init_tasks_generic();
> > > > >   do_pre_smp_initcalls();
> > > > > + rcu_tasks_initiate_self_tests();
> > > > >   lockup_detector_init();
> > > > >  
> > > > >   smp_init();
> > > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > > > --- a/kernel/rcu/tasks.h
> > > > > +++ b/kernel/rcu/tasks.h
> > > > > @@ -1266,7 +1266,7 @@ static void test_rcu_tasks_callback(struct 
> > > > > rcu_head *rhp)
> > > > >   rttd->notrun = true;
> > > > >  }
> > > > >  
> > > > > -static void rcu_tasks_initiate_self_tests(void)
> > > > > +void rcu_tasks_initiate_self_tests(void)
> > > > >  {
> > > > >   pr_info("Running RCU-tasks wait API self tests\n");
> > > > >  #ifdef CONFIG_TASKS_RCU
> > > > > @@ -1322,7 +1322,6 @@ void __init rcu_init_tasks_generic(void)
> > > > >  #endif
> > > > >  
> > > > >   // Run the self-tests.
> > > > > - rcu_tasks_initiate_self_tests();
> > > > >  }
> > > > >  
> > > > >  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> > > > > 
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > 
> > > Apologies for the hassle!  My testing clearly missed this combination
> > > of CONFIG_PROVE_RCU=y and threadirqs=1.  :-(
> > > 
> > > But at least I can easily reproduce this hang as follows:
> > > 
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
> > > --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y 
> > > CONFIG_PROVE_LOCKING=y" --bootargs "threadirqs=1" --trust-make
> > > 
> > > Sadly, I cannot take your patch because that simply papers over the
> > > fact that early boot use of synchronize_rcu_tasks() is broken in this
> > > particular configuration, which will likely eventually bite others now
> > > that init_kprobes() has been moved earlier in boot:
> > > 
> > > 1b04fa990026 ("rcu-tasks: Move RCU-tasks initialization to before 
> > > early_initcall()")
> > > Link: https://lore.kernel.org/rcu/87eekfh80a@dja-thinkpad.axtens.net/
> > > Fixes: 36dadef23fcc ("kprobes: Init kprobes in early_initcall")
> > > 
> > > > > Sebastian
> > > > >
> > > > We should be able to use call_rcu_tasks() in the *initcall() callbacks.
> > > > The problem is that, ksoftirqd threads are not spawned by the time when
> > > > an rcu_init_tasks_generic() is invoked:

Re: [PATCH 2/2] rcu-tasks: add RCU-tasks self tests

2021-02-12 Thread Uladzislau Rezki
On Fri, Feb 12, 2021 at 08:20:59PM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-12-09 21:27:32 [+0100], Uladzislau Rezki (Sony) wrote:
> > Add self tests for checking of RCU-tasks API functionality.
> > It covers:
> > - wait API functions;
> > - invoking/completion call_rcu_tasks*().
> > 
> > Self-tests are run when CONFIG_PROVE_RCU kernel parameter is set.
> 
> I just bisected to this commit. By booting with `threadirqs' I end up
> with:
> [0.176533] Running RCU-tasks wait API self tests
> 
> No stall warning or so.
> It boots again with:
> 
> diff --git a/init/main.c b/init/main.c
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1489,6 +1489,7 @@ void __init console_on_rootfs(void)
>   fput(file);
>  }
>  
> +void rcu_tasks_initiate_self_tests(void);
>  static noinline void __init kernel_init_freeable(void)
>  {
>   /*
> @@ -1514,6 +1515,7 @@ static noinline void __init kernel_init_freeable(void)
>  
>   rcu_init_tasks_generic();
>   do_pre_smp_initcalls();
> + rcu_tasks_initiate_self_tests();
>   lockup_detector_init();
>  
>   smp_init();
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1266,7 +1266,7 @@ static void test_rcu_tasks_callback(struct rcu_head 
> *rhp)
>   rttd->notrun = true;
>  }
>  
> -static void rcu_tasks_initiate_self_tests(void)
> +void rcu_tasks_initiate_self_tests(void)
>  {
>   pr_info("Running RCU-tasks wait API self tests\n");
>  #ifdef CONFIG_TASKS_RCU
> @@ -1322,7 +1322,6 @@ void __init rcu_init_tasks_generic(void)
>  #endif
>  
>   // Run the self-tests.
> - rcu_tasks_initiate_self_tests();
>  }
>  
>  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> Sebastian
>
We should be able to use call_rcu_tasks() in the *initcall() callbacks.
The problem is that, ksoftirqd threads are not spawned by the time when
an rcu_init_tasks_generic() is invoked:

diff --git a/init/main.c b/init/main.c
index c68d784376ca..e6106bb12b2d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -954,7 +954,6 @@ asmlinkage __visible void __init __no_sanitize_address 
start_kernel(void)
rcu_init_nohz();
init_timers();
hrtimers_init();
-   softirq_init();
timekeeping_init();
 
/*
@@ -1512,6 +1511,7 @@ static noinline void __init kernel_init_freeable(void)
 
init_mm_internals();
 
+   softirq_init();
rcu_init_tasks_generic();
do_pre_smp_initcalls();
lockup_detector_init();
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 9d71046ea247..cafa55c496d0 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -630,6 +630,7 @@ void __init softirq_init(void)
_cpu(tasklet_hi_vec, cpu).head;
}
 
+   spawn_ksoftirqd();
open_softirq(TASKLET_SOFTIRQ, tasklet_action);
open_softirq(HI_SOFTIRQ, tasklet_hi_action);
 }
@@ -732,7 +733,6 @@ static __init int spawn_ksoftirqd(void)
 
return 0;
 }
-early_initcall(spawn_ksoftirqd);
 
 /*
  * [ These __weak aliases are kept in a separate compilation unit, so that

Any thoughts?

--
Vlad Rezki


Re: [PATCH v5] kvfree_rcu: Release page cache under memory pressure

2021-02-12 Thread Uladzislau Rezki
> From: Zqiang 
> 
> Add free per-cpu existing krcp's page cache operation in shrink callback
> function, and also during shrink period, simple delay schedule fill page
> work, to avoid refill page while free krcp page cache.
> 
> Signed-off-by: Zqiang 
> Co-developed-by: Uladzislau Rezki (Sony) 
> ---
>  v1->v4:
>  During the test a page shrinker is pretty active, because of low memory
>  condition. callback drains it whereas kvfree_rcu() part refill it right
>  away making kind of vicious circle.
>  Through Vlad Rezki suggestion, to avoid this, schedule a periodic delayed
>  work with HZ, and it's easy to do that.
>  v4->v5:
>  change commit message and use xchg replace WRITE_ONCE()
> 
>  kernel/rcu/tree.c | 49 +++
>  1 file changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c1ae1e52f638..f1fba23f5036 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3139,7 +3139,7 @@ struct kfree_rcu_cpu {
>   bool initialized;
>   int count;
>  
> - struct work_struct page_cache_work;
> + struct delayed_work page_cache_work;
>   atomic_t work_in_progress;
>   struct hrtimer hrtimer;
>  
> @@ -3395,7 +3395,7 @@ schedule_page_work_fn(struct hrtimer *t)
>   struct kfree_rcu_cpu *krcp =
>   container_of(t, struct kfree_rcu_cpu, hrtimer);
>  
> - queue_work(system_highpri_wq, >page_cache_work);
> + queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
>   return HRTIMER_NORESTART;
>  }
>  
> @@ -3404,7 +3404,7 @@ static void fill_page_cache_func(struct work_struct 
> *work)
>   struct kvfree_rcu_bulk_data *bnode;
>   struct kfree_rcu_cpu *krcp =
>   container_of(work, struct kfree_rcu_cpu,
> - page_cache_work);
> + page_cache_work.work);
>   unsigned long flags;
>   bool pushed;
>   int i;
> @@ -3428,15 +3428,21 @@ static void fill_page_cache_func(struct work_struct 
> *work)
>   atomic_set(>work_in_progress, 0);
>  }
>  
> +static atomic_t backoff_page_cache_fill = ATOMIC_INIT(0);
> +
Should we initialize a static atomic_t? It is zero by default.

>  static void
>  run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>  {
>   if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
>   !atomic_xchg(>work_in_progress, 1)) {
> - hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
> - HRTIMER_MODE_REL);
> - krcp->hrtimer.function = schedule_page_work_fn;
> - hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
> + if (atomic_xchg(_page_cache_fill, 0)) {
> + queue_delayed_work(system_highpri_wq, 
> >page_cache_work, HZ);
system_wq? It is not so critical, anyway the job is rearmed with 1 second 
interval.

> + } else {
> + hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
> + HRTIMER_MODE_REL);
> + krcp->hrtimer.function = schedule_page_work_fn;
> + hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
> + }
>   }
>  }
>  
> @@ -3571,19 +3577,44 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
>  
> +static int free_krc_page_cache(struct kfree_rcu_cpu *krcp)
> +{
> + unsigned long flags;
> + struct llist_node *page_list, *pos, *n;
> + int freed = 0;
> +
> + raw_spin_lock_irqsave(>lock, flags);
> + page_list = llist_del_all(>bkvcache);
> + krcp->nr_bkv_objs = 0;
> + raw_spin_unlock_irqrestore(>lock, flags);
> +
> + llist_for_each_safe(pos, n, page_list) {
> + free_page((unsigned long)pos);
> + freed++;
> + }
> +
> + return freed;
> +}
> +
>  static unsigned long
>  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
>   int cpu;
>   unsigned long count = 0;
> + unsigned long flags;
>  
>   /* Snapshot count of all CPUs */
>   for_each_possible_cpu(cpu) {
>   struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
>  
>   count += READ_ONCE(krcp->count);
> +
> + raw_spin_lock_irqsave(>lock, flags);
> + count += krcp->nr_bkv_objs;
> + raw_spin_unlock_irqrestore(>lock, flags);
>   }
>  
> + atomic_set(_page_cache_fill, 1);
>   return count;
>  }
>  
> @@ -3598,6 +3629,8 @@ kfree_

[tip: core/rcu] rcu-tasks: Add RCU-tasks self tests

2021-02-12 Thread tip-bot2 for Uladzislau Rezki (Sony)
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: bfba7ed084f8ab0269a5a1d2f51b07865456c334
Gitweb:
https://git.kernel.org/tip/bfba7ed084f8ab0269a5a1d2f51b07865456c334
Author:Uladzislau Rezki (Sony) 
AuthorDate:Wed, 09 Dec 2020 21:27:32 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 15:54:49 -08:00

rcu-tasks: Add RCU-tasks self tests

This commit adds self tests for early-boot use of RCU-tasks grace periods.
It tests all three variants (Rude, Tasks, and Tasks Trace) and covers
both synchronous (e.g., synchronize_rcu_tasks()) and asynchronous (e.g.,
call_rcu_tasks()) grace-period APIs.

Self-tests are run only in kernels built with CONFIG_PROVE_RCU=y.

Signed-off-by: Uladzislau Rezki (Sony) 
[ paulmck: Handle CONFIG_PROVE_RCU=n and identify test cases' callbacks. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tasks.h | 79 +-
 1 file changed, 79 insertions(+)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 73bbe79..74767d3 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1231,6 +1231,82 @@ void show_rcu_tasks_gp_kthreads(void)
 }
 #endif /* #ifndef CONFIG_TINY_RCU */
 
+#ifdef CONFIG_PROVE_RCU
+struct rcu_tasks_test_desc {
+   struct rcu_head rh;
+   const char *name;
+   bool notrun;
+};
+
+static struct rcu_tasks_test_desc tests[] = {
+   {
+   .name = "call_rcu_tasks()",
+   /* If not defined, the test is skipped. */
+   .notrun = !IS_ENABLED(CONFIG_TASKS_RCU),
+   },
+   {
+   .name = "call_rcu_tasks_rude()",
+   /* If not defined, the test is skipped. */
+   .notrun = !IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
+   },
+   {
+   .name = "call_rcu_tasks_trace()",
+   /* If not defined, the test is skipped. */
+   .notrun = !IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
+   }
+};
+
+static void test_rcu_tasks_callback(struct rcu_head *rhp)
+{
+   struct rcu_tasks_test_desc *rttd =
+   container_of(rhp, struct rcu_tasks_test_desc, rh);
+
+   pr_info("Callback from %s invoked.\n", rttd->name);
+
+   rttd->notrun = true;
+}
+
+static void rcu_tasks_initiate_self_tests(void)
+{
+   pr_info("Running RCU-tasks wait API self tests\n");
+#ifdef CONFIG_TASKS_RCU
+   synchronize_rcu_tasks();
+   call_rcu_tasks([0].rh, test_rcu_tasks_callback);
+#endif
+
+#ifdef CONFIG_TASKS_RUDE_RCU
+   synchronize_rcu_tasks_rude();
+   call_rcu_tasks_rude([1].rh, test_rcu_tasks_callback);
+#endif
+
+#ifdef CONFIG_TASKS_TRACE_RCU
+   synchronize_rcu_tasks_trace();
+   call_rcu_tasks_trace([2].rh, test_rcu_tasks_callback);
+#endif
+}
+
+static int rcu_tasks_verify_self_tests(void)
+{
+   int ret = 0;
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(tests); i++) {
+   if (!tests[i].notrun) { // still hanging.
+   pr_err("%s has been failed.\n", tests[i].name);
+   ret = -1;
+   }
+   }
+
+   if (ret)
+   WARN_ON(1);
+
+   return ret;
+}
+late_initcall(rcu_tasks_verify_self_tests);
+#else /* #ifdef CONFIG_PROVE_RCU */
+static void rcu_tasks_initiate_self_tests(void) { }
+#endif /* #else #ifdef CONFIG_PROVE_RCU */
+
 void __init rcu_init_tasks_generic(void)
 {
 #ifdef CONFIG_TASKS_RCU
@@ -1244,6 +1320,9 @@ void __init rcu_init_tasks_generic(void)
 #ifdef CONFIG_TASKS_TRACE_RCU
rcu_spawn_tasks_trace_kthread();
 #endif
+
+   // Run the self-tests.
+   rcu_tasks_initiate_self_tests();
 }
 
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */


Re: [PATCH v2] mm/vmalloc: use rb_tree instead of list for vread() lookups

2021-02-09 Thread Uladzislau Rezki
> vread() has been linearly searching vmap_area_list for looking up
> vmalloc areas to read from. These same areas are also tracked by
> a rb_tree (vmap_area_root) which offers logarithmic lookup.
> 
> This patch modifies vread() to use the rb_tree structure instead
> of the list and the speedup for heavy /proc/kcore readers can
> be pretty significant. Below are the wall clock measurements of
> a Python application that leverages the drgn debugging library
> to read and interpret data read from /proc/kcore.
> 
> Before the patch:
> -
> $ time sudo sdb -e 'dbuf | head 3000 | wc'
> (unsigned long)3000
> 
> real  0m22.446s
> user  0m2.321s
> sys   0m20.690s
> -
> 
> With the patch:
> -
> $ time sudo sdb -e 'dbuf | head 3000 | wc'
> (unsigned long)3000
> 
> real  0m2.104s
> user  0m2.043s
> sys   0m0.921s
> -
> 
> Signed-off-by: Serapheim Dimitropoulos 
> ---
> Changed in v2:
> 
> - Use __find_vmap_area() for initial lookup but keep iteration via
>   va->list.
> 
>  mm/vmalloc.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 49ab9b6c001d..eb133d000394 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2860,7 +2860,10 @@ long vread(char *buf, char *addr, unsigned long count)
>   count = -(unsigned long) addr;
>  
>   spin_lock(_area_lock);
> - list_for_each_entry(va, _area_list, list) {
> + va = __find_vmap_area((unsigned long)addr);
> + if (!va)
> + goto finished;
> +     list_for_each_entry_from(va, _area_list, list) {
>   if (!count)
>   break;
>  
> -- 
> 2.17.1
> 
Much better :)

Reviewed-by: Uladzislau Rezki (Sony) 

--
Vlad Rezki


Re: [PATCH 1/2] rcuscale: add kfree_rcu() single-argument scale test

2021-02-09 Thread Uladzislau Rezki
On Thu, Feb 04, 2021 at 01:46:48PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 29, 2021 at 09:05:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > To stress and test a single argument of kfree_rcu() call, we
> > should to have a special coverage for it. We used to have it
> > in the test-suite related to vmalloc stressing. The reason is
> > the rcuscale is a correct place for RCU related things.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> This is a great addition, but it would be even better if there was
> a way to say "test both in one run".  One way to do this is to have
> torture_param() variables for both kfree_rcu_test_single and (say)
> kfree_rcu_test_double, both bool and both initialized to false.  If both
> have the same value (false or true) both are tested, otherwise only
> the one with value true is tested.  The value of this is that it allows
> testing of both options with one test.
> 
Make sense to me :)

>From ba083a543a123455455c81230b7b5a9aa2a9cb7f Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" 
Date: Fri, 29 Jan 2021 19:51:27 +0100
Subject: [PATCH v2 1/1] rcuscale: add kfree_rcu() single-argument scale test

To stress and test a single argument of kfree_rcu() call, we
should to have a special coverage for it. We used to have it
in the test-suite related to vmalloc stressing. The reason is
the rcuscale is a correct place for RCU related things.

Therefore introduce two torture_param() variables, one is for
single-argument scale test and another one for double-argument
scale test.

By default kfree_rcu_test_single and kfree_rcu_test_double are
initialized to false. If both have the same value (false or true)
both are tested in one run, otherwise only the one with value
true is tested. The value of this is that it allows testing of
both options with one test.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/rcuscale.c | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 06491d5530db..0cde5c17f06c 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg)
 torture_param(int, kfree_nthreads, -1, "Number of threads running loops of 
kfree_rcu().");
 torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees 
done in an iteration.");
 torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num 
allocations and frees.");
+torture_param(int, kfree_rcu_test_single, 0, "Do we run a kfree_rcu() 
single-argument scale test?");
+torture_param(int, kfree_rcu_test_double, 0, "Do we run a kfree_rcu() 
double-argument scale test?");
 
 static struct task_struct **kfree_reader_tasks;
 static int kfree_nrealthreads;
@@ -641,7 +643,7 @@ kfree_scale_thread(void *arg)
 {
int i, loop = 0;
long me = (long)arg;
-   struct kfree_obj *alloc_ptr;
+   struct kfree_obj *alloc_ptr[2];
u64 start_time, end_time;
long long mem_begin, mem_during = 0;
 
@@ -665,12 +667,33 @@ kfree_scale_thread(void *arg)
mem_during = (mem_during + si_mem_available()) / 2;
}
 
+   // By default kfree_rcu_test_single and kfree_rcu_test_double 
are
+   // initialized to false. If both have the same value (false or 
true)
+   // both are tested in one run, otherwise only the one with value
+   // true is tested.
for (i = 0; i < kfree_alloc_num; i++) {
-   alloc_ptr = kmalloc(kfree_mult * sizeof(struct 
kfree_obj), GFP_KERNEL);
-   if (!alloc_ptr)
-   return -ENOMEM;
+   alloc_ptr[0] = kmalloc(kfree_mult * sizeof(struct 
kfree_obj), GFP_KERNEL);
+   alloc_ptr[1] = (kfree_rcu_test_single == 
kfree_rcu_test_double) ?
+   kmalloc(kfree_mult * sizeof(struct kfree_obj), 
GFP_KERNEL) : NULL;
+
+   // 0 ptr. is freed either over single or double 
argument.
+   if (alloc_ptr[0]) {
+   if (kfree_rcu_test_single == 
kfree_rcu_test_double ||
+   kfree_rcu_test_single) {
+   kfree_rcu(alloc_ptr[0]);
+   } else {
+   kfree_rcu(alloc_ptr[0], rh);
+   }
+   }
+
+   // 1 ptr. is always freed over double argument.
+   if (alloc_ptr[1])
+   kfree_rcu(alloc_ptr[1], rh);
 
-   kfree_rcu(alloc_ptr, rh);
+   if (!alloc_ptr[0] ||
+   (kfree_rcu_t

Re: [PATCH] mm/vmalloc: use rb_tree instead of list for vread() lookups

2021-02-08 Thread Uladzislau Rezki
On Mon, Feb 08, 2021 at 03:53:03PM +, Serapheim Dimitropoulos wrote:
> vread() has been linearly searching vmap_area_list for looking up
> vmalloc areas to read from. These same areas are also tracked by
> a rb_tree (vmap_area_root) which offers logarithmic lookup.
> 
> This patch modifies vread() to use the rb_tree structure instead
> of the list and the speedup for heavy /proc/kcore readers can
> be pretty significant. Below are the wall clock measurements of
> a Python application that leverages the drgn debugging library
> to read and interpret data read from /proc/kcore.
> 
> Before the patch:
> -
> $ time sudo sdb -e 'dbuf | head 2500 | wc'
> (unsigned long)2500
> 
> real  0m21.128s
> user  0m2.321s
> sys   0m19.227s
> -
> 
> With the patch:
> -
> $ time sudo sdb -e 'dbuf | head 2500 | wc'
> (unsigned long)2500
> 
> real  0m1.870s
> user  0m1.628s
> sys   0m0.660s
> -
> 
> Signed-off-by: Serapheim Dimitropoulos 
> ---
>  mm/vmalloc.c | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 49ab9b6c001d..86343b879938 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2851,6 +2851,7 @@ long vread(char *buf, char *addr, unsigned long count)
>  {
>   struct vmap_area *va;
>   struct vm_struct *vm;
> + struct rb_node *node;
>   char *vaddr, *buf_start = buf;
>   unsigned long buflen = count;
>   unsigned long n;
> @@ -2860,17 +2861,15 @@ long vread(char *buf, char *addr, unsigned long count)
>   count = -(unsigned long) addr;
>  
>   spin_lock(_area_lock);
> - list_for_each_entry(va, _area_list, list) {
> - if (!count)
> - break;
> -
> + va = __find_vmap_area((unsigned long)addr);
> + if (!va)
> + goto finished;
> + while (count) {
>   if (!va->vm)
> - continue;
> + goto next_node;
>  
>   vm = va->vm;
>   vaddr = (char *) vm->addr;
> - if (addr >= vaddr + get_vm_area_size(vm))
> - continue;
>   while (addr < vaddr) {
>   if (count == 0)
>   goto finished;
> @@ -2889,6 +2888,12 @@ long vread(char *buf, char *addr, unsigned long count)
>   buf += n;
>   addr += n;
>   count -= n;
> +
> +next_node:
> + node = rb_next(>rb_node);
> + if (!node)
> + break;
> + va = rb_entry(node, struct vmap_area, rb_node);
>
You can also improve it. Instead of rb_next() you can directly access
to a "next" element via "va->list" making it O(1) complexity.

--
Vlad Rezki


Re: [PATCH v4] kvfree_rcu: Release page cache under memory pressure

2021-02-08 Thread Uladzislau Rezki
Hello, Zqiang.

Thank you for your v4!

Some small nits see below:

> From: Zqiang 
> 
> Add free per-cpu existing krcp's page cache operation, when
> the system is under memory pressure.
> 
> Signed-off-by: Zqiang 
> Co-developed-by: Uladzislau Rezki (Sony) 
> ---
>  v1->v2->v3->v4:
>  During the test a page shrinker is pretty active, because of low memory
>  condition. callback drains it whereas kvfree_rcu() part refill it right
>  away making kind of vicious circle.
>  Through Vlad Rezki suggestion, to avoid this, schedule a periodic delayed
>  work with HZ, and it's easy to do that.
> 
I think the commit message should be improved. Please add a clear
description how it works, i mean its connection with shrinker, etc.

>  kernel/rcu/tree.c | 50 +++
>  1 file changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c1ae1e52f638..f3b772eef468 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3139,7 +3139,7 @@ struct kfree_rcu_cpu {
>   bool initialized;
>   int count;
>  
> - struct work_struct page_cache_work;
> + struct delayed_work page_cache_work;
>   atomic_t work_in_progress;
>   struct hrtimer hrtimer;
>  
> @@ -3395,7 +3395,7 @@ schedule_page_work_fn(struct hrtimer *t)
>   struct kfree_rcu_cpu *krcp =
>   container_of(t, struct kfree_rcu_cpu, hrtimer);
>  
> - queue_work(system_highpri_wq, >page_cache_work);
> + queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
>   return HRTIMER_NORESTART;
>  }
>  
> @@ -3404,7 +3404,7 @@ static void fill_page_cache_func(struct work_struct 
> *work)
>   struct kvfree_rcu_bulk_data *bnode;
>   struct kfree_rcu_cpu *krcp =
>   container_of(work, struct kfree_rcu_cpu,
> - page_cache_work);
> + page_cache_work.work);
>   unsigned long flags;
>   bool pushed;
>   int i;
> @@ -3428,15 +3428,22 @@ static void fill_page_cache_func(struct work_struct 
> *work)
>   atomic_set(>work_in_progress, 0);
>  }
>  
> +static bool backoff_page_cache_fill;
> +
>  static void
>  run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>  {
>   if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
>   !atomic_xchg(>work_in_progress, 1)) {
> - hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
> - HRTIMER_MODE_REL);
> - krcp->hrtimer.function = schedule_page_work_fn;
> - hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
> + if (READ_ONCE(backoff_page_cache_fill)) {
Can we just use xchg directly inside "if" statement? So we can
get rid of below WRITE_ONCE(). It is not considered as a "hot"
path, so it should not be an issue.

> + queue_delayed_work(system_highpri_wq, 
> >page_cache_work, HZ);
> + WRITE_ONCE(backoff_page_cache_fill, false);
> + } else {
> + hrtimer_init(>hrtimer, CLOCK_MONOTONIC,
> + HRTIMER_MODE_REL);
> + krcp->hrtimer.function = schedule_page_work_fn;
> + hrtimer_start(>hrtimer, 0, HRTIMER_MODE_REL);
> + }
>   }
>  }

Thank you!

--
Vlad Rezki


Re: [PATCH 2/2] kvfree_rcu: Use same set of flags as for single-argument

2021-02-08 Thread Uladzislau Rezki
On Thu, Feb 04, 2021 at 02:04:27PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 29, 2021 at 09:05:05PM +0100, Uladzislau Rezki (Sony) wrote:
> > Running an rcuscale stress-suite can lead to "Out of memory"
> > of a system. This can happen under high memory pressure with
> > a small amount of physical memory.
> > 
> > For example a KVM test configuration with 64 CPUs and 512 megabytes
> > can lead to of memory after running rcuscale with below parameters:
> > 
> > ../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig 
> > CONFIG_NR_CPUS=64 \
> > --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
> > rcuscale.holdoff=20 \
> >   rcuscale.kfree_loops=1 torture.disable_onoff_at_boot" --trust-make
> > 
> > 
> > [   12.054448] kworker/1:1H invoked oom-killer: 
> > gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0
> > [   12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ 
> > #510
> > [   12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.12.0-1 04/01/2014
> > [   12.056485] Workqueue: events_highpri fill_page_cache_func
> > [   12.056485] Call Trace:
> > [   12.056485]  dump_stack+0x57/0x6a
> > [   12.056485]  dump_header+0x4c/0x30a
> > [   12.056485]  ? del_timer_sync+0x20/0x30
> > [   12.056485]  out_of_memory.cold.47+0xa/0x7e
> > [   12.056485]  __alloc_pages_slowpath.constprop.123+0x82f/0xc00
> > [   12.056485]  __alloc_pages_nodemask+0x289/0x2c0
> > [   12.056485]  __get_free_pages+0x8/0x30
> > [   12.056485]  fill_page_cache_func+0x39/0xb0
> > [   12.056485]  process_one_work+0x1ed/0x3b0
> > [   12.056485]  ? process_one_work+0x3b0/0x3b0
> > [   12.060485]  worker_thread+0x28/0x3c0
> > [   12.060485]  ? process_one_work+0x3b0/0x3b0
> > [   12.060485]  kthread+0x138/0x160
> > [   12.060485]  ? kthread_park+0x80/0x80
> > [   12.060485]  ret_from_fork+0x22/0x30
> > [   12.062156] Mem-Info:
> > [   12.062350] active_anon:0 inactive_anon:0 isolated_anon:0
> > [   12.062350]  active_file:0 inactive_file:0 isolated_file:0
> > [   12.062350]  unevictable:0 dirty:0 writeback:0
> > [   12.062350]  slab_reclaimable:2797 slab_unreclaimable:80920
> > [   12.062350]  mapped:1 shmem:2 pagetables:8 bounce:0
> > [   12.062350]  free:10488 free_pcp:1227 free_cma:0
> > ...
> > [   12.101610] Out of memory and no killable processes...
> > [   12.102042] Kernel panic - not syncing: System is deadlocked on memory
> > [   12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ 
> > #510
> > [   12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.12.0-1 04/01/2014
> > 
> > 
> > Having a fallback mechanism we should not go with "GFP_KERNEL | 
> > __GFP_NOWARN"
> > that implies a "hard" page request involving OOM killer. Replace such set 
> > with
> > the same as the one used for a single argument.
> > 
> > Thus it will follow same rules:
> > a) minimize a fallback hitting;
> > b) avoid of OOM invoking;
> > c) do a light-wait page request;
> > d) avoid of dipping into the emergency reserves.
> > 
> > With this change an rcuscale and the parameters which are in question
> > never runs into "Kernel panic".
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> I did have some misgivings about this one, but after a closer look at
> the GFP flags you suggest along with offlist discussions it looks like
> what needs to happen.  So thank you for persisting!  ;-)
> 
> I did the usual wordsmithing as shown below, so please check to make
> sure that I did not mess anything up.
> 
Looks good to me, i mean a rewording of the commit message :)

--
Vlad Rezki


Re: [PATCH] mm/list_lru.c: remove kvfree_rcu_local()

2021-02-08 Thread Uladzislau Rezki
> The list_lru file used to have local kvfree_rcu() which was renamed by
> commit e0feed08ab41 ("mm/list_lru.c: Rename kvfree_rcu() to local
> variant") to introduce the globally visible kvfree_rcu(). Now we have
> global kvfree_rcu(), so remove the local kvfree_rcu_local() and just
> use the global one.
> 
> Signed-off-by: Shakeel Butt 
> ---
>  mm/list_lru.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index fe230081690b..6f067b6b935f 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -373,21 +373,13 @@ static void memcg_destroy_list_lru_node(struct 
> list_lru_node *nlru)
>   struct list_lru_memcg *memcg_lrus;
>   /*
>* This is called when shrinker has already been unregistered,
> -  * and nobody can use it. So, there is no need to use 
> kvfree_rcu_local().
> +  * and nobody can use it. So, there is no need to use kvfree_rcu().
>*/
>   memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
>   __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
>   kvfree(memcg_lrus);
>  }
>  
> -static void kvfree_rcu_local(struct rcu_head *head)
> -{
> - struct list_lru_memcg *mlru;
> -
> - mlru = container_of(head, struct list_lru_memcg, rcu);
> - kvfree(mlru);
> -}
> -
>  static int memcg_update_list_lru_node(struct list_lru_node *nlru,
> int old_size, int new_size)
>  {
> @@ -419,7 +411,7 @@ static int memcg_update_list_lru_node(struct 
> list_lru_node *nlru,
>   rcu_assign_pointer(nlru->memcg_lrus, new);
>   spin_unlock_irq(>lock);
>  
> - call_rcu(>rcu, kvfree_rcu_local);
> + kvfree_rcu(old, rcu);
>   return 0;
>  }
>  
> -- 
> 2.30.0.478.g8a0d178c01-goog
>
Reviewed-by: Uladzislau Rezki 

--
Vlad Rezki


Re: [PATCH 1/2] rcuscale: add kfree_rcu() single-argument scale test

2021-02-04 Thread Uladzislau Rezki
Hello, Paul.

> To stress and test a single argument of kfree_rcu() call, we
> should to have a special coverage for it. We used to have it
> in the test-suite related to vmalloc stressing. The reason is
> the rcuscale is a correct place for RCU related things.
> 
> Signed-off-by: Uladzislau Rezki (Sony) 
> ---
>  kernel/rcu/rcuscale.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
> index 06491d5530db..e17745a155f9 100644
> --- a/kernel/rcu/rcuscale.c
> +++ b/kernel/rcu/rcuscale.c
> @@ -94,6 +94,7 @@ torture_param(bool, shutdown, RCUSCALE_SHUTDOWN,
>  torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
>  torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to 
> disable");
>  torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
> +torture_param(int, kfree_rcu_test_single, 0, "Do we run a kfree_rcu() 
> single-argument scale test?");
>  torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
>  
>  static char *scale_type = "rcu";
> @@ -667,10 +668,14 @@ kfree_scale_thread(void *arg)
>  
>   for (i = 0; i < kfree_alloc_num; i++) {
>   alloc_ptr = kmalloc(kfree_mult * sizeof(struct 
> kfree_obj), GFP_KERNEL);
> +
>   if (!alloc_ptr)
>   return -ENOMEM;
>  
> - kfree_rcu(alloc_ptr, rh);
> + if (kfree_rcu_test_single)
> + kfree_rcu(alloc_ptr);
> + else
> + kfree_rcu(alloc_ptr, rh);
>   }
>  
>   cond_resched();
> -- 
> 2.20.1
>
What is about this change? Do you have any concern or comments?

--
Vlad Rezki


Re: 回复: [PATCH v3] kvfree_rcu: Release page cache under memory pressure

2021-02-04 Thread Uladzislau Rezki
> 发件人: Uladzislau Rezki 
> 发送时间: 2021年2月2日 3:57
> 收件人: Zhang, Qiang
> 抄送: ure...@gmail.com; paul...@kernel.org; j...@joelfernandes.org; 
> r...@vger.kernel.org; linux-kernel@vger.kernel.org
> 主题: Re: [PATCH v3] kvfree_rcu: Release page cache under memory pressure
> 
> [Please note: This e-mail is from an EXTERNAL e-mail address]
> 
> Hello, Zqiang.
> 
> > From: Zqiang 
> >
> > Add free per-cpu existing krcp's page cache operation, when
> > the system is under memory pressure.
> >
> > Signed-off-by: Zqiang 
> > ---
> >  kernel/rcu/tree.c | 26 ++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index c1ae1e52f638..644b0f3c7b9f 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3571,17 +3571,41 @@ void kvfree_call_rcu(struct rcu_head *head, 
> > rcu_callback_t func)
> >  }
> >  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
> >
> > +static int free_krc_page_cache(struct kfree_rcu_cpu *krcp)
> > +{
> > + unsigned long flags;
> > + struct llist_node *page_list, *pos, *n;
> > + int freed = 0;
> > +
> > + raw_spin_lock_irqsave(>lock, flags);
> > + page_list = llist_del_all(>bkvcache);
> > + krcp->nr_bkv_objs = 0;
> > + raw_spin_unlock_irqrestore(>lock, flags);
> > +
> > + llist_for_each_safe(pos, n, page_list) {
> > + free_page((unsigned long)pos);
> > + freed++;
> > + }
> > +
> > + return freed;
> > +}
> > +
> >  static unsigned long
> >  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> >   int cpu;
> >   unsigned long count = 0;
> > + unsigned long flags;
> >
> >   /* Snapshot count of all CPUs */
> >   for_each_possible_cpu(cpu) {
> >   struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
> >
> >   count += READ_ONCE(krcp->count);
> > +
> > + raw_spin_lock_irqsave(>lock, flags);
> > + count += krcp->nr_bkv_objs;
> > + raw_spin_unlock_irqrestore(>lock, flags);
> >   }
> >
> >   return count;
> > @@ -3598,6 +3622,8 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
> > shrink_control *sc)
> >   struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
> >
> >   count = krcp->count;
> > + count += free_krc_page_cache(krcp);
> > +
> >   raw_spin_lock_irqsave(>lock, flags);
> >   if (krcp->monitor_todo)
> >   kfree_rcu_drain_unlock(krcp, flags);
> > --
> > 2.17.1
> >>
> >Thank you for your patch!
> >
> >I spent some time to see how the patch behaves under low memory condition.
> >To simulate it, i used "rcuscale" tool with below parameters:
> >
> >../rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 10 
> >--kconfig >CONFIG_NR_CPUS=64 \
> >--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
> >>rcuscale.holdoff=20 rcuscale.kfree_loops=1 \
> >torture.disable_onoff_at_boot" --trust-make
> >
> >64 CPUs + 512 MB of memory. In general, my test system was running on edge
> >hitting an out of memory sometimes, but could be considered as stable in
> >regards to a test completion and taken time, so both were pretty solid.
> >
> >You can find a comparison on a plot, that can be downloaded following
> >a link: wget 
> >>ftp://vps418301.ovh.net/incoming/release_page_cache_under_low_memory.png
> >
> >In short, i see that a patched version can lead to longer test completion,
> >whereas the default variant is stable on almost all runs. After some analysis
> >and further digging i came to conclusion that a shrinker 
> >free_krc_page_cache()
> >concurs with run_page_cache_worker(krcp) running from kvfree_rcu() context.
> >
> >i.e. During the test a page shrinker is pretty active, because of low memory
> >condition. Our callback drains it whereas kvfree_rcu() part refill it right
> >away making kind of vicious circle.
> >
> >So, a run_page_cache_worker() should be backoff for some time when a system
> >runs into a low memory condition or high pressure:
> >
> >diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >index 7077d73fcb53..446723b9646b 100644
> >--- a/kernel/rcu/tree.c
> >+++ b/kernel/rcu/tree.c
> >@@ -3163,7 +3163,7 @

Re: [PATCH v3] kvfree_rcu: Release page cache under memory pressure

2021-02-01 Thread Uladzislau Rezki
Hello, Zqiang.

> From: Zqiang 
> 
> Add free per-cpu existing krcp's page cache operation, when
> the system is under memory pressure.
> 
> Signed-off-by: Zqiang 
> ---
>  kernel/rcu/tree.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c1ae1e52f638..644b0f3c7b9f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3571,17 +3571,41 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
>  
> +static int free_krc_page_cache(struct kfree_rcu_cpu *krcp)
> +{
> + unsigned long flags;
> + struct llist_node *page_list, *pos, *n;
> + int freed = 0;
> +
> + raw_spin_lock_irqsave(>lock, flags);
> + page_list = llist_del_all(>bkvcache);
> + krcp->nr_bkv_objs = 0;
> + raw_spin_unlock_irqrestore(>lock, flags);
> +
> + llist_for_each_safe(pos, n, page_list) {
> + free_page((unsigned long)pos);
> + freed++;
> + }
> +
> + return freed;
> +}
> +
>  static unsigned long
>  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
>   int cpu;
>   unsigned long count = 0;
> + unsigned long flags;
>  
>   /* Snapshot count of all CPUs */
>   for_each_possible_cpu(cpu) {
>   struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
>  
>   count += READ_ONCE(krcp->count);
> +
> + raw_spin_lock_irqsave(>lock, flags);
> + count += krcp->nr_bkv_objs;
> + raw_spin_unlock_irqrestore(>lock, flags);
>   }
>  
>   return count;
> @@ -3598,6 +3622,8 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
> shrink_control *sc)
>   struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
>  
>   count = krcp->count;
> + count += free_krc_page_cache(krcp);
> +
>   raw_spin_lock_irqsave(>lock, flags);
>   if (krcp->monitor_todo)
>   kfree_rcu_drain_unlock(krcp, flags);
> -- 
> 2.17.1
> 
Thank you for your patch!

I spent some time to see how the patch behaves under low memory condition.
To simulate it, i used "rcuscale" tool with below parameters:

../rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig 
CONFIG_NR_CPUS=64 \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
rcuscale.holdoff=20 rcuscale.kfree_loops=1 \
torture.disable_onoff_at_boot" --trust-make

64 CPUs + 512 MB of memory. In general, my test system was running on edge
hitting an out of memory sometimes, but could be considered as stable in
regards to a test completion and taken time, so both were pretty solid.

You can find a comparison on a plot, that can be downloaded following
a link: wget 
ftp://vps418301.ovh.net/incoming/release_page_cache_under_low_memory.png

In short, i see that a patched version can lead to longer test completion,
whereas the default variant is stable on almost all runs. After some analysis
and further digging i came to conclusion that a shrinker free_krc_page_cache()
concurs with run_page_cache_worker(krcp) running from kvfree_rcu() context.

i.e. During the test a page shrinker is pretty active, because of low memory
condition. Our callback drains it whereas kvfree_rcu() part refill it right
away making kind of vicious circle.

So, a run_page_cache_worker() should be backoff for some time when a system
runs into a low memory condition or high pressure:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7077d73fcb53..446723b9646b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
bool initialized;
int count;
 
-   struct work_struct page_cache_work;
+   struct delayed_work page_cache_work;
atomic_t work_in_progress;
struct hrtimer hrtimer;
 
@@ -3419,7 +3419,7 @@ schedule_page_work_fn(struct hrtimer *t)
struct kfree_rcu_cpu *krcp =
container_of(t, struct kfree_rcu_cpu, hrtimer);
 
-   queue_work(system_highpri_wq, >page_cache_work);
+   queue_delayed_work(system_highpri_wq, >page_cache_work, 0);
return HRTIMER_NORESTART;
 }
 
@@ -3428,7 +3428,7 @@ static void fill_page_cache_func(struct work_struct *work)
struct kvfree_rcu_bulk_data *bnode;
struct kfree_rcu_cpu *krcp =
container_of(work, struct kfree_rcu_cpu,
-   page_cache_work);
+   page_cache_work.work);
unsigned long flags;
bool pushed;
int i;
@@ -3452,15 +3452,22 @@ static void fill_page_cache_func(struct work_struct 
*work)
atomic_set(>work_in_progress, 0);
 }
 
+static bool backoff_page_cache_fill;
+
 static void
 run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 {
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(>work_in_progress, 1)) {
-   

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-02-01 Thread Uladzislau Rezki
On Mon, Feb 01, 2021 at 12:47:55PM +0100, Michal Hocko wrote:
> On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> > On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > > [...]
> > > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > > From: "Uladzislau Rezki (Sony)" 
> > > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
> > > > single-argument
> > > >  case
> > > > 
> > > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > > makes sense to do these page allocations from the fastpath, and even to
> > > > permit limited sleeping within the allocator.
> > > > 
> > > > This commit therefore allocates if needed on the fastpath using
> > > > GFP_KERNEL|__GFP_NORETRY.
> > > 
> > > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > > prepared to handle the failure which is likely much less disruptive than
> > > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > > 
> > > I cannot give you ack as I am not familiar with the code but this makes
> > > sense to me.
> > > 
> > No problem, i can separate it. We can have a patch on top of what we have so
> > far. The patch only modifies the gfp_mask passed to __get_free_pages():
> > 
> > >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" 
> > Date: Fri, 29 Jan 2021 17:16:03 +0100
> > Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by 
> > __GFP_NORETRY
> > 
> > __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> > therefore a time consuming. That is not optional and there is
> > no need in doing it so hard, because we have a fallback path.
> > 
> > __GFP_NORETRY in its turn can perform some light-weight reclaim
> > and it rather fails under high memory pressure or low memory
> > condition.
> > 
> > In general there are four simple criterias we we would like to
> > achieve:
> > a) minimize a fallback hitting;
> > b) avoid of OOM invoking;
> > c) do a light-wait page request;
> > d) avoid of dipping into the emergency reserves.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> Looks good to me. Feel free to add
> Acked-by: Michal Hocko 
> 
Appreciate it!

--
Vlad Rezki


Re: 回复: [PATCH v2] kvfree_rcu: Release page cache under memory pressure

2021-01-30 Thread Uladzislau Rezki
On Sat, Jan 30, 2021 at 06:47:31AM +, Zhang, Qiang wrote:
> 
> 
> ____
> 发件人: Uladzislau Rezki 
> 发送时间: 2021年1月29日 22:19
> 收件人: Zhang, Qiang
> 抄送: ure...@gmail.com; paul...@kernel.org; j...@joelfernandes.org; 
> r...@vger.kernel.org; linux-kernel@vger.kernel.org
> 主题: Re: [PATCH v2] kvfree_rcu: Release page cache under memory pressure
> 
> [Please note: This e-mail is from an EXTERNAL e-mail address]
> 
> On Fri, Jan 29, 2021 at 04:04:42PM +0800, qiang.zh...@windriver.com wrote:
> > From: Zqiang 
> >
> > Add free per-cpu existing krcp's page cache operation, when
> > the system is under memory pressure.
> >
> > Signed-off-by: Zqiang 
> > ---
> >  kernel/rcu/tree.c | 25 +
> >  1 file changed, 25 insertions(+)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index c1ae1e52f638..ec098910d80b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3571,17 +3571,40 @@ void kvfree_call_rcu(struct rcu_head *head, 
> > rcu_callback_t func)
> >  }
> >  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
> >
> > +static int free_krc_page_cache(struct kfree_rcu_cpu *krcp)
> > +{
> > + unsigned long flags;
> > + struct kvfree_rcu_bulk_data *bnode;
> > + int i;
> > +
> > + for (i = 0; i < rcu_min_cached_objs; i++) {
> > + raw_spin_lock_irqsave(>lock, flags);
> >I am not sure why we should disable IRQs. I think it can be >avoided.
> 
> Suppose in multi CPU system, the kfree_rcu_shrink_scan function is runing on 
> CPU2,
> and we just traverse to CPU2, and then call free_krc_page_cache function,
> if not disable irq, a interrupt may be occurs on CPU2 after the CPU2 
> corresponds to krcp variable 's lock be acquired,  if the interrupt or 
> softirq handler function to call kvfree_rcu function, in this function , 
> acquire CPU2 corresponds to krcp variable 's lock , will happen deadlock.
> Or in single CPU scenario.
> 
Right. Deadlock scenario. It went away from my head during writing that :)

Thanks!

--
Vlad Rezki


[PATCH 2/2] kvfree_rcu: Use same set of flags as for single-argument

2021-01-29 Thread Uladzislau Rezki (Sony)
Running an rcuscale stress-suite can lead to "Out of memory"
of a system. This can happen under high memory pressure with
a small amount of physical memory.

For example a KVM test configuration with 64 CPUs and 512 megabytes
can lead to of memory after running rcuscale with below parameters:

../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig 
CONFIG_NR_CPUS=64 \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
rcuscale.holdoff=20 \
  rcuscale.kfree_loops=1 torture.disable_onoff_at_boot" --trust-make


[   12.054448] kworker/1:1H invoked oom-killer: 
gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0
[   12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.12.0-1 04/01/2014
[   12.056485] Workqueue: events_highpri fill_page_cache_func
[   12.056485] Call Trace:
[   12.056485]  dump_stack+0x57/0x6a
[   12.056485]  dump_header+0x4c/0x30a
[   12.056485]  ? del_timer_sync+0x20/0x30
[   12.056485]  out_of_memory.cold.47+0xa/0x7e
[   12.056485]  __alloc_pages_slowpath.constprop.123+0x82f/0xc00
[   12.056485]  __alloc_pages_nodemask+0x289/0x2c0
[   12.056485]  __get_free_pages+0x8/0x30
[   12.056485]  fill_page_cache_func+0x39/0xb0
[   12.056485]  process_one_work+0x1ed/0x3b0
[   12.056485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  worker_thread+0x28/0x3c0
[   12.060485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  kthread+0x138/0x160
[   12.060485]  ? kthread_park+0x80/0x80
[   12.060485]  ret_from_fork+0x22/0x30
[   12.062156] Mem-Info:
[   12.062350] active_anon:0 inactive_anon:0 isolated_anon:0
[   12.062350]  active_file:0 inactive_file:0 isolated_file:0
[   12.062350]  unevictable:0 dirty:0 writeback:0
[   12.062350]  slab_reclaimable:2797 slab_unreclaimable:80920
[   12.062350]  mapped:1 shmem:2 pagetables:8 bounce:0
[   12.062350]  free:10488 free_pcp:1227 free_cma:0
...
[   12.101610] Out of memory and no killable processes...
[   12.102042] Kernel panic - not syncing: System is deadlocked on memory
[   12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.12.0-1 04/01/2014


Having a fallback mechanism we should not go with "GFP_KERNEL | __GFP_NOWARN"
that implies a "hard" page request involving OOM killer. Replace such set with
the same as the one used for a single argument.

Thus it will follow same rules:
a) minimize a fallback hitting;
b) avoid of OOM invoking;
c) do a light-wait page request;
d) avoid of dipping into the emergency reserves.

With this change an rcuscale and the parameters which are in question
never runs into "Kernel panic".

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1e862120db9e..2c9cf4df942c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3435,7 +3435,7 @@ static void fill_page_cache_func(struct work_struct *work)
 
for (i = 0; i < rcu_min_cached_objs; i++) {
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
 
if (bnode) {
raw_spin_lock_irqsave(>lock, flags);
-- 
2.20.1



[PATCH 1/2] rcuscale: add kfree_rcu() single-argument scale test

2021-01-29 Thread Uladzislau Rezki (Sony)
To stress and test a single argument of kfree_rcu() call, we
should to have a special coverage for it. We used to have it
in the test-suite related to vmalloc stressing. The reason is
the rcuscale is a correct place for RCU related things.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/rcuscale.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 06491d5530db..e17745a155f9 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -94,6 +94,7 @@ torture_param(bool, shutdown, RCUSCALE_SHUTDOWN,
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to 
disable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
+torture_param(int, kfree_rcu_test_single, 0, "Do we run a kfree_rcu() 
single-argument scale test?");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
 
 static char *scale_type = "rcu";
@@ -667,10 +668,14 @@ kfree_scale_thread(void *arg)
 
for (i = 0; i < kfree_alloc_num; i++) {
alloc_ptr = kmalloc(kfree_mult * sizeof(struct 
kfree_obj), GFP_KERNEL);
+
if (!alloc_ptr)
return -ENOMEM;
 
-   kfree_rcu(alloc_ptr, rh);
+   if (kfree_rcu_test_single)
+   kfree_rcu(alloc_ptr);
+   else
+   kfree_rcu(alloc_ptr, rh);
}
 
cond_resched();
-- 
2.20.1



Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-29 Thread Uladzislau Rezki
On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> [...]
> > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" 
> > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
> > single-argument
> >  case
> > 
> > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > so we can directly allocate pages.  Furthermmore, the fallback in case
> > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > makes sense to do these page allocations from the fastpath, and even to
> > permit limited sleeping within the allocator.
> > 
> > This commit therefore allocates if needed on the fastpath using
> > GFP_KERNEL|__GFP_NORETRY.
> 
> Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> prepared to handle the failure which is likely much less disruptive than
> OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> 
> I cannot give you ack as I am not familiar with the code but this makes
> sense to me.
> 
No problem, i can separate it. We can have a patch on top of what we have so
far. The patch only modifies the gfp_mask passed to __get_free_pages():

>From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" 
Date: Fri, 29 Jan 2021 17:16:03 +0100
Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

__GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
therefore a time consuming. That is not optional and there is
no need in doing it so hard, because we have a fallback path.

__GFP_NORETRY in its turn can perform some light-weight reclaim
and it rather fails under high memory pressure or low memory
condition.

In general there are four simple criterias we we would like to
achieve:
a) minimize a fallback hitting;
    b) avoid of OOM invoking;
c) do a light-wait page request;
d) avoid of dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 70ddc339e0b7..1e862120db9e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
bnode = get_cached_bnode(*krcp);
if (!bnode && can_alloc) {
krc_this_cpu_unlock(*krcp, *flags);
+
+   // __GFP_NORETRY - allows a light-weight direct reclaim
+   // what is OK from minimizing of fallback hitting point 
of
+   // view. Apart of that it forbids any OOM invoking what 
is
+   // also beneficial since we are about to release memory 
soon.
+   //
+   // __GFP_NOMEMALLOC - prevents from consuming of all the
+   // memory reserves. Please note we have a fallback path.
+   //
+   // __GFP_NOWARN - it is supposed that an allocation can
+   // be failed under low memory or high memory pressure
+   // scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
*krcp = krc_this_cpu_lock(flags);
}
 
-- 
2.20.1

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Uladzislau Rezki
On Thu, Jan 28, 2021 at 04:30:17PM +0100, Uladzislau Rezki wrote:
> On Thu, Jan 28, 2021 at 04:17:01PM +0100, Michal Hocko wrote:
> > On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> > > On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > > > For a single argument we can directly request a page from a 
> > > > > > > > caller
> > > > > > > > context when a "carry page block" is run out of free spots. 
> > > > > > > > Instead
> > > > > > > > of hitting a slow path we can request an extra page by demand 
> > > > > > > > and
> > > > > > > > proceed with a fast path.
> > > > > > > > 
> > > > > > > > A single-argument kvfree_rcu() must be invoked in sleepable 
> > > > > > > > contexts,
> > > > > > > > and that its fallback is the relatively high latency 
> > > > > > > > synchronize_rcu().
> > > > > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > > > to allow limited sleeping within the memory allocator.
> > > > > > > 
> > > > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the 
> > > > > > > most heavy
> > > > > > > way to allocate without triggering the OOM killer. Is this really 
> > > > > > > what
> > > > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > > > 
> > > > > > Hm... We agreed to proceed with limited lightwait memory direct 
> > > > > > reclaim.
> > > > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > > > 
> > > > > > 
> > > > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, 
> > > > > > and make
> > > > > > further decisions based on instrumentation of the success rates 
> > > > > > of
> > > > > > these opportunistic allocations.
> > > > > > 
> > > > > 
> > > > > I completely agree with Johannes here.
> > > > > 
> > > > > > but for some reason, i can't find a tail or head of it, we 
> > > > > > introduced
> > > > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point 
> > > > > > of view.
> > > > > > What we would like to avoid.
> > > > > 
> > > > > Not that I object to this use but I think it would be much better to 
> > > > > use
> > > > > it based on actual data. Going along with it right away might become a
> > > > > future burden to make any changes in this aspect later on due to lack 
> > > > > of 
> > > > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is 
> > > > > really
> > > > > try as hard as it can get without being really disruptive (like OOM
> > > > > killing something). And your wording didn't really give me that
> > > > > impression.
> > > > > 
> > > > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > > > was a discussion about a fallback path, that uses synchronize_rcu() can 
> > > > be
> > > > slow, thus minimizing its hitting would be great. So, here we go with a
> > > > trade off.
> > > > 
> > > > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to 
> > > > have some
> > > > light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> > > > proposed.
> > > > 
> > > > There were simple criterias we discussed which we would like to achieve:
> > > > 
> > > > a) minimize a fallback hitting;
> > > > b) avoid of OOM involving;
> > > > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> > > > __GFP_NOMEMALLOC for single-argument kvfree_rcu()
&

Re: [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()

2021-01-28 Thread Uladzislau Rezki
On Wed, Jan 20, 2021 at 05:21:47PM +0100, Uladzislau Rezki (Sony) wrote:
> From: "Paul E. McKenney" 
> 
> This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
> carried out by the single-argument variant of kvfree_rcu(), thus avoiding
> this can-sleep code path from dipping into the emergency reserves.
> 
> Acked-by: Michal Hocko 
> Suggested-by: Michal Hocko 
> Signed-off-by: Paul E. McKenney 
> Signed-off-by: Uladzislau Rezki (Sony) 
> ---
>  kernel/rcu/tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2014fb22644d..454809514c91 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3491,7 +3491,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>   if (!bnode && can_alloc) {
>   krc_this_cpu_unlock(*krcp, *flags);
>   bnode = (struct kvfree_rcu_bulk_data *)
> - __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> + __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
>   *krcp = krc_this_cpu_lock(flags);
>   }
>  
> -- 
> 2.20.1
> 
Please see below a V2:

V1 -> V2:
- rebase on [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
single-argument
- add a comment about __GFP_NOMEMALLOC usage.


>From 1427698cdbdced53d9b5eee60aa5d22bc223056d Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" 
Date: Wed, 20 Jan 2021 17:21:47 +0100
Subject: [PATCH v2 1/1] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument
 kvfree_rcu()

This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
carried out by the single-argument variant of kvfree_rcu(), thus avoiding
this can-sleep code path from dipping into the emergency reserves.

Acked-by: Michal Hocko 
Suggested-by: Michal Hocko 
Signed-off-by: Paul E. McKenney 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e450c17a06b2..e7b705155c92 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3496,11 +3496,14 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
// view. Apart of that it forbids any OOM invoking what 
is
// also beneficial since we are about to release memory 
soon.
//
+   // __GFP_NOMEMALLOC - prevents from consuming of all the
+   // memory reserves. Please note we have a fallback path.
+   //
// __GFP_NOWARN - it is supposed that an allocation can
// be failed under low memory or high memory pressure
// scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
*krcp = krc_this_cpu_lock(flags);
}
 
-- 
2.20.1


Thanks!

--
Vlad Rezki


Re: [PATCH] kvfree_rcu: Release page cache under memory pressure

2021-01-28 Thread Uladzislau Rezki
Hello, Zqiang.

See below some nits:

> 
> Add free per-cpu existing krcp's page cache operation, when
> the system is under memory pressure.
> 
> Signed-off-by: Zqiang 
> ---
>  kernel/rcu/tree.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c1ae1e52f638..4e1c14b12bdd 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3571,17 +3571,41 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
>  
> +static inline int free_krc_page_cache(struct kfree_rcu_cpu *krcp)
Do we need it "inlined"?

> +{
> + unsigned long flags;
> + struct kvfree_rcu_bulk_data *bnode;
> + int i, num = 0;
> +
> + for (i = 0; i < rcu_min_cached_objs; i++) {
> + raw_spin_lock_irqsave(>lock, flags);
> + bnode = get_cached_bnode(krcp);
> + raw_spin_unlock_irqrestore(>lock, flags);
> + if (!bnode)
> + break;
> + free_page((unsigned long)bnode);
> + num++;
> + }
> +
> + return num;
Get rid of "num" and return i instead?

> +}
> +
>  static unsigned long
>  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
>   int cpu;
>   unsigned long count = 0;
> + unsigned long flags;
>  
>   /* Snapshot count of all CPUs */
>   for_each_possible_cpu(cpu) {
>   struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
>  
>   count += READ_ONCE(krcp->count);
> +
> + raw_spin_lock_irqsave(>lock, flags);
> + count += krcp->nr_bkv_objs;
> + raw_spin_unlock_irqrestore(>lock, flags);
>   }
>  
>   return count;
> @@ -3604,6 +3628,8 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
> shrink_control *sc)
>   else
>   raw_spin_unlock_irqrestore(>lock, flags);
>  
> + count += free_krc_page_cache(krcp);
Move it upper right after count = krcp->count;, so a "count" is set
in one place what i more readable and clear?

Thank you!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Uladzislau Rezki
On Thu, Jan 28, 2021 at 04:17:01PM +0100, Michal Hocko wrote:
> On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> > On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > > For a single argument we can directly request a page from a caller
> > > > > > > context when a "carry page block" is run out of free spots. 
> > > > > > > Instead
> > > > > > > of hitting a slow path we can request an extra page by demand and
> > > > > > > proceed with a fast path.
> > > > > > > 
> > > > > > > A single-argument kvfree_rcu() must be invoked in sleepable 
> > > > > > > contexts,
> > > > > > > and that its fallback is the relatively high latency 
> > > > > > > synchronize_rcu().
> > > > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > > to allow limited sleeping within the memory allocator.
> > > > > > 
> > > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most 
> > > > > > heavy
> > > > > > way to allocate without triggering the OOM killer. Is this really 
> > > > > > what
> > > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > > 
> > > > > Hm... We agreed to proceed with limited lightwait memory direct 
> > > > > reclaim.
> > > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > > 
> > > > > 
> > > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and 
> > > > > make
> > > > > further decisions based on instrumentation of the success rates of
> > > > > these opportunistic allocations.
> > > > > 
> > > > 
> > > > I completely agree with Johannes here.
> > > > 
> > > > > but for some reason, i can't find a tail or head of it, we introduced
> > > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point 
> > > > > of view.
> > > > > What we would like to avoid.
> > > > 
> > > > Not that I object to this use but I think it would be much better to use
> > > > it based on actual data. Going along with it right away might become a
> > > > future burden to make any changes in this aspect later on due to lack 
> > > > of 
> > > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > > > try as hard as it can get without being really disruptive (like OOM
> > > > killing something). And your wording didn't really give me that
> > > > impression.
> > > > 
> > > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > > was a discussion about a fallback path, that uses synchronize_rcu() can be
> > > slow, thus minimizing its hitting would be great. So, here we go with a
> > > trade off.
> > > 
> > > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to 
> > > have some
> > > light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> > > proposed.
> > > 
> > > There were simple criterias we discussed which we would like to achieve:
> > > 
> > > a) minimize a fallback hitting;
> > > b) avoid of OOM involving;
> > > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> > > __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > > 
> > One question here. Since the code that triggers a page request can be
> > directly invoked from reclaim context as well as outside of it. We had
> > a concern about if any recursion is possible, but what i see it is safe.
> > The context that does it can not enter it twice:
> > 
> > 
> > /* Avoid recursion of direct reclaim */
> > if (current->flags & PF_MEMALLOC)
> > goto nopage;
> > 
> 
> Yes this is a recursion protection.
> 
> > What about any deadlocking in regards to below following flags?
> > 
> > GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN
> 
> and __GFP_NOMEMALLOC will make sure that the allocation will not consume
> all the memory reserves. The later should be clarified in one of your
> patches I have acked IIRC.
>
Yep, it is clarified and reflected in another patch you ACKed.

Thanks!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Uladzislau Rezki
On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > For a single argument we can directly request a page from a caller
> > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > of hitting a slow path we can request an extra page by demand and
> > > > > proceed with a fast path.
> > > > > 
> > > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > > and that its fallback is the relatively high latency 
> > > > > synchronize_rcu().
> > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > to allow limited sleeping within the memory allocator.
> > > > 
> > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > > way to allocate without triggering the OOM killer. Is this really what
> > > > you need/want? Is __GFP_NORETRY too weak?
> > > > 
> > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > 
> > > 
> > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > further decisions based on instrumentation of the success rates of
> > > these opportunistic allocations.
> > > 
> > 
> > I completely agree with Johannes here.
> > 
> > > but for some reason, i can't find a tail or head of it, we introduced
> > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of 
> > > view.
> > > What we would like to avoid.
> > 
> > Not that I object to this use but I think it would be much better to use
> > it based on actual data. Going along with it right away might become a
> > future burden to make any changes in this aspect later on due to lack of 
> > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > try as hard as it can get without being really disruptive (like OOM
> > killing something). And your wording didn't really give me that
> > impression.
> > 
> Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> was a discussion about a fallback path, that uses synchronize_rcu() can be
> slow, thus minimizing its hitting would be great. So, here we go with a
> trade off.
> 
> Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have 
> some
> light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> proposed.
> 
> There were simple criterias we discussed which we would like to achieve:
> 
> a) minimize a fallback hitting;
> b) avoid of OOM involving;
> c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> 
One question here. Since the code that triggers a page request can be
directly invoked from reclaim context as well as outside of it. We had
a concern about if any recursion is possible, but what i see it is safe.
The context that does it can not enter it twice:


/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
goto nopage;


What about any deadlocking in regards to below following flags?

GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN

Thanks!

--
Vlad Rezki


Re: 回复: 回复: 回复: [PATCH] rcu: Release per-cpu krcp page cache when CPU going offline

2021-01-27 Thread Uladzislau Rezki
On Wed, Jan 27, 2021 at 09:00:27AM +, Zhang, Qiang wrote:
> 
> 
> ____
> 发件人: Uladzislau Rezki 
> 发送时间: 2021年1月26日 22:07
> 收件人: Zhang, Qiang
> 抄送: Uladzislau Rezki; Paul E. McKenney; r...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> 主题: Re: 回复: 回复: [PATCH] rcu: Release per-cpu krcp page cache when CPU going 
> offline
> 
> >
> > On Fri, Jan 22, 2021 at 01:44:36AM +, Zhang, Qiang wrote:
> > >
> > >
> > > 
> > > 发件人: Uladzislau Rezki 
> > > 发送时间: 2021年1月22日 4:26
> > > 收件人: Zhang, Qiang
> > > 抄送: Paul E. McKenney; r...@vger.kernel.org; linux-kernel@vger.kernel.org; 
> > > ure...@gmail.com
> > > 主题: Re: [PATCH] rcu: Release per-cpu krcp page cache when CPU going 
> > > offline
> > > >Hello, Qiang,
> > >
> > > > On Thu, Jan 21, 2021 at 02:49:49PM +0800, qiang.zh...@windriver.com 
> > > > wrote:
> > > > > From: Zqiang 
> > > > >
> > > > > If CPUs go offline, the corresponding krcp's page cache can
> > > > > not be use util the CPU come back online, or maybe the CPU
> > > > > will never go online again, this commit therefore free krcp's
> > > > > page cache when CPUs go offline.
> > > > >
> > > > > Signed-off-by: Zqiang 
> > > >
> > > >Do you consider it as an issue? We have 5 pages per CPU, that is 20480 
> > > >bytes.
> > > >
> > >
> > > Hello Rezki
> > >
> > > In a multi CPUs system, more than one CPUs may be offline, there are more 
> > > than 5 pages,  and these offline CPUs may never go online again  or  in 
> > > the process of CPUs online, there are errors, which lead to the failure 
> > > of online, these scenarios will lead to the per-cpu krc page cache will 
> > > never be released.
> > >
> > >Thanks for your answer. I was thinking more about if you knew some 
> > >>platforms
> > >which suffer from such extra page usage when CPU goes offline. Any >issues
> > >your platforms or devices run into because of that.
> > >
> > >So i understand that if CPU goes offline the 5 pages associated with it 
> > >>are
> > >unused until it goes online back.
> >
> >  I agree with you, But I still want to talk about what I think
> >
> >  My understanding is that when the CPU is offline,  the pages is not
> >  accessible,  beacuse we don't know when this CPU will
> >  go online again, so we best to return these page to the buddy system,
> >  when the CPU goes online again, we can allocate page from the buddy
> >  system to fill krcp's page cache.  maybe you may think that this memory
> >  is small and don't need to.
> >
> >BTW, we can release the caches via shrinker path instead, what is more makes
> >sense to me. We already have a callback, that frees pages when a page 
> >allocator
> >asks for it. I think in that case it would be fair to return it to the buddy
> >system. It happens under low memory condition
> 
>   I agree. it can be done in shrink callback, can release the currently 
> existing per-cpu 
>   page cache.
>   
Would not you mind to send a patch? If you need some input, i am happy
to participate.

Thanks!

--
Vlad Rezki


Re: 回复: 回复: [PATCH] rcu: Release per-cpu krcp page cache when CPU going offline

2021-01-26 Thread Uladzislau Rezki
> 
> 发件人: Uladzislau Rezki 
> 发送时间: 2021年1月22日 22:31
> 收件人: Zhang, Qiang
> 抄送: Uladzislau Rezki; Paul E. McKenney; r...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> 主题: Re: 回复: [PATCH] rcu: Release per-cpu krcp page cache when CPU going 
> offline
> 
> On Fri, Jan 22, 2021 at 01:44:36AM +, Zhang, Qiang wrote:
> >
> >
> > ________
> > 发件人: Uladzislau Rezki 
> > 发送时间: 2021年1月22日 4:26
> > 收件人: Zhang, Qiang
> > 抄送: Paul E. McKenney; r...@vger.kernel.org; linux-kernel@vger.kernel.org; 
> > ure...@gmail.com
> > 主题: Re: [PATCH] rcu: Release per-cpu krcp page cache when CPU going offline
> > >Hello, Qiang,
> >
> > > On Thu, Jan 21, 2021 at 02:49:49PM +0800, qiang.zh...@windriver.com wrote:
> > > > From: Zqiang 
> > > >
> > > > If CPUs go offline, the corresponding krcp's page cache can
> > > > not be use util the CPU come back online, or maybe the CPU
> > > > will never go online again, this commit therefore free krcp's
> > > > page cache when CPUs go offline.
> > > >
> > > > Signed-off-by: Zqiang 
> > >
> > >Do you consider it as an issue? We have 5 pages per CPU, that is 20480 
> > >bytes.
> > >
> >
> > Hello Rezki
> >
> > In a multi CPUs system, more than one CPUs may be offline, there are more 
> > than 5 pages,  and these offline CPUs may never go online again  or  in the 
> > process of CPUs online, there are errors, which lead to the failure of 
> > online, these scenarios will lead to the per-cpu krc page cache will never 
> > be released.
> >
> >Thanks for your answer. I was thinking more about if you knew some >platforms
> >which suffer from such extra page usage when CPU goes offline. Any >issues
> >your platforms or devices run into because of that.
> >
> >So i understand that if CPU goes offline the 5 pages associated with it >are
> >unused until it goes online back.
> 
>  I agree with you, But I still want to talk about what I think
> 
>  My understanding is that when the CPU is offline,  the pages is not 
>  accessible,  beacuse we don't know when this CPU will 
>  go online again, so we best to return these page to the buddy system,
>  when the CPU goes online again, we can allocate page from the buddy 
>  system to fill krcp's page cache.  maybe you may think that this memory 
>  is small and don't need to. 
>  
BTW, we can release the caches via shrinker path instead, what is more makes
sense to me. We already have a callback, that frees pages when a page allocator
asks for it. I think in that case it would be fair to return it to the buddy
system. It happens under low memory condition or can be done manually to flush
system caches:

echo 3 > /proc/sys/vm/drop_caches

What do you think?

--
Vlad Rezki


Re: 回复: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

2021-01-26 Thread Uladzislau Rezki
On Tue, Jan 26, 2021 at 09:33:40AM +, Zhang, Qiang wrote:
> 
> 
> ____
> 发件人: Uladzislau Rezki 
> 发送时间: 2021年1月25日 21:49
> 收件人: Zhang, Qiang
> 抄送: Uladzislau Rezki; LKML; RCU; Paul E . McKenney; Michael Ellerman; Andrew 
> Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel Fernandes; 
> Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; Sebastian 
> Andrzej Siewior; Oleksiy Avramchenko
> 主题: Re: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> 
> > >Hello, Zhang.
> >
> > > >____
> > > >发件人: Uladzislau Rezki (Sony) 
> > > >发送时间: 2021年1月21日 0:21
> > > >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> > > >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; 
> > > >Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore 
> > > >Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy 
> > > >Avramchenko
> > > >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> > > >
> > > >Since the page is obtained in a fully preemptible context, dropping
> > > >the lock can lead to migration onto another CPU. As a result a prev.
> > > >bnode of that CPU may be underutilised, because a decision has been
> > > >made for a CPU that was run out of free slots to store a pointer.
> > > >
> > > >migrate_disable/enable() are now independent of RT, use it in order
> > > >to prevent any migration during a page request for a specific CPU it
> > > >is requested for.
> > >
> > >
> > > Hello Rezki
> > >
> > > The critical migrate_disable/enable() area is not allowed to block, under 
> > > RT and non RT.
> > > There is such a description in preempt.h
> > >
> > >
> > > * Notes on the implementation.
> > >  *
> > >  * The implementation is particularly tricky since existing code patterns
> > >  * dictate neither migrate_disable() nor migrate_enable() is allowed to 
> > > block.
> > >  * This means that it cannot use cpus_read_lock() to serialize against 
> > > hotplug,
> > >  * nor can it easily migrate itself into a pending affinity mask change on
> > >  * migrate_enable().
> > >
> > >How i interpret it is migrate_enable()/migrate_disable() are not allowed to
> > >use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
> > >current context as non-migratable.
> > >
> > >void migrate_disable(void)
> > >{
> > > struct task_struct *p = current;
> > >
> > > if (p->migration_disabled) {
> > >  p->migration_disabled++;
> > >  return;
> > > }
> >
> > > preempt_disable();
> > > this_rq()->nr_pinned++;
> > > p->migration_disabled = 1;
> > > preempt_enable();
> > >}
> > >
> > >It does nothing that prevents you from doing schedule() or even wait for 
> > >any
> > >event(mutex slow path behaviour), when the process is removed from the 
> > >run-queue.
> > >I mean after the migrate_disable() is invoked. Or i miss something?
> >
> > Hello Rezki
> >
> > Sorry, there's something wrong with the previous description.
> > There are the following scenarios
> >
> > Due to migrate_disable will increase  this_rq()->nr_pinned , after that
> > if get_free_page be blocked, and this time, CPU going offline,
> > the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
> > and be blocked.
> >
> >But after the migrate_disable() is invoked a CPU can not be brought down.
> >If there are pinned tasks a "hotplug path" will be blocked on 
> >balance_hotplug_wait()
> >call.
> 
> > blocked:
> > sched_cpu_wait_empty()
> > {
> >   struct rq *rq = this_rq();
> >rcuwait_wait_event(>hotplug_wait,
> >rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
> >TASK_UNINTERRUPTIBLE);
> > }
> >
> >Exactly.
> 
> > wakeup:
> > balance_push()
> > {
> > if (is_per_cpu_kthread(push_task) || 
> > is_migration_disabled(push_task)) {
> >
> > if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
> > rcuwait_active(>hotplug_wait)) {
> > raw_sp

Re: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

2021-01-25 Thread Uladzislau Rezki
> 
> 
> 发件人: Uladzislau Rezki 
> 发送时间: 2021年1月25日 5:57
> 收件人: Zhang, Qiang
> 抄送: Uladzislau Rezki (Sony); LKML; RCU; Paul E . McKenney; Michael Ellerman; 
> Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel 
> Fernandes; Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; 
> Sebastian Andrzej Siewior; Oleksiy Avramchenko
> 主题: Re: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> 
> >Hello, Zhang.
> 
> > >____
> > >发件人: Uladzislau Rezki (Sony) 
> > >发送时间: 2021年1月21日 0:21
> > >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> > >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; 
> > >Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y 
> > >. Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> > >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> > >
> > >Since the page is obtained in a fully preemptible context, dropping
> > >the lock can lead to migration onto another CPU. As a result a prev.
> > >bnode of that CPU may be underutilised, because a decision has been
> > >made for a CPU that was run out of free slots to store a pointer.
> > >
> > >migrate_disable/enable() are now independent of RT, use it in order
> > >to prevent any migration during a page request for a specific CPU it
> > >is requested for.
> >
> >
> > Hello Rezki
> >
> > The critical migrate_disable/enable() area is not allowed to block, under 
> > RT and non RT.
> > There is such a description in preempt.h
> >
> >
> > * Notes on the implementation.
> >  *
> >  * The implementation is particularly tricky since existing code patterns
> >  * dictate neither migrate_disable() nor migrate_enable() is allowed to 
> > block.
> >  * This means that it cannot use cpus_read_lock() to serialize against 
> > hotplug,
> >  * nor can it easily migrate itself into a pending affinity mask change on
> >  * migrate_enable().
> >
> >How i interpret it is migrate_enable()/migrate_disable() are not allowed to
> >use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
> >current context as non-migratable.
> >
> >void migrate_disable(void)
> >{
> > struct task_struct *p = current;
> >
> > if (p->migration_disabled) {
> >  p->migration_disabled++;
> >  return;
> > }
> 
> > preempt_disable();
> > this_rq()->nr_pinned++;
> > p->migration_disabled = 1;
> > preempt_enable();
> >}
> >
> >It does nothing that prevents you from doing schedule() or even wait for any
> >event(mutex slow path behaviour), when the process is removed from the 
> >run-queue.
> >I mean after the migrate_disable() is invoked. Or i miss something?
> 
> Hello Rezki
> 
> Sorry, there's something wrong with the previous description.
> There are the following scenarios
> 
> Due to migrate_disable will increase  this_rq()->nr_pinned , after that
> if get_free_page be blocked, and this time, CPU going offline,
> the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
> and be blocked.
> 
But after the migrate_disable() is invoked a CPU can not be brought down.
If there are pinned tasks a "hotplug path" will be blocked on 
balance_hotplug_wait()
call.

> blocked:
> sched_cpu_wait_empty()
> {
>   struct rq *rq = this_rq();
>rcuwait_wait_event(>hotplug_wait,
>rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
>TASK_UNINTERRUPTIBLE);
> }
>
Exactly.

> wakeup:
> balance_push()
> {
> if (is_per_cpu_kthread(push_task) || 
> is_migration_disabled(push_task)) {
>   
> if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
> rcuwait_active(>hotplug_wait)) {
> raw_spin_unlock(>lock);
> rcuwait_wake_up(>hotplug_wait);
> raw_spin_lock(>lock);
> }
> return;
> }
> }
> 
> One of the conditions for this function to wake up is "rq->nr_pinned  == 0"
> that is to say between migrate_disable/enable, if blocked will defect CPU 
> going
> offline longer blocking time.
> 
Indeed, the hotplug time is affected. For example in case of waiting for
a mutex to be released, an owner will wakeup wait

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-25 Thread Uladzislau Rezki
On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > For a single argument we can directly request a page from a caller
> > > > context when a "carry page block" is run out of free spots. Instead
> > > > of hitting a slow path we can request an extra page by demand and
> > > > proceed with a fast path.
> > > > 
> > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > Single-argument kvfree_rcu() therefore uses 
> > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > to allow limited sleeping within the memory allocator.
> > > 
> > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > way to allocate without triggering the OOM killer. Is this really what
> > > you need/want? Is __GFP_NORETRY too weak?
> > > 
> > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > point: https://www.spinics.net/lists/rcu/msg02856.html
> > 
> > 
> > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > further decisions based on instrumentation of the success rates of
> > these opportunistic allocations.
> > 
> 
> I completely agree with Johannes here.
> 
> > but for some reason, i can't find a tail or head of it, we introduced
> > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > What we would like to avoid.
> 
> Not that I object to this use but I think it would be much better to use
> it based on actual data. Going along with it right away might become a
> future burden to make any changes in this aspect later on due to lack of 
> exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> try as hard as it can get without being really disruptive (like OOM
> killing something). And your wording didn't really give me that
> impression.
> 
Initially i proposed just to go with GFP_NOWAIT flag. But later on there
was a discussion about a fallback path, that uses synchronize_rcu() can be
slow, thus minimizing its hitting would be great. So, here we go with a
trade off.

Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have 
some
light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.

There were simple criterias we discussed which we would like to achieve:

a) minimize a fallback hitting;
b) avoid of OOM involving;
c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
__GFP_NOMEMALLOC for single-argument kvfree_rcu()

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-25 Thread Uladzislau Rezki
> On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> 
> __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> way to allocate without triggering the OOM killer. Is this really what
> you need/want? Is __GFP_NORETRY too weak?
> 
Hm... We agreed to proceed with limited lightwait memory direct reclaim.
Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
point: https://www.spinics.net/lists/rcu/msg02856.html


So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
further decisions based on instrumentation of the success rates of
these opportunistic allocations.


but for some reason, i can't find a tail or head of it, we introduced
__GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
What we would like to avoid.

I tend to say that it was a typo.

Thank you for pointing to it!

--
Vlad Rezki


Re: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

2021-01-24 Thread Uladzislau Rezki
Hello, Zhang.

> >
> >发件人: Uladzislau Rezki (Sony) 
> >发送时间: 2021年1月21日 0:21
> >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; 
> >Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . 
> >Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> >
> >Since the page is obtained in a fully preemptible context, dropping
> >the lock can lead to migration onto another CPU. As a result a prev.
> >bnode of that CPU may be underutilised, because a decision has been
> >made for a CPU that was run out of free slots to store a pointer.
> >
> >migrate_disable/enable() are now independent of RT, use it in order
> >to prevent any migration during a page request for a specific CPU it
> >is requested for.
> 
> 
> Hello Rezki
> 
> The critical migrate_disable/enable() area is not allowed to block, under RT 
> and non RT.  
> There is such a description in preempt.h 
> 
> 
> * Notes on the implementation.
>  *
>  * The implementation is particularly tricky since existing code patterns
>  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
>  * This means that it cannot use cpus_read_lock() to serialize against 
> hotplug,
>  * nor can it easily migrate itself into a pending affinity mask change on
>  * migrate_enable().
> 
How i interpret it is migrate_enable()/migrate_disable() are not allowed to
use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
current context as non-migratable.

void migrate_disable(void)
{
 struct task_struct *p = current;

 if (p->migration_disabled) {
  p->migration_disabled++;
  return;
 }

 preempt_disable();
 this_rq()->nr_pinned++;
 p->migration_disabled = 1;
 preempt_enable();
}

It does nothing that prevents you from doing schedule() or even wait for any
event(mutex slow path behaviour), when the process is removed from the 
run-queue.
I mean after the migrate_disable() is invoked. Or i miss something?

>
> How about the following changes:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e7a226abff0d..2aa19537ac7c 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,10 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> bnode = get_cached_bnode(*krcp);
> if (!bnode && can_alloc) {
> -   migrate_disable();
> krc_this_cpu_unlock(*krcp, *flags);
> bnode = (struct kvfree_rcu_bulk_data *)
> __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -   *krcp = krc_this_cpu_lock(flags);
> -   migrate_enable();
> +   raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
>
Hm.. Taking the former lock can lead to a pointer leaking, i mean a CPU 
associated
with "krcp" might go offline during a page request process, so a queuing occurs 
on
off-lined CPU. Apat of that, acquiring a former lock still does not solve:

- CPU1 in process of page allocation;
- CPU1 gets migrated to CPU2;
- another task running on CPU1 also allocate a page;
- both bnodes are added to krcp associated with CPU1.

I agree that such scenario probably will never happen or i would say, can be
considered as a corner case. We can drop the:

[PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

and live with: an allocated bnode can be queued to another CPU, so its prev.
"bnode" can be underutilized. What is also can be considered as a corner case.
According to my tests, it is hard to achieve:

Running kvfree_rcu() simultaneously in a tight loop, 1 000 000 
allocations/freeing:

- 64 CPUs and 64 threads showed 1 migration;
- 64 CPUs and 128 threads showed 0 migrations;
- 64 CPUs and 32 threads showed 0 migration. 

Thoughts?

Thank you for your comments!

--
Vlad Rezki


  1   2   3   4   5   6   7   >