Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-02-03 Thread Paul E. McKenney
On Mon, Feb 01, 2021 at 12:47:55PM +0100, Michal Hocko wrote:
> On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> > On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > > [...]
> > > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > > From: "Uladzislau Rezki (Sony)" 
> > > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
> > > > single-argument
> > > >  case
> > > > 
> > > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > > makes sense to do these page allocations from the fastpath, and even to
> > > > permit limited sleeping within the allocator.
> > > > 
> > > > This commit therefore allocates if needed on the fastpath using
> > > > GFP_KERNEL|__GFP_NORETRY.
> > > 
> > > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > > prepared to handle the failure which is likely much less disruptive than
> > > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > > 
> > > I cannot give you ack as I am not familiar with the code but this makes
> > > sense to me.
> > > 
> > No problem, i can separate it. We can have a patch on top of what we have so
> > far. The patch only modifies the gfp_mask passed to __get_free_pages():
> > 
> > >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" 
> > Date: Fri, 29 Jan 2021 17:16:03 +0100
> > Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by 
> > __GFP_NORETRY
> > 
> > __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> > therefore a time consuming. That is not optional and there is
> > no need in doing it so hard, because we have a fallback path.
> > 
> > __GFP_NORETRY in its turn can perform some light-weight reclaim
> > and it rather fails under high memory pressure or low memory
> > condition.
> > 
> > In general there are four simple criterias we we would like to
> > achieve:
> > a) minimize a fallback hitting;
> > b) avoid of OOM invoking;
> > c) do a light-wait page request;
> > d) avoid of dipping into the emergency reserves.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> Looks good to me. Feel free to add
> Acked-by: Michal Hocko 

Queued, thank you both!

Thanx, Paul

> > ---
> >  kernel/rcu/tree.c | 14 +-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 70ddc339e0b7..1e862120db9e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > bnode = get_cached_bnode(*krcp);
> > if (!bnode && can_alloc) {
> > krc_this_cpu_unlock(*krcp, *flags);
> > +
> > +   // __GFP_NORETRY - allows a light-weight direct reclaim
> > +   // what is OK from minimizing of fallback hitting point 
> > of
> > +   // view. Apart of that it forbids any OOM invoking what 
> > is
> > +   // also beneficial since we are about to release memory 
> > soon.
> > +   //
> > +   // __GFP_NOMEMALLOC - prevents from consuming of all the
> > +   // memory reserves. Please note we have a fallback path.
> > +   //
> > +   // __GFP_NOWARN - it is supposed that an allocation can
> > +   // be failed under low memory or high memory pressure
> > +   // scenarios.
> > bnode = (struct kvfree_rcu_bulk_data *)
> > -   __get_free_page(GFP_KERNEL | 
> > __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
> > __GFP_NOMEMALLOC | __GFP_NOWARN);
> > *krcp = krc_this_cpu_lock(flags);
> > }
> >  
> > -- 
> > 2.20.1
> > 
> > --
> > Vlad Rezki
> 
> -- 
> Michal Hocko
> SUSE Labs


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-02-01 Thread Uladzislau Rezki
On Mon, Feb 01, 2021 at 12:47:55PM +0100, Michal Hocko wrote:
> On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> > On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > > [...]
> > > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > > From: "Uladzislau Rezki (Sony)" 
> > > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
> > > > single-argument
> > > >  case
> > > > 
> > > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > > makes sense to do these page allocations from the fastpath, and even to
> > > > permit limited sleeping within the allocator.
> > > > 
> > > > This commit therefore allocates if needed on the fastpath using
> > > > GFP_KERNEL|__GFP_NORETRY.
> > > 
> > > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > > prepared to handle the failure which is likely much less disruptive than
> > > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > > 
> > > I cannot give you ack as I am not familiar with the code but this makes
> > > sense to me.
> > > 
> > No problem, i can separate it. We can have a patch on top of what we have so
> > far. The patch only modifies the gfp_mask passed to __get_free_pages():
> > 
> > >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" 
> > Date: Fri, 29 Jan 2021 17:16:03 +0100
> > Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by 
> > __GFP_NORETRY
> > 
> > __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> > therefore a time consuming. That is not optional and there is
> > no need in doing it so hard, because we have a fallback path.
> > 
> > __GFP_NORETRY in its turn can perform some light-weight reclaim
> > and it rather fails under high memory pressure or low memory
> > condition.
> > 
> > In general there are four simple criterias we we would like to
> > achieve:
> > a) minimize a fallback hitting;
> > b) avoid of OOM invoking;
> > c) do a light-wait page request;
> > d) avoid of dipping into the emergency reserves.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) 
> 
> Looks good to me. Feel free to add
> Acked-by: Michal Hocko 
> 
Appreciate it!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-02-01 Thread Michal Hocko
On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > [...]
> > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > From: "Uladzislau Rezki (Sony)" 
> > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
> > > single-argument
> > >  case
> > > 
> > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > makes sense to do these page allocations from the fastpath, and even to
> > > permit limited sleeping within the allocator.
> > > 
> > > This commit therefore allocates if needed on the fastpath using
> > > GFP_KERNEL|__GFP_NORETRY.
> > 
> > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > prepared to handle the failure which is likely much less disruptive than
> > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > 
> > I cannot give you ack as I am not familiar with the code but this makes
> > sense to me.
> > 
> No problem, i can separate it. We can have a patch on top of what we have so
> far. The patch only modifies the gfp_mask passed to __get_free_pages():
> 
> >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> From: "Uladzislau Rezki (Sony)" 
> Date: Fri, 29 Jan 2021 17:16:03 +0100
> Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY
> 
> __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> therefore a time consuming. That is not optional and there is
> no need in doing it so hard, because we have a fallback path.
> 
> __GFP_NORETRY in its turn can perform some light-weight reclaim
> and it rather fails under high memory pressure or low memory
> condition.
> 
> In general there are four simple criterias we we would like to
> achieve:
> a) minimize a fallback hitting;
> b) avoid of OOM invoking;
> c) do a light-wait page request;
> d) avoid of dipping into the emergency reserves.
> 
> Signed-off-by: Uladzislau Rezki (Sony) 

Looks good to me. Feel free to add
Acked-by: Michal Hocko 

> ---
>  kernel/rcu/tree.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 70ddc339e0b7..1e862120db9e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>   bnode = get_cached_bnode(*krcp);
>   if (!bnode && can_alloc) {
>   krc_this_cpu_unlock(*krcp, *flags);
> +
> + // __GFP_NORETRY - allows a light-weight direct reclaim
> + // what is OK from minimizing of fallback hitting point 
> of
> + // view. Apart of that it forbids any OOM invoking what 
> is
> + // also beneficial since we are about to release memory 
> soon.
> + //
> + // __GFP_NOMEMALLOC - prevents from consuming of all the
> + // memory reserves. Please note we have a fallback path.
> + //
> + // __GFP_NOWARN - it is supposed that an allocation can
> + // be failed under low memory or high memory pressure
> + // scenarios.
>   bnode = (struct kvfree_rcu_bulk_data *)
> - __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> + __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
> __GFP_NOMEMALLOC | __GFP_NOWARN);
>   *krcp = krc_this_cpu_lock(flags);
>   }
>  
> -- 
> 2.20.1
> 
> --
> Vlad Rezki

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-29 Thread Uladzislau Rezki
On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> [...]
> > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" 
> > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for 
> > single-argument
> >  case
> > 
> > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > so we can directly allocate pages.  Furthermmore, the fallback in case
> > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > makes sense to do these page allocations from the fastpath, and even to
> > permit limited sleeping within the allocator.
> > 
> > This commit therefore allocates if needed on the fastpath using
> > GFP_KERNEL|__GFP_NORETRY.
> 
> Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> prepared to handle the failure which is likely much less disruptive than
> OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> 
> I cannot give you ack as I am not familiar with the code but this makes
> sense to me.
> 
No problem, i can separate it. We can have a patch on top of what we have so
far. The patch only modifies the gfp_mask passed to __get_free_pages():

>From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" 
Date: Fri, 29 Jan 2021 17:16:03 +0100
Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

__GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
therefore a time consuming. That is not optional and there is
no need in doing it so hard, because we have a fallback path.

__GFP_NORETRY in its turn can perform some light-weight reclaim
and it rather fails under high memory pressure or low memory
condition.

In general there are four simple criterias we we would like to
achieve:
a) minimize a fallback hitting;
b) avoid of OOM invoking;
c) do a light-wait page request;
d) avoid of dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) 
---
 kernel/rcu/tree.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 70ddc339e0b7..1e862120db9e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
bnode = get_cached_bnode(*krcp);
if (!bnode && can_alloc) {
krc_this_cpu_unlock(*krcp, *flags);
+
+   // __GFP_NORETRY - allows a light-weight direct reclaim
+   // what is OK from minimizing of fallback hitting point 
of
+   // view. Apart of that it forbids any OOM invoking what 
is
+   // also beneficial since we are about to release memory 
soon.
+   //
+   // __GFP_NOMEMALLOC - prevents from consuming of all the
+   // memory reserves. Please note we have a fallback path.
+   //
+   // __GFP_NOWARN - it is supposed that an allocation can
+   // be failed under low memory or high memory pressure
+   // scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
*krcp = krc_this_cpu_lock(flags);
}
 
-- 
2.20.1

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Uladzislau Rezki
On Thu, Jan 28, 2021 at 04:30:17PM +0100, Uladzislau Rezki wrote:
> On Thu, Jan 28, 2021 at 04:17:01PM +0100, Michal Hocko wrote:
> > On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> > > On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > > > For a single argument we can directly request a page from a 
> > > > > > > > caller
> > > > > > > > context when a "carry page block" is run out of free spots. 
> > > > > > > > Instead
> > > > > > > > of hitting a slow path we can request an extra page by demand 
> > > > > > > > and
> > > > > > > > proceed with a fast path.
> > > > > > > > 
> > > > > > > > A single-argument kvfree_rcu() must be invoked in sleepable 
> > > > > > > > contexts,
> > > > > > > > and that its fallback is the relatively high latency 
> > > > > > > > synchronize_rcu().
> > > > > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > > > to allow limited sleeping within the memory allocator.
> > > > > > > 
> > > > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the 
> > > > > > > most heavy
> > > > > > > way to allocate without triggering the OOM killer. Is this really 
> > > > > > > what
> > > > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > > > 
> > > > > > Hm... We agreed to proceed with limited lightwait memory direct 
> > > > > > reclaim.
> > > > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > > > 
> > > > > > 
> > > > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, 
> > > > > > and make
> > > > > > further decisions based on instrumentation of the success rates 
> > > > > > of
> > > > > > these opportunistic allocations.
> > > > > > 
> > > > > 
> > > > > I completely agree with Johannes here.
> > > > > 
> > > > > > but for some reason, i can't find a tail or head of it, we 
> > > > > > introduced
> > > > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point 
> > > > > > of view.
> > > > > > What we would like to avoid.
> > > > > 
> > > > > Not that I object to this use but I think it would be much better to 
> > > > > use
> > > > > it based on actual data. Going along with it right away might become a
> > > > > future burden to make any changes in this aspect later on due to lack 
> > > > > of 
> > > > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is 
> > > > > really
> > > > > try as hard as it can get without being really disruptive (like OOM
> > > > > killing something). And your wording didn't really give me that
> > > > > impression.
> > > > > 
> > > > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > > > was a discussion about a fallback path, that uses synchronize_rcu() can 
> > > > be
> > > > slow, thus minimizing its hitting would be great. So, here we go with a
> > > > trade off.
> > > > 
> > > > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to 
> > > > have some
> > > > light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> > > > proposed.
> > > > 
> > > > There were simple criterias we discussed which we would like to achieve:
> > > > 
> > > > a) minimize a fallback hitting;
> > > > b) avoid of OOM involving;
> > > > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> > > > __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > > > 
> > > One question here. Since the code that triggers a page request can be
> > > directly invoked from reclaim context as well as outside of it. We had
> > > a concern about if any recursion is possible, but what i see it is safe.
> > > The context that does it can not enter it twice:
> > > 
> > > 
> > > /* Avoid recursion of direct reclaim */
> > > if (current->flags & PF_MEMALLOC)
> > > goto nopage;
> > > 
> > 
> > Yes this is a recursion protection.
> > 
> > > What about any deadlocking in regards to below following flags?
> > > 
> > > GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN
> > 
> > and __GFP_NOMEMALLOC will make sure that the allocation will not consume
> > all the memory reserves. The later should be clarified in one of your
> > patches I have acked IIRC.
> >
> Yep, it is clarified and reflected in another patch you ACKed.
> 
> Thanks!
> 

Please find below the V2.

Changelog V1 -> V2:
- replace the __GFP_RETRY_MAYFAIL by __GFP_NORETRY;
- add more comments explaining why and which flags are used.


>From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" 
Date: Wed, 20 Jan 2021 17:21:46 +0100
Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
 case

Single-argument 

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Uladzislau Rezki
On Thu, Jan 28, 2021 at 04:17:01PM +0100, Michal Hocko wrote:
> On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> > On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > > For a single argument we can directly request a page from a caller
> > > > > > > context when a "carry page block" is run out of free spots. 
> > > > > > > Instead
> > > > > > > of hitting a slow path we can request an extra page by demand and
> > > > > > > proceed with a fast path.
> > > > > > > 
> > > > > > > A single-argument kvfree_rcu() must be invoked in sleepable 
> > > > > > > contexts,
> > > > > > > and that its fallback is the relatively high latency 
> > > > > > > synchronize_rcu().
> > > > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > > to allow limited sleeping within the memory allocator.
> > > > > > 
> > > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most 
> > > > > > heavy
> > > > > > way to allocate without triggering the OOM killer. Is this really 
> > > > > > what
> > > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > > 
> > > > > Hm... We agreed to proceed with limited lightwait memory direct 
> > > > > reclaim.
> > > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > > 
> > > > > 
> > > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and 
> > > > > make
> > > > > further decisions based on instrumentation of the success rates of
> > > > > these opportunistic allocations.
> > > > > 
> > > > 
> > > > I completely agree with Johannes here.
> > > > 
> > > > > but for some reason, i can't find a tail or head of it, we introduced
> > > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point 
> > > > > of view.
> > > > > What we would like to avoid.
> > > > 
> > > > Not that I object to this use but I think it would be much better to use
> > > > it based on actual data. Going along with it right away might become a
> > > > future burden to make any changes in this aspect later on due to lack 
> > > > of 
> > > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > > > try as hard as it can get without being really disruptive (like OOM
> > > > killing something). And your wording didn't really give me that
> > > > impression.
> > > > 
> > > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > > was a discussion about a fallback path, that uses synchronize_rcu() can be
> > > slow, thus minimizing its hitting would be great. So, here we go with a
> > > trade off.
> > > 
> > > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to 
> > > have some
> > > light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> > > proposed.
> > > 
> > > There were simple criterias we discussed which we would like to achieve:
> > > 
> > > a) minimize a fallback hitting;
> > > b) avoid of OOM involving;
> > > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> > > __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > > 
> > One question here. Since the code that triggers a page request can be
> > directly invoked from reclaim context as well as outside of it. We had
> > a concern about if any recursion is possible, but what i see it is safe.
> > The context that does it can not enter it twice:
> > 
> > 
> > /* Avoid recursion of direct reclaim */
> > if (current->flags & PF_MEMALLOC)
> > goto nopage;
> > 
> 
> Yes this is a recursion protection.
> 
> > What about any deadlocking in regards to below following flags?
> > 
> > GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN
> 
> and __GFP_NOMEMALLOC will make sure that the allocation will not consume
> all the memory reserves. The later should be clarified in one of your
> patches I have acked IIRC.
>
Yep, it is clarified and reflected in another patch you ACKed.

Thanks!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Michal Hocko
On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > For a single argument we can directly request a page from a caller
> > > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > > of hitting a slow path we can request an extra page by demand and
> > > > > > proceed with a fast path.
> > > > > > 
> > > > > > A single-argument kvfree_rcu() must be invoked in sleepable 
> > > > > > contexts,
> > > > > > and that its fallback is the relatively high latency 
> > > > > > synchronize_rcu().
> > > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > to allow limited sleeping within the memory allocator.
> > > > > 
> > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most 
> > > > > heavy
> > > > > way to allocate without triggering the OOM killer. Is this really what
> > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > 
> > > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > 
> > > > 
> > > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and 
> > > > make
> > > > further decisions based on instrumentation of the success rates of
> > > > these opportunistic allocations.
> > > > 
> > > 
> > > I completely agree with Johannes here.
> > > 
> > > > but for some reason, i can't find a tail or head of it, we introduced
> > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of 
> > > > view.
> > > > What we would like to avoid.
> > > 
> > > Not that I object to this use but I think it would be much better to use
> > > it based on actual data. Going along with it right away might become a
> > > future burden to make any changes in this aspect later on due to lack of 
> > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > > try as hard as it can get without being really disruptive (like OOM
> > > killing something). And your wording didn't really give me that
> > > impression.
> > > 
> > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > was a discussion about a fallback path, that uses synchronize_rcu() can be
> > slow, thus minimizing its hitting would be great. So, here we go with a
> > trade off.
> > 
> > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to 
> > have some
> > light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> > proposed.
> > 
> > There were simple criterias we discussed which we would like to achieve:
> > 
> > a) minimize a fallback hitting;
> > b) avoid of OOM involving;
> > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> > __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > 
> One question here. Since the code that triggers a page request can be
> directly invoked from reclaim context as well as outside of it. We had
> a concern about if any recursion is possible, but what i see it is safe.
> The context that does it can not enter it twice:
> 
> 
> /* Avoid recursion of direct reclaim */
> if (current->flags & PF_MEMALLOC)
> goto nopage;
> 

Yes this is a recursion protection.

> What about any deadlocking in regards to below following flags?
> 
> GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN

and __GFP_NOMEMALLOC will make sure that the allocation will not consume
all the memory reserves. The later should be clarified in one of your
patches I have acked IIRC.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-28 Thread Uladzislau Rezki
On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > For a single argument we can directly request a page from a caller
> > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > of hitting a slow path we can request an extra page by demand and
> > > > > proceed with a fast path.
> > > > > 
> > > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > > and that its fallback is the relatively high latency 
> > > > > synchronize_rcu().
> > > > > Single-argument kvfree_rcu() therefore uses 
> > > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > to allow limited sleeping within the memory allocator.
> > > > 
> > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > > way to allocate without triggering the OOM killer. Is this really what
> > > > you need/want? Is __GFP_NORETRY too weak?
> > > > 
> > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > 
> > > 
> > > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > further decisions based on instrumentation of the success rates of
> > > these opportunistic allocations.
> > > 
> > 
> > I completely agree with Johannes here.
> > 
> > > but for some reason, i can't find a tail or head of it, we introduced
> > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of 
> > > view.
> > > What we would like to avoid.
> > 
> > Not that I object to this use but I think it would be much better to use
> > it based on actual data. Going along with it right away might become a
> > future burden to make any changes in this aspect later on due to lack of 
> > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > try as hard as it can get without being really disruptive (like OOM
> > killing something). And your wording didn't really give me that
> > impression.
> > 
> Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> was a discussion about a fallback path, that uses synchronize_rcu() can be
> slow, thus minimizing its hitting would be great. So, here we go with a
> trade off.
> 
> Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have 
> some
> light-wait requests would be acceptable. That is why __GFP_NORETRY was 
> proposed.
> 
> There were simple criterias we discussed which we would like to achieve:
> 
> a) minimize a fallback hitting;
> b) avoid of OOM involving;
> c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
> __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> 
One question here. Since the code that triggers a page request can be
directly invoked from reclaim context as well as outside of it. We had
a concern about if any recursion is possible, but what i see it is safe.
The context that does it can not enter it twice:


/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
goto nopage;


What about any deadlocking in regards to below following flags?

GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN

Thanks!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-25 Thread Uladzislau Rezki
On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > For a single argument we can directly request a page from a caller
> > > > context when a "carry page block" is run out of free spots. Instead
> > > > of hitting a slow path we can request an extra page by demand and
> > > > proceed with a fast path.
> > > > 
> > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > Single-argument kvfree_rcu() therefore uses 
> > > > GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > to allow limited sleeping within the memory allocator.
> > > 
> > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > way to allocate without triggering the OOM killer. Is this really what
> > > you need/want? Is __GFP_NORETRY too weak?
> > > 
> > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > point: https://www.spinics.net/lists/rcu/msg02856.html
> > 
> > 
> > So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > further decisions based on instrumentation of the success rates of
> > these opportunistic allocations.
> > 
> 
> I completely agree with Johannes here.
> 
> > but for some reason, i can't find a tail or head of it, we introduced
> > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > What we would like to avoid.
> 
> Not that I object to this use but I think it would be much better to use
> it based on actual data. Going along with it right away might become a
> future burden to make any changes in this aspect later on due to lack of 
> exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> try as hard as it can get without being really disruptive (like OOM
> killing something). And your wording didn't really give me that
> impression.
> 
Initially i proposed just to go with GFP_NOWAIT flag. But later on there
was a discussion about a fallback path, that uses synchronize_rcu() can be
slow, thus minimizing its hitting would be great. So, here we go with a
trade off.

Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have 
some
light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.

There were simple criterias we discussed which we would like to achieve:

a) minimize a fallback hitting;
b) avoid of OOM involving;
c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use 
__GFP_NOMEMALLOC for single-argument kvfree_rcu()

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-25 Thread Michal Hocko
On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > For a single argument we can directly request a page from a caller
> > > context when a "carry page block" is run out of free spots. Instead
> > > of hitting a slow path we can request an extra page by demand and
> > > proceed with a fast path.
> > > 
> > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > and that its fallback is the relatively high latency synchronize_rcu().
> > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > to allow limited sleeping within the memory allocator.
> > 
> > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > way to allocate without triggering the OOM killer. Is this really what
> > you need/want? Is __GFP_NORETRY too weak?
> > 
> Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> point: https://www.spinics.net/lists/rcu/msg02856.html
> 
> 
> So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> further decisions based on instrumentation of the success rates of
> these opportunistic allocations.
> 

I completely agree with Johannes here.

> but for some reason, i can't find a tail or head of it, we introduced
> __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> What we would like to avoid.

Not that I object to this use but I think it would be much better to use
it based on actual data. Going along with it right away might become a
future burden to make any changes in this aspect later on due to lack of 
exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
try as hard as it can get without being really disruptive (like OOM
killing something). And your wording didn't really give me that
impression.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-25 Thread Uladzislau Rezki
> On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> 
> __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> way to allocate without triggering the OOM killer. Is this really what
> you need/want? Is __GFP_NORETRY too weak?
> 
Hm... We agreed to proceed with limited lightwait memory direct reclaim.
Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
point: https://www.spinics.net/lists/rcu/msg02856.html


So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
further decisions based on instrumentation of the success rates of
these opportunistic allocations.


but for some reason, i can't find a tail or head of it, we introduced
__GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
What we would like to avoid.

I tend to say that it was a typo.

Thank you for pointing to it!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-25 Thread Michal Hocko
On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.

__GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
way to allocate without triggering the OOM killer. Is this really what
you need/want? Is __GFP_NORETRY too weak?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-22 Thread Uladzislau Rezki
> On 2021-01-21 13:38:34 [+0100], Uladzislau Rezki wrote:
> > __get_free_page() returns "unsigned long" whereas a bnode is a pointer
> > to kvfree_rcu_bulk_data struct, without a casting the compiler will
> > emit a warning.
> 
> Yes, learned about it, sorry.
> 
> > >> You think that a CPU migration is a bad thing. But why?
> > >>
> > It is not a bad thing. But if it happens we might queue a new bnode
> > to a drain list of another CPU where a previous element of a new
> > bnode may be just underutilized. So that is why i use 
> > migrate_disable()/enable()
> > to prevent it.
> 
> If you allocate a node for queueing and then you realize that you
> already have one then you either free it or queue it for later.
> Given that all this is a slower-path and that this is used on every-CPU,
> sooner or later that page will be used avoids a later allocation, right?
> 
To free it right away is a bit problematic, because we need at least one more
time to drop the lock what would introduce more complexity. Adding to the cache 
for later reuse is possible but would require an extra decay cache logic.

I think, since it is a corner case and i do not consider it as a big issue,
we can just queue it to the drain list so the previous node can be half filled
due to migration.

> > If there are some hidden issues with migrate_disable()/enable() or you
> > think it is a bad idea to use it, it would be appreciated if you could
> > describe your view in more detail.
> 
> Just what I mentioned in my previous email:
> - the whole operation isn't cheap but it is more efficient in tree
>   compared to what we used to have in RT.
> - the task can no be freely placed while it could run. So if the task
>   gets preempted, it will have to wait until it can run on the CPU
>   again.
> 
Yep, that is obvious. From scheduling point of view the task can wait more
time because the migration is prohibited. From the other hand, the page is
obtained in the path that is not considered as a hot one. One page can serve 
~500 pointers.

I do not say that we should keep: [PATCH 3/3] kvfree_rcu: use 
migrate_disable/enable()

without it, a migration can occur, what is really rare according to my tests
therefore i do not have a strong opinion about keeping it. If you or someone
else has some concern we can drop it.

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-22 Thread Paul E. McKenney
On Fri, Jan 22, 2021 at 12:17:33PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 13:54:03 [-0800], Paul E. McKenney wrote:
> > > > +// Record ptr in a page managed by krcp, with the 
> > > > pre-krc_this_cpu_lock()
> > > > +// state specified by flags.  If can_alloc is true, the caller must
> > > > +// be schedulable and not be holding any locks or mutexes that might be
> > > > +// acquired by the memory allocator or anything that it might invoke.
> > > > +// Returns true if ptr was successfully recorded, else the caller must
> > > > +// use a fallback.
> > > 
> > > The whole RCU department is getting swamped by the // comments. Can't we
> > > have proper kernel doc and /* */ style comments like the remaining part
> > > of the kernel?
> > 
> > Because // comments are easier to type and take up less horizontal space.
> 
> As for the typing I could try to sell you 
>   ab // /*
> 
> for your .vimrc and then // would become /* ;) As for the
> horizontal space, I don't have currently anything in my shop. I'm sorry.

;-)

> > Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> > kvfree_rcu(), and we don't normally docbook-ify such functions.
> 
> I didn't mean to promote using docbook to use every. For instance if you
> look at kernel/trace/trace.c, there are no // comments around, just /*
> style, even for things like tracing_selftest_running.
> 
> Basically I was curious if I could learn where this // is coming and if
> I could stop it.

Because they are now allowed and because they make my life easier as
noted above.  Also in-function comment blocks are either one line or two
lines shorter.  Yeah, they look strange at first, but it is not that hard
to get used to them.  After all, I did manage to get used to the /* */
comment style shortly after first encountering it.  ;-)

> > > >  static inline bool
> > > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > > +   unsigned long *flags, void *ptr, bool can_alloc)
> > > >  {
> > > > struct kvfree_rcu_bulk_data *bnode;
> > > > int idx;
> > > >  
> > > > -   if (unlikely(!krcp->initialized))
> > > > +   *krcp = krc_this_cpu_lock(flags);
> > > > +   if (unlikely(!(*krcp)->initialized))
> > > > return false;
> > > >  
> > > > -   lockdep_assert_held(>lock);
> > > > idx = !!is_vmalloc_addr(ptr);
> > > >  
> > > > /* Check if a new block is required. */
> > > > -   if (!krcp->bkvhead[idx] ||
> > > > -   krcp->bkvhead[idx]->nr_records == 
> > > > KVFREE_BULK_MAX_ENTR) {
> > > > -   bnode = get_cached_bnode(krcp);
> > > > -   /* Switch to emergency path. */
> > > > +   if (!(*krcp)->bkvhead[idx] ||
> > > > +   (*krcp)->bkvhead[idx]->nr_records == 
> > > > KVFREE_BULK_MAX_ENTR) {
> > > > +   bnode = get_cached_bnode(*krcp);
> > > > +   if (!bnode && can_alloc) {
> > > > +   krc_this_cpu_unlock(*krcp, *flags);
> > > > +   bnode = (struct kvfree_rcu_bulk_data *)
> > > 
> > > There is no need for this cast.
> > 
> > Without it, gcc version 7.5.0 says:
> > 
> > warning: assignment makes pointer from integer without a cast
> > 
> 
> I'm sorry. I forgot the part where __get_free_page() does not return
> (void *).
> But maybe it should given that free_pages() casts that long back to
> (void *) and __get_free_pages() -> page_address() returns (void *)
> which is then casted long.

No argument here.  Then again, I am not the one in need of convincing.

There are use cases like this from pte_alloc_one_kernel():

unsigned long page = __get_free_page(GFP_DMA);

But a quick look indicates that they are in the minority.

> > > > +   __get_free_page(GFP_KERNEL | 
> > > > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > > +   *krcp = krc_this_cpu_lock(flags);
> > > 
> > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > have been filled (given preemption or CPU migration changed something).
> > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > why?
> > 
> > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > it with the correct CPU.
> > 
> > Though now that you mention it, couldn't the following happen?
> > 
> > o   Task A on CPU 0 notices that allocation is needed, so it
> > drops the lock disables migration, and sleeps while
> > allocating.
> > 
> > o   Task B on CPU 0 does the same.
> > 
> > o   The two tasks wake up in some order, and the second one
> > causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > assignment.
> 
> Yes it could, good point.
> I would really recommend using migrate_disable() at a minimum and only
> if it is really needed. It is more expensive than preempt_disable() and
> it isn't exactly 

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-22 Thread Sebastian Andrzej Siewior
On 2021-01-20 13:54:03 [-0800], Paul E. McKenney wrote:
> > > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > > +// state specified by flags.  If can_alloc is true, the caller must
> > > +// be schedulable and not be holding any locks or mutexes that might be
> > > +// acquired by the memory allocator or anything that it might invoke.
> > > +// Returns true if ptr was successfully recorded, else the caller must
> > > +// use a fallback.
> > 
> > The whole RCU department is getting swamped by the // comments. Can't we
> > have proper kernel doc and /* */ style comments like the remaining part
> > of the kernel?
> 
> Because // comments are easier to type and take up less horizontal space.

As for the typing I could try to sell you 
  ab // /*

for your .vimrc and then // would become /* ;) As for the
horizontal space, I don't have currently anything in my shop. I'm sorry.

> Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> kvfree_rcu(), and we don't normally docbook-ify such functions.

I didn't mean to promote using docbook to use every. For instance if you
look at kernel/trace/trace.c, there are no // comments around, just /*
style, even for things like tracing_selftest_running.

Basically I was curious if I could learn where this // is coming and if
I could stop it.

> > >  static inline bool
> > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > + unsigned long *flags, void *ptr, bool can_alloc)
> > >  {
> > >   struct kvfree_rcu_bulk_data *bnode;
> > >   int idx;
> > >  
> > > - if (unlikely(!krcp->initialized))
> > > + *krcp = krc_this_cpu_lock(flags);
> > > + if (unlikely(!(*krcp)->initialized))
> > >   return false;
> > >  
> > > - lockdep_assert_held(>lock);
> > >   idx = !!is_vmalloc_addr(ptr);
> > >  
> > >   /* Check if a new block is required. */
> > > - if (!krcp->bkvhead[idx] ||
> > > - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> > > {
> > > - bnode = get_cached_bnode(krcp);
> > > - /* Switch to emergency path. */
> > > + if (!(*krcp)->bkvhead[idx] ||
> > > + (*krcp)->bkvhead[idx]->nr_records == 
> > > KVFREE_BULK_MAX_ENTR) {
> > > + bnode = get_cached_bnode(*krcp);
> > > + if (!bnode && can_alloc) {
> > > + krc_this_cpu_unlock(*krcp, *flags);
> > > + bnode = (struct kvfree_rcu_bulk_data *)
> > 
> > There is no need for this cast.
> 
> Without it, gcc version 7.5.0 says:
> 
>   warning: assignment makes pointer from integer without a cast
> 

I'm sorry. I forgot the part where __get_free_page() does not return
(void *).
But maybe it should given that free_pages() casts that long back to
(void *) and __get_free_pages() -> page_address() returns (void *)
which is then casted long.

> > > + __get_free_page(GFP_KERNEL | 
> > > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > + *krcp = krc_this_cpu_lock(flags);
> > 
> > so if bnode is NULL you could retry get_cached_bnode() since it might
> > have been filled (given preemption or CPU migration changed something).
> > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > why?
> 
> So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> it with the correct CPU.
> 
> Though now that you mention it, couldn't the following happen?
> 
> o Task A on CPU 0 notices that allocation is needed, so it
>   drops the lock disables migration, and sleeps while
>   allocating.
> 
> o Task B on CPU 0 does the same.
> 
> o The two tasks wake up in some order, and the second one
>   causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
>   assignment.

Yes it could, good point.
I would really recommend using migrate_disable() at a minimum and only
if it is really needed. It is more expensive than preempt_disable() and
it isn't exactly good in terms of scheduling since the task is run able
but restricted to a specific CPU.
If it is unavoidable it is unavoidable but in this case I wouldn't use
migrate_disable() but re-evaluate the situation after the allocation.

> Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> 
>   Thanx, Paul

Sebastian


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-22 Thread Sebastian Andrzej Siewior
On 2021-01-21 13:38:34 [+0100], Uladzislau Rezki wrote:
> __get_free_page() returns "unsigned long" whereas a bnode is a pointer
> to kvfree_rcu_bulk_data struct, without a casting the compiler will
> emit a warning.

Yes, learned about it, sorry.

> >> You think that a CPU migration is a bad thing. But why?
> >>
> It is not a bad thing. But if it happens we might queue a new bnode
> to a drain list of another CPU where a previous element of a new
> bnode may be just underutilized. So that is why i use 
> migrate_disable()/enable()
> to prevent it.

If you allocate a node for queueing and then you realize that you
already have one then you either free it or queue it for later.
Given that all this is a slower-path and that this is used on every-CPU,
sooner or later that page will be used avoids a later allocation, right?

> If there are some hidden issues with migrate_disable()/enable() or you
> think it is a bad idea to use it, it would be appreciated if you could
> describe your view in more detail.

Just what I mentioned in my previous email:
- the whole operation isn't cheap but it is more efficient in tree
  compared to what we used to have in RT.
- the task can no be freely placed while it could run. So if the task
  gets preempted, it will have to wait until it can run on the CPU
  again.

Sebastian


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-21 Thread Uladzislau Rezki
On Thu, Jan 21, 2021 at 07:07:40AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 21, 2021 at 02:35:10PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> 
> [ . . . ]
> 
> > > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > > have been filled (given preemption or CPU migration changed something).
> > > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > > why?
> > > 
> > > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > > it with the correct CPU.
> > > 
> > > Though now that you mention it, couldn't the following happen?
> > > 
> > > o Task A on CPU 0 notices that allocation is needed, so it
> > >   drops the lock disables migration, and sleeps while
> > >   allocating.
> > > 
> > > o Task B on CPU 0 does the same.
> > > 
> > > o The two tasks wake up in some order, and the second one
> > >   causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > >   assignment.
> > > 
> > > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > > 
> > Probably i should have mentioned your sequence you described, that two tasks
> > can get a page on same CPU, i was thinking about it :) Yep, it can happen
> > since we drop the lock and a context is fully preemptible, so another one
> > can trigger kvfree_rcu() ending up at the same place - entering a page
> > allocator.
> > 
> > I spent some time simulating it, but with no any luck, therefore i did not
> > reflect this case in the commit message, thus did no pay much attention to
> > such scenario.
> > 
> > >
> > > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > >
> > Two woken tasks will be serialized, i.e. an assignment is protected by
> > the our local lock. We do krc_this_cpu_lock(flags); as a first step
> > right after that we do restore a migration. A migration in that case
> > can occur only when krc_this_cpu_unlock(*krcp, *flags); is invoked.
> > 
> > The scenario you described can happen, in that case a previous bnode
> > in the drain list can be either empty or partly utilized. But, again
> > i was non able to trigger such scenario.
> 
> Ah, we did discuss this previously, and yes, the result for a very
> rare race is just underutilization of a page.  With the change below,
> the result of this race is instead needless use of the slowpath.
> 
> > If we should fix it, i think we can go with below "alloc_in_progress"
> > protection:
> > 
> > 
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index cad36074366d..95485ec7267e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3488,12 +3488,19 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu 
> > **krcp,
> > if (!(*krcp)->bkvhead[idx] ||
> > (*krcp)->bkvhead[idx]->nr_records == 
> > KVFREE_BULK_MAX_ENTR) {
> > bnode = get_cached_bnode(*krcp);
> > -   if (!bnode && can_alloc) {
> > +   if (!bnode && can_alloc && !(*krcp)->alloc_in_progress)  {
> > migrate_disable();
> > +
> > +   /* Set it before dropping the lock. */
> > +   (*krcp)->alloc_in_progress = true;
> > krc_this_cpu_unlock(*krcp, *flags);
> > +
> > bnode = (struct kvfree_rcu_bulk_data *)
> > __get_free_page(GFP_KERNEL | 
> > __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > *krcp = krc_this_cpu_lock(flags);
> > +
> > +   /* Clear it, the lock was taken back. */
> > +   (*krcp)->alloc_in_progress = false;
> > migrate_enable();
> > }
> >  
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> > 
> > 
> > in that case a second task will follow a fallback path bypassing a page
> > request. I can send it as a separate patch if there are no any objections.
> 
> I was thinking in terms of something like the following.  Thoughts?
> 
>   Thanx, Paul
> 
> 
> 
> static bool add_ptr_to_bulk_krc_no_space(struct kfree_rcu_cpu *krcp)
> {
>   return !(krcp)->bkvhead[idx] ||
>  (krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR;
> }
>
Agree we should have such wrapper. So the code becomes more readable and
simpler.

> 
> static inline bool
> add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>   unsigned long *flags, void *ptr, bool can_alloc)
> {
>   struct kvfree_rcu_bulk_data *bnode;
>   int idx;
> 

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-21 Thread Paul E. McKenney
On Thu, Jan 21, 2021 at 02:35:10PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:

[ . . . ]

> > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > have been filled (given preemption or CPU migration changed something).
> > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > why?
> > 
> > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > it with the correct CPU.
> > 
> > Though now that you mention it, couldn't the following happen?
> > 
> > o   Task A on CPU 0 notices that allocation is needed, so it
> > drops the lock disables migration, and sleeps while
> > allocating.
> > 
> > o   Task B on CPU 0 does the same.
> > 
> > o   The two tasks wake up in some order, and the second one
> > causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > assignment.
> > 
> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > 
> Probably i should have mentioned your sequence you described, that two tasks
> can get a page on same CPU, i was thinking about it :) Yep, it can happen
> since we drop the lock and a context is fully preemptible, so another one
> can trigger kvfree_rcu() ending up at the same place - entering a page
> allocator.
> 
> I spent some time simulating it, but with no any luck, therefore i did not
> reflect this case in the commit message, thus did no pay much attention to
> such scenario.
> 
> >
> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> >
> Two woken tasks will be serialized, i.e. an assignment is protected by
> the our local lock. We do krc_this_cpu_lock(flags); as a first step
> right after that we do restore a migration. A migration in that case
> can occur only when krc_this_cpu_unlock(*krcp, *flags); is invoked.
> 
> The scenario you described can happen, in that case a previous bnode
> in the drain list can be either empty or partly utilized. But, again
> i was non able to trigger such scenario.

Ah, we did discuss this previously, and yes, the result for a very
rare race is just underutilization of a page.  With the change below,
the result of this race is instead needless use of the slowpath.

> If we should fix it, i think we can go with below "alloc_in_progress"
> protection:
> 
> 
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index cad36074366d..95485ec7267e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,19 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> if (!(*krcp)->bkvhead[idx] ||
> (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> bnode = get_cached_bnode(*krcp);
> -   if (!bnode && can_alloc) {
> +   if (!bnode && can_alloc && !(*krcp)->alloc_in_progress)  {
> migrate_disable();
> +
> +   /* Set it before dropping the lock. */
> +   (*krcp)->alloc_in_progress = true;
> krc_this_cpu_unlock(*krcp, *flags);
> +
> bnode = (struct kvfree_rcu_bulk_data *)
> __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> *krcp = krc_this_cpu_lock(flags);
> +
> +   /* Clear it, the lock was taken back. */
> +   (*krcp)->alloc_in_progress = false;
> migrate_enable();
> }
>  
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> 
> 
> in that case a second task will follow a fallback path bypassing a page
> request. I can send it as a separate patch if there are no any objections.

I was thinking in terms of something like the following.  Thoughts?

Thanx, Paul



static bool add_ptr_to_bulk_krc_no_space(struct kfree_rcu_cpu *krcp)
{
return !(krcp)->bkvhead[idx] ||
   (krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR;
}

static inline bool
add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
unsigned long *flags, void *ptr, bool can_alloc)
{
struct kvfree_rcu_bulk_data *bnode;
int idx;

*krcp = krc_this_cpu_lock(flags);
if (unlikely(!(*krcp)->initialized))
return false;

idx = !!is_vmalloc_addr(ptr);

/* Check if a new block is required. */
if (add_ptr_to_bulk_krc_no_space(*krcp)) {
bnode = get_cached_bnode(*krcp);
if (!bnode && can_alloc) {
   

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-21 Thread Uladzislau Rezki
On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > > For a single argument we can directly request a page from a caller
> > > context when a "carry page block" is run out of free spots. Instead
> > > of hitting a slow path we can request an extra page by demand and
> > > proceed with a fast path.
> > > 
> > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > and that its fallback is the relatively high latency synchronize_rcu().
> > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > to allow limited sleeping within the memory allocator.
> > > 
> > > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > > Signed-off-by: Uladzislau Rezki (Sony) 
> > > Signed-off-by: Paul E. McKenney 
> > > ---
> > >  kernel/rcu/tree.c | 42 ++
> > >  1 file changed, 26 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index e04e336bee42..2014fb22644d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > >   }
> > >  }
> > >  
> > > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > > +// state specified by flags.  If can_alloc is true, the caller must
> > > +// be schedulable and not be holding any locks or mutexes that might be
> > > +// acquired by the memory allocator or anything that it might invoke.
> > > +// Returns true if ptr was successfully recorded, else the caller must
> > > +// use a fallback.
> > 
> > The whole RCU department is getting swamped by the // comments. Can't we
> > have proper kernel doc and /* */ style comments like the remaining part
> > of the kernel?
> 
> Because // comments are easier to type and take up less horizontal space.
> Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> kvfree_rcu(), and we don't normally docbook-ify such functions.
> 
> > >  static inline bool
> > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > + unsigned long *flags, void *ptr, bool can_alloc)
> > >  {
> > >   struct kvfree_rcu_bulk_data *bnode;
> > >   int idx;
> > >  
> > > - if (unlikely(!krcp->initialized))
> > > + *krcp = krc_this_cpu_lock(flags);
> > > + if (unlikely(!(*krcp)->initialized))
> > >   return false;
> > >  
> > > - lockdep_assert_held(>lock);
> > >   idx = !!is_vmalloc_addr(ptr);
> > >  
> > >   /* Check if a new block is required. */
> > > - if (!krcp->bkvhead[idx] ||
> > > - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> > > {
> > > - bnode = get_cached_bnode(krcp);
> > > - /* Switch to emergency path. */
> > > + if (!(*krcp)->bkvhead[idx] ||
> > > + (*krcp)->bkvhead[idx]->nr_records == 
> > > KVFREE_BULK_MAX_ENTR) {
> > > + bnode = get_cached_bnode(*krcp);
> > > + if (!bnode && can_alloc) {
> > > + krc_this_cpu_unlock(*krcp, *flags);
> > > + bnode = (struct kvfree_rcu_bulk_data *)
> > 
> > There is no need for this cast.
> 
> Without it, gcc version 7.5.0 says:
> 
>   warning: assignment makes pointer from integer without a cast
> 
> > > + __get_free_page(GFP_KERNEL | 
> > > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > + *krcp = krc_this_cpu_lock(flags);
> > 
> > so if bnode is NULL you could retry get_cached_bnode() since it might
> > have been filled (given preemption or CPU migration changed something).
> > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > why?
> 
> So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> it with the correct CPU.
> 
> Though now that you mention it, couldn't the following happen?
> 
> o Task A on CPU 0 notices that allocation is needed, so it
>   drops the lock disables migration, and sleeps while
>   allocating.
> 
> o Task B on CPU 0 does the same.
> 
> o The two tasks wake up in some order, and the second one
>   causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
>   assignment.
> 
> Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> 
Probably i should have mentioned your sequence you described, that two tasks
can get a page on same CPU, i was thinking about it :) Yep, it can happen
since we drop the lock and a context is fully preemptible, so another one
can trigger kvfree_rcu() ending up at the same place - entering a page
allocator.

I spent some time simulating it, but with no any luck, therefore i did not
reflect this case in the commit message, thus did no pay much 

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-21 Thread Uladzislau Rezki
On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> > 
> > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > Signed-off-by: Uladzislau Rezki (Sony) 
> > Signed-off-by: Paul E. McKenney 
> > ---
> >  kernel/rcu/tree.c | 42 ++
> >  1 file changed, 26 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e04e336bee42..2014fb22644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > }
> >  }
> >  
> > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > +// state specified by flags.  If can_alloc is true, the caller must
> > +// be schedulable and not be holding any locks or mutexes that might be
> > +// acquired by the memory allocator or anything that it might invoke.
> > +// Returns true if ptr was successfully recorded, else the caller must
> > +// use a fallback.
> 
> The whole RCU department is getting swamped by the // comments. Can't we
> have proper kernel doc and /* */ style comments like the remaining part
> of the kernel?
> 
> >  static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > +   unsigned long *flags, void *ptr, bool can_alloc)
> >  {
> > struct kvfree_rcu_bulk_data *bnode;
> > int idx;
> >  
> > -   if (unlikely(!krcp->initialized))
> > +   *krcp = krc_this_cpu_lock(flags);
> > +   if (unlikely(!(*krcp)->initialized))
> > return false;
> >  
> > -   lockdep_assert_held(>lock);
> > idx = !!is_vmalloc_addr(ptr);
> >  
> > /* Check if a new block is required. */
> > -   if (!krcp->bkvhead[idx] ||
> > -   krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> > {
> > -   bnode = get_cached_bnode(krcp);
> > -   /* Switch to emergency path. */
> > +   if (!(*krcp)->bkvhead[idx] ||
> > +   (*krcp)->bkvhead[idx]->nr_records == 
> > KVFREE_BULK_MAX_ENTR) {
> > +   bnode = get_cached_bnode(*krcp);
> > +   if (!bnode && can_alloc) {
> > +   krc_this_cpu_unlock(*krcp, *flags);
> > +   bnode = (struct kvfree_rcu_bulk_data *)
> 
> There is no need for this cast.
> 
__get_free_page() returns "unsigned long" whereas a bnode is a pointer
to kvfree_rcu_bulk_data struct, without a casting the compiler will
emit a warning.

> > +   __get_free_page(GFP_KERNEL | 
> > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +   *krcp = krc_this_cpu_lock(flags);
> 
> so if bnode is NULL you could retry get_cached_bnode() since it might
> have been filled (given preemption or CPU migration changed something).
> Judging from patch #3 you think that a CPU migration is a bad thing. But
> why?
> 
I see your point. Indeed we can retry but honestly i do not see that it
makes a lot of sense. I prefer to keep the logic as simple as it can be.
If we are run out of free pages(low memory condition), there is a fallback
mechanism for such purpose, i.e it implies that a slow path can be hit.

>>
>> You think that a CPU migration is a bad thing. But why?
>>
It is not a bad thing. But if it happens we might queue a new bnode
to a drain list of another CPU where a previous element of a new
bnode may be just underutilized. So that is why i use migrate_disable()/enable()
to prevent it.

If there are some hidden issues with migrate_disable()/enable() or you
think it is a bad idea to use it, it would be appreciated if you could
describe your view in more detail.

Thanks for the comments!

--
Vlad Rezki


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-20 Thread Paul E. McKenney
On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> > 
> > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > Signed-off-by: Uladzislau Rezki (Sony) 
> > Signed-off-by: Paul E. McKenney 
> > ---
> >  kernel/rcu/tree.c | 42 ++
> >  1 file changed, 26 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e04e336bee42..2014fb22644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > }
> >  }
> >  
> > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > +// state specified by flags.  If can_alloc is true, the caller must
> > +// be schedulable and not be holding any locks or mutexes that might be
> > +// acquired by the memory allocator or anything that it might invoke.
> > +// Returns true if ptr was successfully recorded, else the caller must
> > +// use a fallback.
> 
> The whole RCU department is getting swamped by the // comments. Can't we
> have proper kernel doc and /* */ style comments like the remaining part
> of the kernel?

Because // comments are easier to type and take up less horizontal space.
Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
kvfree_rcu(), and we don't normally docbook-ify such functions.

> >  static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > +   unsigned long *flags, void *ptr, bool can_alloc)
> >  {
> > struct kvfree_rcu_bulk_data *bnode;
> > int idx;
> >  
> > -   if (unlikely(!krcp->initialized))
> > +   *krcp = krc_this_cpu_lock(flags);
> > +   if (unlikely(!(*krcp)->initialized))
> > return false;
> >  
> > -   lockdep_assert_held(>lock);
> > idx = !!is_vmalloc_addr(ptr);
> >  
> > /* Check if a new block is required. */
> > -   if (!krcp->bkvhead[idx] ||
> > -   krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> > {
> > -   bnode = get_cached_bnode(krcp);
> > -   /* Switch to emergency path. */
> > +   if (!(*krcp)->bkvhead[idx] ||
> > +   (*krcp)->bkvhead[idx]->nr_records == 
> > KVFREE_BULK_MAX_ENTR) {
> > +   bnode = get_cached_bnode(*krcp);
> > +   if (!bnode && can_alloc) {
> > +   krc_this_cpu_unlock(*krcp, *flags);
> > +   bnode = (struct kvfree_rcu_bulk_data *)
> 
> There is no need for this cast.

Without it, gcc version 7.5.0 says:

warning: assignment makes pointer from integer without a cast

> > +   __get_free_page(GFP_KERNEL | 
> > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +   *krcp = krc_this_cpu_lock(flags);
> 
> so if bnode is NULL you could retry get_cached_bnode() since it might
> have been filled (given preemption or CPU migration changed something).
> Judging from patch #3 you think that a CPU migration is a bad thing. But
> why?

So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
it with the correct CPU.

Though now that you mention it, couldn't the following happen?

o   Task A on CPU 0 notices that allocation is needed, so it
drops the lock disables migration, and sleeps while
allocating.

o   Task B on CPU 0 does the same.

o   The two tasks wake up in some order, and the second one
causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
assignment.

Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?

Thanx, Paul


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-20 Thread Sebastian Andrzej Siewior
On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.
> 
> [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> Signed-off-by: Uladzislau Rezki (Sony) 
> Signed-off-by: Paul E. McKenney 
> ---
>  kernel/rcu/tree.c | 42 ++
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e04e336bee42..2014fb22644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>   }
>  }
>  
> +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> +// state specified by flags.  If can_alloc is true, the caller must
> +// be schedulable and not be holding any locks or mutexes that might be
> +// acquired by the memory allocator or anything that it might invoke.
> +// Returns true if ptr was successfully recorded, else the caller must
> +// use a fallback.

The whole RCU department is getting swamped by the // comments. Can't we
have proper kernel doc and /* */ style comments like the remaining part
of the kernel?

>  static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> + unsigned long *flags, void *ptr, bool can_alloc)
>  {
>   struct kvfree_rcu_bulk_data *bnode;
>   int idx;
>  
> - if (unlikely(!krcp->initialized))
> + *krcp = krc_this_cpu_lock(flags);
> + if (unlikely(!(*krcp)->initialized))
>   return false;
>  
> - lockdep_assert_held(>lock);
>   idx = !!is_vmalloc_addr(ptr);
>  
>   /* Check if a new block is required. */
> - if (!krcp->bkvhead[idx] ||
> - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> {
> - bnode = get_cached_bnode(krcp);
> - /* Switch to emergency path. */
> + if (!(*krcp)->bkvhead[idx] ||
> + (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> + bnode = get_cached_bnode(*krcp);
> + if (!bnode && can_alloc) {
> + krc_this_cpu_unlock(*krcp, *flags);
> + bnode = (struct kvfree_rcu_bulk_data *)

There is no need for this cast.

> + __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> + *krcp = krc_this_cpu_lock(flags);

so if bnode is NULL you could retry get_cached_bnode() since it might
have been filled (given preemption or CPU migration changed something).
Judging from patch #3 you think that a CPU migration is a bad thing. But
why?

Sebastian


Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-20 Thread Paul E. McKenney
On Wed, Jan 20, 2021 at 05:21:46PM +0100, Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.
> 
> [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> Signed-off-by: Uladzislau Rezki (Sony) 
> Signed-off-by: Paul E. McKenney 

Queued all three for review and testing, thank you!

Thanx, Paul

> ---
>  kernel/rcu/tree.c | 42 ++
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e04e336bee42..2014fb22644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>   }
>  }
>  
> +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> +// state specified by flags.  If can_alloc is true, the caller must
> +// be schedulable and not be holding any locks or mutexes that might be
> +// acquired by the memory allocator or anything that it might invoke.
> +// Returns true if ptr was successfully recorded, else the caller must
> +// use a fallback.
>  static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> + unsigned long *flags, void *ptr, bool can_alloc)
>  {
>   struct kvfree_rcu_bulk_data *bnode;
>   int idx;
>  
> - if (unlikely(!krcp->initialized))
> + *krcp = krc_this_cpu_lock(flags);
> + if (unlikely(!(*krcp)->initialized))
>   return false;
>  
> - lockdep_assert_held(>lock);
>   idx = !!is_vmalloc_addr(ptr);
>  
>   /* Check if a new block is required. */
> - if (!krcp->bkvhead[idx] ||
> - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> {
> - bnode = get_cached_bnode(krcp);
> - /* Switch to emergency path. */
> + if (!(*krcp)->bkvhead[idx] ||
> + (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> + bnode = get_cached_bnode(*krcp);
> + if (!bnode && can_alloc) {
> + krc_this_cpu_unlock(*krcp, *flags);
> + bnode = (struct kvfree_rcu_bulk_data *)
> + __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> + *krcp = krc_this_cpu_lock(flags);
> + }
> +
>   if (!bnode)
>   return false;
>  
>   /* Initialize the new block. */
>   bnode->nr_records = 0;
> - bnode->next = krcp->bkvhead[idx];
> + bnode->next = (*krcp)->bkvhead[idx];
>  
>   /* Attach it to the head. */
> - krcp->bkvhead[idx] = bnode;
> + (*krcp)->bkvhead[idx] = bnode;
>   }
>  
>   /* Finally insert. */
> - krcp->bkvhead[idx]->records
> - [krcp->bkvhead[idx]->nr_records++] = ptr;
> + (*krcp)->bkvhead[idx]->records
> + [(*krcp)->bkvhead[idx]->nr_records++] = ptr;
>  
>   return true;
>  }
> @@ -3533,8 +3546,6 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>   ptr = (unsigned long *) func;
>   }
>  
> - krcp = krc_this_cpu_lock();
> -
>   // Queue the object but don't yet schedule the batch.
>   if (debug_rcu_head_queue(ptr)) {
>   // Probable double kfree_rcu(), just leak.
> @@ -3542,12 +3553,11 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
> __func__, head);
>  
>   // Mark as success and leave.
> - success = true;
> - goto unlock_return;
> + return;
>   }
>  
>   kasan_record_aux_stack(ptr);
> - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> + success = add_ptr_to_bulk_krc_lock(, , ptr, !head);
>   if (!success) {
>   run_page_cache_worker(krcp);
>  
> -- 
> 2.20.1
> 


[PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-20 Thread Uladzislau Rezki (Sony)
For a single argument we can directly request a page from a caller
context when a "carry page block" is run out of free spots. Instead
of hitting a slow path we can request an extra page by demand and
proceed with a fast path.

A single-argument kvfree_rcu() must be invoked in sleepable contexts,
and that its fallback is the relatively high latency synchronize_rcu().
Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
to allow limited sleeping within the memory allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 42 ++
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e04e336bee42..2014fb22644d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
}
 }
 
+// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
+// state specified by flags.  If can_alloc is true, the caller must
+// be schedulable and not be holding any locks or mutexes that might be
+// acquired by the memory allocator or anything that it might invoke.
+// Returns true if ptr was successfully recorded, else the caller must
+// use a fallback.
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+   unsigned long *flags, void *ptr, bool can_alloc)
 {
struct kvfree_rcu_bulk_data *bnode;
int idx;
 
-   if (unlikely(!krcp->initialized))
+   *krcp = krc_this_cpu_lock(flags);
+   if (unlikely(!(*krcp)->initialized))
return false;
 
-   lockdep_assert_held(>lock);
idx = !!is_vmalloc_addr(ptr);
 
/* Check if a new block is required. */
-   if (!krcp->bkvhead[idx] ||
-   krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
{
-   bnode = get_cached_bnode(krcp);
-   /* Switch to emergency path. */
+   if (!(*krcp)->bkvhead[idx] ||
+   (*krcp)->bkvhead[idx]->nr_records == 
KVFREE_BULK_MAX_ENTR) {
+   bnode = get_cached_bnode(*krcp);
+   if (!bnode && can_alloc) {
+   krc_this_cpu_unlock(*krcp, *flags);
+   bnode = (struct kvfree_rcu_bulk_data *)
+   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+   *krcp = krc_this_cpu_lock(flags);
+   }
+
if (!bnode)
return false;
 
/* Initialize the new block. */
bnode->nr_records = 0;
-   bnode->next = krcp->bkvhead[idx];
+   bnode->next = (*krcp)->bkvhead[idx];
 
/* Attach it to the head. */
-   krcp->bkvhead[idx] = bnode;
+   (*krcp)->bkvhead[idx] = bnode;
}
 
/* Finally insert. */
-   krcp->bkvhead[idx]->records
-   [krcp->bkvhead[idx]->nr_records++] = ptr;
+   (*krcp)->bkvhead[idx]->records
+   [(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
return true;
 }
@@ -3533,8 +3546,6 @@ void kvfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
ptr = (unsigned long *) func;
}
 
-   krcp = krc_this_cpu_lock();
-
// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(ptr)) {
// Probable double kfree_rcu(), just leak.
@@ -3542,12 +3553,11 @@ void kvfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
  __func__, head);
 
// Mark as success and leave.
-   success = true;
-   goto unlock_return;
+   return;
}
 
kasan_record_aux_stack(ptr);
-   success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+   success = add_ptr_to_bulk_krc_lock(, , ptr, !head);
if (!success) {
run_page_cache_worker(krcp);
 
-- 
2.20.1