Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-09-06 Thread Tetsuo Handa
On 2018/08/27 16:41, Christian König wrote:
> Am 26.08.2018 um 10:40 schrieb Tetsuo Handa:
>> I'm not following. Why don't we need to do like below (given that
>> nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
>> drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
>> is doing GFP_KERNEL memory allocation with ->lock held for write?
> 
> That's a bug which needs to be fixed separately.
> 
> Allocating memory with GFP_KERNEL while holding a lock which is also taken in 
> the reclaim code path is illegal not matter what you do.
> 
> Patches to fix this are already on the appropriate mailing list and will be 
> pushed upstream today.
> 
> Regards,
> Christian.

Commit 4a2de54dc1d7668f ("drm/amdgpu: fix holding mn_lock while allocating 
memory")
seems to be calling amdgpu_mn_unlock() without amdgpu_mn_lock() when
drm_sched_job_init() failed... 



Michal, you are asking me to fix all bugs (including out of tree code) and 
prevent
future bugs just because you want to avoid using timeout in order to avoid OOM 
lockup
( https://marc.info/?i=55a3fb37-3246-73d7-0f45-5835a3f48...@i-love.sakura.ne.jp 
).
That is a too much request which is impossible for even you. More you count on
the OOM reaper, we exponentially complicates dependency and more likely to 
stumble
over unreviewed/untested code...
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-27 Thread Christian König

Am 26.08.2018 um 10:40 schrieb Tetsuo Handa:

On 2018/08/24 22:52, Michal Hocko wrote:

@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   */
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);


I'm not following. Why don't we need to do like below (given that
nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
is doing GFP_KERNEL memory allocation with ->lock held for write?


That's a bug which needs to be fixed separately.

Allocating memory with GFP_KERNEL while holding a lock which is also 
taken in the reclaim code path is illegal not matter what you do.


Patches to fix this are already on the appropriate mailing list and will 
be pushed upstream today.


Regards,
Christian.



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b..e1cb344 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -64,8 +64,6 @@
   * @node: hash table node to find structure by adev and mn
   * @lock: rw semaphore protecting the notifier nodes
   * @objects: interval tree containing amdgpu_mn_nodes
- * @read_lock: mutex for recursive locking of @lock
- * @recursion: depth of recursion
   *
   * Data for each amdgpu device and process address space.
   */
@@ -85,8 +83,6 @@ struct amdgpu_mn {
/* objects protected by lock */
struct rw_semaphore lock;
struct rb_root_cached   objects;
-   struct mutexread_lock;
-   atomic_trecursion;
  };
  
  /**

@@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
+   down_read(>lock);
+   else if (!down_read_trylock(>lock))
return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
-
return 0;
  }
  
@@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)

   */
  static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
  {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
  }
  
  /**

@@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
amn->type = type;
amn->mn.ops = _mn_ops[type];
amn->objects = RB_ROOT_CACHED;
-   mutex_init(>read_lock);
-   atomic_set(>recursion, 0);
  
  	r = __mmu_notifier_register(>mn, mm);

if (r)


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-26 Thread Tetsuo Handa
On 2018/08/24 22:52, Michal Hocko wrote:
> @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   */
>  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>  {
> - if (blockable)
> - mutex_lock(>read_lock);
> - else if (!mutex_trylock(>read_lock))
> - return -EAGAIN;
> -
> + /*
> +  * We can take sleepable lock even on !blockable mode because
> +  * read_lock is only ever take from this path and the notifier
> +  * lock never really sleeps. In fact the only reason why the
> +  * later is sleepable is because the notifier itself might sleep
> +  * in amdgpu_mn_invalidate_node but blockable mode is handled
> +  * before calling into that path.
> +  */
> + mutex_lock(>read_lock);
>   if (atomic_inc_return(>recursion) == 1)
>   down_read_non_owner(>lock);
>   mutex_unlock(>read_lock);
> 

I'm not following. Why don't we need to do like below (given that
nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
is doing GFP_KERNEL memory allocation with ->lock held for write?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b..e1cb344 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -64,8 +64,6 @@
  * @node: hash table node to find structure by adev and mn
  * @lock: rw semaphore protecting the notifier nodes
  * @objects: interval tree containing amdgpu_mn_nodes
- * @read_lock: mutex for recursive locking of @lock
- * @recursion: depth of recursion
  *
  * Data for each amdgpu device and process address space.
  */
@@ -85,8 +83,6 @@ struct amdgpu_mn {
/* objects protected by lock */
struct rw_semaphore lock;
struct rb_root_cached   objects;
-   struct mutexread_lock;
-   atomic_trecursion;
 };
 
 /**
@@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
+   down_read(>lock);
+   else if (!down_read_trylock(>lock))
return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
-
return 0;
 }
 
@@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
 }
 
 /**
@@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
amn->type = type;
amn->mn.ops = _mn_ops[type];
amn->objects = RB_ROOT_CACHED;
-   mutex_init(>read_lock);
-   atomic_set(>recursion, 0);
 
r = __mmu_notifier_register(>mn, mm);
if (r)
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
> [...]
> > I am fine with Michal patch, i already said so couple month ago first time
> > this discussion did pop up, Michal you can add:
> > 
> > Reviewed-by: Jérôme Glisse 
> 
> So I guess the below is the patch you were talking about?
> 
> From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 24 Aug 2018 15:32:24 +0200
> Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
>  non-blocking mode
> 
> If invalidate_range_start is called for !blocking mode then all
> callbacks have to guarantee they will no block/sleep. The same obviously
> applies to invalidate_range_end because this operation pairs with the
> former and they are called from the same context. Make sure this is
> appropriately documented.

In my branch i already updated HMM to be like other existing user
ie all blocking operation in the start callback. But yes it would
be wise to added such comments.


> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/mmu_notifier.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>*
>* If blockable argument is set to false then the callback cannot
>* sleep and has to return with -EAGAIN. 0 should be returned
> -  * otherwise.
> +  * otherwise. Please note that if invalidate_range_start approves
> +  * a non-blocking behavior then the same applies to
> +  * invalidate_range_end.
>*
>*/
>   int (*invalidate_range_start)(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
[...]
> I am fine with Michal patch, i already said so couple month ago first time
> this discussion did pop up, Michal you can add:
> 
> Reviewed-by: Jérôme Glisse 

So I guess the below is the patch you were talking about?

From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:32:24 +0200
Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
 non-blocking mode

If invalidate_range_start is called for !blocking mode then all
callbacks have to guarantee they will no block/sleep. The same obviously
applies to invalidate_range_end because this operation pairs with the
former and they are called from the same context. Make sure this is
appropriately documented.

Signed-off-by: Michal Hocko 
---
 include/linux/mmu_notifier.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 23:52:25, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

This is just ridiculous. I have no idea what you are trying to achieve
here but please read through Documentation/process/stable-api-nonsense.rst
before you try to make strong statements again.

I have changed an in-kernel interface. I have gone through all users and
fixed them up. It is really appreciated to double check after me and I
am willing to fix up any fallouts. But that is just about it. I do not
get a whit about _any_ out of tree drivers when changing the interface.
I am willing to answer any questions regarding this change so developers
of those drivers know how to do their change properly but doing so is
completely their business.
 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

HMM is a library layer for other driver, until those get merged the same
applies for them as well.
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

Intention is that 99% of HMM users will be upstream as long as they are
not people shouldn't worry. We have been working on nouveau to use it
for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i
hope it will all be there in 4.20/4.21 timeframe.

See my other mail for list of other users.

> 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

I am fine with Michal patch, i already said so couple month ago first time
this discussion did pop up, Michal you can add:

Reviewed-by: Jérôme Glisse 
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 14:18:44, Christian König wrote:
> > Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > > [...]
> > > > > > That won't work like this there might be multiple
> > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > > same time.
> > > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > > rw_semaphore.
> > > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > > trigger another one from the same path?
> > > > No, but what can happen is:
> > > > 
> > > > invalidate_range_start(A,B);
> > > > invalidate_range_start(C,D);
> > > > ...
> > > > invalidate_range_end(C,D);
> > > > invalidate_range_end(A,B);
> > > > 
> > > > Grabbing the read lock twice would be illegal in this case.
> > > I am sorry but I still do not follow. What is the context the two are
> > > called from?
> > 
> > I don't have the slightest idea.
> > 
> > > Can you give me an example. I simply do not see it in the
> > > code, mostly because I am not familiar with it.
> > 
> > I'm neither.
> > 
> > We stumbled over that by pure observation and after discussing the problem
> > with Jerome came up with this solution.
> > 
> > No idea where exactly that case comes from, but I can confirm that it indeed
> > happens.
> 
> Thiking about it some more, I can imagine that a notifier callback which
> performs an allocation might trigger a memory reclaim and that in turn
> might trigger a notifier to be invoked and recurse. But notifier
> shouldn't really allocate memory. They are called from deep MM code
> paths and this would be extremely deadlock prone. Maybe Jerome can come
> up some more realistic scenario. If not then I would propose to simplify
> the locking here. We have lockdep to catch self deadlocks and it is
> always better to handle a specific issue rather than having a code
> without a clear indication how it can recurse.

Multiple concurrent mmu notifier, for overlapping range or not, is
common (each concurrent threads can trigger some). So you might have
multiple invalidate_range_start() in flight for same mm and thus might
complete in different order (invalidate_range_end()). IIRC this is
what this lock was trying to protect against.

I can't think of a reason for recursive mmu notifier call right now.
I will ponder see if i remember something about it.

Cheers,
Jérôme
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Tetsuo Handa
On 2018/08/24 22:32, Michal Hocko wrote:
> On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
>> I worry that (currently
>> out-of-tree) users of this API are involving work / recursion.
> 
> I do not give a slightest about out-of-tree modules. They will have to
> accomodate to the new API. I have no problems to extend the
> documentation and be explicit about this expectation.

You don't need to care about out-of-tree modules. But you need to hear from
mm/hmm.c authors/maintainers when making changes for mmu-notifiers.

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>*
>* If blockable argument is set to false then the callback cannot
>* sleep and has to return with -EAGAIN. 0 should be returned
> -  * otherwise.
> +  * otherwise. Please note that if invalidate_range_start approves
> +  * a non-blocking behavior then the same applies to
> +  * invalidate_range_end.

Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
mmu-notifiers users.

-* If both of these callbacks cannot block, and invalidate_range
-* cannot block, mmu_notifier_ops.flags should have
-* MMU_INVALIDATE_DOES_NOT_BLOCK set.
+* If blockable argument is set to false then the callback 
cannot
+* sleep and has to return with -EAGAIN. 0 should be returned
+* otherwise.

Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.

Now we are in a merge window. And we noticed a possibility that out-of-tree
mmu-notifiers users might have trouble with making changes immediately in order
to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
And you are trying to ignore such possibility by just updating expected behavior
description instead of giving out-of-tree users a grace period to check and 
update
their code.

>> and keeps "all operations protected by hmm->mirrors_sem held for write are
>> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
>> held
>> for read will sleep (and in the worst case involves memory allocation
>> dependency)".
> 
> Yes and so what? The clear expectation is that neither of the range
> notifiers do not sleep in !blocking mode. I really fail to see what you
> are trying to say.

I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote:
> Two more worries for this patch.

[...]

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.
> 
>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

So you can see update and user of this there:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00

I am still working on Mellanox and AMD GPU patchset.

I will post the HMM changes that adapt to Michal shortly as anyway
thus have been sufficiently tested by now.

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=78785dcb5ba0924c2c5e7be027793f99ebbc39f3
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=4fc25571dc893f2b278e90cda9e71e139e01de70

Cheers,
Jérôme
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 15:44:03, Christian König wrote:
> Am 24.08.2018 um 15:40 schrieb Michal Hocko:
> > On Fri 24-08-18 15:28:33, Christian König wrote:
> > > Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > > > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > > > [...]
> > > > > > > > Thiking about it some more, I can imagine that a notifier 
> > > > > > > > callback which
> > > > > > > > performs an allocation might trigger a memory reclaim and that 
> > > > > > > > in turn
> > > > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > > > shouldn't really allocate memory. They are called from deep MM 
> > > > > > > > code
> > > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome 
> > > > > > > > can come
> > > > > > > > up some more realistic scenario. If not then I would propose to 
> > > > > > > > simplify
> > > > > > > > the locking here. We have lockdep to catch self deadlocks and 
> > > > > > > > it is
> > > > > > > > always better to handle a specific issue rather than having a 
> > > > > > > > code
> > > > > > > > without a clear indication how it can recurse.
> > > > > > > Well I agree that we should probably fix that, but I have some 
> > > > > > > concerns to
> > > > > > > remove the existing workaround.
> > > > > > > 
> > > > > > > See we added that to get rid of a real problem in a customer 
> > > > > > > environment and
> > > > > > > I don't want to that to show up again.
> > > > > > It would really help to know more about that case and fix it 
> > > > > > properly
> > > > > > rather than workaround it like this. Anyway, let me think how to 
> > > > > > handle
> > > > > > the non-blocking notifier invocation then. I was not able to come up
> > > > > > with anything remotely sane yet.
> > > > > With avoiding allocating memory in the write lock path I don't see an 
> > > > > issue
> > > > > any more with that.
> > > > > 
> > > > > All what the write lock path does now is adding items to a linked 
> > > > > lists,
> > > > > arrays etc
> > > > Can we change it to non-sleepable lock then?
> > > No, the write side doesn't sleep any more, but the read side does.
> > > 
> > > See amdgpu_mn_invalidate_node() and that is where you actually need to
> > > handle the non-blocking flag correctly.
> > Ohh, right you are. We already handle that by bailing out before calling
> > amdgpu_mn_invalidate_node in !blockable mode.
> 
> Yeah, that is sufficient.
> 
> It could be improved because we have something like 90% chance that
> amdgpu_mn_invalidate_node() actually doesn't need to do anything.
> 
> But I can take care of that when the patch set has landed.
> 
> > So does this looks good to you?
> 
> Yeah, that looks perfect to me. Reviewed-by: Christian König
> 

Cool! Thanks for your guidance and patience with me. Here is the full
patch. Feel free to take it and route per your preference.

From 4e297bf5a55906ee369d70bee9f7beeb3cba74bb Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:45:52 +0200
Subject: [PATCH] drm/amd: clarify amdgpu_mn_read_lock !blocking mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Tetsuo has noticed that 93065ac753e4 ("mm, oom: distinguish blockable
mode for mmu notifiers") !blocking case for amdgpu_mn_read_lock is
incomplete because it might sleep on the notifier lock. This is true
but as it turned out from the discussion with Christian this doesn't
really matter.

The amd notifier lock doesn't block in the exclusive mode. It only ever
sleeps with the read lock inside amdgpu_mn_invalidate_node. That one
is not called in !blockable state so while we might sleep on notifier
read_lock this will only be for a short while. The same applies on the
notifier lock.

Therefore remove blockable handling from amdgpu_mn_read_lock and
document it properly.

Noticed-by: Tetsuo Handa 
Reviewed-by: Christian König 
Signed-off-by: Michal Hocko 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is 

Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 15:40 schrieb Michal Hocko:

On Fri 24-08-18 15:28:33, Christian König wrote:

Am 24.08.2018 um 15:24 schrieb Michal Hocko:

On Fri 24-08-18 15:10:08, Christian König wrote:

Am 24.08.2018 um 15:01 schrieb Michal Hocko:

On Fri 24-08-18 14:52:26, Christian König wrote:

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

[...]

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns to
remove the existing workaround.

See we added that to get rid of a real problem in a customer environment and
I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.

With avoiding allocating memory in the write lock path I don't see an issue
any more with that.

All what the write lock path does now is adding items to a linked lists,
arrays etc

Can we change it to non-sleepable lock then?

No, the write side doesn't sleep any more, but the read side does.

See amdgpu_mn_invalidate_node() and that is where you actually need to
handle the non-blocking flag correctly.

Ohh, right you are. We already handle that by bailing out before calling
amdgpu_mn_invalidate_node in !blockable mode.


Yeah, that is sufficient.

It could be improved because we have something like 90% chance that 
amdgpu_mn_invalidate_node() actually doesn't need to do anything.


But I can take care of that when the patch set has landed.


So does this looks good to
you?


Yeah, that looks perfect to me. Reviewed-by: Christian König 



Thanks,
Christian.



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   */
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 15:28:33, Christian König wrote:
> Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > [...]
> > > > > > Thiking about it some more, I can imagine that a notifier callback 
> > > > > > which
> > > > > > performs an allocation might trigger a memory reclaim and that in 
> > > > > > turn
> > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > shouldn't really allocate memory. They are called from deep MM code
> > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can 
> > > > > > come
> > > > > > up some more realistic scenario. If not then I would propose to 
> > > > > > simplify
> > > > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > > > always better to handle a specific issue rather than having a code
> > > > > > without a clear indication how it can recurse.
> > > > > Well I agree that we should probably fix that, but I have some 
> > > > > concerns to
> > > > > remove the existing workaround.
> > > > > 
> > > > > See we added that to get rid of a real problem in a customer 
> > > > > environment and
> > > > > I don't want to that to show up again.
> > > > It would really help to know more about that case and fix it properly
> > > > rather than workaround it like this. Anyway, let me think how to handle
> > > > the non-blocking notifier invocation then. I was not able to come up
> > > > with anything remotely sane yet.
> > > With avoiding allocating memory in the write lock path I don't see an 
> > > issue
> > > any more with that.
> > > 
> > > All what the write lock path does now is adding items to a linked lists,
> > > arrays etc
> > Can we change it to non-sleepable lock then?
> 
> No, the write side doesn't sleep any more, but the read side does.
> 
> See amdgpu_mn_invalidate_node() and that is where you actually need to
> handle the non-blocking flag correctly.

Ohh, right you are. We already handle that by bailing out before calling
amdgpu_mn_invalidate_node in !blockable mode. So does this looks good to
you?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> On 2018/08/24 20:36, Michal Hocko wrote:
> >> That is, this API seems to be currently used by only out-of-tree users. 
> >> Since
> >> we can't check that nobody has memory allocation dependency, I think that
> >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false 
> >> for now.
> > 
> > The code expects that the invalidate_range_end doesn't block if
> > invalidate_range_start hasn't blocked. That is the reason why the end
> > callback doesn't have blockable parameter. If this doesn't hold then the
> > whole scheme is just fragile because those two calls should pair.
> > 
> That is
> 
>   More worrisome part in that patch is that I don't know whether using
>   trylock if blockable == false at entry is really sufficient.
> 
> . Since those two calls should pair, I think that we need to determine whether
> we need to return -EAGAIN at start call by evaluating both calls.

Yes, and I believe I have done that audit. Module my misunderstanding of
the code.

> Like mn_invl_range_start() involves schedule_delayed_work() which could be
> blocked on memory allocation under OOM situation,

It doesn't because that code path is not invoked for the !blockable
case.

> I worry that (currently
> out-of-tree) users of this API are involving work / recursion.

I do not give a slightest about out-of-tree modules. They will have to
accomodate to the new API. I have no problems to extend the
documentation and be explicit about this expectation.
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,


> And hmm_release() says that
> 
>   /*
>* Drop mirrors_sem so callback can wait on any pending
>* work that might itself trigger mmu_notifier callback
>* and thus would deadlock with us.
>*/
> 
> and keeps "all operations protected by hmm->mirrors_sem held for write are
> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> held
> for read will sleep (and in the worst case involves memory allocation
> dependency)".

Yes and so what? The clear expectation is that neither of the range
notifiers do not sleep in !blocking mode. I really fail to see what you
are trying to say.

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 15:24 schrieb Michal Hocko:

On Fri 24-08-18 15:10:08, Christian König wrote:

Am 24.08.2018 um 15:01 schrieb Michal Hocko:

On Fri 24-08-18 14:52:26, Christian König wrote:

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

[...]

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns to
remove the existing workaround.

See we added that to get rid of a real problem in a customer environment and
I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.

With avoiding allocating memory in the write lock path I don't see an issue
any more with that.

All what the write lock path does now is adding items to a linked lists,
arrays etc

Can we change it to non-sleepable lock then?


No, the write side doesn't sleep any more, but the read side does.

See amdgpu_mn_invalidate_node() and that is where you actually need to 
handle the non-blocking flag correctly.


Christian.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 15:10:08, Christian König wrote:
> Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > [...]
> > > > Thiking about it some more, I can imagine that a notifier callback which
> > > > performs an allocation might trigger a memory reclaim and that in turn
> > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > shouldn't really allocate memory. They are called from deep MM code
> > > > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > > > up some more realistic scenario. If not then I would propose to simplify
> > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > always better to handle a specific issue rather than having a code
> > > > without a clear indication how it can recurse.
> > > Well I agree that we should probably fix that, but I have some concerns to
> > > remove the existing workaround.
> > > 
> > > See we added that to get rid of a real problem in a customer environment 
> > > and
> > > I don't want to that to show up again.
> > It would really help to know more about that case and fix it properly
> > rather than workaround it like this. Anyway, let me think how to handle
> > the non-blocking notifier invocation then. I was not able to come up
> > with anything remotely sane yet.
> 
> With avoiding allocating memory in the write lock path I don't see an issue
> any more with that.
> 
> All what the write lock path does now is adding items to a linked lists,
> arrays etc

Can we change it to non-sleepable lock then?
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 15:01 schrieb Michal Hocko:

On Fri 24-08-18 14:52:26, Christian König wrote:

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

[...]

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns to
remove the existing workaround.

See we added that to get rid of a real problem in a customer environment and
I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.


With avoiding allocating memory in the write lock path I don't see an 
issue any more with that.


All what the write lock path does now is adding items to a linked lists, 
arrays etc


So there is no more blocking involved here and the read lock side should 
be able to grab the lock immediately.


Christian.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Tetsuo Handa
On 2018/08/24 20:36, Michal Hocko wrote:
>> That is, this API seems to be currently used by only out-of-tree users. Since
>> we can't check that nobody has memory allocation dependency, I think that
>> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
>> now.
> 
> The code expects that the invalidate_range_end doesn't block if
> invalidate_range_start hasn't blocked. That is the reason why the end
> callback doesn't have blockable parameter. If this doesn't hold then the
> whole scheme is just fragile because those two calls should pair.
> 
That is

  More worrisome part in that patch is that I don't know whether using
  trylock if blockable == false at entry is really sufficient.

. Since those two calls should pair, I think that we need to determine whether
we need to return -EAGAIN at start call by evaluating both calls.

Like mn_invl_range_start() involves schedule_delayed_work() which could be
blocked on memory allocation under OOM situation, I worry that (currently
out-of-tree) users of this API are involving work / recursion.
And hmm_release() says that

/*
 * Drop mirrors_sem so callback can wait on any pending
 * work that might itself trigger mmu_notifier callback
 * and thus would deadlock with us.
 */

and keeps "all operations protected by hmm->mirrors_sem held for write are
atomic". This suggests that "some operations protected by hmm->mirrors_sem held
for read will sleep (and in the worst case involves memory allocation
dependency)".

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 14:52:26, Christian König wrote:
> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
[...]
> > Thiking about it some more, I can imagine that a notifier callback which
> > performs an allocation might trigger a memory reclaim and that in turn
> > might trigger a notifier to be invoked and recurse. But notifier
> > shouldn't really allocate memory. They are called from deep MM code
> > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > up some more realistic scenario. If not then I would propose to simplify
> > the locking here. We have lockdep to catch self deadlocks and it is
> > always better to handle a specific issue rather than having a code
> > without a clear indication how it can recurse.
> 
> Well I agree that we should probably fix that, but I have some concerns to
> remove the existing workaround.
> 
> See we added that to get rid of a real problem in a customer environment and
> I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

On Fri 24-08-18 14:18:44, Christian König wrote:

Am 24.08.2018 um 14:03 schrieb Michal Hocko:

On Fri 24-08-18 13:57:52, Christian König wrote:

Am 24.08.2018 um 13:52 schrieb Michal Hocko:

On Fri 24-08-18 13:43:16, Christian König wrote:

[...]

That won't work like this there might be multiple
invalidate_range_start()/invalidate_range_end() pairs open at the same time.
E.g. the lock might be taken recursively and that is illegal for a
rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?

No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from?

I don't have the slightest idea.


Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.

I'm neither.

We stumbled over that by pure observation and after discussing the problem
with Jerome came up with this solution.

No idea where exactly that case comes from, but I can confirm that it indeed
happens.

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.


Well I agree that we should probably fix that, but I have some concerns 
to remove the existing workaround.


See we added that to get rid of a real problem in a customer environment 
and I don't want to that to show up again.


In the meantime I've send out a fix to avoid allocating memory while 
holding the mn_lock.


Thanks for pointing that out,
Christian.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 14:18:44, Christian König wrote:
> Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > [...]
> > > > > That won't work like this there might be multiple
> > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > same time.
> > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > rw_semaphore.
> > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > trigger another one from the same path?
> > > No, but what can happen is:
> > > 
> > > invalidate_range_start(A,B);
> > > invalidate_range_start(C,D);
> > > ...
> > > invalidate_range_end(C,D);
> > > invalidate_range_end(A,B);
> > > 
> > > Grabbing the read lock twice would be illegal in this case.
> > I am sorry but I still do not follow. What is the context the two are
> > called from?
> 
> I don't have the slightest idea.
> 
> > Can you give me an example. I simply do not see it in the
> > code, mostly because I am not familiar with it.
> 
> I'm neither.
> 
> We stumbled over that by pure observation and after discussing the problem
> with Jerome came up with this solution.
> 
> No idea where exactly that case comes from, but I can confirm that it indeed
> happens.

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 14:03 schrieb Michal Hocko:

On Fri 24-08-18 13:57:52, Christian König wrote:

Am 24.08.2018 um 13:52 schrieb Michal Hocko:

On Fri 24-08-18 13:43:16, Christian König wrote:

[...]

That won't work like this there might be multiple
invalidate_range_start()/invalidate_range_end() pairs open at the same time.
E.g. the lock might be taken recursively and that is illegal for a
rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?

No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from?


I don't have the slightest idea.


Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.


I'm neither.

We stumbled over that by pure observation and after discussing the 
problem with Jerome came up with this solution.


No idea where exactly that case comes from, but I can confirm that it 
indeed happens.


Regards,
Christian.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Tetsuo Handa
Two more worries for this patch.



> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   *
>   * @amn: our notifier
>   */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>  {
> -   mutex_lock(>read_lock);
> +   if (blockable)
> +   mutex_lock(>read_lock);
> +   else if (!mutex_trylock(>read_lock))
> +   return -EAGAIN;
> +
> if (atomic_inc_return(>recursion) == 1)
> down_read_non_owner(>lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

> mutex_unlock(>read_lock);
> +
> +   return 0;
>  }
> 
>  /**



> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct 
> mm_struct *mm)
> up_write(>mirrors_sem);
>  }
> 
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>struct mm_struct *mm,
>unsigned long start,
> -  unsigned long end)
> +  unsigned long end,
> +  bool blockable)
>  {
> struct hmm *hmm = mm->hmm;
> 
> VM_BUG_ON(!hmm);
> 
> atomic_inc(>sequence);
> +
> +   return 0;
>  }
> 
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,

This assumes that hmm_invalidate_range_end() does not have memory
allocation dependency. But hmm_invalidate_range() from
hmm_invalidate_range_end() involves

down_read(>mirrors_sem);
list_for_each_entry(mirror, >mirrors, list)
mirror->ops->sync_cpu_device_pagetables(mirror, action,
start, end);
up_read(>mirrors_sem);

sequence. What is surprising is that there is no in-tree user who assigns
sync_cpu_device_pagetables field.

  $ grep -Fr sync_cpu_device_pagetables *
  Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
page tables
  include/linux/hmm.h: * will get callbacks through 
sync_cpu_device_pagetables() operation (see
  include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
tables
  include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct hmm_mirror 
*mirror,
  include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, 
so that CPU page
  mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
action,

That is, this API seems to be currently used by only out-of-tree users. Since
we can't check that nobody has memory allocation dependency, I think that
hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
now.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 13:57:52, Christian König wrote:
> Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > On Fri 24-08-18 13:43:16, Christian König wrote:
[...]
> > > That won't work like this there might be multiple
> > > invalidate_range_start()/invalidate_range_end() pairs open at the same 
> > > time.
> > > E.g. the lock might be taken recursively and that is illegal for a
> > > rw_semaphore.
> > I am not sure I follow. Are you saying that one invalidate_range might
> > trigger another one from the same path?
> 
> No, but what can happen is:
> 
> invalidate_range_start(A,B);
> invalidate_range_start(C,D);
> ...
> invalidate_range_end(C,D);
> invalidate_range_end(A,B);
> 
> Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from? Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 13:52 schrieb Michal Hocko:

On Fri 24-08-18 13:43:16, Christian König wrote:

Am 24.08.2018 um 13:32 schrieb Michal Hocko:

On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:

Two more worries for this patch.




--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
*
* @amn: our notifier
*/
-static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
   {
-   mutex_lock(>read_lock);
+   if (blockable)
+   mutex_lock(>read_lock);
+   else if (!mutex_trylock(>read_lock))
+   return -EAGAIN;
+
  if (atomic_inc_return(>recursion) == 1)
  down_read_non_owner(>lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.

The write side of the lock is only taken in the command submission IOCTL.

So you actually don't need to change anything here (even the proposed
changes are overkill) since we can't tear down the struct_mm while an IOCTL
is still using.

I am not so sure. We are not in the mm destruction phase yet. This is
mostly about the oom context which might fire right during the IOCTL. If
any of the path which is holding the write lock blocks for unbound
amount of time or even worse allocates a memory then we are screwed. So
we need to back of when blockable = false.


Oh, yeah good point. Haven't thought about that possibility.




Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.

That won't work like this there might be multiple
invalidate_range_start()/invalidate_range_end() pairs open at the same time.
E.g. the lock might be taken recursively and that is illegal for a
rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?


No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

Regards,
Christian.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 13:43:16, Christian König wrote:
> Am 24.08.2018 um 13:32 schrieb Michal Hocko:
> > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> > > Two more worries for this patch.
> > > 
> > > 
> > > 
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> > > >*
> > > >* @amn: our notifier
> > > >*/
> > > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> > > >   {
> > > > -   mutex_lock(>read_lock);
> > > > +   if (blockable)
> > > > +   mutex_lock(>read_lock);
> > > > +   else if (!mutex_trylock(>read_lock))
> > > > +   return -EAGAIN;
> > > > +
> > > >  if (atomic_inc_return(>recursion) == 1)
> > > >  down_read_non_owner(>lock);
> > > Why don't we need to use trylock here if blockable == false ?
> > > Want comment why it is safe to use blocking lock here.
> > Hmm, I am pretty sure I have checked the code but it was quite confusing
> > so I might have missed something. Double checking now, it seems that
> > this read_lock is not used anywhere else and it is not _the_ lock we are
> > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
> > it is taken in exclusive mode for expensive operations.
> 
> The write side of the lock is only taken in the command submission IOCTL.
> 
> So you actually don't need to change anything here (even the proposed
> changes are overkill) since we can't tear down the struct_mm while an IOCTL
> is still using.

I am not so sure. We are not in the mm destruction phase yet. This is
mostly about the oom context which might fire right during the IOCTL. If
any of the path which is holding the write lock blocks for unbound
amount of time or even worse allocates a memory then we are screwed. So
we need to back of when blockable = false.

> > Is that correct Christian? If this is correct then we need to update the
> > locking here. I am struggling to grasp the ref counting part. Why cannot
> > all readers simply take the lock rather than rely on somebody else to
> > take it? 1ed3d2567c800 didn't really help me to understand the locking
> > scheme here so any help would be appreciated.
> 
> That won't work like this there might be multiple
> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> E.g. the lock might be taken recursively and that is illegal for a
> rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König

Am 24.08.2018 um 13:32 schrieb Michal Hocko:

On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:

Two more worries for this patch.




--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   *
   * @amn: our notifier
   */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   mutex_lock(>read_lock);
+   if (blockable)
+   mutex_lock(>read_lock);
+   else if (!mutex_trylock(>read_lock))
+   return -EAGAIN;
+
 if (atomic_inc_return(>recursion) == 1)
 down_read_non_owner(>lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.


The write side of the lock is only taken in the command submission IOCTL.

So you actually don't need to change anything here (even the proposed 
changes are overkill) since we can't tear down the struct_mm while an 
IOCTL is still using.



Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.


That won't work like this there might be multiple 
invalidate_range_start()/invalidate_range_end() pairs open at the same 
time. E.g. the lock might be taken recursively and that is illegal for a 
rw_semaphore.


Regards,
Christian.



I am wondering why we cannot do
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..93034178673d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   */
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
+   if (!down_read_trylock(>lock)) {
+   if (!blockable)
+   return -EAGAIN;
+   down_read(amn->lock);
+   }
  
  	return 0;

  }
@@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
   */
  static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
  {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
  }
  
  /**




___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
[...]
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.

Yes HMM doesn't have any in tree user yet.

>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

The code expects that the invalidate_range_end doesn't block if
invalidate_range_start hasn't blocked. That is the reason why the end
callback doesn't have blockable parameter. If this doesn't hold then the
whole scheme is just fragile because those two calls should pair.

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> Two more worries for this patch.
> 
> 
> 
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >   *
> >   * @amn: our notifier
> >   */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> >  {
> > -   mutex_lock(>read_lock);
> > +   if (blockable)
> > +   mutex_lock(>read_lock);
> > +   else if (!mutex_trylock(>read_lock))
> > +   return -EAGAIN;
> > +
> > if (atomic_inc_return(>recursion) == 1)
> > down_read_non_owner(>lock);
> 
> Why don't we need to use trylock here if blockable == false ?
> Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.

Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.

I am wondering why we cannot do
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..93034178673d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
+   if (!down_read_trylock(>lock)) {
+   if (!blockable)
+   return -EAGAIN;
+   down_read(amn->lock);
+   }
 
return 0;
 }
@@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
 }
 
 /**

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread David Rientjes
On Tue, 24 Jul 2018, Michal Hocko wrote:

> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   /* failed to reap part of the address space. Try again later */
>   if (!__oom_reap_task_mm(mm)) {
>   up_read(>mmap_sem);
> - return true;
> + return false;
>   }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> 
> 
> On top of that the proposed cleanup looks as follows:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 88657e018714..4e185a282b3d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>   return ret;
>  }
>  
> +/*
> + * Reaps the address space of the give task.
> + *
> + * Returns true on success and false if none or part of the address space
> + * has been reclaimed and the caller should retry later.
> + */
>  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
> + bool ret = true;
> +
>   if (!down_read_trylock(>mmap_sem)) {
>   trace_skip_task_reaping(tsk->pid);
>   return false;
> @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>* down_write();up_write() cycle in exit_mmap().
>*/
>   if (test_bit(MMF_OOM_SKIP, >flags)) {
> - up_read(>mmap_sem);
>   trace_skip_task_reaping(tsk->pid);
> - return true;
> + goto out_unlock;
>   }
>  
>   trace_start_task_reaping(tsk->pid);
>  
>   /* failed to reap part of the address space. Try again later */
> - if (!__oom_reap_task_mm(mm)) {
> - up_read(>mmap_sem);
> - return false;
> - }
> + ret = __oom_reap_task_mm(mm);
> + if (!ret)
> + goto out_finish;
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,
>   K(get_mm_counter(mm, MM_ANONPAGES)),
>   K(get_mm_counter(mm, MM_FILEPAGES)),
>   K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +out_finish:
> + trace_finish_task_reaping(tsk->pid);
> +out_unlock:
>   up_read(>mmap_sem);
>  
> - trace_finish_task_reaping(tsk->pid);
> - return true;
> + return ret;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10

I think we still want to trace when reaping was skipped to know that the 
oom reaper will retry again later.



mm/oom_kill.c: clean up oom_reap_task_mm() fix

indicate reaping has been partially skipped so we can expect future skips 
or another start before finish.

Signed-off-by: David Rientjes 
---
 mm/oom_kill.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 
trace_start_task_reaping(tsk->pid);
 
-   /* failed to reap part of the address space. Try again later */
ret = __oom_reap_task_mm(mm);
-   if (!ret)
+   if (!ret) {
+   /* Failed to reap part of the address space. Try again later */
+   trace_skip_task_reaping(tsk->pid);
goto out_finish;
+   }
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread Michal Hocko
On Tue 24-07-18 12:53:07, Andrew Morton wrote:
[...]
> > On top of that the proposed cleanup looks as follows:
> > 
> 
> Looks good to me.  Seems a bit strange that we omit the pr_info()
> output if the mm was partially reaped - people would still want to know
> this?   Not very important though.

I think that having a single output once we are done is better but I do
not have a strong opinion on this.

Btw. here is the changelog for the cleanup.

"
Andrew has noticed someinconsistencies in oom_reap_task_mm. Notably
 - Undocumented return value.

 - comment "failed to reap part..." is misleading - sounds like it's
   referring to something which happened in the past, is in fact
   referring to something which might happen in the future.

 - fails to call trace_finish_task_reaping() in one case

 - code duplication.

 - Increases mmap_sem hold time a little by moving
   trace_finish_task_reaping() inside the locked region.  So sue me ;)

 - Sharing the finish: path means that the trace event won't
   distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Suggested-by: Andrew Morton 
Signed-off-by: Michal Hocko 
"

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread Michal Hocko
On Tue 24-07-18 14:07:49, David Rientjes wrote:
[...]
> mm/oom_kill.c: clean up oom_reap_task_mm() fix
> 
> indicate reaping has been partially skipped so we can expect future skips 
> or another start before finish.

But we are not skipping. This is essentially the same case as mmap_sem
trylock fail. Maybe we can add a bool parameter to trace_finish_task_reaping
to denote partial success?

> Signed-off-by: David Rientjes 
> ---
>  mm/oom_kill.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>   trace_start_task_reaping(tsk->pid);
>  
> - /* failed to reap part of the address space. Try again later */
>   ret = __oom_reap_task_mm(mm);
> - if (!ret)
> + if (!ret) {
> + /* Failed to reap part of the address space. Try again later */
> + trace_skip_task_reaping(tsk->pid);
>   goto out_finish;
> + }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread Andrew Morton
On Tue, 24 Jul 2018 16:17:47 +0200 Michal Hocko  wrote:

> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > - Undocumented return value.
> > 
> > - comment "failed to reap part..." is misleading - sounds like it's
> >   referring to something which happened in the past, is in fact
> >   referring to something which might happen in the future.
> > 
> > - fails to call trace_finish_task_reaping() in one case
> > 
> > - code duplication.
> > 
> > - Increases mmap_sem hold time a little by moving
> >   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> > 
> > - Sharing the finish: path means that the trace event won't
> >   distinguish between the two sources of finishing.
> > 
> > Please take a look?
> 
> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   /* failed to reap part of the address space. Try again later */
>   if (!__oom_reap_task_mm(mm)) {
>   up_read(>mmap_sem);
> - return true;
> + return false;
>   }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",

OK, thanks, I added that.

> 
> On top of that the proposed cleanup looks as follows:
> 

Looks good to me.  Seems a bit strange that we omit the pr_info()
output if the mm was partially reaped - people would still want to know
this?   Not very important though.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread Michal Hocko
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> - Undocumented return value.
> 
> - comment "failed to reap part..." is misleading - sounds like it's
>   referring to something which happened in the past, is in fact
>   referring to something which might happen in the future.
> 
> - fails to call trace_finish_task_reaping() in one case
> 
> - code duplication.
> 
> - Increases mmap_sem hold time a little by moving
>   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> 
> - Sharing the finish: path means that the trace event won't
>   distinguish between the two sources of finishing.
> 
> Please take a look?

oom_reap_task_mm should return false when __oom_reap_task_mm return
false. This is what my patch did but it seems this changed by
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
so that one should be fixed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 104ef4a01a55..88657e018714 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
/* failed to reap part of the address space. Try again later */
if (!__oom_reap_task_mm(mm)) {
up_read(>mmap_sem);
-   return true;
+   return false;
}
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",


On top of that the proposed cleanup looks as follows:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 88657e018714..4e185a282b3d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
return ret;
 }
 
+/*
+ * Reaps the address space of the give task.
+ *
+ * Returns true on success and false if none or part of the address space
+ * has been reclaimed and the caller should retry later.
+ */
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
+   bool ret = true;
+
if (!down_read_trylock(>mmap_sem)) {
trace_skip_task_reaping(tsk->pid);
return false;
@@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 * down_write();up_write() cycle in exit_mmap().
 */
if (test_bit(MMF_OOM_SKIP, >flags)) {
-   up_read(>mmap_sem);
trace_skip_task_reaping(tsk->pid);
-   return true;
+   goto out_unlock;
}
 
trace_start_task_reaping(tsk->pid);
 
/* failed to reap part of the address space. Try again later */
-   if (!__oom_reap_task_mm(mm)) {
-   up_read(>mmap_sem);
-   return false;
-   }
+   ret = __oom_reap_task_mm(mm);
+   if (!ret)
+   goto out_finish;
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_finish:
+   trace_finish_task_reaping(tsk->pid);
+out_unlock:
up_read(>mmap_sem);
 
-   trace_finish_task_reaping(tsk->pid);
-   return true;
+   return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Fri 20-07-18 16:01:25, Andrew Morton wrote:
> On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko  wrote:
> 
> > > Any suggestions regarding how the driver developers can test this code
> > > path?  I don't think we presently have a way to fake an oom-killing
> > > event?  Perhaps we should add such a thing, given the problems we're
> > > having with that feature.
> > 
> > The simplest way is to wrap an userspace code which uses these notifiers
> > into a memcg and set the hard limit to hit the oom. This can be done
> > e.g. after the test faults in all the mmu notifier managed memory and
> > set the hard limit to something really small. Then we are looking for a
> > proper process tear down.
> 
> Chances are, some of the intended audience don't know how to do this
> and will either have to hunt down a lot of documentation or will just
> not test it.  But we want them to test it, so a little worked step-by-step
> example would help things along please.

I am willing to give more specific steps. Is anybody interested? From my
experience so far this is not something drivers developers using mmu
notifiers would be unfamiliar with.

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Mon 23-07-18 09:11:54, Michal Hocko wrote:
> On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> > On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> > [...]
> > > Please take a look?
> > 
> > Are you OK to have these in a separate patch?
> 
> Btw. I will rebase this patch once oom stuff in linux-next settles. At
> least oom_lock removal from oom_reaper will conflict.

Hmm, I have just checked Andrew's akpm and the patch is already in and
Andrew has resolved the conflict with the oom_lock patch. It just seems
that linux-next (next-20180720) doesn't have the newest mmotm tree.

Anyway, I will go with the incremental cleanup patch per Andrew's
comments as soon as linux-next catches up.

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > Please take a look?
> 
> Are you OK to have these in a separate patch?

Btw. I will rebase this patch once oom stuff in linux-next settles. At
least oom_lock removal from oom_reaper will conflict.

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> Please take a look?

Are you OK to have these in a separate patch?

-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-20 Thread Andrew Morton
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> ...
>
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>   trace_start_task_reaping(tsk->pid);
>  
> - __oom_reap_task_mm(mm);
> + /* failed to reap part of the address space. Try again later */
> + if (!__oom_reap_task_mm(mm)) {
> + up_read(>mmap_sem);
> + ret = false;
> + goto unlock_oom;
> + }

This function is starting to look a bit screwy.

: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
:   if (!down_read_trylock(>mmap_sem)) {
:   trace_skip_task_reaping(tsk->pid);
:   return false;
:   }
: 
:   /*
:* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
:* work on the mm anymore. The check for MMF_OOM_SKIP must run
:* under mmap_sem for reading because it serializes against the
:* down_write();up_write() cycle in exit_mmap().
:*/
:   if (test_bit(MMF_OOM_SKIP, >flags)) {
:   up_read(>mmap_sem);
:   trace_skip_task_reaping(tsk->pid);
:   return true;
:   }
: 
:   trace_start_task_reaping(tsk->pid);
: 
:   /* failed to reap part of the address space. Try again later */
:   if (!__oom_reap_task_mm(mm)) {
:   up_read(>mmap_sem);
:   return true;
:   }
: 
:   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
:   task_pid_nr(tsk), tsk->comm,
:   K(get_mm_counter(mm, MM_ANONPAGES)),
:   K(get_mm_counter(mm, MM_FILEPAGES)),
:   K(get_mm_counter(mm, MM_SHMEMPAGES)));
:   up_read(>mmap_sem);
: 
:   trace_finish_task_reaping(tsk->pid);
:   return true;
: }

- Undocumented return value.

- comment "failed to reap part..." is misleading - sounds like it's
  referring to something which happened in the past, is in fact
  referring to something which might happen in the future.

- fails to call trace_finish_task_reaping() in one case

- code duplication.


I'm thinking it wants to be something like this?

: /*
:  * Return true if we successfully acquired (then released) mmap_sem
:  */
: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
:   if (!down_read_trylock(>mmap_sem)) {
:   trace_skip_task_reaping(tsk->pid);
:   return false;
:   }
: 
:   /*
:* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
:* work on the mm anymore. The check for MMF_OOM_SKIP must run
:* under mmap_sem for reading because it serializes against the
:* down_write();up_write() cycle in exit_mmap().
:*/
:   if (test_bit(MMF_OOM_SKIP, >flags)) {
:   trace_skip_task_reaping(tsk->pid);
:   goto out;
:   }
: 
:   trace_start_task_reaping(tsk->pid);
: 
:   if (!__oom_reap_task_mm(mm)) {
:   /* Failed to reap part of the address space. Try again later */
:   goto finish;
:   }
: 
:   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
:   task_pid_nr(tsk), tsk->comm,
:   K(get_mm_counter(mm, MM_ANONPAGES)),
:   K(get_mm_counter(mm, MM_FILEPAGES)),
:   K(get_mm_counter(mm, MM_SHMEMPAGES)));
: finish:
:   trace_finish_task_reaping(tsk->pid);
: out:
:   up_read(>mmap_sem);
:   return true;
: }

- Increases mmap_sem hold time a little by moving
  trace_finish_task_reaping() inside the locked region.  So sue me ;)

- Sharing the finish: path means that the trace event won't
  distinguish between the two sources of finishing.

Please take a look?
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-20 Thread Andrew Morton
On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko  wrote:

> > Any suggestions regarding how the driver developers can test this code
> > path?  I don't think we presently have a way to fake an oom-killing
> > event?  Perhaps we should add such a thing, given the problems we're
> > having with that feature.
> 
> The simplest way is to wrap an userspace code which uses these notifiers
> into a memcg and set the hard limit to hit the oom. This can be done
> e.g. after the test faults in all the mmu notifier managed memory and
> set the hard limit to something really small. Then we are looking for a
> proper process tear down.

Chances are, some of the intended audience don't know how to do this
and will either have to hunt down a lot of documentation or will just
not test it.  But we want them to test it, so a little worked step-by-step
example would help things along please.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-19 Thread Michal Hocko
Does anybody see any reasons why this should get into mmotm tree?
I do not want to rush this in but if general feeling is to push it for
the upcoming merge window then I will not object.
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-17 Thread Michal Hocko
On Mon 16-07-18 16:12:49, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> > 
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> > 
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> > 
> > This patch handles the low hanging fruid. 
> > __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
> 
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

Yes, this is the case right now. Maybe we will grow other users in
future. Those other potential users is the reason why I used blockable
rather than oom parameter name.

> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.

The simplest way is to wrap an userspace code which uses these notifiers
into a memcg and set the hard limit to hit the oom. This can be done
e.g. after the test faults in all the mmu notifier managed memory and
set the hard limit to something really small. Then we are looking for a
proper process tear down.

> > I think we can improve that even further because there is a common
> > pattern to do a range lookup first and then do something about that.
> > The first part can be done without a sleeping lock in most cases AFAICS.
> > 
> > The oom_reaper end then simply retries if there is at least one notifier
> > which couldn't make any progress in !blockable mode. A retry loop is
> > already implemented to wait for the mmap_sem and this is basically the
> > same thing.
> > 
> > ...
> >
> > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> > mm_struct *mm,
> > + unsigned long start, unsigned long end)
> > +{
> > +   int ret = 0;
> > +   if (mm_has_notifiers(mm))
> > +   ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> > false);
> > +
> > +   return ret;
> >  }
> 
> nit,
> 
> {
>   if (mm_has_notifiers(mm))
>   return __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
>   return 0;
> }
> 
> would suffice.

Sure. Fixed
 
> > 
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
> >  * reliably test it.
> >  */
> > mutex_lock(_lock);
> > -   __oom_reap_task_mm(mm);
> > +   (void)__oom_reap_task_mm(mm);
> > mutex_unlock(_lock);
> 
> What does this do?

There is no error to be returned here as the comment above explains
 * Nothing can be holding mm->mmap_sem here and the above call
 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 * __oom_reap_task_mm() will not block.
-- 
Michal Hocko
SUSE Labs
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Leon Romanovsky
On Mon, Jul 16, 2018 at 04:12:49PM -0700, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:
>
> > From: Michal Hocko 
> >
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> >
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> >
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> >
> > This patch handles the low hanging fruid. 
> > __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
>
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

I can't say for everyone, but at least for me (mlx5), it is not rare event.
I'm seeing OOM very often while I'm running my tests in low memory VMs.

Thanks

>
> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.


signature.asc
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Andrew Morton
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. 
> __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.

I assume device driver developers are wondering "what does this mean
for me".  As I understand it, the only time they will see
blockable==false is when their driver is being called in response to an
out-of-memory condition, yes?  So it is a very rare thing.

Any suggestions regarding how the driver developers can test this code
path?  I don't think we presently have a way to fake an oom-killing
event?  Perhaps we should add such a thing, given the problems we're
having with that feature.

> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> ...
>
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> mm_struct *mm,
> +   unsigned long start, unsigned long end)
> +{
> + int ret = 0;
> + if (mm_has_notifiers(mm))
> + ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
> +
> + return ret;
>  }

nit,

{
if (mm_has_notifiers(mm))
return __mmu_notifier_invalidate_range_start(mm, start, end, 
false);
return 0;
}

would suffice.


> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>* reliably test it.
>*/
>   mutex_lock(_lock);
> - __oom_reap_task_mm(mm);
> + (void)__oom_reap_task_mm(mm);
>   mutex_unlock(_lock);

What does this do?

>   set_bit(MMF_OOM_SKIP, >flags);
> 
> ...
>

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Michal Hocko
From: Michal Hocko 

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: Felix Kuehling 
Cc: k...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-r...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
Acked-by: Christian König  # AMD notifiers
Acked-by: Leon Romanovsky  # mlx and umem_odp
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---

Hi,
there were no major objections when I sent this as an RFC the last time
[1]. I was hoping for more feedback in the drivers land because I am
touching the code I have no way to test. On the other hand the pattern
is quite simple and consistent over all users so there shouldn't be
any large surprises hopefully.

Any further review would be highly appreciate of course. But is this
something to put into the mm tree now?

[1] http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz


 arch/x86/kvm/x86.c  |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
 drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
 drivers/infiniband/core/umem_odp.c  | 33 +++
 drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
 drivers/infiniband/hw/mlx5/odp.c|  2 +-
 drivers/misc/mic/scif/scif_dma.c|  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
 drivers/xen/gntdev.c| 44 -
 include/linux/kvm_host.h|  4 +--
 include/linux/mmu_notifier.h| 35 +++-
 include/linux/oom.h |  2 +-
 include/rdma/ib_umem_odp.h  |  3 +-
 mm/hmm.c|  7 ++--
 mm/mmap.c   |  2 +-
 mm/mmu_notifier.c   | 19 ---
 mm/oom_kill.c   | 29 
 virt/kvm/kvm_main.c | 15 ++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned long start, unsigned long end)
+int