Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-05-23 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 30-04-15 18:44:25, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I mean we should eventually fail all the allocation types but GFP_NOFS
> > > is coming from _carefully_ handled code paths which is an easier starting
> > > point than a random code path in the kernel/drivers. So can we finally
> > > move at least in this direction?
> > 
> > I agree that all the allocation types can fail unless GFP_NOFAIL is given.
> > But I also expect that all the allocation types should not fail unless
> > order > PAGE_ALLOC_COSTLY_ORDER or GFP_NORETRY is given or chosen as an OOM
> > victim.
> 
> Yeah, let's keep shooting our feet and then look for workarounds to deal
> with it...
>  
> > We already experienced at Linux 3.19 what happens if !__GFP_FS allocations
> > fails. out_of_memory() is called by pagefault_out_of_memory() when 0x2015a
> > (!__GFP_FS) allocation failed.
> 
> I have posted a patch to deal with this
> (http://marc.info/?l=linux-mm&m=142770374521952&w=2). There is no real
> reason to do the GFP_NOFS from the page fault context just because the
> mapping _always_ insists on it. Page fault simply _has_ to be GFP_FS
> safe, we are badly broken otherwise. That patch should go in hand with
> GFP_NOFS might fail one. I haven't posted it yet because I was waiting
> for the merge window to close.
> 
Converting page fault allocations from GFP_NOFS to GFP_KERNEL is a different
problem for me. My concern is that failing/stalling GFP_NOFS/GFP_NOIO
allocations are more dangerous than GFP_KERNEL allocations.

> > This looks to me that !__GFP_FS allocations
> > are effectively OOM killer context. It is not fair to kill the thread which
> > triggered a page fault, for that thread may not be using so much memory
> > (unfair from memory usage point of view) or that thread may be global init
> > (unfair because killing the entire system than survive by killing somebody).
> 
> Why would we kill the faulting process?
> 
We can see that processes are killed by SIGBUS if we allow memory allocations
by page faults to fail, can't we? I didn't catch what your question is.

> > Also, failing the GFP_NOFS/GFP_NOIO allocations which are not triggered by
> > a page fault generally causes more damage (e.g. taking filesystem error
> > action) than survive by killing somebody. Therefore, I think we should not
> > hesitate invoking the OOM killer for !__GFP_FS allocation.
> 
> No, we should fix those places and use proper gfp flags rather than
> pretend that the problem doesn't exist and deal with all the side
> effectes.

Do you think we can identify and fix such places and _backport_ them before
customers bother us with unexplained hang up?

As Andrew Morton picked up from 1 to 7 of this series, I reposted timeout based
OOM killing patch at http://marc.info/?l=linux-mm&m=143239200805478&w=2 .
Please check and point out what I'm misunderstanding.

> -- 
> Michal Hocko
> SUSE Labs
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-05-04 Thread Michal Hocko
On Mon 04-05-15 14:02:10, Johannes Weiner wrote:
> Hi Andrew,
> 
> since patches 8 and 9 are still controversial, would you mind picking
> up just 1-7 for now?  They're cleaunps nice to have on their own.

Completely agreed.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-05-04 Thread Johannes Weiner
Hi Andrew,

since patches 8 and 9 are still controversial, would you mind picking
up just 1-7 for now?  They're cleaunps nice to have on their own.

Thanks,
Johannes

On Mon, Apr 27, 2015 at 03:05:46PM -0400, Johannes Weiner wrote:
> There is a possible deadlock scenario between the page allocator and
> the OOM killer.  Most allocations currently retry forever inside the
> page allocator, but when the OOM killer is invoked the chosen victim
> might try taking locks held by the allocating task.  This series, on
> top of many cleanups in the allocator & OOM killer, grants such OOM-
> killing allocations access to the system's memory reserves in order
> for them to make progress without relying on their own kill to exit.
> 
> Changes since v1:
> - drop GFP_NOFS deadlock fix (Dave Chinner)
> - drop low-order deadlock fix (Michal Hocko)
> - fix missing oom_lock in sysrq+f (Michal Hocko)
> - fix PAGE_ALLOC_COSTLY retry condition (Michal Hocko)
> - ALLOC_NO_WATERMARKS only for OOM victims, not all killed tasks (Tetsuo 
> Handa)
> - bump OOM wait timeout from 1s to 5s (Vlastimil Babka & Michal Hocko)
> 
>  drivers/staging/android/lowmemorykiller.c |   2 +-
>  drivers/tty/sysrq.c   |   2 +
>  include/linux/oom.h   |  12 +-
>  kernel/exit.c |   2 +-
>  mm/memcontrol.c   |  20 ++--
>  mm/oom_kill.c | 167 +++-
>  mm/page_alloc.c   | 161 ---
>  7 files changed, 137 insertions(+), 229 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-30 Thread Michal Hocko
On Thu 30-04-15 18:44:25, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I mean we should eventually fail all the allocation types but GFP_NOFS
> > is coming from _carefully_ handled code paths which is an easier starting
> > point than a random code path in the kernel/drivers. So can we finally
> > move at least in this direction?
> 
> I agree that all the allocation types can fail unless GFP_NOFAIL is given.
> But I also expect that all the allocation types should not fail unless
> order > PAGE_ALLOC_COSTLY_ORDER or GFP_NORETRY is given or chosen as an OOM
> victim.

Yeah, let's keep shooting our feet and then look for workarounds to deal
with it...
 
> We already experienced at Linux 3.19 what happens if !__GFP_FS allocations
> fails. out_of_memory() is called by pagefault_out_of_memory() when 0x2015a
> (!__GFP_FS) allocation failed.

I have posted a patch to deal with this
(http://marc.info/?l=linux-mm&m=142770374521952&w=2). There is no real
reason to do the GFP_NOFS from the page fault context just because the
mapping _always_ insists on it. Page fault simply _has_ to be GFP_FS
safe, we are badly broken otherwise. That patch should go in hand with
GFP_NOFS might fail one. I haven't posted it yet because I was waiting
for the merge window to close.

> This looks to me that !__GFP_FS allocations
> are effectively OOM killer context. It is not fair to kill the thread which
> triggered a page fault, for that thread may not be using so much memory
> (unfair from memory usage point of view) or that thread may be global init
> (unfair because killing the entire system than survive by killing somebody).

Why would we kill the faulting process?

> Also, failing the GFP_NOFS/GFP_NOIO allocations which are not triggered by
> a page fault generally causes more damage (e.g. taking filesystem error
> action) than survive by killing somebody. Therefore, I think we should not
> hesitate invoking the OOM killer for !__GFP_FS allocation.

No, we should fix those places and use proper gfp flags rather than
pretend that the problem doesn't exist and deal with all the side
effectes.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-30 Thread Tetsuo Handa
Michal Hocko wrote:
> I mean we should eventually fail all the allocation types but GFP_NOFS
> is coming from _carefully_ handled code paths which is an easier starting
> point than a random code path in the kernel/drivers. So can we finally
> move at least in this direction?

I agree that all the allocation types can fail unless GFP_NOFAIL is given.
But I also expect that all the allocation types should not fail unless
order > PAGE_ALLOC_COSTLY_ORDER or GFP_NORETRY is given or chosen as an OOM
victim.

We already experienced at Linux 3.19 what happens if !__GFP_FS allocations
fails. out_of_memory() is called by pagefault_out_of_memory() when 0x2015a
(!__GFP_FS) allocation failed. This looks to me that !__GFP_FS allocations
are effectively OOM killer context. It is not fair to kill the thread which
triggered a page fault, for that thread may not be using so much memory
(unfair from memory usage point of view) or that thread may be global init
(unfair because killing the entire system than survive by killing somebody).
Also, failing the GFP_NOFS/GFP_NOIO allocations which are not triggered by
a page fault generally causes more damage (e.g. taking filesystem error
action) than survive by killing somebody. Therefore, I think we should not
hesitate invoking the OOM killer for !__GFP_FS allocation.

> > Likewise, there is possibility that such memory reserve is used by threads
> > which the OOM victim is not waiting for, for malloc() + memset() causes
> > __GFP_FS allocations.
> 
> We cannot be certain without complete dependency tracking. This is
> just a heuristic.

Yes, we cannot be certain without complete dependency tracking. And doing
complete dependency tracking is too expensive to implement. Dave is
recommending that we should focus on not to trigger the OOM killer than
how to handle corner cases in OOM conditions, isn't he? I still believe that
choosing more OOM victims upon timeout (which is a heuristic after all) and
invoking the OOM killer for !__GFP_FS allocations are the cheapest and least
surprising. This is something like automatically and periodically pressing
SysRq-f on behalf of the system administrator when memory allocator cannot
recover from low memory situation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-29 Thread Michal Hocko
On Thu 30-04-15 02:27:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 29-04-15 08:55:06, Johannes Weiner wrote:
> > > What we can do to mitigate this is tie the timeout to the setting of
> > > TIF_MEMDIE so that the wait is not 5s from the point of calling
> > > out_of_memory() but from the point of where TIF_MEMDIE was set.
> > > Subsequent allocations will then go straight to the reserves.
> > 
> > That would deplete the reserves very easily. Shouldn't we rather
> > go other way around? Allow OOM killer context to dive into memory
> > reserves some more (ALLOC_OOM on top of current ALLOC flags and
> > __zone_watermark_ok would allow an additional 1/4 of the reserves) and
> > start waiting for the victim after that reserve is depleted. We would
> > still have some room for TIF_MEMDIE to allocate, the reserves consumption
> > would be throttled somehow and the holders of resources would have some
> > chance to release them and allow the victim to die.
> 
> Does OOM killer context mean memory allocations which can call 
> out_of_memory()?

Yes, that was the idea, because others will not reclaim any memory. Even
all those which invoke out_of_memory will not kill a new task but one
killed task should compensate for the ALLOC_OOM part of the memory
reserves.

> If yes, there is no guarantee that such memory reserve is used by threads 
> which
> the OOM victim is waiting for, for they might do only !__GFP_FS allocations.

OK, so we are back to GFP_NOFS. Right, those are your main pain point
because you can see i_mutex deadlocks. But really, those allocations
should simply fail because looping in the allocator and rely on others
to make a progress is simply retarded.

I thought that Dave was quite explicit that they do not strictly
need nofail behavior of GFP_NOFS but rather a GFP flag which
would allow to dive into reserves some more for specific contexts
(http://marc.info/?l=linux-mm&m=142897087230385&w=2). I also do not
remember him or anybody else saying that _every_ GFP_NOFS should get the
access to reserves automatically.

Dave, could you clarify/confirm, please?

Because we are going back and forth about GFP_NOFS without any progress
for a very long time already and it seems one class of issues could be
handled by this change already.

I mean we should eventually fail all the allocation types but GFP_NOFS
is coming from _carefully_ handled code paths which is an easier starting
point than a random code path in the kernel/drivers. So can we finally
move at least in this direction?

> Likewise, there is possibility that such memory reserve is used by threads
> which the OOM victim is not waiting for, for malloc() + memset() causes
> __GFP_FS allocations.

We cannot be certain without complete dependency tracking. This is
just a heuristic.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-29 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 29-04-15 08:55:06, Johannes Weiner wrote:
> > What we can do to mitigate this is tie the timeout to the setting of
> > TIF_MEMDIE so that the wait is not 5s from the point of calling
> > out_of_memory() but from the point of where TIF_MEMDIE was set.
> > Subsequent allocations will then go straight to the reserves.
> 
> That would deplete the reserves very easily. Shouldn't we rather
> go other way around? Allow OOM killer context to dive into memory
> reserves some more (ALLOC_OOM on top of current ALLOC flags and
> __zone_watermark_ok would allow an additional 1/4 of the reserves) and
> start waiting for the victim after that reserve is depleted. We would
> still have some room for TIF_MEMDIE to allocate, the reserves consumption
> would be throttled somehow and the holders of resources would have some
> chance to release them and allow the victim to die.

Does OOM killer context mean memory allocations which can call out_of_memory()?
If yes, there is no guarantee that such memory reserve is used by threads which
the OOM victim is waiting for, for they might do only !__GFP_FS allocations.
Likewise, there is possibility that such memory reserve is used by threads
which the OOM victim is not waiting for, for malloc() + memset() causes
__GFP_FS allocations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-29 Thread Michal Hocko
On Wed 29-04-15 08:55:06, Johannes Weiner wrote:
> On Wed, Apr 29, 2015 at 12:50:37AM +0900, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
> > > [...]
> > > > [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow 
> > > > (5
> > > > seconds / page) because out_of_memory() serialized by the oom_lock 
> > > > sleeps for
> > > > 5 seconds before returning true when the OOM victim got stuck. This 
> > > > throttling
> > > > also slows down !__GFP_FS allocations when there is a thread doing a 
> > > > __GFP_FS
> > > > allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> > > > regardless of gfp_mask.
> > > 
> > > This is indeed unnecessary.
> > > 
> > > > How long will the OOM victim is blocked when the
> > > > allocating task needs to allocate e.g. 1000 !__GFP_FS pages before 
> > > > allowing
> > > > the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It 
> > > > will be
> > > > a too-long-to-wait stall which is effectively a deadlock for users. I 
> > > > think
> > > > we should not sleep with the oom_lock held.
> > > 
> > > I do not see why sleeping with oom_lock would be a problem. It simply
> > > doesn't make much sense to try to trigger OOM killer when there is/are
> > > OOM victims still exiting.
> > 
> > Because thread A's memory allocation is deferred by threads B, C, D...'s 
> > memory
> > allocation which are holding (or waiting for) the oom_lock when the OOM 
> > victim
> > is waiting for thread A's allocation. I think that a memory allocator which
> > allocates at average 5 seconds is considered as unusable. If we sleep 
> > without
> > the oom_lock held, the memory allocator can allocate at average
> > (5 / number_of_allocating_threads) seconds. Sleeping with the oom_lock held
> > can effectively prevent thread A from making progress.
> 
> I agree with that.  The problem with the sleeping is that it has to be
> long enough to give the OOM victim a fair chance to exit, but short
> enough to not make the page allocator unusable in case there is a
> genuine deadlock.  And you are right, the context blocking the OOM
> victim from exiting might do a whole string of allocations, not just
> one, before releasing any locks.
> 
> What we can do to mitigate this is tie the timeout to the setting of
> TIF_MEMDIE so that the wait is not 5s from the point of calling
> out_of_memory() but from the point of where TIF_MEMDIE was set.
> Subsequent allocations will then go straight to the reserves.

That would deplete the reserves very easily. Shouldn't we rather
go other way around? Allow OOM killer context to dive into memory
reserves some more (ALLOC_OOM on top of current ALLOC flags and
__zone_watermark_ok would allow an additional 1/4 of the reserves) and
start waiting for the victim after that reserve is depleted. We would
still have some room for TIF_MEMDIE to allocate, the reserves consumption
would be throttled somehow and the holders of resources would have some
chance to release them and allow the victim to die.

If the allocation still fails after the timeout then we should consider
failing the allocation as the next step or give NO_WATERMARK to
GFP_NOFAIL.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-29 Thread Johannes Weiner
On Wed, Apr 29, 2015 at 12:50:37AM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
> > [...]
> > > [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> > > seconds / page) because out_of_memory() serialized by the oom_lock sleeps 
> > > for
> > > 5 seconds before returning true when the OOM victim got stuck. This 
> > > throttling
> > > also slows down !__GFP_FS allocations when there is a thread doing a 
> > > __GFP_FS
> > > allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> > > regardless of gfp_mask.
> > 
> > This is indeed unnecessary.
> > 
> > > How long will the OOM victim is blocked when the
> > > allocating task needs to allocate e.g. 1000 !__GFP_FS pages before 
> > > allowing
> > > the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It 
> > > will be
> > > a too-long-to-wait stall which is effectively a deadlock for users. I 
> > > think
> > > we should not sleep with the oom_lock held.
> > 
> > I do not see why sleeping with oom_lock would be a problem. It simply
> > doesn't make much sense to try to trigger OOM killer when there is/are
> > OOM victims still exiting.
> 
> Because thread A's memory allocation is deferred by threads B, C, D...'s 
> memory
> allocation which are holding (or waiting for) the oom_lock when the OOM victim
> is waiting for thread A's allocation. I think that a memory allocator which
> allocates at average 5 seconds is considered as unusable. If we sleep without
> the oom_lock held, the memory allocator can allocate at average
> (5 / number_of_allocating_threads) seconds. Sleeping with the oom_lock held
> can effectively prevent thread A from making progress.

I agree with that.  The problem with the sleeping is that it has to be
long enough to give the OOM victim a fair chance to exit, but short
enough to not make the page allocator unusable in case there is a
genuine deadlock.  And you are right, the context blocking the OOM
victim from exiting might do a whole string of allocations, not just
one, before releasing any locks.

What we can do to mitigate this is tie the timeout to the setting of
TIF_MEMDIE so that the wait is not 5s from the point of calling
out_of_memory() but from the point of where TIF_MEMDIE was set.
Subsequent allocations will then go straight to the reserves.

> > >   (2) oom_kill_process() can determine when to kill next OOM victim.
> > > 
> > >   (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
> > >   account when choosing an OOM victim.
> > 
> > You have heard my opinions about this and I do not plan to repeat them
> > here again.
> 
> Yes, I've heard your opinions. But neither ALLOC_NO_WATERMARKS nor WMARK_OOM
> is a perfect measure for avoiding deadlock. We want to solve "Without any
> extra measures the above situation will result in a deadlock" problem.
> When WMARK_OOM failed to avoid the deadlock, and we don't want to go to
> ALLOC_NO_WATERMARKS, I think somehow choosing and killing more victims is
> the only possible measure. Maybe choosing next OOM victim upon reaching
> WMARK_OOM?

I also think that the argument against moving on to the next victim is
completely missing the tradeoff we are making here.  When the victim
is stuck and we run out of memory reserves, the machine *deadlocks*
while there are still tasks that might be able to release memory.
It's irresponsible not to go after them.  Why *shouldn't* we try?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-28 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
> [...]
> > [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> > seconds / page) because out_of_memory() serialized by the oom_lock sleeps 
> > for
> > 5 seconds before returning true when the OOM victim got stuck. This 
> > throttling
> > also slows down !__GFP_FS allocations when there is a thread doing a 
> > __GFP_FS
> > allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> > regardless of gfp_mask.
> 
> This is indeed unnecessary.
> 
> > How long will the OOM victim is blocked when the
> > allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
> > the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will 
> > be
> > a too-long-to-wait stall which is effectively a deadlock for users. I think
> > we should not sleep with the oom_lock held.
> 
> I do not see why sleeping with oom_lock would be a problem. It simply
> doesn't make much sense to try to trigger OOM killer when there is/are
> OOM victims still exiting.

Because thread A's memory allocation is deferred by threads B, C, D...'s memory
allocation which are holding (or waiting for) the oom_lock when the OOM victim
is waiting for thread A's allocation. I think that a memory allocator which
allocates at average 5 seconds is considered as unusable. If we sleep without
the oom_lock held, the memory allocator can allocate at average
(5 / number_of_allocating_threads) seconds. Sleeping with the oom_lock held
can effectively prevent thread A from making progress.

> > By the way, I came up with an idea (incomplete patch on top of patches up to
> > 7/9 is shown below) while trying to avoid sleeping with the oom_lock held.
> > This patch is meant for
> > 
> >   (1) blocking_notifier_call_chain(&oom_notify_list) is called after
> >   the OOM killer is disabled in order to increase possibility of
> >   memory allocation to succeed.
> 
> How do you guarantee that the notifier doesn't wake up any process and
> break the oom_disable guarantee?

I thought oom_notify_list wakes up only kernel threads. OK, that's the reason
we don't call oom_notify_list after the OOM killer is disabled?

> 
> >   (2) oom_kill_process() can determine when to kill next OOM victim.
> > 
> >   (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
> >   account when choosing an OOM victim.
> 
> You have heard my opinions about this and I do not plan to repeat them
> here again.

Yes, I've heard your opinions. But neither ALLOC_NO_WATERMARKS nor WMARK_OOM
is a perfect measure for avoiding deadlock. We want to solve "Without any
extra measures the above situation will result in a deadlock" problem.
When WMARK_OOM failed to avoid the deadlock, and we don't want to go to
ALLOC_NO_WATERMARKS, I think somehow choosing and killing more victims is
the only possible measure. Maybe choosing next OOM victim upon reaching
WMARK_OOM?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-28 Thread Michal Hocko
On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
[...]
> [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
> 5 seconds before returning true when the OOM victim got stuck. This throttling
> also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
> allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> regardless of gfp_mask.

This is indeed unnecessary.

> How long will the OOM victim is blocked when the
> allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
> the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
> a too-long-to-wait stall which is effectively a deadlock for users. I think
> we should not sleep with the oom_lock held.

I do not see why sleeping with oom_lock would be a problem. It simply
doesn't make much sense to try to trigger OOM killer when there is/are
OOM victims still exiting.

> Also, allowing any !fatal_signal_pending() threads doing __GFP_FS allocations
> (e.g. malloc() + memset()) to dip into the reserves will deplete them when the
> OOM victim is blocked for a thread doing a !__GFP_FS allocation, for
> [PATCH 9/9] does not allow !test_thread_flag(TIF_MEMDIE) threads doing
> !__GFP_FS allocations to access the reserves. Of course, updating [PATCH 9/9]
> like
> 
> -+ if (*did_some_progress)
> -+  alloc_flags |= ALLOC_NO_WATERMARKS;
>   out:
> ++ if (*did_some_progress)
> ++  alloc_flags |= ALLOC_NO_WATERMARKS;
>mutex_unlock(&oom_lock);
> 
> (which means use of "no watermark" without invoking the OOM killer) is
> obviously wrong. I think we should not allow __GFP_FS allocations to
> access to the reserves when the OOM victim is blocked.
> 
> By the way, I came up with an idea (incomplete patch on top of patches up to
> 7/9 is shown below) while trying to avoid sleeping with the oom_lock held.
> This patch is meant for
> 
>   (1) blocking_notifier_call_chain(&oom_notify_list) is called after
>   the OOM killer is disabled in order to increase possibility of
>   memory allocation to succeed.

How do you guarantee that the notifier doesn't wake up any process and
break the oom_disable guarantee?

>   (2) oom_kill_process() can determine when to kill next OOM victim.
> 
>   (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
>   account when choosing an OOM victim.

You have heard my opinions about this and I do not plan to repeat them
here again.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] mm: improve OOM mechanism v2

2015-04-28 Thread Tetsuo Handa
Johannes Weiner wrote:
> There is a possible deadlock scenario between the page allocator and
> the OOM killer.  Most allocations currently retry forever inside the
> page allocator, but when the OOM killer is invoked the chosen victim
> might try taking locks held by the allocating task.  This series, on
> top of many cleanups in the allocator & OOM killer, grants such OOM-
> killing allocations access to the system's memory reserves in order
> for them to make progress without relying on their own kill to exit.

I don't think

  [PATCH 8/9] mm: page_alloc: wait for OOM killer progress before retrying

and

  [PATCH 9/9] mm: page_alloc: memory reserve access for OOM-killing allocations

will work.

[PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
5 seconds before returning true when the OOM victim got stuck. This throttling
also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
regardless of gfp_mask. How long will the OOM victim is blocked when the
allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
a too-long-to-wait stall which is effectively a deadlock for users. I think
we should not sleep with the oom_lock held.

Also, allowing any !fatal_signal_pending() threads doing __GFP_FS allocations
(e.g. malloc() + memset()) to dip into the reserves will deplete them when the
OOM victim is blocked for a thread doing a !__GFP_FS allocation, for
[PATCH 9/9] does not allow !test_thread_flag(TIF_MEMDIE) threads doing
!__GFP_FS allocations to access the reserves. Of course, updating [PATCH 9/9]
like

-+ if (*did_some_progress)
-+  alloc_flags |= ALLOC_NO_WATERMARKS;
  out:
++ if (*did_some_progress)
++  alloc_flags |= ALLOC_NO_WATERMARKS;
   mutex_unlock(&oom_lock);

(which means use of "no watermark" without invoking the OOM killer) is
obviously wrong. I think we should not allow __GFP_FS allocations to
access to the reserves when the OOM victim is blocked.



By the way, I came up with an idea (incomplete patch on top of patches up to
7/9 is shown below) while trying to avoid sleeping with the oom_lock held.
This patch is meant for

  (1) blocking_notifier_call_chain(&oom_notify_list) is called after
  the OOM killer is disabled in order to increase possibility of
  memory allocation to succeed.

  (2) oom_kill_process() can determine when to kill next OOM victim.

  (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
  account when choosing an OOM victim.

What do you think?



diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index b20d2c0..843f2cd 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -356,11 +356,9 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-   mutex_lock(&oom_lock);
if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
   GFP_KERNEL, 0, NULL, true))
pr_info("OOM request ignored because killer is disabled\n");
-   mutex_unlock(&oom_lock);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7deecb7..ed8c53b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -32,8 +32,6 @@ enum oom_scan_t {
 /* Thread is the potential origin of an oom condition; kill first on oom */
 #define OOM_FLAG_ORIGIN((__force oom_flags_t)0x1)
 
-extern struct mutex oom_lock;
-
 static inline void set_current_oom_origin(void)
 {
current->signal->oom_flags |= OOM_FLAG_ORIGIN;
@@ -49,8 +47,6 @@ static inline bool oom_task_origin(const struct task_struct 
*p)
return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
 extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
unsigned long totalpages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fd273d..06a7a9f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1530,16 +1530,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
unsigned int points = 0;
struct task_struct *chosen = NULL;
 
-   mutex_lock(&oom_lock);
-
/*
 * If current has a pending SIGKILL or is exiting, then automatically
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its memory.
 */
if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-   mark_oom_victim(current);
-   goto unlock;
+   get_task_struct(current);
+   oom_kill_process(current, gfp_mask, order, 0, 0, mem