Re: can't oom-kill zap the victim's memory?

2015-10-08 Thread Michal Hocko
On Wed 07-10-15 14:00:16, Oleg Nesterov wrote:
> On 10/07, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
> > > freed_enough_mem() == T. But it is absolutely not clear to me how we
> > > should define this freed_enough_mem(), so I think we should do this
> > > later.
> >
> > Maybe
> >
> >   bool freed_enough_mem(void) { !atomic_read(&oom_victims); }
> >
> > if we change to call mark_oom_victim() on all threads which should be
> > killed as OOM victims.
> 
> Well, in this case
> 
>   if (atomic_read(&mm->mm_users) == 1)
>   break;
> 
> makes much more sense. Plus we do not need to change mark_oom_victim().
> 
> Lets discuss this later?

Yes I do not think this is that important if a kernel thread is going to
reclaim the address space. It will effectively free memory on behalf of
the victim so a longer scan shouldn't be such a big problem. At least
not for the first implementation.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-08 Thread Michal Hocko
On Tue 06-10-15 20:45:02, Oleg Nesterov wrote:
[...]
> And I was going to make V1 which avoids queue_work/kthread and zaps the
> memory in oom_kill_process() context.
> 
> But this can't work because we need to increment ->mm_users to avoid
> the race with exit_mmap/etc. And this means that we need mmput() after
> that, and as we recently discussed it can deadlock if mm_users goes
> to zero, we can't do exit_mmap/etc in oom_kill_process().

Right. I hoped we could rely on mm_count just to pin mm but that is not
sufficient because exit_mmap doesn't rely on mmap_sem so we do not have
any synchronization there. Unfortunate. This means that we indeed have
to do it asynchronously. Maybe we can come up with some trickery but
let's do it later. I do agree that going with a kernel thread for now
would be easier. Sorry about misleading you, I should have realized that
mmput from the oom killing path is dangerous.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-08 Thread Vlastimil Babka

On 10/07/2015 12:43 PM, Tetsuo Handa wrote:

Vlastimil Babka wrote:

On 5.10.2015 16:44, Michal Hocko wrote:

So I can see basically only few ways out of this deadlock situation.
Either we face the reality and allow small allocations (withtout
__GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
(so after even OOM killer hasn't made any progress).


Note that small allocations already *can* fail if they are done in the context
of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case
when they failed in a code that "handled" the allocation failure with a
BUG_ON(!page).


Did You hit a race described below?


I don't know, I don't even have direct evidence of TIF_MEMDIE being set, 
but OOMs were happening all over the place, and I haven't found another 
reason why the allocation would not be too-small-to-fail otherwise.



http://lkml.kernel.org/r/201508272249.hdh81838.ftqolmffovs...@i-love.sakura.ne.jp

Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL.


Yes, I suggested so:
http://marc.info/?l=linux-kernel&m=144181523115244&w=2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-07 Thread Oleg Nesterov
On 10/07, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
> > freed_enough_mem() == T. But it is absolutely not clear to me how we
> > should define this freed_enough_mem(), so I think we should do this
> > later.
>
> Maybe
>
>   bool freed_enough_mem(void) { !atomic_read(&oom_victims); }
>
> if we change to call mark_oom_victim() on all threads which should be
> killed as OOM victims.

Well, in this case

if (atomic_read(&mm->mm_users) == 1)
break;

makes much more sense. Plus we do not need to change mark_oom_victim().

Lets discuss this later?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-07 Thread Tetsuo Handa
Oleg Nesterov wrote:
> > > Hmm. If we already have mmap_sem and started zap_page_range() then
> > > I do not think it makes sense to stop until we free everything we can.
> >
> > Zapping a huge address space can take quite some time
> 
> Yes, and this is another reason we should do this asynchronously.
> 
> > and we really do
> > not have to free it all on behalf of the killer when enough memory is
> > freed to allow for further progress and the rest can be done by the
> > victim. If one batch doesn't seem sufficient then another retry can
> > continue.
> >
> > I do not think that a limited scan would make the implementation more
> > complicated
> 
> But we can't even know much memory unmap_single_vma() actually frees.
> Even if we could, how can we know we freed enough?
> 
> Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
> freed_enough_mem() == T. But it is absolutely not clear to me how we
> should define this freed_enough_mem(), so I think we should do this
> later.

Maybe

  bool freed_enough_mem(void) { !atomic_read(&oom_victims); }

if we change to call mark_oom_victim() on all threads which should be
killed as OOM victims.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-07 Thread Tetsuo Handa
Vlastimil Babka wrote:
> On 5.10.2015 16:44, Michal Hocko wrote:
> > So I can see basically only few ways out of this deadlock situation.
> > Either we face the reality and allow small allocations (withtout
> > __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
> > (so after even OOM killer hasn't made any progress).
> 
> Note that small allocations already *can* fail if they are done in the context
> of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case
> when they failed in a code that "handled" the allocation failure with a
> BUG_ON(!page).
> 
Did You hit a race described below?
http://lkml.kernel.org/r/201508272249.hdh81838.ftqolmffovs...@i-love.sakura.ne.jp

Where was the BUG_ON(!page) ? Maybe it is a candidate for adding __GFP_NOFAIL.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-06 Thread Vlastimil Babka
On 5.10.2015 16:44, Michal Hocko wrote:
> So I can see basically only few ways out of this deadlock situation.
> Either we face the reality and allow small allocations (withtout
> __GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
> (so after even OOM killer hasn't made any progress).

Note that small allocations already *can* fail if they are done in the context
of a task selected as OOM victim (i.e. TIF_MEMDIE). And yeah I've seen a case
when they failed in a code that "handled" the allocation failure with a
BUG_ON(!page).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-06 Thread Oleg Nesterov
Damn. I can't believe this, but I still can't make the initial change.
And no, it is not that I hit some technical problems, just I can't
decide what exactly the first step should do to be a) really simple
and b) useful. I am starting to think I'll just update my draft patch
which uses queue_work() and send it tomorrow (yes, tomorrow again ;).

But let me at least answer this email,

On 09/23, Michal Hocko wrote:
>
> On Tue 22-09-15 18:06:08, Oleg Nesterov wrote:
> >
> > OK, let it be a kthread from the very beginning, I won't argue. This
> > is really minor compared to other problems.
>
> I am still not sure how you want to implement that kernel thread but I
> am quite skeptical it would be very much useful because all the current
> allocations which end up in the OOM killer path cannot simply back off
> and drop the locks with the current allocator semantic.  So they will
> be sitting on top of unknown pile of locks whether you do an additional
> reclaim (unmap the anon memory) in the direct OOM context or looping
> in the allocator and waiting for kthread/workqueue to do its work. The
> only argument that I can see is the stack usage but I haven't seen stack
> overflows in the OOM path AFAIR.

Please see below,

> > And note that the caller can held other locks we do not even know about.
> > Most probably we should not deadlock, at least if we only unmap the anon
> > pages, but still this doesn't look safe.
>
> The unmapper cannot fall back to reclaim and/or trigger the OOM so
> we should be indeed very careful and mark the allocation context
> appropriately. I can remember mmu_gather but it is only doing
> opportunistic allocation AFAIR.

And I was going to make V1 which avoids queue_work/kthread and zaps the
memory in oom_kill_process() context.

But this can't work because we need to increment ->mm_users to avoid
the race with exit_mmap/etc. And this means that we need mmput() after
that, and as we recently discussed it can deadlock if mm_users goes
to zero, we can't do exit_mmap/etc in oom_kill_process().

> > Hmm. If we already have mmap_sem and started zap_page_range() then
> > I do not think it makes sense to stop until we free everything we can.
>
> Zapping a huge address space can take quite some time

Yes, and this is another reason we should do this asynchronously.

> and we really do
> not have to free it all on behalf of the killer when enough memory is
> freed to allow for further progress and the rest can be done by the
> victim. If one batch doesn't seem sufficient then another retry can
> continue.
>
> I do not think that a limited scan would make the implementation more
> complicated

But we can't even know much memory unmap_single_vma() actually frees.
Even if we could, how can we know we freed enough?

Anyway. Perhaps it makes sense to abort the for_each_vma() loop if
freed_enough_mem() == T. But it is absolutely not clear to me how we
should define this freed_enough_mem(), so I think we should do this
later.

> > But. Can't we just remove another ->oom_score_adj check when we try
> > to kill all mm users (the last for_each_process loop). If yes, this
> > all can be simplified.
> >
> > I guess we can't and its a pity. Because it looks simply pointless
> > to not kill all mm users. This just means the select_bad_process()
> > picked the wrong task.
>
> Yes I am not really sure why oom_score_adj is not per-mm and we are
> doing that per signal struct to be honest.

Heh ;) Yes, but I guess it is too late to move it back.

> Maybe we can revisit this...

I hope, but I am not going to try to remove this OOM_SCORE_ADJ_MIN
check now. Just we should not zap this mm if we find the OOM-unkillable
user.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-06 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds
>  wrote:
>>
>> The basic fact remains: kernel allocations are so important that
>> rather than fail, you should kill user space. Only kernel allocations
>> that *explicitly* know that they have fallback code should fail, and
>> they should just do the __GFP_NORETRY.

If you have reached the point of killing userspace you might as well
panic the box.  Userspace will recover more cleanly and more quickly.
The oom-killer is like an oops.  Nice for debugging but not something
you want on a production workload.

> To be clear: "big" orders (I forget if the limit is at order-3 or
> order-4) do fail much more aggressively. But no, we do not limit retry
> to just order-0, because even small kmalloc sizes tend to often do
> order-1 or order-2 just because of memory packing issues (ie trying to
> pack into a single page wastes too much memory if the allocation sizes
> don't come out right).

I am not asking that we limit retry to just order-0 pages.  I am asking
that we limit the oom-killer on failure to just order-0 pages.

> So no, order-0 isn't special. 1/2 are rather important too.

That is a justification for retrying.  That is not a justification for
killing the box.

> [ Checking /proc/slabinfo: it looks like several slabs are order-3,
> for things like files_cache, signal_cache and sighand_cache for me at
> least. So I think it's up to order-3 that we basically need to
> consider "we'll need to shrink user space aggressively unless we have
> an explicit fallback for the allocation" ]

What I know is that order-3 is definitely too big.  I had 4G of RAM
free.  I needed 16K to exapand the fd table.  The box died.  That is
not good.

We have static checkers now, failure to check and handle errors tends to
be caught.

So yes for the rare case of order-[123] allocations failing we should
return the failure to the caller.  The kernel can handle it.  Userspace
can handle just about anything better than random processes dying.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-06 Thread Linus Torvalds
On Tue, Oct 6, 2015 at 9:49 AM, Linus Torvalds
 wrote:
>
> The basic fact remains: kernel allocations are so important that
> rather than fail, you should kill user space. Only kernel allocations
> that *explicitly* know that they have fallback code should fail, and
> they should just do the __GFP_NORETRY.

To be clear: "big" orders (I forget if the limit is at order-3 or
order-4) do fail much more aggressively. But no, we do not limit retry
to just order-0, because even small kmalloc sizes tend to often do
order-1 or order-2 just because of memory packing issues (ie trying to
pack into a single page wastes too much memory if the allocation sizes
don't come out right).

So no, order-0 isn't special. 1/2 are rather important too.

[ Checking /proc/slabinfo: it looks like several slabs are order-3,
for things like files_cache, signal_cache and sighand_cache for me at
least. So I think it's up to order-3 that we basically need to
consider "we'll need to shrink user space aggressively unless we have
an explicit fallback for the allocation" ]

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-06 Thread Linus Torvalds
On Tue, Oct 6, 2015 at 8:55 AM, Eric W. Biederman  wrote:
>
> Not to take away from your point about very small allocations.  However
> assuming allocations larger than a page will always succeed is down
> right dangerous.

We've required retrying for *at least* order-1 allocations. Exactly
because things like fork() etc have wanted them, and:

 - as you say, you can be unlucky even with reasonable amounts of free memory

 - the page-out code is approximate and doesn't guarantee that you get
buddy coalescing

 - just failing after a couple of loops has been known to result in
fork() and similar friends returning -EAGAIN and breaking user space.

Really. Stop this idiocy. We have gone through this before. It's a disaster.

The basic fact remains: kernel allocations are so important that
rather than fail, you should kill user space. Only kernel allocations
that *explicitly* know that they have fallback code should fail, and
they should just do the __GFP_NORETRY.

So the rule ends up being that we retry the memory freeing loop for
small allocations (where "small" is something like "order 2 or less")

So really. If you find some particular case that is painful because it
wants an order-1 or order-2 allocation, then you do this:

 - do the allocation with GFP_NORETRY

 - have a fallback that uses vmalloc or just is able to make the
buffer even smaller.

But by default we will continue to make small orders retry. As
mentioned, we have tried the alternatives. It doesn't work.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-06 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko  wrote:
>>
>> Have they been reported/fixed? All kernel paths doing an allocation are
>> _supposed_ to check and handle ENOMEM. If they are not then they are
>> buggy and should be fixed.
>
> No. Stop this theoretical idiocy.
>
> We've tried it. I objected before people tried it, and it turns out
> that it was a horrible idea.
>
> Small kernel allocations should basically never fail, because we end
> up needing memory for random things, and if a kmalloc() fails it's
> because some application is using too much memory, and the application
> should be killed. Never should the kernel allocation fail. It really
> is that simple. If we are out of memory, that does not mean that we
> should start failing random kernel things.
>
> So this "people should check for allocation failures" is bullshit.
> It's a computer science myth. It's simply not true in all cases.
>
> Kernel allocators that know that they do large allocations (ie bigger
> than a few pages) need to be able to handle the failure, but not the
> general case. Also, kernel allocators that know they have a good
> fallback (eg they try a large allocation first but can fall back to a
> smaller one) should use __GFP_NORETRY, but again, that does *not* in
> any way mean that general kernel allocations should randomly fail.
>
> So no. The answer is ABSOLUTELY NOT "everybody should check allocation
> failure". Get over it. I refuse to go through that circus again. It's
> stupid.

Not to take away from your point about very small allocations.  However
assuming allocations larger than a page will always succeed is down
right dangerous.  Last time this issue rose up and bit me I sat down and
did the math, and it is ugly.  You have to have 50% of the memory free
to guarantee that an order 1 allocation will succeed.

So quite frankly I think it is only safe to require order 0 alloctions
to succeed.  Larger allocations do fail in practice, and it causes real
problems on real workloads when we try and loop forever waiting for
something that will never come.

My analysis from when it bit me.

commit 96c7a2ff21501691587e1ae969b83cbec8b78e08
Author: Eric W. Biederman 
Date:   Mon Feb 10 14:25:41 2014 -0800

fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem

Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept.  At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available.  Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem.  Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification).  We have already tried everything
reasonable to allocate a page and the only thing left to do is wait.  So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-05 Thread Michal Hocko
On Fri 02-10-15 15:01:06, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko  wrote:
> >
> > Have they been reported/fixed? All kernel paths doing an allocation are
> > _supposed_ to check and handle ENOMEM. If they are not then they are
> > buggy and should be fixed.
> 
> No. Stop this theoretical idiocy.
> 
> We've tried it. I objected before people tried it, and it turns out
> that it was a horrible idea.
> 
> Small kernel allocations should basically never fail, because we end
> up needing memory for random things, and if a kmalloc() fails it's
> because some application is using too much memory, and the application
> should be killed. Never should the kernel allocation fail. It really
> is that simple. If we are out of memory, that does not mean that we
> should start failing random kernel things.

But you do realize that killing a task as a memory reclaim technique is
not 100% reliable, right?

Any task might be blocked in an uninterruptible context (e.g. a mutex)
waiting for completion which depends on the allocation success. The page
allocator (resp. OOM killer) is not aware of these dependencies and I am
really skeptical it will ever be because dependency tracking is way too
expensive. So killing a task doesn't guarantee a forward progress.

So I can see basically only few ways out of this deadlock situation.
Either we face the reality and allow small allocations (withtout
__GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
(so after even OOM killer hasn't made any progress).
Or we can start killing other tasks but this might end up in the same
state and the time to resolve the problem might be basically unbounded
(it is trivial to construct loads where hundreds of tasks are bashing
against a single i_mutex and all of them depending on an allocation...).
Or we can panic/reboot the system if the OOM situation cannot be solved
within a selected timeout.

There are other ways to micro-optimize the current implementation by
playing with memory reserves but all that is just postponing the final
disaster and there is still a point of no further progress that we have
to deal with somehow.

> So this "people should check for allocation failures" is bullshit.
> It's a computer science myth. It's simply not true in all cases.

Sure it is not true in _all_ cases. If some paths cannot fail they can
use __GFP_NOFAIL for that purpose. The point is that most allocations
_can_ handle the failure. People are taught to check for allocation
failures. We even have scripts/coccinelle/null/kmerr.cocci which helps
to detect slab allocator users to some degree.

> Kernel allocators that know that they do large allocations (ie bigger
> than a few pages) need to be able to handle the failure, but not the
> general case. Also, kernel allocators that know they have a good
> fallback (eg they try a large allocation first but can fall back to a
> smaller one) should use __GFP_NORETRY, but again, that does *not* in
> any way mean that general kernel allocations should randomly fail.
> 
> So no. The answer is ABSOLUTELY NOT "everybody should check allocation
> failure". Get over it. I refuse to go through that circus again. It's
> stupid.
> 
>  Linus

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-02 Thread Linus Torvalds
On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko  wrote:
>
> Have they been reported/fixed? All kernel paths doing an allocation are
> _supposed_ to check and handle ENOMEM. If they are not then they are
> buggy and should be fixed.

No. Stop this theoretical idiocy.

We've tried it. I objected before people tried it, and it turns out
that it was a horrible idea.

Small kernel allocations should basically never fail, because we end
up needing memory for random things, and if a kmalloc() fails it's
because some application is using too much memory, and the application
should be killed. Never should the kernel allocation fail. It really
is that simple. If we are out of memory, that does not mean that we
should start failing random kernel things.

So this "people should check for allocation failures" is bullshit.
It's a computer science myth. It's simply not true in all cases.

Kernel allocators that know that they do large allocations (ie bigger
than a few pages) need to be able to handle the failure, but not the
general case. Also, kernel allocators that know they have a good
fallback (eg they try a large allocation first but can fall back to a
smaller one) should use __GFP_NORETRY, but again, that does *not* in
any way mean that general kernel allocations should randomly fail.

So no. The answer is ABSOLUTELY NOT "everybody should check allocation
failure". Get over it. I refuse to go through that circus again. It's
stupid.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-02 Thread Tetsuo Handa
Michal Hocko wrote:
> On Mon 28-09-15 15:24:06, David Rientjes wrote:
> > I agree that i_mutex seems to be one of the most common offenders.  
> > However, I'm not sure I understand why holding it while trying to allocate 
> > infinitely for an order-0 allocation is problematic wrt the proposed 
> > kthread. 
> 
> I didn't say it would be problematic. We are talking past each other
> here. All I wanted to say was that a separate kernel oom thread wouldn't
> _help_ with the lock dependencies.
> 
Oops. I misunderstood that you are skeptical about memory unmapping approach
due to lock dependency. But rather, you are skeptical about use of a dedicated
kernel thread for memory unmapping approach.

> > The kthread itself need only take mmap_sem for read.  If all 
> > threads sharing the mm with a victim have been SIGKILL'd, they should get 
> > TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> > drop mmap_sem. 
> 
> which is the case if the direct oom context used trylock...
> So just to make it clear. I am not objecting a specialized oom kernel
> thread. It would work as well. I am just not convinced that it is really
> needed because the direct oom context can use trylock and do the same
> work directly.

Well, I think it depends on from where we call memory unmapping code.

The first candidate is oom_kill_process() because it is a location where
the mm struct to unmap is determined. But since select_bad_process()
aborts upon encountering a TIF_MEMDIE task, we will fail to call memory
unmapping code again if the first down_trylock(&mm->mmap_sem) attempt in
oom_kill_process() failed. (Here I assumed that we allow all OOM victims
to access memory reserves so that subsequent down_trylock(&mm->mmap_sem)
attempts could succeed.)

The second candidate is select_bad_process() because it is a location
where we can call memory unmapping code again upon encountering a
TIF_MEMDIE task.

The third candidate is caller of out_of_memory() because it is a location
where we can call memory unmapping code again even when the OOM victims
are blocked. (Our discussion seems to assume that TIF_MEMDIE tasks can
make forward progress and die. But since TIF_MEMDIE tasks might encounter
unkillable locks after returning from allocation (e.g.
http://lkml.kernel.org/r/201509290118.bcj43256.tsfffmolhvo...@i-love.sakura.ne.jp
 ),
it will be safer not to assume that out_of_memory() can be always called.
So, I thought that a dedicated kernel thread makes it easy to call memory
unmapping code periodically again and again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-02 Thread Michal Hocko
On Tue 29-09-15 01:18:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > The point I've tried to made is that oom unmapper running in a detached
> > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > make any difference wrt. lock because the holders of the lock would loop
> > inside the allocator anyway because we do not fail small allocations.
> 
> We tried to allow small allocations to fail. It resulted in unstable system
> with obscure bugs.

Have they been reported/fixed? All kernel paths doing an allocation are
_supposed_ to check and handle ENOMEM. If they are not then they are
buggy and should be fixed.

> We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
> effectively __GFP_NOFAIL allocations.

What do you mean by that? An opencoded __GFP_NOFAIL?
 
> We are now trying to allow zapping OOM victim's mm. Michal is already
> skeptical about this approach due to lock dependency.

I am not sure where this came from. I am all for this approach. It will
not solve the problem completely for sure but it can help in many cases
already.

> We already spent 9 months on this OOM livelock. No silver bullet yet.
> Proposed approaches are too drastic to backport for existing users.
> I think we are out of bullet.

Not at all. We have this problem since ever basically. And we have a lot
of legacy issues to care about. But nobody could reasonably expect this
will be solved in a short time period.

> Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
> of callsites,

This is simply not doable. There are thousand of allocation sites all
over the kernel.

> timeout based workaround will be the only bullet we can use.

Those are the last resort which only paper over real bugs which should
be fixed. I would agree with your urging if this was something that can
easily happen on a _properly_ configured system. System which can blow
into an OOM storm is far from being configured properly. If you have an
untrusted users running on your system you should better put them into a
highly restricted environment and limit as much as possible.

I can completely understand your frustration about the pace of the
progress here but this is nothing new and we should strive for long term
vision which would be much less fragile than what we have right now. No
timeout based solution is the way in that direction.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-01 Thread Michal Hocko
On Mon 28-09-15 15:24:06, David Rientjes wrote:
> On Fri, 25 Sep 2015, Michal Hocko wrote:
> 
> > > > I am still not sure how you want to implement that kernel thread but I
> > > > am quite skeptical it would be very much useful because all the current
> > > > allocations which end up in the OOM killer path cannot simply back off
> > > > and drop the locks with the current allocator semantic.  So they will
> > > > be sitting on top of unknown pile of locks whether you do an additional
> > > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > > only argument that I can see is the stack usage but I haven't seen stack
> > > > overflows in the OOM path AFAIR.
> > > > 
> > > 
> > > Which locks are you specifically interested in?
> > 
> > Any locks they were holding before they entered the page allocator (e.g.
> > i_mutex is the easiest one to trigger from the userspace but mmap_sem
> > might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> > mmap_sem held for write). Those would be locked until the page allocator
> > returns, which with the current semantic might be _never_.
> > 
> 
> I agree that i_mutex seems to be one of the most common offenders.  
> However, I'm not sure I understand why holding it while trying to allocate 
> infinitely for an order-0 allocation is problematic wrt the proposed 
> kthread. 

I didn't say it would be problematic. We are talking past each other
here. All I wanted to say was that a separate kernel oom thread wouldn't
_help_ with the lock dependencies.

> The kthread itself need only take mmap_sem for read.  If all 
> threads sharing the mm with a victim have been SIGKILL'd, they should get 
> TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> drop mmap_sem. 

which is the case if the direct oom context used trylock...
So just to make it clear. I am not objecting a specialized oom kernel
thread. It would work as well. I am just not convinced that it is really
needed because the direct oom context can use trylock and do the same
work directly.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-10-01 Thread Tetsuo Handa
David Rientjes wrote:
> On Wed, 30 Sep 2015, Tetsuo Handa wrote:
> 
> > If we choose only 1 OOM victim, the possibility of hitting this memory
> > unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the
> > possibility becomes (almost) 0%. And if we still hit this livelock even
> > after choosing many OOM victims, it is time to call panic().
> > 
> 
> Again, this is a fundamental disagreement between your approach of 
> randomly killing processes hoping that we target one that can make a quick 
> exit vs. my approach where we give threads access to memory reserves after 
> reclaim has failed in an oom livelock so they at least make forward 
> progress.  We're going around in circles.

I don't like that memory management subsystem shows an expectant attitude
when memory allocation is failing. There are many possible silent hang up
paths. And my customer's servers might be hitting such paths. But I can't
go in front of their servers and capture SysRq. Thus, I want to let memory
management subsystem try to recover automatically; at least emit some
diagnostic kernel messages automatically.

> 
> > (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
> > enter direct reclaim paths in order to avoid being blocked by unkillable fs
> > locks?)
> > 
> 
> OOM victims shouldn't need to enter reclaim, and there have been patches 
> before to abort reclaim if current has a pending SIGKILL,

Yes. shrink_inactive_list() and throttle_direct_reclaim() recognize
fatal_signal_pending() tasks.

>   if they have 
> access to memory reserves.

What does this mean?

shrink_inactive_list() and throttle_direct_reclaim() do not check whether
OOM victims have access to memory reserves, do they?

We don't allow access to memory reserves by OOM victims without TIF_MEMDIE.
I think that we should favor kthread and dying threads over normal threads
at __alloc_pages_slowpath() but there is no response on
http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 .

> Nothing prevents the victim from already being 
> in reclaim, however, when it is killed.

I think this is problematic because there are unkillable locks in reclaim
paths. The memory management subsystem reports nothing.

> 
> > > Perhaps this is an argument that we need to provide access to memory 
> > > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such 
> > > scenarios, 
> > > but I would wait to make that extension until we see it in practice.
> > 
> > I think that GFP_ATOMIC allocations already access memory reserves via
> > ALLOC_HIGH priority.
> > 
> 
> Yes, that's true.  It doesn't help for GFP_NOFS, however.  It may be 
> possible that GFP_ATOMIC reserves have been depleted or there is a 
> GFP_NOFS allocation that gets stuck looping forever that doesn't get the 
> ability to allocate without watermarks.

Why can't we emit some diagnostic kernel messages automatically?
Memory allocation requests which did not complete within e.g. 30 seconds
deserve possible memory allocation deadlock warning messages.

>  I'd wait to see it in practice 
> before making this extension since it relies on scanning the tasklist.
> 

Is this extension something like check_hung_uninterruptible_tasks()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-30 Thread David Rientjes
On Wed, 30 Sep 2015, Tetsuo Handa wrote:

> If we choose only 1 OOM victim, the possibility of hitting this memory
> unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the
> possibility becomes (almost) 0%. And if we still hit this livelock even
> after choosing many OOM victims, it is time to call panic().
> 

Again, this is a fundamental disagreement between your approach of 
randomly killing processes hoping that we target one that can make a quick 
exit vs. my approach where we give threads access to memory reserves after 
reclaim has failed in an oom livelock so they at least make forward 
progress.  We're going around in circles.

> (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
> enter direct reclaim paths in order to avoid being blocked by unkillable fs
> locks?)
> 

OOM victims shouldn't need to enter reclaim, and there have been patches 
before to abort reclaim if current has a pending SIGKILL, if they have 
access to memory reserves.  Nothing prevents the victim from already being 
in reclaim, however, when it is killed.

> > Perhaps this is an argument that we need to provide access to memory 
> > reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
> > but I would wait to make that extension until we see it in practice.
> 
> I think that GFP_ATOMIC allocations already access memory reserves via
> ALLOC_HIGH priority.
> 

Yes, that's true.  It doesn't help for GFP_NOFS, however.  It may be 
possible that GFP_ATOMIC reserves have been depleted or there is a 
GFP_NOFS allocation that gets stuck looping forever that doesn't get the 
ability to allocate without watermarks.  I'd wait to see it in practice 
before making this extension since it relies on scanning the tasklist.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-30 Thread Tetsuo Handa
Tetsuo Handa wrote:
> (Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
> enter direct reclaim paths in order to avoid being blocked by unkillable fs
> locks?)

I'm not familiar with how fs writeback manages memory. I feel I'm missing
something. Can somebody please re-check whether my illustrations are really
possible?

If they are really possible, I think we have yet another silent hang up
sequence. Say, there are one userspace task named P1 and one kernel thread
named KT1.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P1 dirties memory allocated for buffered write.

(4) P1 leaves kernel mode.

(5) KT1 finds dirtied memory.

(6) KT1 holds fs's unkillable lock for fs writeback.

(7) KT1 tries to allocate memory for fs writeback, but fails to allocate
because watermark is low. KT1 cannot call out_of_memory() because of
!__GFP_FS allocation.

(8) P1 enters into kernel mode.

(9) P1 calls kmalloc(GFP_KERNEL) and is blocked at unkillable lock for fs
writeback held by KT1.

How do we allow KT1 to make forward progress? Are we giving access to
memory reserves (e.g. ALLOC_NO_WATERMARKS priority) to KT1?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-29 Thread Tetsuo Handa
David Rientjes wrote:
> I think both of your illustrations show why it is not helpful to kill 
> additional processes after a time period has elapsed and a victim has 
> failed to exit.  In both of your scenarios, it would require that KT1 be 
> killed to allow forward progress and we know that's not possible.

My illustrations show why it is helpful to kill additional processes after
a time period has elapsed and a victim has failed to exit. We don't need
to kill KT1 if we combine memory unmapping approach and timeout based OOM
killing approach.

Simply choosing more OOM victims (processes which do not share other OOM
victim's mm) based on timeout itself does not guarantee that other OOM
victims can exit. But if timeout based OOM killing is used together with
memory unmapping approach, the possibility that OOM victims can exit
significantly increases because the only case where memory unmapping
approach stucks will be when mm->mmap_sem was held for writing (which
should unlikely occur).

If we choose only 1 OOM victim, the possibility of hitting this memory
unmapping livelock is (say) 1%. But if we choose multiple OOM victims, the
possibility becomes (almost) 0%. And if we still hit this livelock even
after choosing many OOM victims, it is time to call panic().

(Well, do we need to change __alloc_pages_slowpath() that OOM victims do not
enter direct reclaim paths in order to avoid being blocked by unkillable fs
locks?)

> 
> Perhaps this is an argument that we need to provide access to memory 
> reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
> but I would wait to make that extension until we see it in practice.

I think that GFP_ATOMIC allocations already access memory reserves via
ALLOC_HIGH priority.

> 
> Killing all mm->mmap_sem threads certainly isn't meant to solve all oom 
> killer livelocks, as you show.
> 

Good.

I'm not denying memory unmapping approach. I'm just pointing out that
use of memory unmapping approach alone still leaves room for hang up.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-29 Thread David Rientjes
On Tue, 29 Sep 2015, Tetsuo Handa wrote:

> Is the story such simple? I think there are factors which disturb memory
> allocation with mmap_sem held for writing.
> 
>   down_write(&mm->mmap_sem);
>   kmalloc(GFP_KERNEL);
>   up_write(&mm->mmap_sem);
> 
> can involve locks inside __alloc_pages_slowpath().
> 
> Say, there are three userspace tasks named P1, P2T1, P2T2 and
> one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm.
> KT1 is a kernel thread for fs writeback (maybe kswapd?).
> I think sequence shown below is possible.
> 
> (1) P1 enters into kernel mode via write() syscall.
> 
> (2) P1 allocates memory for buffered write.
> 
> (3) P2T1 enters into kernel mode and calls kmalloc().
> 
> (4) P2T1 arrives at __alloc_pages_may_oom() because there was no
> reclaimable memory. (Memory allocated by P1 is not reclaimable
> as of this moment.)
> 
> (5) P1 dirties memory allocated for buffered write.
> 
> (6) P2T2 enters into kernel mode and calls kmalloc() with
> mmap_sem held for writing.
> 
> (7) KT1 finds dirtied memory.
> 
> (8) KT1 holds fs's unkillable lock for fs writeback.
> 
> (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1.
> 
> (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets
>  TIF_MEMDIE on both P2T1 and P2T2.
> 
> (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
>  held by KT1.
> 
> (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and
>  P2T2 cannot release memory because memory unmapping code cannot hold
>  mmap_sem for reading, KT1 waits forever OOM livelock completed!
> 
> I think sequence shown below is also possible. Say, there are three
> userspace tasks named P1, P2, P3 and one kernel thread named KT1.
> 
> (1) P1 enters into kernel mode via write() syscall.
> 
> (2) P1 allocates memory for buffered write.
> 
> (3) P2 enters into kernel mode and holds mmap_sem for writing.
> 
> (4) P3 enters into kernel mode and calls kmalloc().
> 
> (5) P3 arrives at __alloc_pages_may_oom() because there was no
> reclaimable memory. (Memory allocated by P1 is not reclaimable
> as of this moment.)
> 
> (6) P1 dirties memory allocated for buffered write.
> 
> (7) KT1 finds dirtied memory.
> 
> (8) KT1 holds fs's unkillable lock for fs writeback.
> 
> (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback
> held by KT1.
> 
> (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets
>  TIF_MEMDIE on P2.
> 
> (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
>  held by KT1.
> 
> (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot
>  release memory because memory unmapping code cannot hold mmap_sem for
>  reading, KT1 waits forever OOM livelock completed!
> 
> So, allowing all OOM victim threads to use memory reserves does not guarantee
> that a thread which held mmap_sem for writing to make forward progress.
> 

Thank you for writing this all out, it definitely helps to understand the 
concerns.

This, in my understanding, is the same scenario that requires not only oom 
victims to be able to access memory reserves, but also any thread after an 
oom victim has failed to make a timely exit.

I point out mm->mmap_sem as a special case because we have had fixes in 
the past, such as the special fatal_signal_pending() handling in 
__get_user_pages(), that try to ensure forward progress since we know that 
we need exclusive mm->mmap_sem for the victim to make an exit.

I think both of your illustrations show why it is not helpful to kill 
additional processes after a time period has elapsed and a victim has 
failed to exit.  In both of your scenarios, it would require that KT1 be 
killed to allow forward progress and we know that's not possible.

Perhaps this is an argument that we need to provide access to memory 
reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
but I would wait to make that extension until we see it in practice.

Killing all mm->mmap_sem threads certainly isn't meant to solve all oom 
killer livelocks, as you show.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-29 Thread Tetsuo Handa
David Rientjes wrote:
> On Fri, 25 Sep 2015, Michal Hocko wrote:
> > > > I am still not sure how you want to implement that kernel thread but I
> > > > am quite skeptical it would be very much useful because all the current
> > > > allocations which end up in the OOM killer path cannot simply back off
> > > > and drop the locks with the current allocator semantic.  So they will
> > > > be sitting on top of unknown pile of locks whether you do an additional
> > > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > > only argument that I can see is the stack usage but I haven't seen stack
> > > > overflows in the OOM path AFAIR.
> > > > 
> > > 
> > > Which locks are you specifically interested in?
> > 
> > Any locks they were holding before they entered the page allocator (e.g.
> > i_mutex is the easiest one to trigger from the userspace but mmap_sem
> > might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> > mmap_sem held for write). Those would be locked until the page allocator
> > returns, which with the current semantic might be _never_.
> > 
> 
> I agree that i_mutex seems to be one of the most common offenders.  
> However, I'm not sure I understand why holding it while trying to allocate 
> infinitely for an order-0 allocation is problematic wrt the proposed 
> kthread.  The kthread itself need only take mmap_sem for read.  If all 
> threads sharing the mm with a victim have been SIGKILL'd, they should get 
> TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> drop mmap_sem.  We must ensure that any holder of mmap_sem cannot quickly 
> deplete memory reserves without properly checking for 
> fatal_signal_pending().

Is the story such simple? I think there are factors which disturb memory
allocation with mmap_sem held for writing.

  down_write(&mm->mmap_sem);
  kmalloc(GFP_KERNEL);
  up_write(&mm->mmap_sem);

can involve locks inside __alloc_pages_slowpath().

Say, there are three userspace tasks named P1, P2T1, P2T2 and
one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm.
KT1 is a kernel thread for fs writeback (maybe kswapd?).
I think sequence shown below is possible.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P2T1 enters into kernel mode and calls kmalloc().

(4) P2T1 arrives at __alloc_pages_may_oom() because there was no
reclaimable memory. (Memory allocated by P1 is not reclaimable
as of this moment.)

(5) P1 dirties memory allocated for buffered write.

(6) P2T2 enters into kernel mode and calls kmalloc() with
mmap_sem held for writing.

(7) KT1 finds dirtied memory.

(8) KT1 holds fs's unkillable lock for fs writeback.

(9) P2T2 is blocked at unkillable lock for fs writeback held by KT1.

(10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets
 TIF_MEMDIE on both P2T1 and P2T2.

(11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
 held by KT1.

(12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and
 P2T2 cannot release memory because memory unmapping code cannot hold
 mmap_sem for reading, KT1 waits forever OOM livelock completed!

I think sequence shown below is also possible. Say, there are three
userspace tasks named P1, P2, P3 and one kernel thread named KT1.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P2 enters into kernel mode and holds mmap_sem for writing.

(4) P3 enters into kernel mode and calls kmalloc().

(5) P3 arrives at __alloc_pages_may_oom() because there was no
reclaimable memory. (Memory allocated by P1 is not reclaimable
as of this moment.)

(6) P1 dirties memory allocated for buffered write.

(7) KT1 finds dirtied memory.

(8) KT1 holds fs's unkillable lock for fs writeback.

(9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback
held by KT1.

(10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets
 TIF_MEMDIE on P2.

(11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
 held by KT1.

(12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot
 release memory because memory unmapping code cannot hold mmap_sem for
 reading, KT1 waits forever OOM livelock completed!

So, allowing all OOM victim threads to use memory reserves does not guarantee
that a thread which held mmap_sem for writing to make forward progress.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-28 Thread David Rientjes
On Tue, 29 Sep 2015, Tetsuo Handa wrote:

> > The point I've tried to made is that oom unmapper running in a detached
> > context (e.g. kernel thread) vs. directly in the oom context doesn't
> > make any difference wrt. lock because the holders of the lock would loop
> > inside the allocator anyway because we do not fail small allocations.
> 
> We tried to allow small allocations to fail. It resulted in unstable system
> with obscure bugs.
> 

These are helpful to identify regardless of the outcome of this 
discussion.  I'm not sure where the best place to report them would be, 
or whether its even feasible to dig through looking for possibilities, but 
I think it would be interesting to see which callers are relying on 
internal page allocator implementation to work properly since it may 
uncover bugs that would occur later if it were changed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-28 Thread David Rientjes
On Fri, 25 Sep 2015, Michal Hocko wrote:

> > > I am still not sure how you want to implement that kernel thread but I
> > > am quite skeptical it would be very much useful because all the current
> > > allocations which end up in the OOM killer path cannot simply back off
> > > and drop the locks with the current allocator semantic.  So they will
> > > be sitting on top of unknown pile of locks whether you do an additional
> > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > only argument that I can see is the stack usage but I haven't seen stack
> > > overflows in the OOM path AFAIR.
> > > 
> > 
> > Which locks are you specifically interested in?
> 
> Any locks they were holding before they entered the page allocator (e.g.
> i_mutex is the easiest one to trigger from the userspace but mmap_sem
> might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> mmap_sem held for write). Those would be locked until the page allocator
> returns, which with the current semantic might be _never_.
> 

I agree that i_mutex seems to be one of the most common offenders.  
However, I'm not sure I understand why holding it while trying to allocate 
infinitely for an order-0 allocation is problematic wrt the proposed 
kthread.  The kthread itself need only take mmap_sem for read.  If all 
threads sharing the mm with a victim have been SIGKILL'd, they should get 
TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
drop mmap_sem.  We must ensure that any holder of mmap_sem cannot quickly 
deplete memory reserves without properly checking for 
fatal_signal_pending().

> > We have already discussed 
> > the usefulness of killing all threads on the system sharing the same ->mm, 
> > meaning all threads that are either holding or want to hold mm->mmap_sem 
> > will be able to allocate into memory reserves.  Any allocator holding 
> > down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
> > (Are you concerned about MAP_POPULATE?)
> 
> I am not sure I understand. We would have to fail the request in order
> the context which requested the memory could drop the lock. Are we
> talking about the same thing here?
> 

Not fail the request, they should be able to allocate from memory reserves 
when TIF_MEMDIE gets set.  This would require that threads is all gfp 
contexts are able to get TIF_MEMDIE set without an explicit call to 
out_of_memory() for !__GFP_FS.

> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> > it's the reason the code exists.  Any optimizations to that is certainly 
> > welcome, but we definitely need to send SIGKILL to all threads sharing the 
> > mm to make forward progress, otherwise we are going back to pre-2008 
> > livelocks.
> 
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.
>  

It must solve the issue you describe, killing other processes that share 
the ->mm, otherwise we have mm->mmap_sem livelock.  We are not concerned 
about iterating over all task_structs in the oom killer as a painpoint, 
such users should already be using oom_kill_allocating_task which is why 
it was introduced.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-28 Thread Tetsuo Handa
Michal Hocko wrote:
> The point I've tried to made is that oom unmapper running in a detached
> context (e.g. kernel thread) vs. directly in the oom context doesn't
> make any difference wrt. lock because the holders of the lock would loop
> inside the allocator anyway because we do not fail small allocations.

We tried to allow small allocations to fail. It resulted in unstable system
with obscure bugs.

We tried to allow small !__GFP_FS allocations to fail. It failed to fail by
effectively __GFP_NOFAIL allocations.

We are now trying to allow zapping OOM victim's mm. Michal is already
skeptical about this approach due to lock dependency.

We already spent 9 months on this OOM livelock. No silver bullet yet.
Proposed approaches are too drastic to backport for existing users.
I think we are out of bullet.

Until we complete adding/testing __GFP_NORETRY (or __GFP_KILLABLE) to most
of callsites, timeout based workaround will be the only bullet we can use.

Michal's panic_on_oom_timeout and David's "global access to memory reserves"
will be acceptable for some users if these approaches are used as opt-in.
Likewise, my memdie_task_skip_secs / memdie_task_panic_secs will be
acceptable for those who want to retry a bit more rather than panic on
accidental livelock if this approach is used as opt-in.

Tetsuo Handa wrote:
> Excuse me, but thinking about CLONE_VM without CLONE_THREAD case...
> Isn't there possibility of hitting livelocks at
> 
> /*
>  * If current has a pending SIGKILL or is exiting, then automatically
>  * select it.  The goal is to allow it to allocate so that it may
>  * quickly exit and free its memory.
>  *
>  * But don't select if current has already released its mm and cleared
>  * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
>  */
> if (current->mm &&
> (fatal_signal_pending(current) || task_will_free_mem(current))) {
> mark_oom_victim(current);
> return true;
> }
> 
> if current thread receives SIGKILL just before reaching here, for we don't
> send SIGKILL to all threads sharing the mm?

Seems that CLONE_VM without CLONE_THREAD is irrelevant here.
We have sequences like

  Do a GFP_KENREL allocation.
  Hold a lock.
  Do a GFP_NOFS allocation.
  Release a lock.

where an example is seen in VFS operations which receive pathname from
user space using getname() and then call VFS functions and filesystem
code takes locks which can contend with other threads.


diff --git a/fs/namei.c b/fs/namei.c
index d68c21f..d51c333 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4005,6 +4005,8 @@ int vfs_symlink(struct inode *dir, struct dentry *dentry, 
const char *oldname)
if (error)
return error;

+   if (fatal_signal_pending(current))
+   printk(KERN_INFO "Calling symlink with SIGKILL pending\n");
error = dir->i_op->symlink(dir, dentry, oldname);
if (!error)
fsnotify_create(dir, dentry);
@@ -4021,6 +4023,10 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
struct path path;
unsigned int lookup_flags = 0;

+   if (!strcmp(current->comm, "a.out")) {
+   printk(KERN_INFO "Sending SIGKILL to current thread\n");
+   do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true);
+   }
from = getname(oldname);
if (IS_ERR(from))
return PTR_ERR(from);
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 996481e..2b6faa5 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -240,6 +240,8 @@ xfs_symlink(
if (error)
goto out_trans_cancel;

+   if (fatal_signal_pending(current))
+   printk(KERN_INFO "Calling xfs_ilock() with SIGKILL pending\n");
xfs_ilock(dp, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL |
  XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
unlock_dp_on_error = true;


[  119.534976] Sending SIGKILL to current thread
[  119.535898] Calling symlink with SIGKILL pending
[  119.536870] Calling xfs_ilock() with SIGKILL pending

Any program can potentially hit this silent livelock. We can't predict
what locks the OOM victim threads will depend on after TIF_MEMDIE was
set by the OOM killer. Therefore, I think that TIF_MEMDIE disables the
OOM killer indefinitely is one of possible causes regarding silent
hangup troubles.

Michal Hocko wrote:
> I really hate to do "easy" things now just to feel better about
> particular case which will kick us back little bit later. And from my
> own experience I can tell you that a more non-deterministic OOM behavior
> is thing people complain about.

I believe that not waiting for TIF_MEMDIE thread indefinitely is the first
choice we can propose people to try. From my own experience I can

Re: can't oom-kill zap the victim's memory?

2015-09-25 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 24-09-15 14:15:34, David Rientjes wrote:
> > > > Finally. Whatever we do, we need to change oom_kill_process() first,
> > > > and I think we should do this regardless. The "Kill all user processes
> > > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > > > I'll try to make some patches tomorrow if I have time...
> > > 
> > > That would be appreciated. I do not like that part either. At least we
> > > shouldn't go over the whole list when we have a good chance that the mm
> > > is not shared with other processes.
> > > 
> > 
> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> > it's the reason the code exists.  Any optimizations to that is certainly 
> > welcome, but we definitely need to send SIGKILL to all threads sharing the 
> > mm to make forward progress, otherwise we are going back to pre-2008 
> > livelocks.
> 
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.

Excuse me, but thinking about CLONE_VM without CLONE_THREAD case...
Isn't there possibility of hitting livelocks at

/*
 * If current has a pending SIGKILL or is exiting, then automatically
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its memory.
 *
 * But don't select if current has already released its mm and cleared
 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 */
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
return true;
}

if current thread receives SIGKILL just before reaching here, for we don't
send SIGKILL to all threads sharing the mm?

Hopefully current thread is not holding inode->i_mutex because reaching here
(i.e. calling out_of_memory()) suggests that we are doing GFP_KERNEL
allocation. But it could be !__GFP_NOFS && __GFP_NOFAIL allocation, or
different locks contended by another thread sharing the mm?

I don't like "That thread will now get access to memory reserves since it
has a pending fatal signal." line in comments for the "Kill all user
processes sharing victim->mm" logic. That thread won't get access to memory
reserves unless that thread can call out_of_memory() (i.e. doing __GFP_FS or
__GFP_NOFAIL allocations). Since I can observe that that thread may be doing
!__GFP_NOFS allocation, I think that this comment needs to be updated.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-25 Thread Michal Hocko
On Thu 24-09-15 14:15:34, David Rientjes wrote:
> On Wed, 23 Sep 2015, Michal Hocko wrote:
> 
> > I am still not sure how you want to implement that kernel thread but I
> > am quite skeptical it would be very much useful because all the current
> > allocations which end up in the OOM killer path cannot simply back off
> > and drop the locks with the current allocator semantic.  So they will
> > be sitting on top of unknown pile of locks whether you do an additional
> > reclaim (unmap the anon memory) in the direct OOM context or looping
> > in the allocator and waiting for kthread/workqueue to do its work. The
> > only argument that I can see is the stack usage but I haven't seen stack
> > overflows in the OOM path AFAIR.
> > 
> 
> Which locks are you specifically interested in?

Any locks they were holding before they entered the page allocator (e.g.
i_mutex is the easiest one to trigger from the userspace but mmap_sem
might be involved as well because we are doing kmalloc(GFP_KERNEL) with
mmap_sem held for write). Those would be locked until the page allocator
returns, which with the current semantic might be _never_.

> We have already discussed 
> the usefulness of killing all threads on the system sharing the same ->mm, 
> meaning all threads that are either holding or want to hold mm->mmap_sem 
> will be able to allocate into memory reserves.  Any allocator holding 
> down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
> (Are you concerned about MAP_POPULATE?)

I am not sure I understand. We would have to fail the request in order
the context which requested the memory could drop the lock. Are we
talking about the same thing here?

The point I've tried to made is that oom unmapper running in a detached
context (e.g. kernel thread) vs. directly in the oom context doesn't
make any difference wrt. lock because the holders of the lock would loop
inside the allocator anyway because we do not fail small allocations.

> > > Finally. Whatever we do, we need to change oom_kill_process() first,
> > > and I think we should do this regardless. The "Kill all user processes
> > > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > > I'll try to make some patches tomorrow if I have time...
> > 
> > That would be appreciated. I do not like that part either. At least we
> > shouldn't go over the whole list when we have a good chance that the mm
> > is not shared with other processes.
> > 
> 
> Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> it's the reason the code exists.  Any optimizations to that is certainly 
> welcome, but we definitely need to send SIGKILL to all threads sharing the 
> mm to make forward progress, otherwise we are going back to pre-2008 
> livelocks.

Yes but mm is not shared between processes most of the time. CLONE_VM
without CLONE_THREAD is more a corner case yet we have to crawl all the
task_structs for _each_ OOM killer invocation. Yes this is an extreme
slow path but still might take quite some unnecessarily time.
 
> > Yes I am not really sure why oom_score_adj is not per-mm and we are
> > doing that per signal struct to be honest. It doesn't make much sense as
> > the mm_struct is the primary source of information for the oom victim
> > selection. And the fact that mm might be shared withtout sharing signals
> > make it double the reason to have it in mm.
> > 
> > It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
> > value from task_struct to mm_struct") but it was later reverted by
> > 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
> > with the reasoning there because vfork is documented to have undefined
> > behavior
> > "
> >if the process created by vfork() either modifies any data other
> >than a variable of type pid_t used to store the return value
> >from vfork(), or returns from the function in which vfork() was
> >called, or calls any other function before successfully calling
> >_exit(2) or one of the exec(3) family of functions.
> > "
> > Maybe we can revisit this... It would make the whole semantic much more
> > straightforward. The current situation when you kill a task which might
> > share the mm with OOM unkillable task is clearly suboptimal and
> > confusing.
> > 
> 
> How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj 
> value from task_struct to signal_struct")?

If the oom_score_adj is per mm then all the threads and processes which
share the mm would share the same value. So that would naturally extend
per-process to per address space sharing tasks and would be in line with
the above commit.

> We also must appreciate the 
> real-world usecase for an oom disabled process doing fork(), setting 
> /proc/child/oom_score_adj to non-disabled, and exec().

I guess you meant vfork mentioned in 0753ba01e126. I am not sure this
is a valid use of set_oom_adj. As the documentation explicitly states
this

Re: can't oom-kill zap the victim's memory?

2015-09-24 Thread David Rientjes
On Wed, 23 Sep 2015, Michal Hocko wrote:

> I am still not sure how you want to implement that kernel thread but I
> am quite skeptical it would be very much useful because all the current
> allocations which end up in the OOM killer path cannot simply back off
> and drop the locks with the current allocator semantic.  So they will
> be sitting on top of unknown pile of locks whether you do an additional
> reclaim (unmap the anon memory) in the direct OOM context or looping
> in the allocator and waiting for kthread/workqueue to do its work. The
> only argument that I can see is the stack usage but I haven't seen stack
> overflows in the OOM path AFAIR.
> 

Which locks are you specifically interested in?  We have already discussed 
the usefulness of killing all threads on the system sharing the same ->mm, 
meaning all threads that are either holding or want to hold mm->mmap_sem 
will be able to allocate into memory reserves.  Any allocator holding 
down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
(Are you concerned about MAP_POPULATE?)

> > Finally. Whatever we do, we need to change oom_kill_process() first,
> > and I think we should do this regardless. The "Kill all user processes
> > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > I'll try to make some patches tomorrow if I have time...
> 
> That would be appreciated. I do not like that part either. At least we
> shouldn't go over the whole list when we have a good chance that the mm
> is not shared with other processes.
> 

Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
it's the reason the code exists.  Any optimizations to that is certainly 
welcome, but we definitely need to send SIGKILL to all threads sharing the 
mm to make forward progress, otherwise we are going back to pre-2008 
livelocks.

> Yes I am not really sure why oom_score_adj is not per-mm and we are
> doing that per signal struct to be honest. It doesn't make much sense as
> the mm_struct is the primary source of information for the oom victim
> selection. And the fact that mm might be shared withtout sharing signals
> make it double the reason to have it in mm.
> 
> It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
> value from task_struct to mm_struct") but it was later reverted by
> 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
> with the reasoning there because vfork is documented to have undefined
> behavior
> "
>if the process created by vfork() either modifies any data other
>than a variable of type pid_t used to store the return value
>from vfork(), or returns from the function in which vfork() was
>called, or calls any other function before successfully calling
>_exit(2) or one of the exec(3) family of functions.
> "
> Maybe we can revisit this... It would make the whole semantic much more
> straightforward. The current situation when you kill a task which might
> share the mm with OOM unkillable task is clearly suboptimal and
> confusing.
> 

How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj 
value from task_struct to signal_struct")?  We also must appreciate the 
real-world usecase for an oom disabled process doing fork(), setting 
/proc/child/oom_score_adj to non-disabled, and exec().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-23 Thread Michal Hocko
On Tue 22-09-15 18:06:08, Oleg Nesterov wrote:
> On 09/21, Michal Hocko wrote:
> >
> > On Mon 21-09-15 17:32:52, Oleg Nesterov wrote:
[...]
> > > We probably need a
> > > dedicated kernel thread, but I still think (although I am not sure) that
> > > initial change can use workueue. In the likely case system_unbound_wq pool
> > > should have an idle thread, if not - OK, this change won't help in this
> > > case. This is minor.
> >
> > The point is that the implementation should be robust from the very
> > beginning.
> 
> OK, let it be a kthread from the very beginning, I won't argue. This
> is really minor compared to other problems.

I am still not sure how you want to implement that kernel thread but I
am quite skeptical it would be very much useful because all the current
allocations which end up in the OOM killer path cannot simply back off
and drop the locks with the current allocator semantic.  So they will
be sitting on top of unknown pile of locks whether you do an additional
reclaim (unmap the anon memory) in the direct OOM context or looping
in the allocator and waiting for kthread/workqueue to do its work. The
only argument that I can see is the stack usage but I haven't seen stack
overflows in the OOM path AFAIR.

> > > > So I think we probably need to do this in the OOM killer context (with
> > > > try_lock)
> > >
> > > Yes we should try to do this in the OOM killer context, and in this case
> > > (of course) we need trylock. Let me quote my previous email:
> > >
> > >   And we want to avoid using workqueues when the caller can do this
> > >   directly. And in this case we certainly need trylock. But this needs
> > >   some refactoring: we do not want to do this under oom_lock,
> >
> > Why do you think oom_lock would be a big deal?
> 
> I don't really know... This doesn't look sane to me, but perhaps this
> is just because I don't understand this code enough.

Well one of the purpose of this lock is to throttle all the concurrent
allocators to not step on each other toes because only one task is
allowed to get killed currently. So they wouldn't be any useful anyway.

> And note that the caller can held other locks we do not even know about.
> Most probably we should not deadlock, at least if we only unmap the anon
> pages, but still this doesn't look safe.

The unmapper cannot fall back to reclaim and/or trigger the OOM so
we should be indeed very careful and mark the allocation context
appropriately. I can remember mmu_gather but it is only doing
opportunistic allocation AFAIR.

> But I agree, this probably needs more discussion.
> 
> > Address space of the
> > victim might be really large but we can back off after a batch of
> > unmapped pages.
> 
> Hmm. If we already have mmap_sem and started zap_page_range() then
> I do not think it makes sense to stop until we free everything we can.

Zapping a huge address space can take quite some time and we really do
not have to free it all on behalf of the killer when enough memory is
freed to allow for further progress and the rest can be done by the
victim. If one batch doesn't seem sufficient then another retry can
continue.

I do not think that a limited scan would make the implementation more
complicated but I will leave the decision to you of course.

> > I definitely agree with the simplicity for the first iteration. That
> > means only unmap private exclusive pages and release at most few megs of
> > them.
> 
> See above, I am not sure this makes sense. And in any case this will
> complicate the initial changes, not simplify.
> 
> > I am still not sure about some details, e.g. futex sitting in such
> > a memory. Wouldn't threads blow up when they see an unmapped futex page,
> > try to page it in and it would be in an uninitialized state? Maybe this
> > is safe
> 
> But this must be safe.
> 
> We do not care about userspace (assuming that all mm users have a
> pending SIGKILL).
> 
> If this can (say) crash the kernel somehow, then we have a bug which
> should be fixed. Simply because userspace can exploit this bug doing
> MADV_DONTEED from another thread or CLONE_VM process.

OK, that makes perfect sense. I should have realized that an in-kernel
state for a futex must not be controlled from the userspace. So you are
right and futex shouldn't be a big deal.

> Finally. Whatever we do, we need to change oom_kill_process() first,
> and I think we should do this regardless. The "Kill all user processes
> sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> I'll try to make some patches tomorrow if I have time...

That would be appreciated. I do not like that part either. At least we
shouldn't go over the whole list when we have a good chance that the mm
is not shared with other processes.

> But. Can't we just remove another ->oom_score_adj check when we try
> to kill all mm users (the last for_each_process loop). If yes, this
> all can be simplified.
> 
> I guess we can't and its a pity. Because it looks simply pointless
> to no

Re: can't oom-kill zap the victim's memory?

2015-09-22 Thread David Rientjes
On Tue, 22 Sep 2015, Oleg Nesterov wrote:

> Finally. Whatever we do, we need to change oom_kill_process() first,
> and I think we should do this regardless. The "Kill all user processes
> sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> I'll try to make some patches tomorrow if I have time...
> 

Killing all processes sharing the ->mm has been done in the past to 
obviously ensure that memory is eventually freed, but also to solve 
mm->mmap_sem livelocks where a thread is holding a contended mutex and 
needs a fatal signal to acquire TIF_MEMDIE if it calls into the oom killer 
and be able to allocate so that it may eventually drop the mutex.

> But. Can't we just remove another ->oom_score_adj check when we try
> to kill all mm users (the last for_each_process loop). If yes, this
> all can be simplified.
> 

For complete correctness, we would avoid killing any process that shares 
memory with an oom disabled thread since the oom killer shall not kill it 
and otherwise we do not free any memory.

> I guess we can't and its a pity. Because it looks simply pointless
> to not kill all mm users. This just means the select_bad_process()
> picked the wrong task.
> 

This is a side-effect of moving oom scoring to signal_struct from 
mm_struct.  It could be improved separately by flagging mm_structs that 
are unkillable which would also allow for an optimization in 
find_lock_task_mm().

> And while this completely offtopic... why does it take task_lock()
> to protect ->comm? Sure, without task_lock() we can print garbage.
> Is it really that important? I am asking because sometime people
> think that it is not safe to use ->comm lockless, but this is not
> true.
> 

This has come up a couple times in the past and, from what I recall, 
Andrew has said that we don't actually care since the string will always 
be terminated and if we race we don't actually care.  There are other 
places in the kernel where task_lock() isn't used solely to protect 
->comm.  It can be removed from the oom_kill_process() loop checking for 
other potential victims.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-22 Thread Oleg Nesterov
On 09/21, Michal Hocko wrote:
>
> On Mon 21-09-15 17:32:52, Oleg Nesterov wrote:
> > On 09/21, Michal Hocko wrote:
> > >
> > > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
> > > [...]
> > > > So yes, in general oom_kill_process() can't call oom_unmap_func() 
> > > > directly.
> > > > That is why the patch uses queue_work(oom_unmap_func). The workqueue 
> > > > thread
> > > > takes mmap_sem and frees the memory allocated by user space.
> > >
> > > OK, this might have been a bit confusing. I didn't mean you cannot use
> > > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
> > > mentioned that you _shouldn't_ use workqueue context in the first place
> > > because all the workers might be blocked on locks and new workers cannot
> > > be created due to memory pressure.
> >
> > Yes, yes, and I already tried to comment this part.
>
> OK then we are on the same page, good.

Yes, yes.

> > We probably need a
> > dedicated kernel thread, but I still think (although I am not sure) that
> > initial change can use workueue. In the likely case system_unbound_wq pool
> > should have an idle thread, if not - OK, this change won't help in this
> > case. This is minor.
>
> The point is that the implementation should be robust from the very
> beginning.

OK, let it be a kthread from the very beginning, I won't argue. This
is really minor compared to other problems.

> > > So I think we probably need to do this in the OOM killer context (with
> > > try_lock)
> >
> > Yes we should try to do this in the OOM killer context, and in this case
> > (of course) we need trylock. Let me quote my previous email:
> >
> > And we want to avoid using workqueues when the caller can do this
> > directly. And in this case we certainly need trylock. But this needs
> > some refactoring: we do not want to do this under oom_lock,
>
> Why do you think oom_lock would be a big deal?

I don't really know... This doesn't look sane to me, but perhaps this
is just because I don't understand this code enough.

And note that the caller can held other locks we do not even know about.
Most probably we should not deadlock, at least if we only unmap the anon
pages, but still this doesn't look safe.

But I agree, this probably needs more discussion.

> Address space of the
> victim might be really large but we can back off after a batch of
> unmapped pages.

Hmm. If we already have mmap_sem and started zap_page_range() then
I do not think it makes sense to stop until we free everything we can.

> I definitely agree with the simplicity for the first iteration. That
> means only unmap private exclusive pages and release at most few megs of
> them.

See above, I am not sure this makes sense. And in any case this will
complicate the initial changes, not simplify.

> I am still not sure about some details, e.g. futex sitting in such
> a memory. Wouldn't threads blow up when they see an unmapped futex page,
> try to page it in and it would be in an uninitialized state? Maybe this
> is safe

But this must be safe.

We do not care about userspace (assuming that all mm users have a
pending SIGKILL).

If this can (say) crash the kernel somehow, then we have a bug which
should be fixed. Simply because userspace can exploit this bug doing
MADV_DONTEED from another thread or CLONE_VM process.



Finally. Whatever we do, we need to change oom_kill_process() first,
and I think we should do this regardless. The "Kill all user processes
sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
I'll try to make some patches tomorrow if I have time...

But. Can't we just remove another ->oom_score_adj check when we try
to kill all mm users (the last for_each_process loop). If yes, this
all can be simplified.

I guess we can't and its a pity. Because it looks simply pointless
to not kill all mm users. This just means the select_bad_process()
picked the wrong task.


Say, vfork(). OK, it is possible that parent is OOM_SCORE_ADJ_MIN and
the child has already updated its oom_score_adj before exec. Now if
we to kill the child we will only upset the parent for no reason, this
won't help to free the memory.



And while this completely offtopic... why does it take task_lock()
to protect ->comm? Sure, without task_lock() we can print garbage.
Is it really that important? I am asking because sometime people
think that it is not safe to use ->comm lockless, but this is not
true.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-22 Thread Oleg Nesterov
On 09/22, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 09/22, Tetsuo Handa wrote:
> > >   rcu_read_lock();
> > >   for_each_process_thread(g, p) {
> > >   if (likely(!fatal_signal_pending(p)))
> > >   continue;
> > >   task_lock(p);
> > >   mm = p->mm;
> > >   if (mm && mm->mmap && !mm->mmap_zapped && 
> > > down_read_trylock(&mm->mmap_sem)) {
> >^^^
> >
> > We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
> > set by oom_kill_process() and cleared after zap_page_range().
> >
> > Because otherwise we can not handle CLONE_VM correctly. Suppose that
> > an innocent process P does vfork() and the child is killed but not
> > exited yet. mm_zapper() can find the child, do zap_page_range(), and
> > surprise its alive parent P which uses the same ->mm.
>
> kill(P's-child, SIGKILL) does not kill P sharing the same ->mm.
> Thus, mm_zapper() can be used for only OOM-kill case

Yes, and only if we know for sure that all tasks which can use
this ->mm were killed.

> and
> test_tsk_thread_flag(p, TIF_MEMDIE) should be used than
> fatal_signal_pending(p).

No. For example, just look at mark_oom_victim() at the start of
out_of_memory().

> > Tetsuo, can't we do something simple which "obviously can't hurt at
> > least" and then discuss the potential improvements?
>
> No problem. I can wait for your version.

All I wanted to say is that this all is a bit more complicated than it
looks at first glance.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-22 Thread Tetsuo Handa
Oleg Nesterov wrote:
> On 09/22, Tetsuo Handa wrote:
> >
> > I imagined a dedicated kernel thread doing something like shown below.
> > (I don't know about mm->mmap management.)
> > mm->mmap_zapped corresponds to MMF_MEMDIE.
> 
> No, it doesn't, please see below.
> 
> > bool has_sigkill_task;
> > wait_queue_head_t kick_mm_zapper;
> 
> OK, if this kthread is kicked by oom this makes more sense, but still
> doesn't look right at least initially.

Yes, I meant this kthread is kicked upon sending SIGKILL. But I forgot that

> 
> Let me repeat, I do think we need MMF_MEMDIE or something like it before
> we do something more clever. And in fact I think this flag makes sense
> regardless.
> 
> > static void mm_zapper(void *unused)
> > {
> > struct task_struct *g, *p;
> > struct mm_struct *mm;
> >
> > sleep:
> > wait_event(kick_remover, has_sigkill_task);
> > has_sigkill_task = false;
> > restart:
> > rcu_read_lock();
> > for_each_process_thread(g, p) {
> > if (likely(!fatal_signal_pending(p)))
> > continue;
> > task_lock(p);
> > mm = p->mm;
> > if (mm && mm->mmap && !mm->mmap_zapped && 
> > down_read_trylock(&mm->mmap_sem)) {
>^^^
> 
> We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
> set by oom_kill_process() and cleared after zap_page_range().
> 
> Because otherwise we can not handle CLONE_VM correctly. Suppose that
> an innocent process P does vfork() and the child is killed but not
> exited yet. mm_zapper() can find the child, do zap_page_range(), and
> surprise its alive parent P which uses the same ->mm.

kill(P's-child, SIGKILL) does not kill P sharing the same ->mm.
Thus, mm_zapper() can be used for only OOM-kill case and
test_tsk_thread_flag(p, TIF_MEMDIE) should be used than
fatal_signal_pending(p).

> 
> And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then
> for_each_process_thread() doesn't really make sense. And if we have
> a single MMF_MEMDIE process (likely case) then the unconditional
> _trylock is suboptimal.

I guess the more likely case is that the OOM victim successfully exits
before mm_zapper() finds it.

I thought that a dedicated kernel thread which scans the task list can do
deferred zapping by automatically retrying (in a few seconds interval ?)
when down_read_trylock() failed. 

> 
> Tetsuo, can't we do something simple which "obviously can't hurt at
> least" and then discuss the potential improvements?

No problem. I can wait for your version.

> 
> And yes, yes, the "Kill all user processes sharing victim->mm" logic
> in oom_kill_process() doesn't 100% look right, at least wrt the change
> we discuss.

If we use test_tsk_thread_flag(p, TIF_MEMDIE), we will need to set
TIF_MEMDIE to the victim after sending SIGKILL to all processes sharing
the victim's mm. Well, the likely case that the OOM victim exits before
mm_zapper() finds it becomes not-so-likely case? Then, MMF_MEMDIE is
better than test_tsk_thread_flag(p, TIF_MEMDIE)...

> 
> Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-22 Thread Oleg Nesterov
On 09/22, Tetsuo Handa wrote:
>
> I imagined a dedicated kernel thread doing something like shown below.
> (I don't know about mm->mmap management.)
> mm->mmap_zapped corresponds to MMF_MEMDIE.

No, it doesn't, please see below.

> bool has_sigkill_task;
> wait_queue_head_t kick_mm_zapper;

OK, if this kthread is kicked by oom this makes more sense, but still
doesn't look right at least initially.

Let me repeat, I do think we need MMF_MEMDIE or something like it before
we do something more clever. And in fact I think this flag makes sense
regardless.

> static void mm_zapper(void *unused)
> {
>   struct task_struct *g, *p;
>   struct mm_struct *mm;
>
> sleep:
>   wait_event(kick_remover, has_sigkill_task);
>   has_sigkill_task = false;
> restart:
>   rcu_read_lock();
>   for_each_process_thread(g, p) {
>   if (likely(!fatal_signal_pending(p)))
>   continue;
>   task_lock(p);
>   mm = p->mm;
>   if (mm && mm->mmap && !mm->mmap_zapped && 
> down_read_trylock(&mm->mmap_sem)) {
   ^^^

We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
set by oom_kill_process() and cleared after zap_page_range().

Because otherwise we can not handle CLONE_VM correctly. Suppose that
an innocent process P does vfork() and the child is killed but not
exited yet. mm_zapper() can find the child, do zap_page_range(), and
surprise its alive parent P which uses the same ->mm.

And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then
for_each_process_thread() doesn't really make sense. And if we have
a single MMF_MEMDIE process (likely case) then the unconditional
_trylock is suboptimal.

Tetsuo, can't we do something simple which "obviously can't hurt at
least" and then discuss the potential improvements?

And yes, yes, the "Kill all user processes sharing victim->mm" logic
in oom_kill_process() doesn't 100% look right, at least wrt the change
we discuss.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread David Rientjes
On Mon, 21 Sep 2015, Oleg Nesterov wrote:

> Yes we should try to do this in the OOM killer context, and in this case
> (of course) we need trylock. Let me quote my previous email:
> 
>   And we want to avoid using workqueues when the caller can do this
>   directly. And in this case we certainly need trylock. But this needs
>   some refactoring: we do not want to do this under oom_lock, otoh it
>   makes sense to do this from mark_oom_victim() if current && killed,
>   and a lot more details.
> 
> and probably this is another reason why do we need MMF_MEMDIE. But again,
> I think the initial change should be simple.
> 

I agree with the direction and I don't think it would be too complex to 
have a dedicated kthread that is kicked when we queue an mm to do 
MADV_DONTNEED behavior, and have that happen only if a trylock in 
oom_kill_process() fails to do it itself for anonymous mappings.  We may 
have different opinions of simplicity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Linus Torvalds
On Mon, Sep 21, 2015 at 6:44 AM, Oleg Nesterov  wrote:
>
> I must have missed something. I can't understand your and Michal's
> concerns.

Heh.  I looked at that patch, and apparently entirely missed the
queue_work() part of the whole patch, thinking it was a direct call.

So never mind.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Tetsuo Handa
Oleg Nesterov wrote:
> Yes, yes, and I already tried to comment this part. We probably need a
> dedicated kernel thread, but I still think (although I am not sure) that
> initial change can use workueue. In the likely case system_unbound_wq pool
> should have an idle thread, if not - OK, this change won't help in this
> case. This is minor.
> 
I imagined a dedicated kernel thread doing something like shown below.
(I don't know about mm->mmap management.)
mm->mmap_zapped corresponds to MMF_MEMDIE.
I think this kernel thread can be used for normal kill(pid, SIGKILL) cases.

--
bool has_sigkill_task;
wait_queue_head_t kick_mm_zapper;

static void mm_zapper(void *unused)
{
struct task_struct *g, *p;
struct mm_struct *mm;

sleep:
wait_event(kick_remover, has_sigkill_task);
has_sigkill_task = false;
restart:
rcu_read_lock();
for_each_process_thread(g, p) {
if (likely(!fatal_signal_pending(p)))
continue;
task_lock(p);
mm = p->mm;
if (mm && mm->mmap && !mm->mmap_zapped && 
down_read_trylock(&mm->mmap_sem)) {
atomic_inc(&mm->mm_users);
task_unlock(p);
rcu_read_unlock();
if (mm->mmap && !mm->mmap_zapped)
zap_page_range(mm->mmap, 0, TASK_SIZE, NULL);
mm->mmap_zapped = 1;
up_read(&mm->mmap_sem);
mmput(mm);
cond_resched();
goto restart;
}
task_unlock(p);
}
rcu_read_unlock();
goto sleep;
}

kthread_run(mm_zapper, NULL, "mm_zapper");
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Michal Hocko
On Mon 21-09-15 17:32:52, Oleg Nesterov wrote:
> On 09/21, Michal Hocko wrote:
> >
> > On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
> > [...]
> > > So yes, in general oom_kill_process() can't call oom_unmap_func() 
> > > directly.
> > > That is why the patch uses queue_work(oom_unmap_func). The workqueue 
> > > thread
> > > takes mmap_sem and frees the memory allocated by user space.
> >
> > OK, this might have been a bit confusing. I didn't mean you cannot use
> > mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
> > mentioned that you _shouldn't_ use workqueue context in the first place
> > because all the workers might be blocked on locks and new workers cannot
> > be created due to memory pressure.
> 
> Yes, yes, and I already tried to comment this part.

OK then we are on the same page, good.

> We probably need a
> dedicated kernel thread, but I still think (although I am not sure) that
> initial change can use workueue. In the likely case system_unbound_wq pool
> should have an idle thread, if not - OK, this change won't help in this
> case. This is minor.

The point is that the implementation should be robust from the very
beginning. I am not sure what you mean by the idle thread here but the
rescuer can get stuck the very same way other workers. So I think that
we cannot rely on WQ for a real solution here.

> > So I think we probably need to do this in the OOM killer context (with
> > try_lock)
> 
> Yes we should try to do this in the OOM killer context, and in this case
> (of course) we need trylock. Let me quote my previous email:
> 
>   And we want to avoid using workqueues when the caller can do this
>   directly. And in this case we certainly need trylock. But this needs
>   some refactoring: we do not want to do this under oom_lock,

Why do you think oom_lock would be a big deal? Address space of the
victim might be really large but we can back off after a batch of
unmapped pages.

>   otoh it
>   makes sense to do this from mark_oom_victim() if current && killed,
>   and a lot more details.
> 
> and probably this is another reason why do we need MMF_MEMDIE. But again,
> I think the initial change should be simple.

I definitely agree with the simplicity for the first iteration. That
means only unmap private exclusive pages and release at most few megs of
them. I am still not sure about some details, e.g. futex sitting in such
a memory. Wouldn't threads blow up when they see an unmapped futex page,
try to page it in and it would be in an uninitialized state? Maybe this
is safe because they will die anyway but I am not familiar with that
code.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Oleg Nesterov
On 09/21, Michal Hocko wrote:
>
> On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
> [...]
> > So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
> > That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
> > takes mmap_sem and frees the memory allocated by user space.
>
> OK, this might have been a bit confusing. I didn't mean you cannot use
> mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
> mentioned that you _shouldn't_ use workqueue context in the first place
> because all the workers might be blocked on locks and new workers cannot
> be created due to memory pressure.

Yes, yes, and I already tried to comment this part. We probably need a
dedicated kernel thread, but I still think (although I am not sure) that
initial change can use workueue. In the likely case system_unbound_wq pool
should have an idle thread, if not - OK, this change won't help in this
case. This is minor.

> So I think we probably need to do this in the OOM killer context (with
> try_lock)

Yes we should try to do this in the OOM killer context, and in this case
(of course) we need trylock. Let me quote my previous email:

And we want to avoid using workqueues when the caller can do this
directly. And in this case we certainly need trylock. But this needs
some refactoring: we do not want to do this under oom_lock, otoh it
makes sense to do this from mark_oom_victim() if current && killed,
and a lot more details.

and probably this is another reason why do we need MMF_MEMDIE. But again,
I think the initial change should be simple.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Michal Hocko
On Mon 21-09-15 15:44:14, Oleg Nesterov wrote:
[...]
> So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
> That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
> takes mmap_sem and frees the memory allocated by user space.

OK, this might have been a bit confusing. I didn't mean you cannot use
mmap_sem directly from the workqueue context. You _can_ AFAICS. But I've
mentioned that you _shouldn't_ use workqueue context in the first place
because all the workers might be blocked on locks and new workers cannot
be created due to memory pressure. This has been demostrated already
where sysrq+f couldn't trigger OOM killer because the work item to do so
was waiting for a worker which never came...

So I think we probably need to do this in the OOM killer context (with
try_lock) or hand over to a special kernel thread. I am not sure a
special kernel thread is really worth that but maybe it will turn out to
be a better choice.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Oleg Nesterov
On 09/20, Raymond Jennings wrote:
>
> On 09/20/15 11:05, Linus Torvalds wrote:
>>
>> which can be called from just about any context (but atomic
>> allocations will never get here, so it can schedule etc).
>
> I think in this case the oom killer should just slap a SIGKILL on the
> task and then back out, and whatever needed the memory should just wait
> patiently for the sacrificial lamb to commit seppuku.

Not sure I understand you correctly, but this is what we currently do.
The only problem is that this doesn't work sometimes.

> Also, I observed that a task in the middle of dumping core doesn't
> respond to signals while it's dumping,

How did you observe this? The coredumping is killable.

Although yes, we have problems here in oom condition. In particular
with CLONE_VM tasks.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-21 Thread Oleg Nesterov
On 09/20, Linus Torvalds wrote:
>
> On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov  wrote:
> >
> > In this case the workqueue thread will block.
>
> What workqueue thread?

I must have missed something. I can't understand your and Michal's
concerns.

>pagefault_out_of_memory ->
>   out_of_memory ->
>  oom_kill_process
>
> as far as I can tell, this can be called by any task. Now, that
> pagefault case should only happen when the page fault comes from user
> space, but we also have
>
>__alloc_pages_slowpath ->
>   __alloc_pages_may_oom ->
>  out_of_memory ->
> oom_kill_process
>
> which can be called from just about any context (but atomic
> allocations will never get here, so it can schedule etc).

So yes, in general oom_kill_process() can't call oom_unmap_func() directly.
That is why the patch uses queue_work(oom_unmap_func). The workqueue thread
takes mmap_sem and frees the memory allocated by user space.

If this can lead to deadlock somehow, then we can hit the same deadlock
when an oom-killed thread calls exit_mm().

> So what's your point?

This can help if the killed process refuse to die and (of course) it
doesn't hold the mmap_sem for writing. Say, it waits for some mutex
held by the task which tries to alloc the memory and triggers oom.

> Explain again just how do you guarantee that you
> can take the mmap_sem.

This is not guaranteed, down_read(mmap_sem) can block forever. But this
means that the (killed) victim never drops mmap_sem / never exits, so
we lose anyway. We have no memory, oom-killer is blocked, etc.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Raymond Jennings

On 09/20/15 11:05, Linus Torvalds wrote:

On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov  wrote:

In this case the workqueue thread will block.

What workqueue thread?

pagefault_out_of_memory ->
   out_of_memory ->
  oom_kill_process

as far as I can tell, this can be called by any task. Now, that
pagefault case should only happen when the page fault comes from user
space, but we also have

__alloc_pages_slowpath ->
   __alloc_pages_may_oom ->
  out_of_memory ->
 oom_kill_process

which can be called from just about any context (but atomic
allocations will never get here, so it can schedule etc).


I think in this case the oom killer should just slap a SIGKILL on the 
task and then back out, and whatever needed the memory should just wait 
patiently for the sacrificial lamb to commit seppuku.


Which, btw, we should IMO encourage ASAP in the context of the lamb by 
having anything potentially locky or semaphory pay attention to if the 
task in question has a fatal signal pending, and if so, drop everything 
and run like hell so that the task can cough up any locks or semaphores.

So what's your point? Explain again just how do you guarantee that you
can take the mmap_sem.

Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


Also, I observed that a task in the middle of dumping core doesn't 
respond to signals while it's dumping, and I would guess that might be 
the case even if the task receives a SIGKILL from the OOM handler.  Just 
a potential observation.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Linus Torvalds
On Sun, Sep 20, 2015 at 5:56 AM, Oleg Nesterov  wrote:
>
> In this case the workqueue thread will block.

What workqueue thread?

   pagefault_out_of_memory ->
  out_of_memory ->
 oom_kill_process

as far as I can tell, this can be called by any task. Now, that
pagefault case should only happen when the page fault comes from user
space, but we also have

   __alloc_pages_slowpath ->
  __alloc_pages_may_oom ->
 out_of_memory ->
oom_kill_process

which can be called from just about any context (but atomic
allocations will never get here, so it can schedule etc).

So what's your point? Explain again just how do you guarantee that you
can take the mmap_sem.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Oleg Nesterov
On 09/20, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 09/17, Kyle Walker wrote:
> > >
> > > Currently, the oom killer will attempt to kill a process that is in
> > > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> > > period of time, such as processes writing to a frozen filesystem during
> > > a lengthy backup operation, this can result in a deadlock condition as
> > > related processes memory access will stall within the page fault
> > > handler.
> >
> > And there are other potential reasons for deadlock.
> >
> > Stupid idea. Can't we help the memory hog to free its memory? This is
> > orthogonal to other improvements we can do.
>
> So, we are trying to release memory without waiting for arriving at
> exit_mm() from do_exit(), right? If it works, it will be a simple and
> small change that will be easy to backport.
>
> The idea is that since fatal_signal_pending() tasks no longer return to
> user space, we can release memory allocated for use by user space, right?

Yes.

> Then, I think that this approach can be applied to not only OOM-kill case
> but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1)
> or somewhere?).

I don't think so... but we might want to do this if (say) we are not going
to kill someone else because fatal_signal_pending(current).

> A dedicated kernel thread (not limited to OOM-kill purpose)
> scans for fatal_signal_pending() tasks and release that task's memory.

Perhaps a dedicated kernel thread makes sense (see other emails),
but I don't think it should scan the killed threads. oom-kill should
kict it.

Anyway, let me repeat there are a lot of details we might want to
discuss. But the initial changes should be simple as possible, imo.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Tetsuo Handa
Oleg Nesterov wrote:
> On 09/17, Kyle Walker wrote:
> >
> > Currently, the oom killer will attempt to kill a process that is in
> > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> > period of time, such as processes writing to a frozen filesystem during
> > a lengthy backup operation, this can result in a deadlock condition as
> > related processes memory access will stall within the page fault
> > handler.
> 
> And there are other potential reasons for deadlock.
> 
> Stupid idea. Can't we help the memory hog to free its memory? This is
> orthogonal to other improvements we can do.

So, we are trying to release memory without waiting for arriving at
exit_mm() from do_exit(), right? If it works, it will be a simple and
small change that will be easy to backport.

The idea is that since fatal_signal_pending() tasks no longer return to
user space, we can release memory allocated for use by user space, right?

Then, I think that this approach can be applied to not only OOM-kill case
but also regular kill(pid, SIGKILL) case (i.e. kick from signal_wake_up(1)
or somewhere?). A dedicated kernel thread (not limited to OOM-kill purpose)
scans for fatal_signal_pending() tasks and release that task's memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Oleg Nesterov
On 09/19, Michal Hocko wrote:
>
> On Sat 19-09-15 17:03:16, Oleg Nesterov wrote:
> >
> > Stupid idea. Can't we help the memory hog to free its memory? This is
> > orthogonal to other improvements we can do.
> >
> > Please don't tell me the patch below is ugly, incomplete and suboptimal
> > in many ways, I know ;) I am not sure it is even correct. Just to explain
> > what I mean.
>
> Unmapping the memory for the oom victim has been already mentioned as a
> way to improve the OOM killer behavior. Nobody has implemented that yet
> though unfortunately. I have that on my TODO list since we have
> discussed it with Mel at LSF.

OK, good. So perhaps we should try to do this.

>
> > Perhaps oom_unmap_func() should only zap the anonymous vmas... and there
> > are a lot of other details which should be discussed if this can make any
> > sense.
>
> I have just returned from an internal conference so my head is
> completely cabbaged. I will have a look on Monday. From a quick look
> the idea is feasible. You cannot rely on the worker context because
> workqueues might be completely stuck with at this stage.

Yes this is true. See another email, probably oom-kill.c needs its own
kthread.

And again, we should actually try to avoid queue_work or queue_kthread_work
in any case. But not in the initial implementation. And initial implementation
could use workqueues, I think. I the likely case system_unbound_wq pool
should have an idle thread.

> You also cannot
> do take mmap_sem directly because that might be held already so you need
> a try_lock instead.

Still can't understand this part. See other emails, perhaps I missed
something.

> Focusing on anonymous vmas first sounds like a good
> idea to me because that would be simpler I guess.

And safer.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Oleg Nesterov
On 09/20, Michal Hocko wrote:
>
> On Sat 19-09-15 15:24:02, Linus Torvalds wrote:
> > On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov  wrote:
> > > +
> > > +static void oom_unmap_func(struct work_struct *work)
> > > +{
> > > +   struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> > > +
> > > +   if (!atomic_inc_not_zero(&mm->mm_users))
> > > +   return;
> > > +
> > > +   // If this is not safe we can do use_mm() + unuse_mm()
> > > +   down_read(&mm->mmap_sem);
> >
> > I don't think this is safe.
> >
> > What makes you sure that we might not deadlock on the mmap_sem here?
> > For all we know, the process that is going out of memory is in the
> > middle of a mmap(), and already holds the mmap_sem for writing. No?
> >
> > So at the very least that needs to be a trylock, I think.
>
> Agreed.

Why? See my reply to Linus's email.

Just in case, yes sure the unconditonal down_read() is suboptimal, but
this is minor compared to other problems we need to solve.

> > And I'm not
> > sure zap_page_range() is ok with the mmap_sem only held for reading.
> > Normally our rule is that you can *populate* the page tables
> > concurrently, but you can't tear the down
>
> Actually mmap_sem for reading should be sufficient because we do not
> alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem
> for example.

Yes, but see the ->vm_flags check in madvise_dontneed().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Oleg Nesterov
On 09/19, Linus Torvalds wrote:
>
> On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov  wrote:
> > +
> > +static void oom_unmap_func(struct work_struct *work)
> > +{
> > +   struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> > +
> > +   if (!atomic_inc_not_zero(&mm->mm_users))
> > +   return;
> > +
> > +   // If this is not safe we can do use_mm() + unuse_mm()
> > +   down_read(&mm->mmap_sem);
>
> I don't think this is safe.
>
> What makes you sure that we might not deadlock on the mmap_sem here?
> For all we know, the process that is going out of memory is in the
> middle of a mmap(), and already holds the mmap_sem for writing. No?

In this case the workqueue thread will block. But it can not block
forever. I mean if it can then the killed process will never exit
(exit_mm does down_read) and release its memory, so we lose anyway.

But let me repeat this patch is obviously not complete/etc,

> So at the very least that needs to be a trylock, I think.

And we want to avoid using workqueues when the caller can do this
directly. And in this case we certainly need trylock. But this needs
some refactoring: we do not want to do this under oom_lock, otoh it
makes sense to do this from mark_oom_victim() if current && killed,
and a lot more details.

The workqueue thread has other reasons for trylock, but probably not
in the initial version of this patch. And perhaps we should use a
dedicated kthread and do not use workqueues at all. And yes, a single
"mm_struct *oom_unmap_mm" is ugly, it should be the list of mm's to
unmap, but then at least we need MMF_MEMDIE.

> And I'm not
> sure zap_page_range() is ok with the mmap_sem only held for reading.
> Normally our rule is that you can *populate* the page tables
> concurrently, but you can't tear the down.

Well, according to madvise_need_mmap_write() MADV_DONTNEED does this
under down_read().

But yes, yes, this is probably not right anyway. Say, VM_LOCKED...
That is why I mentioned that perhaps this should only unmap the
anonymous pages. We can probably add zap_details->for_oom hint.



Another question if it is safe to abuse the foreign mm this way.
Well, zap_page_range_single() does this, so this is probably safe.
But we can do use_mm().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-20 Thread Michal Hocko
On Sat 19-09-15 15:24:02, Linus Torvalds wrote:
> On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov  wrote:
> > +
> > +static void oom_unmap_func(struct work_struct *work)
> > +{
> > +   struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> > +
> > +   if (!atomic_inc_not_zero(&mm->mm_users))
> > +   return;
> > +
> > +   // If this is not safe we can do use_mm() + unuse_mm()
> > +   down_read(&mm->mmap_sem);
> 
> I don't think this is safe.
> 
> What makes you sure that we might not deadlock on the mmap_sem here?
> For all we know, the process that is going out of memory is in the
> middle of a mmap(), and already holds the mmap_sem for writing. No?
> 
> So at the very least that needs to be a trylock, I think.

Agreed.

> And I'm not
> sure zap_page_range() is ok with the mmap_sem only held for reading.
> Normally our rule is that you can *populate* the page tables
> concurrently, but you can't tear the down

Actually mmap_sem for reading should be sufficient because we do not
alter the layout. Both MADV_DONTNEED and MADV_FREE require read mmap_sem
for example.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-19 Thread Linus Torvalds
On Sat, Sep 19, 2015 at 4:00 PM, Raymond Jennings  wrote:
>
> Potentially stupid question that others may be asking: Is it legal to return
> EINTR from mmap() to let a SIGKILL from the OOM handler punch the task out
> of the kernel and back to userspace?

Yes. Note that mmap() itself seldom sleeps or allocates much memory
(yeah, there's the vma itself and soem minimal stuff), so it's mainly
an issue for things like MAP_POPULATE etc.

The more common situation is things like uninterruptible reads when a
device (or network) is not responding, and we have special support for
"killable" waits that act like normal uninterruptible waits but can be
interrupted by deadly signals, exactly because for those cases we
don't need to worry about things like POSIX return value guarantees
("all or nothing" for file reads) etc.

So you do generally have to write extra code for the "killable sleep".
But it's a good thing to do, if you notice that certain cases aren't
responding well to oom killing because they keep on waiting.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-19 Thread Raymond Jennings

On 09/19/15 15:24, Linus Torvalds wrote:

On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov  wrote:

+
+static void oom_unmap_func(struct work_struct *work)
+{
+   struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
+
+   if (!atomic_inc_not_zero(&mm->mm_users))
+   return;
+
+   // If this is not safe we can do use_mm() + unuse_mm()
+   down_read(&mm->mmap_sem);

I don't think this is safe.

What makes you sure that we might not deadlock on the mmap_sem here?
For all we know, the process that is going out of memory is in the
middle of a mmap(), and already holds the mmap_sem for writing. No?


Potentially stupid question that others may be asking: Is it legal to 
return EINTR from mmap() to let a SIGKILL from the OOM handler punch the 
task out of the kernel and back to userspace?


(sorry for the dupe btw, new email client snuck in html and I got bounced)


So at the very least that needs to be a trylock, I think. And I'm not
sure zap_page_range() is ok with the mmap_sem only held for reading.
Normally our rule is that you can *populate* the page tables
concurrently, but you can't tear the down.

 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-19 Thread Linus Torvalds
On Sat, Sep 19, 2015 at 8:03 AM, Oleg Nesterov  wrote:
> +
> +static void oom_unmap_func(struct work_struct *work)
> +{
> +   struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> +
> +   if (!atomic_inc_not_zero(&mm->mm_users))
> +   return;
> +
> +   // If this is not safe we can do use_mm() + unuse_mm()
> +   down_read(&mm->mmap_sem);

I don't think this is safe.

What makes you sure that we might not deadlock on the mmap_sem here?
For all we know, the process that is going out of memory is in the
middle of a mmap(), and already holds the mmap_sem for writing. No?

So at the very least that needs to be a trylock, I think. And I'm not
sure zap_page_range() is ok with the mmap_sem only held for reading.
Normally our rule is that you can *populate* the page tables
concurrently, but you can't tear the down.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-19 Thread Michal Hocko
On Sat 19-09-15 17:03:16, Oleg Nesterov wrote:
> On 09/17, Kyle Walker wrote:
> >
> > Currently, the oom killer will attempt to kill a process that is in
> > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> > period of time, such as processes writing to a frozen filesystem during
> > a lengthy backup operation, this can result in a deadlock condition as
> > related processes memory access will stall within the page fault
> > handler.
> 
> And there are other potential reasons for deadlock.
> 
> Stupid idea. Can't we help the memory hog to free its memory? This is
> orthogonal to other improvements we can do.
> 
> Please don't tell me the patch below is ugly, incomplete and suboptimal
> in many ways, I know ;) I am not sure it is even correct. Just to explain
> what I mean.

Unmapping the memory for the oom victim has been already mentioned as a
way to improve the OOM killer behavior. Nobody has implemented that yet
though unfortunately. I have that on my TODO list since we have
discussed it with Mel at LSF.

> Perhaps oom_unmap_func() should only zap the anonymous vmas... and there
> are a lot of other details which should be discussed if this can make any
> sense.

I have just returned from an internal conference so my head is
completely cabbaged. I will have a look on Monday. From a quick look
the idea is feasible. You cannot rely on the worker context because
workqueues might be completely stuck with at this stage. You also cannot
do take mmap_sem directly because that might be held already so you need
a try_lock instead. Focusing on anonymous vmas first sounds like a good
idea to me because that would be simpler I guess.

> 
> Oleg.
> ---
> 
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -493,6 +493,26 @@ void oom_killer_enable(void)
>   up_write(&oom_sem);
>  }
>  
> +static struct mm_struct *oom_unmap_mm;
> +
> +static void oom_unmap_func(struct work_struct *work)
> +{
> + struct mm_struct *mm = xchg(&oom_unmap_mm, NULL);
> +
> + if (!atomic_inc_not_zero(&mm->mm_users))
> + return;
> +
> + // If this is not safe we can do use_mm() + unuse_mm()
> + down_read(&mm->mmap_sem);
> + if (mm->mmap)
> + zap_page_range(mm->mmap, 0, TASK_SIZE, NULL);
> + up_read(&mm->mmap_sem);
> +
> + mmput(mm);
> + mmdrop(mm);
> +}
> +static DECLARE_WORK(oom_unmap_work, oom_unmap_func);
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  /*
>   * Must be called while holding a reference to p, which will be released upon
> @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t 
> gfp_mask, int order,
>   victim = p;
>   }
>  
> - /* mm cannot safely be dereferenced after task_unlock(victim) */
>   mm = victim->mm;
> + atomic_inc(&mm->mm_count);
>   mark_tsk_oom_victim(victim);
>   pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB\n",
>   task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> @@ -604,6 +624,10 @@ void oom_kill_process(struct task_struct *p, gfp_t 
> gfp_mask, int order,
>   rcu_read_unlock();
>  
>   do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> + if (cmpxchg(&oom_unmap_mm, NULL, mm))
> + mmdrop(mm);
> + else
> + queue_work(system_unbound_wq, &oom_unmap_work);
>   put_task_struct(victim);
>  }
>  #undef K

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: can't oom-kill zap the victim's memory?

2015-09-19 Thread Oleg Nesterov
(off-topic)

On 09/19, Oleg Nesterov wrote:
>
> @@ -570,8 +590,8 @@ void oom_kill_process(struct task_struct *p, gfp_t 
> gfp_mask, int order,
>   victim = p;
>   }
>
> - /* mm cannot safely be dereferenced after task_unlock(victim) */
>   mm = victim->mm;
> + atomic_inc(&mm->mm_count);

Btw, I think we need this change anyway. This is pure theoretical, but
otherwise this task can exit and free its mm_struct right after task_unlock(),
then this mm_struct can be reallocated and used by another task, so we
can't trust the "p->mm == mm" check below.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/