Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-24 Thread Tetsuo Handa
Kyle Walker wrote:
> I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
> and omitting them from the oom selection process, continuing the
> carnage is likely to result in more unpredictable results. At this
> time, I believe Oleg's solution of zapping the process memory use
> while it sleeps with the fatal signal enroute is ideal.

I cannot help thinking about the worst case.

(1) If memory zapping code successfully reclaimed some memory from
the mm struct used by the OOM victim, what guarantees that the
reclaimed memory is used by OOM victims (and processes which
are blocking OOM victims)?

David's "global access to memory reserves" allows a local unprivileged
user to deplete memory reserves; could allow that user to deplete the
reclaimed memory as well.

I think that my "Favor kthread and dying threads over normal threads"
( 
http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 )
would allow the reclaimed memory to be used by OOM victims and kernel
threads if the reclaimed memory is added to free list bit by bit
in a way that watermark remains low enough to prevent normal threads
from allocating the reclaimed memory.

But my patch still fails if normal threads are blocking the OOM
victims or unrelated kernel threads consume the reclaimed memory.

(2) If memory zapping code failed to reclaim enough memory from the mm
struct needed for the OOM victim, what mechanism can solve the OOM
stalls?

Some administrator sets /proc/pid/oom_score_adj to -1000 to most of
enterprise processes (e.g. java) and as a consequence only trivial
processes (e.g. grep / sed) are candidates for OOM victims.

Moreover, a local unprivileged user can easily fool the OOM killer using
decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is
set to 999).

(3) If memory zapping code reclaimed no memory due to ->mmap_sem contention,
what mechanism can solve the OOM stalls?

While we don't allocate much memory with ->mmap_sem held for writing,
the task which is holding ->mmap_sem for writing can be chosen as
one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not
set, it can form OOM-livelock unless all memory allocations with
->mmap_sem held for writing are __GFP_FS allocations and that task can
reach out_of_memory() (i.e. not blocked by unexpected factors such as
waiting for filesystem's writeback).

After all I think we have to consider what to do if memory zapping code
failed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-24 Thread Tetsuo Handa
Kyle Walker wrote:
> I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
> and omitting them from the oom selection process, continuing the
> carnage is likely to result in more unpredictable results. At this
> time, I believe Oleg's solution of zapping the process memory use
> while it sleeps with the fatal signal enroute is ideal.

I cannot help thinking about the worst case.

(1) If memory zapping code successfully reclaimed some memory from
the mm struct used by the OOM victim, what guarantees that the
reclaimed memory is used by OOM victims (and processes which
are blocking OOM victims)?

David's "global access to memory reserves" allows a local unprivileged
user to deplete memory reserves; could allow that user to deplete the
reclaimed memory as well.

I think that my "Favor kthread and dying threads over normal threads"
( 
http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 )
would allow the reclaimed memory to be used by OOM victims and kernel
threads if the reclaimed memory is added to free list bit by bit
in a way that watermark remains low enough to prevent normal threads
from allocating the reclaimed memory.

But my patch still fails if normal threads are blocking the OOM
victims or unrelated kernel threads consume the reclaimed memory.

(2) If memory zapping code failed to reclaim enough memory from the mm
struct needed for the OOM victim, what mechanism can solve the OOM
stalls?

Some administrator sets /proc/pid/oom_score_adj to -1000 to most of
enterprise processes (e.g. java) and as a consequence only trivial
processes (e.g. grep / sed) are candidates for OOM victims.

Moreover, a local unprivileged user can easily fool the OOM killer using
decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is
set to 999).

(3) If memory zapping code reclaimed no memory due to ->mmap_sem contention,
what mechanism can solve the OOM stalls?

While we don't allocate much memory with ->mmap_sem held for writing,
the task which is holding ->mmap_sem for writing can be chosen as
one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not
set, it can form OOM-livelock unless all memory allocations with
->mmap_sem held for writing are __GFP_FS allocations and that task can
reach out_of_memory() (i.e. not blocked by unexpected factors such as
waiting for filesystem's writeback).

After all I think we have to consider what to do if memory zapping code
failed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-23 Thread Kyle Walker
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes  wrote:
>
> I struggle to understand how the approach of randomly continuing to kill
> more and more processes in the hope that it slows down usage of memory
> reserves or that we get lucky is better.

Thank you to one and all for the feedback.

I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
and omitting them from the oom selection process, continuing the
carnage is likely to result in more unpredictable results. At this
time, I believe Oleg's solution of zapping the process memory use
while it sleeps with the fatal signal enroute is ideal.

Kyle Walker
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-23 Thread Kyle Walker
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes  wrote:
>
> I struggle to understand how the approach of randomly continuing to kill
> more and more processes in the hope that it slows down usage of memory
> reserves or that we get lucky is better.

Thank you to one and all for the feedback.

I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable,
and omitting them from the oom selection process, continuing the
carnage is likely to result in more unpredictable results. At this
time, I believe Oleg's solution of zapping the process memory use
while it sleeps with the fatal signal enroute is ideal.

Kyle Walker
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-22 Thread David Rientjes
On Tue, 22 Sep 2015, Tetsuo Handa wrote:

> David Rientjes wrote:
> > Your proposal, which I mostly agree with, tries to kill additional 
> > processes so that they allocate and drop the lock that the original victim 
> > depends on.  My approach, from 
> > http://marc.info/?l=linux-kernel=144010444913702, is the same, but 
> > without the killing.  It's unecessary to kill every process on the system 
> > that is depending on the same lock, and we can't know which processes are 
> > stalling on that lock and which are not.
> 
> Would you try your approach with below program?
> (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.)
> 
> -- oom-depleter3.c start --
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static int zero_fd = EOF;
> static char *buf = NULL;
> static unsigned long size = 0;
> 
> static int dummy(void *unused)
> {
>   static char buffer[4096] = { };
>   int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
>   while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) &&
>  fsync(fd) == 0);
>   return 0;
> }
> 
> static int trigger(void *unused)
> {
>   read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
>   return 0;
> }
> 
> int main(int argc, char *argv[])
> {
> unsigned long i;
>   zero_fd = open("/dev/zero", O_RDONLY);
>   for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
>   char *cp = realloc(buf, size);
>   if (!cp) {
>   size >>= 1;
>   break;
>   }
>   buf = cp;
>   }
>   /*
>* Create many child threads in order to enlarge time lag between
>* the OOM killer sets TIF_MEMDIE to thread group leader and
>* the OOM killer sends SIGKILL to that thread.
>*/
>   for (i = 0; i < 1000; i++) {
>   clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
> NULL);
>   }
>   /* Let a child thread trigger the OOM killer. */
>   clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
>   /* Deplete all memory reserve using the time lag. */
>   for (i = size; i; i -= 4096)
>   buf[i - 1] = 1;
>   return * (char *) NULL; /* Kill all threads. */
> }
> -- oom-depleter3.c end --
> 
> uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz
> shows that the memory reserves completely depleted and
> uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz
> shows that the memory reserves was not used at all.
> Is this result what you expected?
> 

What are the results when the kernel isn't patched at all?  The trade-off 
being made is that we want to attempt to make forward progress when there 
is an excessive stall in an oom victim making its exit rather than 
livelock the system forever waiting for memory that can never be 
allocated.

I struggle to understand how the approach of randomly continuing to kill 
more and more processes in the hope that it slows down usage of memory 
reserves or that we get lucky is better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-22 Thread David Rientjes
On Tue, 22 Sep 2015, Tetsuo Handa wrote:

> David Rientjes wrote:
> > Your proposal, which I mostly agree with, tries to kill additional 
> > processes so that they allocate and drop the lock that the original victim 
> > depends on.  My approach, from 
> > http://marc.info/?l=linux-kernel=144010444913702, is the same, but 
> > without the killing.  It's unecessary to kill every process on the system 
> > that is depending on the same lock, and we can't know which processes are 
> > stalling on that lock and which are not.
> 
> Would you try your approach with below program?
> (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.)
> 
> -- oom-depleter3.c start --
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static int zero_fd = EOF;
> static char *buf = NULL;
> static unsigned long size = 0;
> 
> static int dummy(void *unused)
> {
>   static char buffer[4096] = { };
>   int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
>   while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) &&
>  fsync(fd) == 0);
>   return 0;
> }
> 
> static int trigger(void *unused)
> {
>   read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
>   return 0;
> }
> 
> int main(int argc, char *argv[])
> {
> unsigned long i;
>   zero_fd = open("/dev/zero", O_RDONLY);
>   for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
>   char *cp = realloc(buf, size);
>   if (!cp) {
>   size >>= 1;
>   break;
>   }
>   buf = cp;
>   }
>   /*
>* Create many child threads in order to enlarge time lag between
>* the OOM killer sets TIF_MEMDIE to thread group leader and
>* the OOM killer sends SIGKILL to that thread.
>*/
>   for (i = 0; i < 1000; i++) {
>   clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
> NULL);
>   }
>   /* Let a child thread trigger the OOM killer. */
>   clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
>   /* Deplete all memory reserve using the time lag. */
>   for (i = size; i; i -= 4096)
>   buf[i - 1] = 1;
>   return * (char *) NULL; /* Kill all threads. */
> }
> -- oom-depleter3.c end --
> 
> uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz
> shows that the memory reserves completely depleted and
> uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz
> shows that the memory reserves was not used at all.
> Is this result what you expected?
> 

What are the results when the kernel isn't patched at all?  The trade-off 
being made is that we want to attempt to make forward progress when there 
is an excessive stall in an oom victim making its exit rather than 
livelock the system forever waiting for memory that can never be 
allocated.

I struggle to understand how the approach of randomly continuing to kill 
more and more processes in the hope that it slows down usage of memory 
reserves or that we get lucky is better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread Tetsuo Handa
David Rientjes wrote:
> Your proposal, which I mostly agree with, tries to kill additional 
> processes so that they allocate and drop the lock that the original victim 
> depends on.  My approach, from 
> http://marc.info/?l=linux-kernel=144010444913702, is the same, but 
> without the killing.  It's unecessary to kill every process on the system 
> that is depending on the same lock, and we can't know which processes are 
> stalling on that lock and which are not.

Would you try your approach with below program?
(My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.)

-- oom-depleter3.c start --
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int zero_fd = EOF;
static char *buf = NULL;
static unsigned long size = 0;

static int dummy(void *unused)
{
static char buffer[4096] = { };
int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) &&
   fsync(fd) == 0);
return 0;
}

static int trigger(void *unused)
{
read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
return 0;
}

int main(int argc, char *argv[])
{
unsigned long i;
zero_fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
/*
 * Create many child threads in order to enlarge time lag between
 * the OOM killer sets TIF_MEMDIE to thread group leader and
 * the OOM killer sends SIGKILL to that thread.
 */
for (i = 0; i < 1000; i++) {
clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
  NULL);
}
/* Let a child thread trigger the OOM killer. */
clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
/* Deplete all memory reserve using the time lag. */
for (i = size; i; i -= 4096)
buf[i - 1] = 1;
return * (char *) NULL; /* Kill all threads. */
}
-- oom-depleter3.c end --

uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz
shows that the memory reserves completely depleted and
uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz
shows that the memory reserves was not used at all.
Is this result what you expected?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread David Rientjes
On Sat, 19 Sep 2015, Tetsuo Handa wrote:

> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency. Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?
> 

I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS 
being problematic.  It is simply the mechanism by which we give oom killed 
processes access to memory reserves if they need it.  I believe you are 
referring only to the oom killer stalling when it finds an oom victim.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.
> 

Your proposal, which I mostly agree with, tries to kill additional 
processes so that they allocate and drop the lock that the original victim 
depends on.  My approach, from 
http://marc.info/?l=linux-kernel=144010444913702, is the same, but 
without the killing.  It's unecessary to kill every process on the system 
that is depending on the same lock, and we can't know which processes are 
stalling on that lock and which are not.

I think it's much easier to simply identify such a situation where a 
process has not exited in a timely manner and then provide processes 
access to memory reserves without being killed.  We hope that the victim 
will have queued its mutex_lock() and allocators that are holding the lock 
will drop it after successfully utilizing memory reserves.

We can mitigate immediate depletion of memory reserves by requiring all 
allocators to reclaim (or compact) and calling the oom killer to identify 
the timeout before granting access to memory reserves for a single 
allocation before schedule_timeout_killable(1) and returning.

I don't know of any alternative solutions where we can guarantee that 
memory reserves cannot be depleted unless memory reserves are 100% of 
memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread David Rientjes
On Fri, 18 Sep 2015, Christoph Lameter wrote:

> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.
> 
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/oom_kill.c
> ===
> --- linux.orig/mm/oom_kill.c  2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves.
>*/
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +
>   if (!task->mm)
>   return OOM_SCAN_CONTINUE;
> 

If this would result in the newly chosen process being guaranteed to exit, 
this would be fine.  Unfortunately, no such guarantee is possible.  If a 
thread is holding a contended mutex that the victim(s) require, this 
serial oom killer could eventually panic the system if that thread is 
OOM_DISABLE.

The solution that we have merged internally is described at 
http://marc.info/?l=linux-kernel=144010444913702 -- we provide access to 
memory reserves to processes that find a stalled exit in the oom killer so 
that they may allocate.  It comes along with a test module that takes a 
contended mutex and ensures that forward progress is made as long as 
memory reserves are not depleted.  We can't actually guarantee that memory 
reserves won't be depleted, but we (1) hope that nobody is actually 
allocating a lot of memory before dropping a mutex and (2) want to avoid 
the alternative which is a system livelock.

This will address situations such as

allocator   oom victim
-   --
mutex_lock(lock)
alloc_pages(GFP_KERNEL)
mutex_lock(lock)
mutex_unlock(lock)
handle SIGKILL

since this otherwise results in a livelock without a solution such as 
mine since the GFP_KERNEL allocation stalls forever waiting for the oom 
victim to acquire the mutex and exit.  This also works if the allocator is 
OOM_DISABLE.

This won't handle other situations where the victim gets wedged in D state 
and is not allocating memory, but this is by far the more common 
occurrence that we have dealt with.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread David Rientjes
On Sat, 19 Sep 2015, Michal Hocko wrote:

> Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded
> state. Using it as an oom victim criteria makes the victim selection
> less deterministic which is undesirable. As much as I am aware of
> potential issues with the current implementation, making the behavior
> more random doesn't really help.
> 

Agreed, we can't avoid killing a process simply because it is in D state, 
this isn't an indication that the process will not be able to exit and in 
the worst case could panic the system if all other processes cannot be oom 
killed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread Tetsuo Handa
David Rientjes wrote:
> Your proposal, which I mostly agree with, tries to kill additional 
> processes so that they allocate and drop the lock that the original victim 
> depends on.  My approach, from 
> http://marc.info/?l=linux-kernel=144010444913702, is the same, but 
> without the killing.  It's unecessary to kill every process on the system 
> that is depending on the same lock, and we can't know which processes are 
> stalling on that lock and which are not.

Would you try your approach with below program?
(My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.)

-- oom-depleter3.c start --
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int zero_fd = EOF;
static char *buf = NULL;
static unsigned long size = 0;

static int dummy(void *unused)
{
static char buffer[4096] = { };
int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) &&
   fsync(fd) == 0);
return 0;
}

static int trigger(void *unused)
{
read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
return 0;
}

int main(int argc, char *argv[])
{
unsigned long i;
zero_fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
/*
 * Create many child threads in order to enlarge time lag between
 * the OOM killer sets TIF_MEMDIE to thread group leader and
 * the OOM killer sends SIGKILL to that thread.
 */
for (i = 0; i < 1000; i++) {
clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
  NULL);
}
/* Let a child thread trigger the OOM killer. */
clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
/* Deplete all memory reserve using the time lag. */
for (i = size; i; i -= 4096)
buf[i - 1] = 1;
return * (char *) NULL; /* Kill all threads. */
}
-- oom-depleter3.c end --

uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz
shows that the memory reserves completely depleted and
uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz
shows that the memory reserves was not used at all.
Is this result what you expected?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread David Rientjes
On Sat, 19 Sep 2015, Tetsuo Handa wrote:

> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency. Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?
> 

I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS 
being problematic.  It is simply the mechanism by which we give oom killed 
processes access to memory reserves if they need it.  I believe you are 
referring only to the oom killer stalling when it finds an oom victim.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.
> 

Your proposal, which I mostly agree with, tries to kill additional 
processes so that they allocate and drop the lock that the original victim 
depends on.  My approach, from 
http://marc.info/?l=linux-kernel=144010444913702, is the same, but 
without the killing.  It's unecessary to kill every process on the system 
that is depending on the same lock, and we can't know which processes are 
stalling on that lock and which are not.

I think it's much easier to simply identify such a situation where a 
process has not exited in a timely manner and then provide processes 
access to memory reserves without being killed.  We hope that the victim 
will have queued its mutex_lock() and allocators that are holding the lock 
will drop it after successfully utilizing memory reserves.

We can mitigate immediate depletion of memory reserves by requiring all 
allocators to reclaim (or compact) and calling the oom killer to identify 
the timeout before granting access to memory reserves for a single 
allocation before schedule_timeout_killable(1) and returning.

I don't know of any alternative solutions where we can guarantee that 
memory reserves cannot be depleted unless memory reserves are 100% of 
memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread David Rientjes
On Sat, 19 Sep 2015, Michal Hocko wrote:

> Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded
> state. Using it as an oom victim criteria makes the victim selection
> less deterministic which is undesirable. As much as I am aware of
> potential issues with the current implementation, making the behavior
> more random doesn't really help.
> 

Agreed, we can't avoid killing a process simply because it is in D state, 
this isn't an indication that the process will not be able to exit and in 
the worst case could panic the system if all other processes cannot be oom 
killed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-21 Thread David Rientjes
On Fri, 18 Sep 2015, Christoph Lameter wrote:

> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.
> 
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/oom_kill.c
> ===
> --- linux.orig/mm/oom_kill.c  2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves.
>*/
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +
>   if (!task->mm)
>   return OOM_SCAN_CONTINUE;
> 

If this would result in the newly chosen process being guaranteed to exit, 
this would be fine.  Unfortunately, no such guarantee is possible.  If a 
thread is holding a contended mutex that the victim(s) require, this 
serial oom killer could eventually panic the system if that thread is 
OOM_DISABLE.

The solution that we have merged internally is described at 
http://marc.info/?l=linux-kernel=144010444913702 -- we provide access to 
memory reserves to processes that find a stalled exit in the oom killer so 
that they may allocate.  It comes along with a test module that takes a 
contended mutex and ensures that forward progress is made as long as 
memory reserves are not depleted.  We can't actually guarantee that memory 
reserves won't be depleted, but we (1) hope that nobody is actually 
allocating a lot of memory before dropping a mutex and (2) want to avoid 
the alternative which is a system livelock.

This will address situations such as

allocator   oom victim
-   --
mutex_lock(lock)
alloc_pages(GFP_KERNEL)
mutex_lock(lock)
mutex_unlock(lock)
handle SIGKILL

since this otherwise results in a livelock without a solution such as 
mine since the GFP_KERNEL allocation stalls forever waiting for the oom 
victim to acquire the mutex and exit.  This also works if the allocator is 
OOM_DISABLE.

This won't handle other situations where the victim gets wedged in D state 
and is not allocating memory, but this is by far the more common 
occurrence that we have dealt with.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Sat 19-09-15 23:33:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This has been posted in various forms many times over past years. I
> > still do not think this is a right approach of dealing with the problem.
> 
> I do not think "GFP_NOFS can fail" patch is a right approach because
> that patch easily causes messages like below.
> 
>   Buffer I/O error on dev sda1, logical block 34661831, lost async page write
>   XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
>   XFS: possible memory allocation deadlock in xfs_buf_allocate_memory 
> (mode:0x250)
>   XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

These messages just tell you that the allocation fails repeatedly. Have
a look and check the code. They are basically opencoded NOFAIL
allocations. They haven't been converted to actually tell the MM layer
that they cannot fail because Dave said they have a long term plan to
change this code and basically implement different failing strategies.

> Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.
> 
> I believe choosing more OOM victims is the only way which can solve OOM 
> stalls.

I am very well aware of your position and all the attempts to tweak
different code paths to actually pass your corner case. I, however, care
for the longer term goals more. And I believe that the page allocator
and the reclaim should strive for being less deadlock prone in the
first place.  That includes a more natural semantic and non-failing
default semantic is really error prone IMHO. We have been through this
discussion many times already and I've tried to express this is a long
term goal with incremental steps.
I really hate to do "easy" things now just to feel better about
particular case which will kick us back little bit later. And from my
own experience I can tell you that a more non-deterministic OOM behavior
is thing people complain about.

> > You can quickly deplete memory reserves this way without making further
> > progress (I am afraid you can even trigger this from userspace without
> > having big privileges) so even administrator will have no way to
> > intervene.
> 
> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency.

Of course. This is a heurstic and as such it cannot ever work in 100%
situations. And it is not the first heuristic we have for the OOM
killer. The last time this has been all rewritten was because the OOM
killer was too unreliable/non-deterministic. Reports have decreased
considerable since then.

> Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?

This has been explained several times. There is no guaranteed this would
help and _your_ own usecase shows how you can end up with such a long
lock dependency chains that you can easily eat up the whole memory
reserves before you can make any progress.

I do agree that a hand break mechanism is really desirable for those who
really care.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.

But those two goes against each other.

[...]

> If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
> rule, what alternative do you have? (I do not like panic_on_oom_timeout
> because it is more data-lossy approach than choosing next OOM victim.)

I am not married to 1 TIF_MEMDIE task thing. I just think that there is
still a lot of room for other improvements. The original issue which
triggered this discussion again is a good example. I completely miss why
a writer has to be unkillable when the fs is frozen. There are others
which are more complicated of course. Including the whole class
represented by GFP_NOFS allocations as you have noted. But we still have
a room for improvements even in the reclaim. It has been suggested quite
some time ago that the memory mapped by the OOM victim might be
unmapped. Basically what Oleg is proposing in other email. I didn't get
to read his email yet properly but that should certainly help to reduce
the problem space.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Oleg Nesterov
On 09/19, Michal Hocko wrote:
>
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

Agreed. But still I think it makes sense to try to kill another task
if the victim refuse to die. Yes, the details are not clear to me.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Tetsuo Handa
Michal Hocko wrote:
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

I do not think "GFP_NOFS can fail" patch is a right approach because
that patch easily causes messages like below.

  Buffer I/O error on dev sda1, logical block 34661831, lost async page write
  XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  XFS: possible memory allocation deadlock in xfs_buf_allocate_memory 
(mode:0x250)
  XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.

I believe choosing more OOM victims is the only way which can solve OOM stalls.

> You can quickly deplete memory reserves this way without making further
> progress (I am afraid you can even trigger this from userspace without
> having big privileges) so even administrator will have no way to
> intervene.

I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
OOM victim task as soon as possible, but it turned out that it will not
work if there is invisible lock dependency. Therefore, why not to give up
"there should be only up to 1 TIF_MEMDIE task" rule?

What this patch (and many others posted in various forms many times over
past years) does is to give up "there should be only up to 1 TIF_MEMDIE
task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
and somehow manage in a way memory reserves will not deplete.

In my proposal which favors all fatal_signal_pending() tasks evenly
( 
http://lkml.kernel.org/r/201509102318.ghg18789.ohmslfjoqfo...@i-love.sakura.ne.jp
 )
suggests that the OOM victim task unlikely needs all of memory reserves.
In other words, the OOM victim task can likely make forward progress
if some amount of memory reserves are allowed (compared to normal tasks
waiting for memory).

So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule
and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current)
will handle many cases if fatal_signal_pending() tasks are allowed to access
some amount of memory reserves. And my proposal which chooses next OOM
victim upon timeout will handle the remaining cases without depleting
memory reserves.

If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
rule, what alternative do you have? (I do not like panic_on_oom_timeout
because it is more data-lossy approach than choosing next OOM victim.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Fri 18-09-15 12:00:59, Christoph Lameter wrote:
[...]
> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.

This has been posted in various forms many times over past years. I
still do not think this is a right approach of dealing with the problem.
You can quickly deplete memory reserves this way without making further
progress (I am afraid you can even trigger this from userspace without
having big privileges) so even administrator will have no way to
intervene.

> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/oom_kill.c
> ===
> --- linux.orig/mm/oom_kill.c  2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves.
>*/
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +
>   if (!task->mm)
>   return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Fri 18-09-15 10:41:09, Christoph Lameter wrote:
[...]
>   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> + if (unlikely(frozen(task)))
> + __thaw_task(task);

TIF_MEMDIE processes will get thawed automatically and then cannot be
frozen again. Have a look at mark_oom_victim.

>   }
>   if (!task->mm)
>   return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Thu 17-09-15 13:59:43, Kyle Walker wrote:
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.

I am not familiar with the fs freezing code so I might be missing
something important here. __sb_start_write waits for the frozen fs by
wait_event which is really UN sleep. Why cannot we sleep here in IN
sleep and return with EINTR when interrupted? I would consider this
a better behavior not only because of OOM because having unkillable
tasks in general is undesirable. AFAIU the fs might be frozen for ever
and admin cannot do anything about the pending processes.

> Within oom_unkillable_task(), check for processes in
> TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
> move on to another task.

Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded
state. Using it as an oom victim criteria makes the victim selection
less deterministic which is undesirable. As much as I am aware of
potential issues with the current implementation, making the behavior
more random doesn't really help.

> Signed-off-by: Kyle Walker 
> ---
>  mm/oom_kill.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..66f03f8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
>   if (memcg && !task_in_mem_cgroup(p, memcg))
>   return true;
>  
> + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
> + if (p->state == TASK_UNINTERRUPTIBLE)
> + return true;
> +
>   /* p may not have freeable memory in nodemask */
>   if (!has_intersects_mems_allowed(p, nodemask))
>   return true;
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Fri 18-09-15 12:00:59, Christoph Lameter wrote:
[...]
> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.

This has been posted in various forms many times over past years. I
still do not think this is a right approach of dealing with the problem.
You can quickly deplete memory reserves this way without making further
progress (I am afraid you can even trigger this from userspace without
having big privileges) so even administrator will have no way to
intervene.

> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/oom_kill.c
> ===
> --- linux.orig/mm/oom_kill.c  2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves.
>*/
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +
>   if (!task->mm)
>   return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Fri 18-09-15 10:41:09, Christoph Lameter wrote:
[...]
>   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> + if (unlikely(frozen(task)))
> + __thaw_task(task);

TIF_MEMDIE processes will get thawed automatically and then cannot be
frozen again. Have a look at mark_oom_victim.

>   }
>   if (!task->mm)
>   return OOM_SCAN_CONTINUE;

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Thu 17-09-15 13:59:43, Kyle Walker wrote:
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.

I am not familiar with the fs freezing code so I might be missing
something important here. __sb_start_write waits for the frozen fs by
wait_event which is really UN sleep. Why cannot we sleep here in IN
sleep and return with EINTR when interrupted? I would consider this
a better behavior not only because of OOM because having unkillable
tasks in general is undesirable. AFAIU the fs might be frozen for ever
and admin cannot do anything about the pending processes.

> Within oom_unkillable_task(), check for processes in
> TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
> move on to another task.

Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded
state. Using it as an oom victim criteria makes the victim selection
less deterministic which is undesirable. As much as I am aware of
potential issues with the current implementation, making the behavior
more random doesn't really help.

> Signed-off-by: Kyle Walker 
> ---
>  mm/oom_kill.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..66f03f8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
>   if (memcg && !task_in_mem_cgroup(p, memcg))
>   return true;
>  
> + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
> + if (p->state == TASK_UNINTERRUPTIBLE)
> + return true;
> +
>   /* p may not have freeable memory in nodemask */
>   if (!has_intersects_mems_allowed(p, nodemask))
>   return true;
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Oleg Nesterov
On 09/19, Michal Hocko wrote:
>
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

Agreed. But still I think it makes sense to try to kill another task
if the victim refuse to die. Yes, the details are not clear to me.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Tetsuo Handa
Michal Hocko wrote:
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

I do not think "GFP_NOFS can fail" patch is a right approach because
that patch easily causes messages like below.

  Buffer I/O error on dev sda1, logical block 34661831, lost async page write
  XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  XFS: possible memory allocation deadlock in xfs_buf_allocate_memory 
(mode:0x250)
  XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.

I believe choosing more OOM victims is the only way which can solve OOM stalls.

> You can quickly deplete memory reserves this way without making further
> progress (I am afraid you can even trigger this from userspace without
> having big privileges) so even administrator will have no way to
> intervene.

I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
OOM victim task as soon as possible, but it turned out that it will not
work if there is invisible lock dependency. Therefore, why not to give up
"there should be only up to 1 TIF_MEMDIE task" rule?

What this patch (and many others posted in various forms many times over
past years) does is to give up "there should be only up to 1 TIF_MEMDIE
task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
and somehow manage in a way memory reserves will not deplete.

In my proposal which favors all fatal_signal_pending() tasks evenly
( 
http://lkml.kernel.org/r/201509102318.ghg18789.ohmslfjoqfo...@i-love.sakura.ne.jp
 )
suggests that the OOM victim task unlikely needs all of memory reserves.
In other words, the OOM victim task can likely make forward progress
if some amount of memory reserves are allowed (compared to normal tasks
waiting for memory).

So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule
and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current)
will handle many cases if fatal_signal_pending() tasks are allowed to access
some amount of memory reserves. And my proposal which chooses next OOM
victim upon timeout will handle the remaining cases without depleting
memory reserves.

If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
rule, what alternative do you have? (I do not like panic_on_oom_timeout
because it is more data-lossy approach than choosing next OOM victim.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-19 Thread Michal Hocko
On Sat 19-09-15 23:33:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This has been posted in various forms many times over past years. I
> > still do not think this is a right approach of dealing with the problem.
> 
> I do not think "GFP_NOFS can fail" patch is a right approach because
> that patch easily causes messages like below.
> 
>   Buffer I/O error on dev sda1, logical block 34661831, lost async page write
>   XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
>   XFS: possible memory allocation deadlock in xfs_buf_allocate_memory 
> (mode:0x250)
>   XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

These messages just tell you that the allocation fails repeatedly. Have
a look and check the code. They are basically opencoded NOFAIL
allocations. They haven't been converted to actually tell the MM layer
that they cannot fail because Dave said they have a long term plan to
change this code and basically implement different failing strategies.

> Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.
> 
> I believe choosing more OOM victims is the only way which can solve OOM 
> stalls.

I am very well aware of your position and all the attempts to tweak
different code paths to actually pass your corner case. I, however, care
for the longer term goals more. And I believe that the page allocator
and the reclaim should strive for being less deadlock prone in the
first place.  That includes a more natural semantic and non-failing
default semantic is really error prone IMHO. We have been through this
discussion many times already and I've tried to express this is a long
term goal with incremental steps.
I really hate to do "easy" things now just to feel better about
particular case which will kick us back little bit later. And from my
own experience I can tell you that a more non-deterministic OOM behavior
is thing people complain about.

> > You can quickly deplete memory reserves this way without making further
> > progress (I am afraid you can even trigger this from userspace without
> > having big privileges) so even administrator will have no way to
> > intervene.
> 
> I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
> cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
> OOM victim task as soon as possible, but it turned out that it will not
> work if there is invisible lock dependency.

Of course. This is a heurstic and as such it cannot ever work in 100%
situations. And it is not the first heuristic we have for the OOM
killer. The last time this has been all rewritten was because the OOM
killer was too unreliable/non-deterministic. Reports have decreased
considerable since then.

> Therefore, why not to give up
> "there should be only up to 1 TIF_MEMDIE task" rule?

This has been explained several times. There is no guaranteed this would
help and _your_ own usecase shows how you can end up with such a long
lock dependency chains that you can easily eat up the whole memory
reserves before you can make any progress.

I do agree that a hand break mechanism is really desirable for those who
really care.

> What this patch (and many others posted in various forms many times over
> past years) does is to give up "there should be only up to 1 TIF_MEMDIE
> task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
> and somehow manage in a way memory reserves will not deplete.

But those two goes against each other.

[...]

> If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
> rule, what alternative do you have? (I do not like panic_on_oom_timeout
> because it is more data-lossy approach than choosing next OOM victim.)

I am not married to 1 TIF_MEMDIE task thing. I just think that there is
still a lot of room for other improvements. The original issue which
triggered this discussion again is a good example. I completely miss why
a writer has to be unkillable when the fs is frozen. There are others
which are more complicated of course. Including the whole class
represented by GFP_NOFS allocations as you have noted. But we still have
a room for improvements even in the reclaim. It has been suggested quite
some time ago that the memory mapped by the OOM victim might be
unmapped. Basically what Oleg is proposing in other email. I didn't get
to read his email yet properly but that should certainly help to reduce
the problem space.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
On Fri, 18 Sep 2015, Kyle Walker wrote:

> I do like the idea of not stalling completely in an oom just because the
> first attempt didn't go so well. Is there any possibility of simply having
> our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE
> tasks
> as low-hanging fruit and allowing the oom to continue in the event that the
> first attempt stalls?

TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they
*should react* in a reasonable timeframe. There is an alternative API for
those cases that cannot. Typically this is a write that is stalling. If we
kill the process then its pointless to wait on the write to complete. See

https://lwn.net/Articles/288056/

http://www.ibm.com/developerworks/library/l-task-killable/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
On Fri, 18 Sep 2015, Oleg Nesterov wrote:

> And btw. Yes, this is a bit off-topic, but I think another change make
> sense too. We should report the fact we are going to kill another task
> because the previous victim refuse to die, and print its stack trace.

What happens is that the previous victim did not enter exit processing. If
it would then it would be excluded by other checks. The first victim never
reacted and never started using the memory resources available for
exiting. Thats why I thought it maybe safe to go this way.

An issue could result from another process being terminated and the first
victim finally reacting to the signal and also beginning termination. Then
we have contention on the reserves.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Oleg Nesterov
On 09/18, Christoph Lameter wrote:
>
> --- linux.orig/mm/oom_kill.c  2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves.
>*/
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +

Well, I can't really comment. Hopefully we will see more comments from
those who understand oom-killer.

But I still think this is not enough, and we need some (configurable?)
timeout before we pick another victim...


And btw. Yes, this is a bit off-topic, but I think another change make
sense too. We should report the fact we are going to kill another task
because the previous victim refuse to die, and print its stack trace.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
On Fri, 18 Sep 2015, Oleg Nesterov wrote:

> To simplify the discussion lets ignore PF_FROZEN, this is another issue.

Ok.

Subject: Allow multiple kills from the OOM killer

The OOM killer currently aborts if it finds a process that already is having
access to the reserve memory pool for exit processing. This is done so that
the reserves are not overcommitted but on the other hand this also allows
only one process being oom killed at the time. That process may be stuck
in D state.

Signed-off-by: Christoph Lameter 

Index: linux/mm/oom_kill.c
===
--- linux.orig/mm/oom_kill.c2015-09-18 11:58:52.963946782 -0500
+++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500
@@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
 * This task already has access to memory reserves and is being killed.
 * Don't allow any other task to have access to the reserves.
 */
-   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-   if (oc->order != -1)
-   return OOM_SCAN_ABORT;
-   }
+   if (test_tsk_thread_flag(task, TIF_MEMDIE))
+   return OOM_SCAN_CONTINUE;
+
if (!task->mm)
return OOM_SCAN_CONTINUE;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Oleg Nesterov
On 09/19, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > To simplify the discussion lets ignore PF_FROZEN, this is another issue.
> >
> > I am not sure this change is enough, we need to ensure that
> > select_bad_process() won't pick the same task (or its sub-thread) again.
>
> SysRq-f is sometimes unusable because it continues choosing the same thread.
> oom_kill_process() should not choose a thread which already has TIF_MEMDIE.

So I was right, this is really not enough...

> I think we need to rewrite oom_kill_process().

Heh. I can only ack the intent and wish you good luck ;)

> > And perhaps something like
> >
> > wait_event_timeout(oom_victims_wait, !oom_victims,
> > configurable_timeout);
> >
> > before select_bad_process() makes sense?
>
> I think you should not sleep for long with oom_lock mutex held.
> http://marc.info/?l=linux-mm=143031212312459

Yes, yes, sure, I didn't mean we should wait under oom_lock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Tetsuo Handa
Oleg Nesterov wrote:
> To simplify the discussion lets ignore PF_FROZEN, this is another issue.
> 
> I am not sure this change is enough, we need to ensure that
> select_bad_process() won't pick the same task (or its sub-thread) again.

SysRq-f is sometimes unusable because it continues choosing the same thread.
oom_kill_process() should not choose a thread which already has TIF_MEMDIE.
I think we need to rewrite oom_kill_process().

> 
> And perhaps something like
> 
>   wait_event_timeout(oom_victims_wait, !oom_victims,
>   configurable_timeout);
> 
> before select_bad_process() makes sense?

I think you should not sleep for long with oom_lock mutex held.
http://marc.info/?l=linux-mm=143031212312459
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Oleg Nesterov
On 09/18, Christoph Lameter wrote:
>
> > But yes, such a deadlock is possible. I would really like to see the 
> > comments
> > from maintainers. In particular, I seem to recall that someone suggested to
> > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
> > we should actually do...
>
> Well yes here is a patch that kills another memdie process but there is
> some risk with such an approach of overusing the reserves.

Yes, I understand it is not that simple. And probably this is all I can
understand ;)

> --- linux.orig/mm/oom_kill.c  2015-09-18 10:38:29.601963726 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 10:39:55.911699017 -0500
> @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread(
>* Don't allow any other task to have access to the reserves.
>*/
>   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> + if (unlikely(frozen(task)))
> + __thaw_task(task);

To simplify the discussion lets ignore PF_FROZEN, this is another issue.

I am not sure this change is enough, we need to ensure that
select_bad_process() won't pick the same task (or its sub-thread) again.

And perhaps something like

wait_event_timeout(oom_victims_wait, !oom_victims,
configurable_timeout);

before select_bad_process() makes sense?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
> But yes, such a deadlock is possible. I would really like to see the comments
> from maintainers. In particular, I seem to recall that someone suggested to
> try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
> we should actually do...

Well yes here is a patch that kills another memdie process but there is
some risk with such an approach of overusing the reserves.


Subject: Allow multiple kills from the OOM killer

The OOM killer currently aborts if it finds a process that already is having
access to the reserve memory pool for exit processing. This is done so that
the reserves are not overcommitted but on the other hand this also allows
only one process being oom killed at the time. That process may be stuck
in D state.

The patch simply removes the aborting of the scan so that other processes
may be killed if one is stuck in D state.

Signed-off-by: Christoph Lameter 

Index: linux/mm/oom_kill.c
===
--- linux.orig/mm/oom_kill.c2015-09-18 10:38:29.601963726 -0500
+++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500
@@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread(
 * Don't allow any other task to have access to the reserves.
 */
if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-   if (oc->order != -1)
-   return OOM_SCAN_ABORT;
+   if (unlikely(frozen(task)))
+   __thaw_task(task);
}
if (!task->mm)
return OOM_SCAN_CONTINUE;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Oleg Nesterov
On 09/18, Christoph Lameter wrote:
>
> --- linux.orig/mm/oom_kill.c  2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves.
>*/
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +

Well, I can't really comment. Hopefully we will see more comments from
those who understand oom-killer.

But I still think this is not enough, and we need some (configurable?)
timeout before we pick another victim...


And btw. Yes, this is a bit off-topic, but I think another change make
sense too. We should report the fact we are going to kill another task
because the previous victim refuse to die, and print its stack trace.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
On Fri, 18 Sep 2015, Oleg Nesterov wrote:

> And btw. Yes, this is a bit off-topic, but I think another change make
> sense too. We should report the fact we are going to kill another task
> because the previous victim refuse to die, and print its stack trace.

What happens is that the previous victim did not enter exit processing. If
it would then it would be excluded by other checks. The first victim never
reacted and never started using the memory resources available for
exiting. Thats why I thought it maybe safe to go this way.

An issue could result from another process being terminated and the first
victim finally reacting to the signal and also beginning termination. Then
we have contention on the reserves.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
On Fri, 18 Sep 2015, Kyle Walker wrote:

> I do like the idea of not stalling completely in an oom just because the
> first attempt didn't go so well. Is there any possibility of simply having
> our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE
> tasks
> as low-hanging fruit and allowing the oom to continue in the event that the
> first attempt stalls?

TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they
*should react* in a reasonable timeframe. There is an alternative API for
those cases that cannot. Typically this is a write that is stalling. If we
kill the process then its pointless to wait on the write to complete. See

https://lwn.net/Articles/288056/

http://www.ibm.com/developerworks/library/l-task-killable/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Tetsuo Handa
Oleg Nesterov wrote:
> To simplify the discussion lets ignore PF_FROZEN, this is another issue.
> 
> I am not sure this change is enough, we need to ensure that
> select_bad_process() won't pick the same task (or its sub-thread) again.

SysRq-f is sometimes unusable because it continues choosing the same thread.
oom_kill_process() should not choose a thread which already has TIF_MEMDIE.
I think we need to rewrite oom_kill_process().

> 
> And perhaps something like
> 
>   wait_event_timeout(oom_victims_wait, !oom_victims,
>   configurable_timeout);
> 
> before select_bad_process() makes sense?

I think you should not sleep for long with oom_lock mutex held.
http://marc.info/?l=linux-mm=143031212312459
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Oleg Nesterov
On 09/19, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > To simplify the discussion lets ignore PF_FROZEN, this is another issue.
> >
> > I am not sure this change is enough, we need to ensure that
> > select_bad_process() won't pick the same task (or its sub-thread) again.
>
> SysRq-f is sometimes unusable because it continues choosing the same thread.
> oom_kill_process() should not choose a thread which already has TIF_MEMDIE.

So I was right, this is really not enough...

> I think we need to rewrite oom_kill_process().

Heh. I can only ack the intent and wish you good luck ;)

> > And perhaps something like
> >
> > wait_event_timeout(oom_victims_wait, !oom_victims,
> > configurable_timeout);
> >
> > before select_bad_process() makes sense?
>
> I think you should not sleep for long with oom_lock mutex held.
> http://marc.info/?l=linux-mm=143031212312459

Yes, yes, sure, I didn't mean we should wait under oom_lock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
On Fri, 18 Sep 2015, Oleg Nesterov wrote:

> To simplify the discussion lets ignore PF_FROZEN, this is another issue.

Ok.

Subject: Allow multiple kills from the OOM killer

The OOM killer currently aborts if it finds a process that already is having
access to the reserve memory pool for exit processing. This is done so that
the reserves are not overcommitted but on the other hand this also allows
only one process being oom killed at the time. That process may be stuck
in D state.

Signed-off-by: Christoph Lameter 

Index: linux/mm/oom_kill.c
===
--- linux.orig/mm/oom_kill.c2015-09-18 11:58:52.963946782 -0500
+++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500
@@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
 * This task already has access to memory reserves and is being killed.
 * Don't allow any other task to have access to the reserves.
 */
-   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-   if (oc->order != -1)
-   return OOM_SCAN_ABORT;
-   }
+   if (test_tsk_thread_flag(task, TIF_MEMDIE))
+   return OOM_SCAN_CONTINUE;
+
if (!task->mm)
return OOM_SCAN_CONTINUE;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Oleg Nesterov
On 09/18, Christoph Lameter wrote:
>
> > But yes, such a deadlock is possible. I would really like to see the 
> > comments
> > from maintainers. In particular, I seem to recall that someone suggested to
> > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
> > we should actually do...
>
> Well yes here is a patch that kills another memdie process but there is
> some risk with such an approach of overusing the reserves.

Yes, I understand it is not that simple. And probably this is all I can
understand ;)

> --- linux.orig/mm/oom_kill.c  2015-09-18 10:38:29.601963726 -0500
> +++ linux/mm/oom_kill.c   2015-09-18 10:39:55.911699017 -0500
> @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread(
>* Don't allow any other task to have access to the reserves.
>*/
>   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> + if (unlikely(frozen(task)))
> + __thaw_task(task);

To simplify the discussion lets ignore PF_FROZEN, this is another issue.

I am not sure this change is enough, we need to ensure that
select_bad_process() won't pick the same task (or its sub-thread) again.

And perhaps something like

wait_event_timeout(oom_victims_wait, !oom_victims,
configurable_timeout);

before select_bad_process() makes sense?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-18 Thread Christoph Lameter
> But yes, such a deadlock is possible. I would really like to see the comments
> from maintainers. In particular, I seem to recall that someone suggested to
> try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
> we should actually do...

Well yes here is a patch that kills another memdie process but there is
some risk with such an approach of overusing the reserves.


Subject: Allow multiple kills from the OOM killer

The OOM killer currently aborts if it finds a process that already is having
access to the reserve memory pool for exit processing. This is done so that
the reserves are not overcommitted but on the other hand this also allows
only one process being oom killed at the time. That process may be stuck
in D state.

The patch simply removes the aborting of the scan so that other processes
may be killed if one is stuck in D state.

Signed-off-by: Christoph Lameter 

Index: linux/mm/oom_kill.c
===
--- linux.orig/mm/oom_kill.c2015-09-18 10:38:29.601963726 -0500
+++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500
@@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread(
 * Don't allow any other task to have access to the reserves.
 */
if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-   if (oc->order != -1)
-   return OOM_SCAN_ABORT;
+   if (unlikely(frozen(task)))
+   __thaw_task(task);
}
if (!task->mm)
return OOM_SCAN_CONTINUE;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-17 Thread Oleg Nesterov
Add cc's.

On 09/17, Kyle Walker wrote:
>
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.
>
> Within oom_unkillable_task(), check for processes in
> TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
> move on to another task.
>
> Signed-off-by: Kyle Walker 
> ---
>  mm/oom_kill.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..66f03f8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
>   if (memcg && !task_in_mem_cgroup(p, memcg))
>   return true;
>
> + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
> + if (p->state == TASK_UNINTERRUPTIBLE)
> + return true;
> +

So we can skip a memory hog which, say, does mutex_lock(). And this can't
help if this task is multithreaded, unless all its sub-threads are in "D"
state too oom killer will pick another thread with the same ->mm. Plus
other problems.

But yes, such a deadlock is possible. I would really like to see the comments
from maintainers. In particular, I seem to recall that someone suggested to
try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
we should actually do...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

2015-09-17 Thread Oleg Nesterov
Add cc's.

On 09/17, Kyle Walker wrote:
>
> Currently, the oom killer will attempt to kill a process that is in
> TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional
> period of time, such as processes writing to a frozen filesystem during
> a lengthy backup operation, this can result in a deadlock condition as
> related processes memory access will stall within the page fault
> handler.
>
> Within oom_unkillable_task(), check for processes in
> TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will
> move on to another task.
>
> Signed-off-by: Kyle Walker 
> ---
>  mm/oom_kill.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..66f03f8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p,
>   if (memcg && !task_in_mem_cgroup(p, memcg))
>   return true;
>
> + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */
> + if (p->state == TASK_UNINTERRUPTIBLE)
> + return true;
> +

So we can skip a memory hog which, say, does mutex_lock(). And this can't
help if this task is multithreaded, unless all its sub-threads are in "D"
state too oom killer will pick another thread with the same ->mm. Plus
other problems.

But yes, such a deadlock is possible. I would really like to see the comments
from maintainers. In particular, I seem to recall that someone suggested to
try to kill another !TIF_MEMDIE process after timeout, perhaps this is what
we should actually do...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/