Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Kyle Walker wrote: > I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, > and omitting them from the oom selection process, continuing the > carnage is likely to result in more unpredictable results. At this > time, I believe Oleg's solution of zapping the process memory use > while it sleeps with the fatal signal enroute is ideal. I cannot help thinking about the worst case. (1) If memory zapping code successfully reclaimed some memory from the mm struct used by the OOM victim, what guarantees that the reclaimed memory is used by OOM victims (and processes which are blocking OOM victims)? David's "global access to memory reserves" allows a local unprivileged user to deplete memory reserves; could allow that user to deplete the reclaimed memory as well. I think that my "Favor kthread and dying threads over normal threads" ( http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp ) would allow the reclaimed memory to be used by OOM victims and kernel threads if the reclaimed memory is added to free list bit by bit in a way that watermark remains low enough to prevent normal threads from allocating the reclaimed memory. But my patch still fails if normal threads are blocking the OOM victims or unrelated kernel threads consume the reclaimed memory. (2) If memory zapping code failed to reclaim enough memory from the mm struct needed for the OOM victim, what mechanism can solve the OOM stalls? Some administrator sets /proc/pid/oom_score_adj to -1000 to most of enterprise processes (e.g. java) and as a consequence only trivial processes (e.g. grep / sed) are candidates for OOM victims. Moreover, a local unprivileged user can easily fool the OOM killer using decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is set to 999). (3) If memory zapping code reclaimed no memory due to ->mmap_sem contention, what mechanism can solve the OOM stalls? While we don't allocate much memory with ->mmap_sem held for writing, the task which is holding ->mmap_sem for writing can be chosen as one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not set, it can form OOM-livelock unless all memory allocations with ->mmap_sem held for writing are __GFP_FS allocations and that task can reach out_of_memory() (i.e. not blocked by unexpected factors such as waiting for filesystem's writeback). After all I think we have to consider what to do if memory zapping code failed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Kyle Walker wrote: > I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, > and omitting them from the oom selection process, continuing the > carnage is likely to result in more unpredictable results. At this > time, I believe Oleg's solution of zapping the process memory use > while it sleeps with the fatal signal enroute is ideal. I cannot help thinking about the worst case. (1) If memory zapping code successfully reclaimed some memory from the mm struct used by the OOM victim, what guarantees that the reclaimed memory is used by OOM victims (and processes which are blocking OOM victims)? David's "global access to memory reserves" allows a local unprivileged user to deplete memory reserves; could allow that user to deplete the reclaimed memory as well. I think that my "Favor kthread and dying threads over normal threads" ( http://lkml.kernel.org/r/1442939668-4421-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp ) would allow the reclaimed memory to be used by OOM victims and kernel threads if the reclaimed memory is added to free list bit by bit in a way that watermark remains low enough to prevent normal threads from allocating the reclaimed memory. But my patch still fails if normal threads are blocking the OOM victims or unrelated kernel threads consume the reclaimed memory. (2) If memory zapping code failed to reclaim enough memory from the mm struct needed for the OOM victim, what mechanism can solve the OOM stalls? Some administrator sets /proc/pid/oom_score_adj to -1000 to most of enterprise processes (e.g. java) and as a consequence only trivial processes (e.g. grep / sed) are candidates for OOM victims. Moreover, a local unprivileged user can easily fool the OOM killer using decoy tasks (which consumes little memory and /proc/pid/oom_score_adj is set to 999). (3) If memory zapping code reclaimed no memory due to ->mmap_sem contention, what mechanism can solve the OOM stalls? While we don't allocate much memory with ->mmap_sem held for writing, the task which is holding ->mmap_sem for writing can be chosen as one of OOM victims. If such task receives SIGKILL but TIF_MEMDIE is not set, it can form OOM-livelock unless all memory allocations with ->mmap_sem held for writing are __GFP_FS allocations and that task can reach out_of_memory() (i.e. not blocked by unexpected factors such as waiting for filesystem's writeback). After all I think we have to consider what to do if memory zapping code failed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjes wrote: > > I struggle to understand how the approach of randomly continuing to kill > more and more processes in the hope that it slows down usage of memory > reserves or that we get lucky is better. Thank you to one and all for the feedback. I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, and omitting them from the oom selection process, continuing the carnage is likely to result in more unpredictable results. At this time, I believe Oleg's solution of zapping the process memory use while it sleeps with the fatal signal enroute is ideal. Kyle Walker -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Tue, Sep 22, 2015 at 7:32 PM, David Rientjeswrote: > > I struggle to understand how the approach of randomly continuing to kill > more and more processes in the hope that it slows down usage of memory > reserves or that we get lucky is better. Thank you to one and all for the feedback. I agree, in lieu of treating TASK_UNINTERRUPTIBLE tasks as unkillable, and omitting them from the oom selection process, continuing the carnage is likely to result in more unpredictable results. At this time, I believe Oleg's solution of zapping the process memory use while it sleeps with the fatal signal enroute is ideal. Kyle Walker -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Tue, 22 Sep 2015, Tetsuo Handa wrote: > David Rientjes wrote: > > Your proposal, which I mostly agree with, tries to kill additional > > processes so that they allocate and drop the lock that the original victim > > depends on. My approach, from > > http://marc.info/?l=linux-kernel=144010444913702, is the same, but > > without the killing. It's unecessary to kill every process on the system > > that is depending on the same lock, and we can't know which processes are > > stalling on that lock and which are not. > > Would you try your approach with below program? > (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) > > -- oom-depleter3.c start -- > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > static int zero_fd = EOF; > static char *buf = NULL; > static unsigned long size = 0; > > static int dummy(void *unused) > { > static char buffer[4096] = { }; > int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && > fsync(fd) == 0); > return 0; > } > > static int trigger(void *unused) > { > read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ > return 0; > } > > int main(int argc, char *argv[]) > { > unsigned long i; > zero_fd = open("/dev/zero", O_RDONLY); > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > /* >* Create many child threads in order to enlarge time lag between >* the OOM killer sets TIF_MEMDIE to thread group leader and >* the OOM killer sends SIGKILL to that thread. >*/ > for (i = 0; i < 1000; i++) { > clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, > NULL); > } > /* Let a child thread trigger the OOM killer. */ > clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); > /* Deplete all memory reserve using the time lag. */ > for (i = size; i; i -= 4096) > buf[i - 1] = 1; > return * (char *) NULL; /* Kill all threads. */ > } > -- oom-depleter3.c end -- > > uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz > shows that the memory reserves completely depleted and > uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz > shows that the memory reserves was not used at all. > Is this result what you expected? > What are the results when the kernel isn't patched at all? The trade-off being made is that we want to attempt to make forward progress when there is an excessive stall in an oom victim making its exit rather than livelock the system forever waiting for memory that can never be allocated. I struggle to understand how the approach of randomly continuing to kill more and more processes in the hope that it slows down usage of memory reserves or that we get lucky is better. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Tue, 22 Sep 2015, Tetsuo Handa wrote: > David Rientjes wrote: > > Your proposal, which I mostly agree with, tries to kill additional > > processes so that they allocate and drop the lock that the original victim > > depends on. My approach, from > > http://marc.info/?l=linux-kernel=144010444913702, is the same, but > > without the killing. It's unecessary to kill every process on the system > > that is depending on the same lock, and we can't know which processes are > > stalling on that lock and which are not. > > Would you try your approach with below program? > (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) > > -- oom-depleter3.c start -- > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > static int zero_fd = EOF; > static char *buf = NULL; > static unsigned long size = 0; > > static int dummy(void *unused) > { > static char buffer[4096] = { }; > int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && > fsync(fd) == 0); > return 0; > } > > static int trigger(void *unused) > { > read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ > return 0; > } > > int main(int argc, char *argv[]) > { > unsigned long i; > zero_fd = open("/dev/zero", O_RDONLY); > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > /* >* Create many child threads in order to enlarge time lag between >* the OOM killer sets TIF_MEMDIE to thread group leader and >* the OOM killer sends SIGKILL to that thread. >*/ > for (i = 0; i < 1000; i++) { > clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, > NULL); > } > /* Let a child thread trigger the OOM killer. */ > clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); > /* Deplete all memory reserve using the time lag. */ > for (i = size; i; i -= 4096) > buf[i - 1] = 1; > return * (char *) NULL; /* Kill all threads. */ > } > -- oom-depleter3.c end -- > > uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz > shows that the memory reserves completely depleted and > uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz > shows that the memory reserves was not used at all. > Is this result what you expected? > What are the results when the kernel isn't patched at all? The trade-off being made is that we want to attempt to make forward progress when there is an excessive stall in an oom victim making its exit rather than livelock the system forever waiting for memory that can never be allocated. I struggle to understand how the approach of randomly continuing to kill more and more processes in the hope that it slows down usage of memory reserves or that we get lucky is better. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
David Rientjes wrote: > Your proposal, which I mostly agree with, tries to kill additional > processes so that they allocate and drop the lock that the original victim > depends on. My approach, from > http://marc.info/?l=linux-kernel=144010444913702, is the same, but > without the killing. It's unecessary to kill every process on the system > that is depending on the same lock, and we can't know which processes are > stalling on that lock and which are not. Would you try your approach with below program? (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) -- oom-depleter3.c start -- #define _GNU_SOURCE #include #include #include #include #include #include #include static int zero_fd = EOF; static char *buf = NULL; static unsigned long size = 0; static int dummy(void *unused) { static char buffer[4096] = { }; int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && fsync(fd) == 0); return 0; } static int trigger(void *unused) { read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } int main(int argc, char *argv[]) { unsigned long i; zero_fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } /* * Create many child threads in order to enlarge time lag between * the OOM killer sets TIF_MEMDIE to thread group leader and * the OOM killer sends SIGKILL to that thread. */ for (i = 0; i < 1000; i++) { clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, NULL); } /* Let a child thread trigger the OOM killer. */ clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); /* Deplete all memory reserve using the time lag. */ for (i = size; i; i -= 4096) buf[i - 1] = 1; return * (char *) NULL; /* Kill all threads. */ } -- oom-depleter3.c end -- uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz shows that the memory reserves completely depleted and uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz shows that the memory reserves was not used at all. Is this result what you expected? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Sat, 19 Sep 2015, Tetsuo Handa wrote: > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? > I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS being problematic. It is simply the mechanism by which we give oom killed processes access to memory reserves if they need it. I believe you are referring only to the oom killer stalling when it finds an oom victim. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. > Your proposal, which I mostly agree with, tries to kill additional processes so that they allocate and drop the lock that the original victim depends on. My approach, from http://marc.info/?l=linux-kernel=144010444913702, is the same, but without the killing. It's unecessary to kill every process on the system that is depending on the same lock, and we can't know which processes are stalling on that lock and which are not. I think it's much easier to simply identify such a situation where a process has not exited in a timely manner and then provide processes access to memory reserves without being killed. We hope that the victim will have queued its mutex_lock() and allocators that are holding the lock will drop it after successfully utilizing memory reserves. We can mitigate immediate depletion of memory reserves by requiring all allocators to reclaim (or compact) and calling the oom killer to identify the timeout before granting access to memory reserves for a single allocation before schedule_timeout_killable(1) and returning. I don't know of any alternative solutions where we can guarantee that memory reserves cannot be depleted unless memory reserves are 100% of memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Christoph Lameter wrote: > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. > > Signed-off-by: Christoph Lameter > > Index: linux/mm/oom_kill.c > === > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( >* This task already has access to memory reserves and is being killed. >* Don't allow any other task to have access to the reserves. >*/ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; > If this would result in the newly chosen process being guaranteed to exit, this would be fine. Unfortunately, no such guarantee is possible. If a thread is holding a contended mutex that the victim(s) require, this serial oom killer could eventually panic the system if that thread is OOM_DISABLE. The solution that we have merged internally is described at http://marc.info/?l=linux-kernel=144010444913702 -- we provide access to memory reserves to processes that find a stalled exit in the oom killer so that they may allocate. It comes along with a test module that takes a contended mutex and ensures that forward progress is made as long as memory reserves are not depleted. We can't actually guarantee that memory reserves won't be depleted, but we (1) hope that nobody is actually allocating a lot of memory before dropping a mutex and (2) want to avoid the alternative which is a system livelock. This will address situations such as allocator oom victim - -- mutex_lock(lock) alloc_pages(GFP_KERNEL) mutex_lock(lock) mutex_unlock(lock) handle SIGKILL since this otherwise results in a livelock without a solution such as mine since the GFP_KERNEL allocation stalls forever waiting for the oom victim to acquire the mutex and exit. This also works if the allocator is OOM_DISABLE. This won't handle other situations where the victim gets wedged in D state and is not allocating memory, but this is by far the more common occurrence that we have dealt with. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Sat, 19 Sep 2015, Michal Hocko wrote: > Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded > state. Using it as an oom victim criteria makes the victim selection > less deterministic which is undesirable. As much as I am aware of > potential issues with the current implementation, making the behavior > more random doesn't really help. > Agreed, we can't avoid killing a process simply because it is in D state, this isn't an indication that the process will not be able to exit and in the worst case could panic the system if all other processes cannot be oom killed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
David Rientjes wrote: > Your proposal, which I mostly agree with, tries to kill additional > processes so that they allocate and drop the lock that the original victim > depends on. My approach, from > http://marc.info/?l=linux-kernel=144010444913702, is the same, but > without the killing. It's unecessary to kill every process on the system > that is depending on the same lock, and we can't know which processes are > stalling on that lock and which are not. Would you try your approach with below program? (My reproducers are tested on XFS on a VM with 4 CPUs / 2048MB RAM.) -- oom-depleter3.c start -- #define _GNU_SOURCE #include #include #include #include #include #include #include static int zero_fd = EOF; static char *buf = NULL; static unsigned long size = 0; static int dummy(void *unused) { static char buffer[4096] = { }; int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer) == sizeof(buffer)) && fsync(fd) == 0); return 0; } static int trigger(void *unused) { read(zero_fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } int main(int argc, char *argv[]) { unsigned long i; zero_fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } /* * Create many child threads in order to enlarge time lag between * the OOM killer sets TIF_MEMDIE to thread group leader and * the OOM killer sends SIGKILL to that thread. */ for (i = 0; i < 1000; i++) { clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM, NULL); } /* Let a child thread trigger the OOM killer. */ clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL); /* Deplete all memory reserve using the time lag. */ for (i = size; i; i -= 4096) buf[i - 1] = 1; return * (char *) NULL; /* Kill all threads. */ } -- oom-depleter3.c end -- uptime > 350 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-1.txt.xz shows that the memory reserves completely depleted and uptime > 42 of http://I-love.SAKURA.ne.jp/tmp/serial-20150922-2.txt.xz shows that the memory reserves was not used at all. Is this result what you expected? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Sat, 19 Sep 2015, Tetsuo Handa wrote: > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? > I don't see the connection between TIF_MEMDIE and ALLOC_NO_WATERMARKS being problematic. It is simply the mechanism by which we give oom killed processes access to memory reserves if they need it. I believe you are referring only to the oom killer stalling when it finds an oom victim. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. > Your proposal, which I mostly agree with, tries to kill additional processes so that they allocate and drop the lock that the original victim depends on. My approach, from http://marc.info/?l=linux-kernel=144010444913702, is the same, but without the killing. It's unecessary to kill every process on the system that is depending on the same lock, and we can't know which processes are stalling on that lock and which are not. I think it's much easier to simply identify such a situation where a process has not exited in a timely manner and then provide processes access to memory reserves without being killed. We hope that the victim will have queued its mutex_lock() and allocators that are holding the lock will drop it after successfully utilizing memory reserves. We can mitigate immediate depletion of memory reserves by requiring all allocators to reclaim (or compact) and calling the oom killer to identify the timeout before granting access to memory reserves for a single allocation before schedule_timeout_killable(1) and returning. I don't know of any alternative solutions where we can guarantee that memory reserves cannot be depleted unless memory reserves are 100% of memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Sat, 19 Sep 2015, Michal Hocko wrote: > Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded > state. Using it as an oom victim criteria makes the victim selection > less deterministic which is undesirable. As much as I am aware of > potential issues with the current implementation, making the behavior > more random doesn't really help. > Agreed, we can't avoid killing a process simply because it is in D state, this isn't an indication that the process will not be able to exit and in the worst case could panic the system if all other processes cannot be oom killed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Christoph Lameter wrote: > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. > > Signed-off-by: Christoph Lameter> > Index: linux/mm/oom_kill.c > === > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( >* This task already has access to memory reserves and is being killed. >* Don't allow any other task to have access to the reserves. >*/ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; > If this would result in the newly chosen process being guaranteed to exit, this would be fine. Unfortunately, no such guarantee is possible. If a thread is holding a contended mutex that the victim(s) require, this serial oom killer could eventually panic the system if that thread is OOM_DISABLE. The solution that we have merged internally is described at http://marc.info/?l=linux-kernel=144010444913702 -- we provide access to memory reserves to processes that find a stalled exit in the oom killer so that they may allocate. It comes along with a test module that takes a contended mutex and ensures that forward progress is made as long as memory reserves are not depleted. We can't actually guarantee that memory reserves won't be depleted, but we (1) hope that nobody is actually allocating a lot of memory before dropping a mutex and (2) want to avoid the alternative which is a system livelock. This will address situations such as allocator oom victim - -- mutex_lock(lock) alloc_pages(GFP_KERNEL) mutex_lock(lock) mutex_unlock(lock) handle SIGKILL since this otherwise results in a livelock without a solution such as mine since the GFP_KERNEL allocation stalls forever waiting for the oom victim to acquire the mutex and exit. This also works if the allocator is OOM_DISABLE. This won't handle other situations where the victim gets wedged in D state and is not allocating memory, but this is by far the more common occurrence that we have dealt with. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Sat 19-09-15 23:33:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > > still do not think this is a right approach of dealing with the problem. > > I do not think "GFP_NOFS can fail" patch is a right approach because > that patch easily causes messages like below. > > Buffer I/O error on dev sda1, logical block 34661831, lost async page write > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) > XFS: possible memory allocation deadlock in xfs_buf_allocate_memory > (mode:0x250) > XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) These messages just tell you that the allocation fails repeatedly. Have a look and check the code. They are basically opencoded NOFAIL allocations. They haven't been converted to actually tell the MM layer that they cannot fail because Dave said they have a long term plan to change this code and basically implement different failing strategies. > Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. > > I believe choosing more OOM victims is the only way which can solve OOM > stalls. I am very well aware of your position and all the attempts to tweak different code paths to actually pass your corner case. I, however, care for the longer term goals more. And I believe that the page allocator and the reclaim should strive for being less deadlock prone in the first place. That includes a more natural semantic and non-failing default semantic is really error prone IMHO. We have been through this discussion many times already and I've tried to express this is a long term goal with incremental steps. I really hate to do "easy" things now just to feel better about particular case which will kick us back little bit later. And from my own experience I can tell you that a more non-deterministic OOM behavior is thing people complain about. > > You can quickly deplete memory reserves this way without making further > > progress (I am afraid you can even trigger this from userspace without > > having big privileges) so even administrator will have no way to > > intervene. > > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Of course. This is a heurstic and as such it cannot ever work in 100% situations. And it is not the first heuristic we have for the OOM killer. The last time this has been all rewritten was because the OOM killer was too unreliable/non-deterministic. Reports have decreased considerable since then. > Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? This has been explained several times. There is no guaranteed this would help and _your_ own usecase shows how you can end up with such a long lock dependency chains that you can easily eat up the whole memory reserves before you can make any progress. I do agree that a hand break mechanism is really desirable for those who really care. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. But those two goes against each other. [...] > If you still want to keep "there should be only up to 1 TIF_MEMDIE task" > rule, what alternative do you have? (I do not like panic_on_oom_timeout > because it is more data-lossy approach than choosing next OOM victim.) I am not married to 1 TIF_MEMDIE task thing. I just think that there is still a lot of room for other improvements. The original issue which triggered this discussion again is a good example. I completely miss why a writer has to be unkillable when the fs is frozen. There are others which are more complicated of course. Including the whole class represented by GFP_NOFS allocations as you have noted. But we still have a room for improvements even in the reclaim. It has been suggested quite some time ago that the memory mapped by the OOM victim might be unmapped. Basically what Oleg is proposing in other email. I didn't get to read his email yet properly but that should certainly help to reduce the problem space. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/19, Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. Agreed. But still I think it makes sense to try to kill another task if the victim refuse to die. Yes, the details are not clear to me. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Michal Hocko wrote: > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. I do not think "GFP_NOFS can fail" patch is a right approach because that patch easily causes messages like below. Buffer I/O error on dev sda1, logical block 34661831, lost async page write XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. I believe choosing more OOM victims is the only way which can solve OOM stalls. > You can quickly deplete memory reserves this way without making further > progress (I am afraid you can even trigger this from userspace without > having big privileges) so even administrator will have no way to > intervene. I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the OOM victim task as soon as possible, but it turned out that it will not work if there is invisible lock dependency. Therefore, why not to give up "there should be only up to 1 TIF_MEMDIE task" rule? What this patch (and many others posted in various forms many times over past years) does is to give up "there should be only up to 1 TIF_MEMDIE task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks and somehow manage in a way memory reserves will not deplete. In my proposal which favors all fatal_signal_pending() tasks evenly ( http://lkml.kernel.org/r/201509102318.ghg18789.ohmslfjoqfo...@i-love.sakura.ne.jp ) suggests that the OOM victim task unlikely needs all of memory reserves. In other words, the OOM victim task can likely make forward progress if some amount of memory reserves are allowed (compared to normal tasks waiting for memory). So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current) will handle many cases if fatal_signal_pending() tasks are allowed to access some amount of memory reserves. And my proposal which chooses next OOM victim upon timeout will handle the remaining cases without depleting memory reserves. If you still want to keep "there should be only up to 1 TIF_MEMDIE task" rule, what alternative do you have? (I do not like panic_on_oom_timeout because it is more data-lossy approach than choosing next OOM victim.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri 18-09-15 12:00:59, Christoph Lameter wrote: [...] > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. This has been posted in various forms many times over past years. I still do not think this is a right approach of dealing with the problem. You can quickly deplete memory reserves this way without making further progress (I am afraid you can even trigger this from userspace without having big privileges) so even administrator will have no way to intervene. > Signed-off-by: Christoph Lameter > > Index: linux/mm/oom_kill.c > === > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( >* This task already has access to memory reserves and is being killed. >* Don't allow any other task to have access to the reserves. >*/ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri 18-09-15 10:41:09, Christoph Lameter wrote: [...] > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); TIF_MEMDIE processes will get thawed automatically and then cannot be frozen again. Have a look at mark_oom_victim. > } > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Thu 17-09-15 13:59:43, Kyle Walker wrote: > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. I am not familiar with the fs freezing code so I might be missing something important here. __sb_start_write waits for the frozen fs by wait_event which is really UN sleep. Why cannot we sleep here in IN sleep and return with EINTR when interrupted? I would consider this a better behavior not only because of OOM because having unkillable tasks in general is undesirable. AFAIU the fs might be frozen for ever and admin cannot do anything about the pending processes. > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded state. Using it as an oom victim criteria makes the victim selection less deterministic which is undesirable. As much as I am aware of potential issues with the current implementation, making the behavior more random doesn't really help. > Signed-off-by: Kyle Walker > --- > mm/oom_kill.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + > /* p may not have freeable memory in nodemask */ > if (!has_intersects_mems_allowed(p, nodemask)) > return true; > -- > 2.4.3 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri 18-09-15 12:00:59, Christoph Lameter wrote: [...] > Subject: Allow multiple kills from the OOM killer > > The OOM killer currently aborts if it finds a process that already is having > access to the reserve memory pool for exit processing. This is done so that > the reserves are not overcommitted but on the other hand this also allows > only one process being oom killed at the time. That process may be stuck > in D state. This has been posted in various forms many times over past years. I still do not think this is a right approach of dealing with the problem. You can quickly deplete memory reserves this way without making further progress (I am afraid you can even trigger this from userspace without having big privileges) so even administrator will have no way to intervene. > Signed-off-by: Christoph Lameter> > Index: linux/mm/oom_kill.c > === > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( >* This task already has access to memory reserves and is being killed. >* Don't allow any other task to have access to the reserves. >*/ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri 18-09-15 10:41:09, Christoph Lameter wrote: [...] > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); TIF_MEMDIE processes will get thawed automatically and then cannot be frozen again. Have a look at mark_oom_victim. > } > if (!task->mm) > return OOM_SCAN_CONTINUE; -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Thu 17-09-15 13:59:43, Kyle Walker wrote: > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. I am not familiar with the fs freezing code so I might be missing something important here. __sb_start_write waits for the frozen fs by wait_event which is really UN sleep. Why cannot we sleep here in IN sleep and return with EINTR when interrupted? I would consider this a better behavior not only because of OOM because having unkillable tasks in general is undesirable. AFAIU the fs might be frozen for ever and admin cannot do anything about the pending processes. > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. Nack to this. TASK_UNINTERRUPTIBLE should be time constrained/bounded state. Using it as an oom victim criteria makes the victim selection less deterministic which is undesirable. As much as I am aware of potential issues with the current implementation, making the behavior more random doesn't really help. > Signed-off-by: Kyle Walker> --- > mm/oom_kill.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + > /* p may not have freeable memory in nodemask */ > if (!has_intersects_mems_allowed(p, nodemask)) > return true; > -- > 2.4.3 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/19, Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. Agreed. But still I think it makes sense to try to kill another task if the victim refuse to die. Yes, the details are not clear to me. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Michal Hocko wrote: > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. I do not think "GFP_NOFS can fail" patch is a right approach because that patch easily causes messages like below. Buffer I/O error on dev sda1, logical block 34661831, lost async page write XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. I believe choosing more OOM victims is the only way which can solve OOM stalls. > You can quickly deplete memory reserves this way without making further > progress (I am afraid you can even trigger this from userspace without > having big privileges) so even administrator will have no way to > intervene. I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the OOM victim task as soon as possible, but it turned out that it will not work if there is invisible lock dependency. Therefore, why not to give up "there should be only up to 1 TIF_MEMDIE task" rule? What this patch (and many others posted in various forms many times over past years) does is to give up "there should be only up to 1 TIF_MEMDIE task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks and somehow manage in a way memory reserves will not deplete. In my proposal which favors all fatal_signal_pending() tasks evenly ( http://lkml.kernel.org/r/201509102318.ghg18789.ohmslfjoqfo...@i-love.sakura.ne.jp ) suggests that the OOM victim task unlikely needs all of memory reserves. In other words, the OOM victim task can likely make forward progress if some amount of memory reserves are allowed (compared to normal tasks waiting for memory). So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current) will handle many cases if fatal_signal_pending() tasks are allowed to access some amount of memory reserves. And my proposal which chooses next OOM victim upon timeout will handle the remaining cases without depleting memory reserves. If you still want to keep "there should be only up to 1 TIF_MEMDIE task" rule, what alternative do you have? (I do not like panic_on_oom_timeout because it is more data-lossy approach than choosing next OOM victim.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Sat 19-09-15 23:33:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > This has been posted in various forms many times over past years. I > > still do not think this is a right approach of dealing with the problem. > > I do not think "GFP_NOFS can fail" patch is a right approach because > that patch easily causes messages like below. > > Buffer I/O error on dev sda1, logical block 34661831, lost async page write > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) > XFS: possible memory allocation deadlock in xfs_buf_allocate_memory > (mode:0x250) > XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) These messages just tell you that the allocation fails repeatedly. Have a look and check the code. They are basically opencoded NOFAIL allocations. They haven't been converted to actually tell the MM layer that they cannot fail because Dave said they have a long term plan to change this code and basically implement different failing strategies. > Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. > > I believe choosing more OOM victims is the only way which can solve OOM > stalls. I am very well aware of your position and all the attempts to tweak different code paths to actually pass your corner case. I, however, care for the longer term goals more. And I believe that the page allocator and the reclaim should strive for being less deadlock prone in the first place. That includes a more natural semantic and non-failing default semantic is really error prone IMHO. We have been through this discussion many times already and I've tried to express this is a long term goal with incremental steps. I really hate to do "easy" things now just to feel better about particular case which will kick us back little bit later. And from my own experience I can tell you that a more non-deterministic OOM behavior is thing people complain about. > > You can quickly deplete memory reserves this way without making further > > progress (I am afraid you can even trigger this from userspace without > > having big privileges) so even administrator will have no way to > > intervene. > > I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying > cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the > OOM victim task as soon as possible, but it turned out that it will not > work if there is invisible lock dependency. Of course. This is a heurstic and as such it cannot ever work in 100% situations. And it is not the first heuristic we have for the OOM killer. The last time this has been all rewritten was because the OOM killer was too unreliable/non-deterministic. Reports have decreased considerable since then. > Therefore, why not to give up > "there should be only up to 1 TIF_MEMDIE task" rule? This has been explained several times. There is no guaranteed this would help and _your_ own usecase shows how you can end up with such a long lock dependency chains that you can easily eat up the whole memory reserves before you can make any progress. I do agree that a hand break mechanism is really desirable for those who really care. > What this patch (and many others posted in various forms many times over > past years) does is to give up "there should be only up to 1 TIF_MEMDIE > task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks > and somehow manage in a way memory reserves will not deplete. But those two goes against each other. [...] > If you still want to keep "there should be only up to 1 TIF_MEMDIE task" > rule, what alternative do you have? (I do not like panic_on_oom_timeout > because it is more data-lossy approach than choosing next OOM victim.) I am not married to 1 TIF_MEMDIE task thing. I just think that there is still a lot of room for other improvements. The original issue which triggered this discussion again is a good example. I completely miss why a writer has to be unkillable when the fs is frozen. There are others which are more complicated of course. Including the whole class represented by GFP_NOFS allocations as you have noted. But we still have a room for improvements even in the reclaim. It has been suggested quite some time ago that the memory mapped by the OOM victim might be unmapped. Basically what Oleg is proposing in other email. I didn't get to read his email yet properly but that should certainly help to reduce the problem space. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Kyle Walker wrote: > I do like the idea of not stalling completely in an oom just because the > first attempt didn't go so well. Is there any possibility of simply having > our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE > tasks > as low-hanging fruit and allowing the oom to continue in the event that the > first attempt stalls? TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they *should react* in a reasonable timeframe. There is an alternative API for those cases that cannot. Typically this is a write that is stalling. If we kill the process then its pointless to wait on the write to complete. See https://lwn.net/Articles/288056/ http://www.ibm.com/developerworks/library/l-task-killable/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Oleg Nesterov wrote: > And btw. Yes, this is a bit off-topic, but I think another change make > sense too. We should report the fact we are going to kill another task > because the previous victim refuse to die, and print its stack trace. What happens is that the previous victim did not enter exit processing. If it would then it would be excluded by other checks. The first victim never reacted and never started using the memory resources available for exiting. Thats why I thought it maybe safe to go this way. An issue could result from another process being terminated and the first victim finally reacting to the signal and also beginning termination. Then we have contention on the reserves. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/18, Christoph Lameter wrote: > > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( >* This task already has access to memory reserves and is being killed. >* Don't allow any other task to have access to the reserves. >*/ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + Well, I can't really comment. Hopefully we will see more comments from those who understand oom-killer. But I still think this is not enough, and we need some (configurable?) timeout before we pick another victim... And btw. Yes, this is a bit off-topic, but I think another change make sense too. We should report the fact we are going to kill another task because the previous victim refuse to die, and print its stack trace. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. Ok. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. Signed-off-by: Christoph Lameter Index: linux/mm/oom_kill.c === --- linux.orig/mm/oom_kill.c2015-09-18 11:58:52.963946782 -0500 +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves. */ - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; - } + if (test_tsk_thread_flag(task, TIF_MEMDIE)) + return OOM_SCAN_CONTINUE; + if (!task->mm) return OOM_SCAN_CONTINUE; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/19, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > > > I am not sure this change is enough, we need to ensure that > > select_bad_process() won't pick the same task (or its sub-thread) again. > > SysRq-f is sometimes unusable because it continues choosing the same thread. > oom_kill_process() should not choose a thread which already has TIF_MEMDIE. So I was right, this is really not enough... > I think we need to rewrite oom_kill_process(). Heh. I can only ack the intent and wish you good luck ;) > > And perhaps something like > > > > wait_event_timeout(oom_victims_wait, !oom_victims, > > configurable_timeout); > > > > before select_bad_process() makes sense? > > I think you should not sleep for long with oom_lock mutex held. > http://marc.info/?l=linux-mm=143031212312459 Yes, yes, sure, I didn't mean we should wait under oom_lock. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > I am not sure this change is enough, we need to ensure that > select_bad_process() won't pick the same task (or its sub-thread) again. SysRq-f is sometimes unusable because it continues choosing the same thread. oom_kill_process() should not choose a thread which already has TIF_MEMDIE. I think we need to rewrite oom_kill_process(). > > And perhaps something like > > wait_event_timeout(oom_victims_wait, !oom_victims, > configurable_timeout); > > before select_bad_process() makes sense? I think you should not sleep for long with oom_lock mutex held. http://marc.info/?l=linux-mm=143031212312459 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/18, Christoph Lameter wrote: > > > But yes, such a deadlock is possible. I would really like to see the > > comments > > from maintainers. In particular, I seem to recall that someone suggested to > > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > > we should actually do... > > Well yes here is a patch that kills another memdie process but there is > some risk with such an approach of overusing the reserves. Yes, I understand it is not that simple. And probably this is all I can understand ;) > --- linux.orig/mm/oom_kill.c 2015-09-18 10:38:29.601963726 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 > @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( >* Don't allow any other task to have access to the reserves. >*/ > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); To simplify the discussion lets ignore PF_FROZEN, this is another issue. I am not sure this change is enough, we need to ensure that select_bad_process() won't pick the same task (or its sub-thread) again. And perhaps something like wait_event_timeout(oom_victims_wait, !oom_victims, configurable_timeout); before select_bad_process() makes sense? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
> But yes, such a deadlock is possible. I would really like to see the comments > from maintainers. In particular, I seem to recall that someone suggested to > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > we should actually do... Well yes here is a patch that kills another memdie process but there is some risk with such an approach of overusing the reserves. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. The patch simply removes the aborting of the scan so that other processes may be killed if one is stuck in D state. Signed-off-by: Christoph Lameter Index: linux/mm/oom_kill.c === --- linux.orig/mm/oom_kill.c2015-09-18 10:38:29.601963726 -0500 +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; + if (unlikely(frozen(task))) + __thaw_task(task); } if (!task->mm) return OOM_SCAN_CONTINUE; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/18, Christoph Lameter wrote: > > --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 > @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( >* This task already has access to memory reserves and is being killed. >* Don't allow any other task to have access to the reserves. >*/ > - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > - } > + if (test_tsk_thread_flag(task, TIF_MEMDIE)) > + return OOM_SCAN_CONTINUE; > + Well, I can't really comment. Hopefully we will see more comments from those who understand oom-killer. But I still think this is not enough, and we need some (configurable?) timeout before we pick another victim... And btw. Yes, this is a bit off-topic, but I think another change make sense too. We should report the fact we are going to kill another task because the previous victim refuse to die, and print its stack trace. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Oleg Nesterov wrote: > And btw. Yes, this is a bit off-topic, but I think another change make > sense too. We should report the fact we are going to kill another task > because the previous victim refuse to die, and print its stack trace. What happens is that the previous victim did not enter exit processing. If it would then it would be excluded by other checks. The first victim never reacted and never started using the memory resources available for exiting. Thats why I thought it maybe safe to go this way. An issue could result from another process being terminated and the first victim finally reacting to the signal and also beginning termination. Then we have contention on the reserves. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Kyle Walker wrote: > I do like the idea of not stalling completely in an oom just because the > first attempt didn't go so well. Is there any possibility of simply having > our cake and eating it too? Specifically, omitting TASK_UNINTERRUPTIBLE > tasks > as low-hanging fruit and allowing the oom to continue in the event that the > first attempt stalls? TASK_UNINTERRUPTIBLE tasks should not be sleeping that long and they *should react* in a reasonable timeframe. There is an alternative API for those cases that cannot. Typically this is a write that is stalling. If we kill the process then its pointless to wait on the write to complete. See https://lwn.net/Articles/288056/ http://www.ibm.com/developerworks/library/l-task-killable/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > I am not sure this change is enough, we need to ensure that > select_bad_process() won't pick the same task (or its sub-thread) again. SysRq-f is sometimes unusable because it continues choosing the same thread. oom_kill_process() should not choose a thread which already has TIF_MEMDIE. I think we need to rewrite oom_kill_process(). > > And perhaps something like > > wait_event_timeout(oom_victims_wait, !oom_victims, > configurable_timeout); > > before select_bad_process() makes sense? I think you should not sleep for long with oom_lock mutex held. http://marc.info/?l=linux-mm=143031212312459 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/19, Tetsuo Handa wrote: > > Oleg Nesterov wrote: > > To simplify the discussion lets ignore PF_FROZEN, this is another issue. > > > > I am not sure this change is enough, we need to ensure that > > select_bad_process() won't pick the same task (or its sub-thread) again. > > SysRq-f is sometimes unusable because it continues choosing the same thread. > oom_kill_process() should not choose a thread which already has TIF_MEMDIE. So I was right, this is really not enough... > I think we need to rewrite oom_kill_process(). Heh. I can only ack the intent and wish you good luck ;) > > And perhaps something like > > > > wait_event_timeout(oom_victims_wait, !oom_victims, > > configurable_timeout); > > > > before select_bad_process() makes sense? > > I think you should not sleep for long with oom_lock mutex held. > http://marc.info/?l=linux-mm=143031212312459 Yes, yes, sure, I didn't mean we should wait under oom_lock. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On Fri, 18 Sep 2015, Oleg Nesterov wrote: > To simplify the discussion lets ignore PF_FROZEN, this is another issue. Ok. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. Signed-off-by: Christoph LameterIndex: linux/mm/oom_kill.c === --- linux.orig/mm/oom_kill.c2015-09-18 11:58:52.963946782 -0500 +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500 @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread( * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves. */ - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; - } + if (test_tsk_thread_flag(task, TIF_MEMDIE)) + return OOM_SCAN_CONTINUE; + if (!task->mm) return OOM_SCAN_CONTINUE; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
On 09/18, Christoph Lameter wrote: > > > But yes, such a deadlock is possible. I would really like to see the > > comments > > from maintainers. In particular, I seem to recall that someone suggested to > > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > > we should actually do... > > Well yes here is a patch that kills another memdie process but there is > some risk with such an approach of overusing the reserves. Yes, I understand it is not that simple. And probably this is all I can understand ;) > --- linux.orig/mm/oom_kill.c 2015-09-18 10:38:29.601963726 -0500 > +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 > @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( >* Don't allow any other task to have access to the reserves. >*/ > if (test_tsk_thread_flag(task, TIF_MEMDIE)) { > - if (oc->order != -1) > - return OOM_SCAN_ABORT; > + if (unlikely(frozen(task))) > + __thaw_task(task); To simplify the discussion lets ignore PF_FROZEN, this is another issue. I am not sure this change is enough, we need to ensure that select_bad_process() won't pick the same task (or its sub-thread) again. And perhaps something like wait_event_timeout(oom_victims_wait, !oom_victims, configurable_timeout); before select_bad_process() makes sense? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
> But yes, such a deadlock is possible. I would really like to see the comments > from maintainers. In particular, I seem to recall that someone suggested to > try to kill another !TIF_MEMDIE process after timeout, perhaps this is what > we should actually do... Well yes here is a patch that kills another memdie process but there is some risk with such an approach of overusing the reserves. Subject: Allow multiple kills from the OOM killer The OOM killer currently aborts if it finds a process that already is having access to the reserve memory pool for exit processing. This is done so that the reserves are not overcommitted but on the other hand this also allows only one process being oom killed at the time. That process may be stuck in D state. The patch simply removes the aborting of the scan so that other processes may be killed if one is stuck in D state. Signed-off-by: Christoph LameterIndex: linux/mm/oom_kill.c === --- linux.orig/mm/oom_kill.c2015-09-18 10:38:29.601963726 -0500 +++ linux/mm/oom_kill.c 2015-09-18 10:39:55.911699017 -0500 @@ -265,8 +265,8 @@ enum oom_scan_t oom_scan_process_thread( * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (oc->order != -1) - return OOM_SCAN_ABORT; + if (unlikely(frozen(task))) + __thaw_task(task); } if (!task->mm) return OOM_SCAN_CONTINUE; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Add cc's. On 09/17, Kyle Walker wrote: > > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. > > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. > > Signed-off-by: Kyle Walker > --- > mm/oom_kill.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + So we can skip a memory hog which, say, does mutex_lock(). And this can't help if this task is multithreaded, unless all its sub-threads are in "D" state too oom killer will pick another thread with the same ->mm. Plus other problems. But yes, such a deadlock is possible. I would really like to see the comments from maintainers. In particular, I seem to recall that someone suggested to try to kill another !TIF_MEMDIE process after timeout, perhaps this is what we should actually do... Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks
Add cc's. On 09/17, Kyle Walker wrote: > > Currently, the oom killer will attempt to kill a process that is in > TASK_UNINTERRUPTIBLE state. For tasks in this state for an exceptional > period of time, such as processes writing to a frozen filesystem during > a lengthy backup operation, this can result in a deadlock condition as > related processes memory access will stall within the page fault > handler. > > Within oom_unkillable_task(), check for processes in > TASK_UNINTERRUPTIBLE (TASK_KILLABLE omitted). The oom killer will > move on to another task. > > Signed-off-by: Kyle Walker> --- > mm/oom_kill.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1ecc0bc..66f03f8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -131,6 +131,10 @@ static bool oom_unkillable_task(struct task_struct *p, > if (memcg && !task_in_mem_cgroup(p, memcg)) > return true; > > + /* Uninterruptible tasks should not be killed unless in TASK_WAKEKILL */ > + if (p->state == TASK_UNINTERRUPTIBLE) > + return true; > + So we can skip a memory hog which, say, does mutex_lock(). And this can't help if this task is multithreaded, unless all its sub-threads are in "D" state too oom killer will pick another thread with the same ->mm. Plus other problems. But yes, such a deadlock is possible. I would really like to see the comments from maintainers. In particular, I seem to recall that someone suggested to try to kill another !TIF_MEMDIE process after timeout, perhaps this is what we should actually do... Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/