Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
With the full changelog and the vmstat update for the reference. --- >From 9492966a552751e6d7a63e9aafb87e35992b840a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 11 Nov 2015 16:45:53 +0100 Subject: [PATCH] mm, vmstat: Allow WQ concurrency to discover memory reclaim doesn't make any progress Tetsuo Handa has reported that the system might basically livelock in OOM condition without triggering the OOM killer. The issue is caused by internal dependency of the direct reclaim on vmstat counter updates (via zone_reclaimable) which are performed from the workqueue context. If all the current workers get assigned to an allocation request, though, they will be looping inside the allocator trying to reclaim memory but zone_reclaimable can see stalled numbers so it will consider a zone reclaimable even though it has been scanned way too much. WQ concurrency logic will not consider this situation as a congested workqueue because it relies that worker would have to sleep in such a situation. This also means that it doesn't try to spawn new workers or invoke the rescuer thread if the one is assigned to the queue. In order to fix this issue we need to do two things. First we have to let wq concurrency code know that we are in trouble so we have to do a short sleep. In order to prevent from issues handled by 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") we limit the sleep only to worker threads which are the ones of the interest anyway. The second thing to do is to create a dedicated workqueue for vmstat and mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to have a spare worker thread for it. Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- mm/backing-dev.c | 19 --- mm/vmstat.c | 6 -- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8ed2ffd963c5..7340353f8aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait); * jiffies for either a BDI to exit congestion of the given @sync queue * or a write to complete. * - * In the absence of zone congestion, cond_resched() is called to yield - * the processor if necessary but otherwise does not sleep. + * In the absence of zone congestion, a short sleep or a cond_resched is + * performed to yield the processor and to allow other subsystems to make + * a forward progress. * * The return value is 0 if the sleep is for the full timeout. Otherwise, * it is the number of jiffies that were still remaining when the function @@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) */ if (atomic_read(_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, >flags)) { - cond_resched(); + + /* +* Memory allocation/reclaim might be called from a WQ +* context and the current implementation of the WQ +* concurrency control doesn't recognize that a particular +* WQ is congested if the worker thread is looping without +* ever sleeping. Therefore we have to do a short sleep +* here rather than calling cond_resched(). +*/ + if (current->flags & PF_WQ_WORKER) + schedule_timeout(1); + else + cond_resched(); /* In case we scheduled, work out time remaining */ ret = timeout - (jiffies - start); diff --git a/mm/vmstat.c b/mm/vmstat.c index 45dcbcb5c594..0975da8e3432 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1381,6 +1381,7 @@ static const struct file_operations proc_vmstat_file_operations = { #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SMP +static struct workqueue_struct *vmstat_wq; static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; static cpumask_var_t cpu_stat_off; @@ -1393,7 +1394,7 @@ static void vmstat_update(struct work_struct *w) * to occur in the future. Keep on running the * update worker thread. */ - schedule_delayed_work_on(smp_processor_id(), + queue_delayed_work_on(smp_processor_id(), vmstat_wq, this_cpu_ptr(_work), round_jiffies_relative(sysctl_stat_interval)); } else { @@ -1462,7 +1463,7 @@ static void vmstat_shepherd(struct work_struct *w) if (need_update(cpu) && cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - schedule_delayed_work_on(cpu, + queue_delayed_work_on(cpu, vmstat_wq, _cpu(vmstat_work, cpu), 0); put_online_cpus(); @@ -1551,6 +1552,7 @@ static int
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu 05-11-15 19:16:48, Tejun Heo wrote: > Hello, > > On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote: > > Sorry but we need work queue processing for vmstat counters that is > > I made this analogy before but this is similar to looping with > preemption off. If anything on workqueue stays RUNNING w/o making > forward progress, it's buggy. I'd venture to say any code which busy > loops without making forward progress in the time scale noticeable to > human beings is borderline buggy too. Well, the caller asked for a memory but the request cannot succeed. Due to the memory allocator semantic we cannot fail the request so we have to loop. If we had an event to wait for we would do so, of course. Now wrt. to a small sleep. We used to do that and called congestion_wait(HZ/50) before retry. This has proved to cause stalls during high memory pressure 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone"). I do not really remember what was CONFIG_HZ in those reports but it is quite possible it was 250. So there is a risk of (partial) re-introducing of those stalls with the patch from Tetsuo (http://lkml.kernel.org/r/201510251952.cef04109.osotlfhfvfj...@i-love.sakura.ne.jp) If we really have to do short sleep, though, then I would suggest sticking that into wait_iff_congested rather than spread it into more places and reduce it only to worker threads. This should be much more safer. Thought? --- diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8ed2ffd963c5..7340353f8aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait); * jiffies for either a BDI to exit congestion of the given @sync queue * or a write to complete. * - * In the absence of zone congestion, cond_resched() is called to yield - * the processor if necessary but otherwise does not sleep. + * In the absence of zone congestion, a short sleep or a cond_resched is + * performed to yield the processor and to allow other subsystems to make + * a forward progress. * * The return value is 0 if the sleep is for the full timeout. Otherwise, * it is the number of jiffies that were still remaining when the function @@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) */ if (atomic_read(_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, >flags)) { - cond_resched(); + + /* +* Memory allocation/reclaim might be called from a WQ +* context and the current implementation of the WQ +* concurrency control doesn't recognize that a particular +* WQ is congested if the worker thread is looping without +* ever sleeping. Therefore we have to do a short sleep +* here rather than calling cond_resched(). +*/ + if (current->flags & PF_WQ_WORKER) + schedule_timeout(1); + else + cond_resched(); /* In case we scheduled, work out time remaining */ ret = timeout - (jiffies - start); -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
With the full changelog and the vmstat update for the reference. --- >From 9492966a552751e6d7a63e9aafb87e35992b840a Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Wed, 11 Nov 2015 16:45:53 +0100 Subject: [PATCH] mm, vmstat: Allow WQ concurrency to discover memory reclaim doesn't make any progress Tetsuo Handa has reported that the system might basically livelock in OOM condition without triggering the OOM killer. The issue is caused by internal dependency of the direct reclaim on vmstat counter updates (via zone_reclaimable) which are performed from the workqueue context. If all the current workers get assigned to an allocation request, though, they will be looping inside the allocator trying to reclaim memory but zone_reclaimable can see stalled numbers so it will consider a zone reclaimable even though it has been scanned way too much. WQ concurrency logic will not consider this situation as a congested workqueue because it relies that worker would have to sleep in such a situation. This also means that it doesn't try to spawn new workers or invoke the rescuer thread if the one is assigned to the queue. In order to fix this issue we need to do two things. First we have to let wq concurrency code know that we are in trouble so we have to do a short sleep. In order to prevent from issues handled by 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") we limit the sleep only to worker threads which are the ones of the interest anyway. The second thing to do is to create a dedicated workqueue for vmstat and mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to have a spare worker thread for it. Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- mm/backing-dev.c | 19 --- mm/vmstat.c | 6 -- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8ed2ffd963c5..7340353f8aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait); * jiffies for either a BDI to exit congestion of the given @sync queue * or a write to complete. * - * In the absence of zone congestion, cond_resched() is called to yield - * the processor if necessary but otherwise does not sleep. + * In the absence of zone congestion, a short sleep or a cond_resched is + * performed to yield the processor and to allow other subsystems to make + * a forward progress. * * The return value is 0 if the sleep is for the full timeout. Otherwise, * it is the number of jiffies that were still remaining when the function @@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) */ if (atomic_read(_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, >flags)) { - cond_resched(); + + /* +* Memory allocation/reclaim might be called from a WQ +* context and the current implementation of the WQ +* concurrency control doesn't recognize that a particular +* WQ is congested if the worker thread is looping without +* ever sleeping. Therefore we have to do a short sleep +* here rather than calling cond_resched(). +*/ + if (current->flags & PF_WQ_WORKER) + schedule_timeout(1); + else + cond_resched(); /* In case we scheduled, work out time remaining */ ret = timeout - (jiffies - start); diff --git a/mm/vmstat.c b/mm/vmstat.c index 45dcbcb5c594..0975da8e3432 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1381,6 +1381,7 @@ static const struct file_operations proc_vmstat_file_operations = { #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SMP +static struct workqueue_struct *vmstat_wq; static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; static cpumask_var_t cpu_stat_off; @@ -1393,7 +1394,7 @@ static void vmstat_update(struct work_struct *w) * to occur in the future. Keep on running the * update worker thread. */ - schedule_delayed_work_on(smp_processor_id(), + queue_delayed_work_on(smp_processor_id(), vmstat_wq, this_cpu_ptr(_work), round_jiffies_relative(sysctl_stat_interval)); } else { @@ -1462,7 +1463,7 @@ static void vmstat_shepherd(struct work_struct *w) if (need_update(cpu) && cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - schedule_delayed_work_on(cpu, + queue_delayed_work_on(cpu, vmstat_wq, _cpu(vmstat_work,
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu 05-11-15 19:16:48, Tejun Heo wrote: > Hello, > > On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote: > > Sorry but we need work queue processing for vmstat counters that is > > I made this analogy before but this is similar to looping with > preemption off. If anything on workqueue stays RUNNING w/o making > forward progress, it's buggy. I'd venture to say any code which busy > loops without making forward progress in the time scale noticeable to > human beings is borderline buggy too. Well, the caller asked for a memory but the request cannot succeed. Due to the memory allocator semantic we cannot fail the request so we have to loop. If we had an event to wait for we would do so, of course. Now wrt. to a small sleep. We used to do that and called congestion_wait(HZ/50) before retry. This has proved to cause stalls during high memory pressure 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone"). I do not really remember what was CONFIG_HZ in those reports but it is quite possible it was 250. So there is a risk of (partial) re-introducing of those stalls with the patch from Tetsuo (http://lkml.kernel.org/r/201510251952.cef04109.osotlfhfvfj...@i-love.sakura.ne.jp) If we really have to do short sleep, though, then I would suggest sticking that into wait_iff_congested rather than spread it into more places and reduce it only to worker threads. This should be much more safer. Thought? --- diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8ed2ffd963c5..7340353f8aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait); * jiffies for either a BDI to exit congestion of the given @sync queue * or a write to complete. * - * In the absence of zone congestion, cond_resched() is called to yield - * the processor if necessary but otherwise does not sleep. + * In the absence of zone congestion, a short sleep or a cond_resched is + * performed to yield the processor and to allow other subsystems to make + * a forward progress. * * The return value is 0 if the sleep is for the full timeout. Otherwise, * it is the number of jiffies that were still remaining when the function @@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) */ if (atomic_read(_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, >flags)) { - cond_resched(); + + /* +* Memory allocation/reclaim might be called from a WQ +* context and the current implementation of the WQ +* concurrency control doesn't recognize that a particular +* WQ is congested if the worker thread is looping without +* ever sleeping. Therefore we have to do a short sleep +* here rather than calling cond_resched(). +*/ + if (current->flags & PF_WQ_WORKER) + schedule_timeout(1); + else + cond_resched(); /* In case we scheduled, work out time remaining */ ret = timeout - (jiffies - start); -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote: > Sorry but we need work queue processing for vmstat counters that is I made this analogy before but this is similar to looping with preemption off. If anything on workqueue stays RUNNING w/o making forward progress, it's buggy. I'd venture to say any code which busy loops without making forward progress in the time scale noticeable to human beings is borderline buggy too. If things need to be retried in that time scale, putting in a short sleep between trials is a sensible thing to do. There's no point in occupying the cpu and burning cycles without making forward progress. These things actually matter. Freezer used to burn cycles this way and was really good at burning off the last remaining battery reserve during emergency hibernation if freezing takes some amount of time. It is true that as it currently stands this is error-prone as workqueue can't detect these conditions and warn about them. The same goes for workqueues which sit in memory reclaim path but forgets WQ_MEM_RECLAIM. I'm going to add lockup detection, similar to how softlockup but that's a different issue, so please update the code. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 5 Nov 2015, Tetsuo Handa wrote: > memory allocation. By allowing workqueue items to be processed (by using > short sleep), some task might release memory when workqueue item is > processed. > > Therefore, not only to keep vmstat counters up to date, but also for > avoid wasting CPU cycles, I prefer a short sleep. Sorry but we need work queue processing for vmstat counters that is independent of other requests submitted that may block. Adding points where we sleep / schedule everywhere to do this is not the right approach. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Michal Hocko wrote: > As already pointed out I really detest a short sleep and would prefer > a way to tell WQ what we really need. vmstat is not the only user. OOM > sysrq will need this special treatment as well. While the > zone_reclaimable can be fixed in an easy patch > (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp) > which is perfectly suited for the stable backport, OOM sysrq resp. any > sysrq which runs from the WQ context should be as robust as possible and > shouldn't rely on all the code running from WQ context to issue a sleep > to get unstuck. So I definitely support something like this patch. I still prefer a short sleep from a different perspective. I tested above patch with below patch applied diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d0499ff..54bedd8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE; } +static atomic_t stall_tasks; + +static int kmallocwd(void *unused) +{ + struct task_struct *g, *p; + unsigned int sigkill_pending; + unsigned int memdie_pending; + unsigned int stalling_tasks; + + not_stalling: /* Healty case. */ + schedule_timeout_interruptible(HZ); + if (likely(!atomic_read(_tasks))) + goto not_stalling; + maybe_stalling: /* Maybe something is wrong. Let's check. */ + /* Count stalling tasks, dying and victim tasks. */ + sigkill_pending = 0; + memdie_pending = 0; + stalling_tasks = atomic_read(_tasks); + preempt_disable(); + rcu_read_lock(); + for_each_process_thread(g, p) { + if (test_tsk_thread_flag(p, TIF_MEMDIE)) + memdie_pending++; + if (fatal_signal_pending(p)) + sigkill_pending++; + } + rcu_read_unlock(); + preempt_enable(); + pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n", + stalling_tasks, sigkill_pending, memdie_pending); + show_workqueue_state(); + schedule_timeout_interruptible(10 * HZ); + if (atomic_read(_tasks)) + goto maybe_stalling; + goto not_stalling; + return 0; /* To suppress "no return statement" compiler warning. */ +} + +static int __init start_kmallocwd(void) +{ + struct task_struct *task = kthread_run(kmallocwd, NULL, + "kmallocwd"); + BUG_ON(IS_ERR(task)); + return 0; +} +late_initcall(start_kmallocwd); + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + unsigned long start = jiffies; + bool stall_counted = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3095,6 +3144,11 @@ retry: if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + if (!stall_counted && time_after(jiffies, start + 10 * HZ)) { + atomic_inc(_tasks); + stall_counted = true; + } + /* * Try direct compaction. The first pass is asynchronous. Subsequent * attempts after direct reclaim are synchronous @@ -3188,6 +3242,8 @@ noretry: nopage: warn_alloc_failed(gfp_mask, order, NULL); got_pg: + if (stall_counted) + atomic_dec(_tasks); return page; } using a crazy stressing program. (Not a TIF_MEMDIE stall.) #include #include #include #include #include #include #include #include static void child(void) { char *buf = NULL; unsigned long size = 0; const int fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ } int main(int argc, char *argv[]) { if (argc > 1) { int i; char buffer[4096]; for (i = 0; i < 1000; i++) { if (fork() == 0) { sleep(20); memset(buffer, 0, sizeof(buffer)); _exit(0); } }
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 5 Nov 2015, Tetsuo Handa wrote: > memory allocation. By allowing workqueue items to be processed (by using > short sleep), some task might release memory when workqueue item is > processed. > > Therefore, not only to keep vmstat counters up to date, but also for > avoid wasting CPU cycles, I prefer a short sleep. Sorry but we need work queue processing for vmstat counters that is independent of other requests submitted that may block. Adding points where we sleep / schedule everywhere to do this is not the right approach. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote: > Sorry but we need work queue processing for vmstat counters that is I made this analogy before but this is similar to looping with preemption off. If anything on workqueue stays RUNNING w/o making forward progress, it's buggy. I'd venture to say any code which busy loops without making forward progress in the time scale noticeable to human beings is borderline buggy too. If things need to be retried in that time scale, putting in a short sleep between trials is a sensible thing to do. There's no point in occupying the cpu and burning cycles without making forward progress. These things actually matter. Freezer used to burn cycles this way and was really good at burning off the last remaining battery reserve during emergency hibernation if freezing takes some amount of time. It is true that as it currently stands this is error-prone as workqueue can't detect these conditions and warn about them. The same goes for workqueues which sit in memory reclaim path but forgets WQ_MEM_RECLAIM. I'm going to add lockup detection, similar to how softlockup but that's a different issue, so please update the code. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Michal Hocko wrote: > As already pointed out I really detest a short sleep and would prefer > a way to tell WQ what we really need. vmstat is not the only user. OOM > sysrq will need this special treatment as well. While the > zone_reclaimable can be fixed in an easy patch > (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp) > which is perfectly suited for the stable backport, OOM sysrq resp. any > sysrq which runs from the WQ context should be as robust as possible and > shouldn't rely on all the code running from WQ context to issue a sleep > to get unstuck. So I definitely support something like this patch. I still prefer a short sleep from a different perspective. I tested above patch with below patch applied diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d0499ff..54bedd8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE; } +static atomic_t stall_tasks; + +static int kmallocwd(void *unused) +{ + struct task_struct *g, *p; + unsigned int sigkill_pending; + unsigned int memdie_pending; + unsigned int stalling_tasks; + + not_stalling: /* Healty case. */ + schedule_timeout_interruptible(HZ); + if (likely(!atomic_read(_tasks))) + goto not_stalling; + maybe_stalling: /* Maybe something is wrong. Let's check. */ + /* Count stalling tasks, dying and victim tasks. */ + sigkill_pending = 0; + memdie_pending = 0; + stalling_tasks = atomic_read(_tasks); + preempt_disable(); + rcu_read_lock(); + for_each_process_thread(g, p) { + if (test_tsk_thread_flag(p, TIF_MEMDIE)) + memdie_pending++; + if (fatal_signal_pending(p)) + sigkill_pending++; + } + rcu_read_unlock(); + preempt_enable(); + pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n", + stalling_tasks, sigkill_pending, memdie_pending); + show_workqueue_state(); + schedule_timeout_interruptible(10 * HZ); + if (atomic_read(_tasks)) + goto maybe_stalling; + goto not_stalling; + return 0; /* To suppress "no return statement" compiler warning. */ +} + +static int __init start_kmallocwd(void) +{ + struct task_struct *task = kthread_run(kmallocwd, NULL, + "kmallocwd"); + BUG_ON(IS_ERR(task)); + return 0; +} +late_initcall(start_kmallocwd); + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; int contended_compaction = COMPACT_CONTENDED_NONE; + unsigned long start = jiffies; + bool stall_counted = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3095,6 +3144,11 @@ retry: if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + if (!stall_counted && time_after(jiffies, start + 10 * HZ)) { + atomic_inc(_tasks); + stall_counted = true; + } + /* * Try direct compaction. The first pass is asynchronous. Subsequent * attempts after direct reclaim are synchronous @@ -3188,6 +3242,8 @@ noretry: nopage: warn_alloc_failed(gfp_mask, order, NULL); got_pg: + if (stall_counted) + atomic_dec(_tasks); return page; } using a crazy stressing program. (Not a TIF_MEMDIE stall.) #include #include #include #include #include #include #include #include static void child(void) { char *buf = NULL; unsigned long size = 0; const int fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ } int main(int argc, char *argv[]) { if (argc > 1) { int i; char buffer[4096]; for (i = 0; i < 1000; i++) { if (fork() == 0) { sleep(20); memset(buffer, 0, sizeof(buffer)); _exit(0); } }
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Tetsuo. On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote: > Tejun Heo wrote: > > If > > the possibility of sysrq getting stuck behind concurrency management > > is an issue, queueing them on an unbound or highpri workqueue should > > be good enough. > > Regarding SysRq-f, we could do like below. Though I think that converting > the OOM killer into a dedicated kernel thread would allow more things to do > (e.g. Oleg's memory zapping code, my timeout based next victim selection). I'm not sure doing anything to sysrq-f is warranted. If workqueue can't make forward progress due to memory exhaustion, OOM will be triggered anyway. Getting stuck behind concurrency management isn't that different a failure mode from getting stuck behind busy loop with preemption off. We should just plug them at the source. If necessary, what we can do is adding stall watchdog (can prolly combined with the usual watchdog) so that it can better point out the culprit. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Tetsuo. On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote: > Tejun Heo wrote: > > If > > the possibility of sysrq getting stuck behind concurrency management > > is an issue, queueing them on an unbound or highpri workqueue should > > be good enough. > > Regarding SysRq-f, we could do like below. Though I think that converting > the OOM killer into a dedicated kernel thread would allow more things to do > (e.g. Oleg's memory zapping code, my timeout based next victim selection). I'm not sure doing anything to sysrq-f is warranted. If workqueue can't make forward progress due to memory exhaustion, OOM will be triggered anyway. Getting stuck behind concurrency management isn't that different a failure mode from getting stuck behind busy loop with preemption off. We should just plug them at the source. If necessary, what we can do is adding stall watchdog (can prolly combined with the usual watchdog) so that it can better point out the culprit. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Tejun Heo wrote: > If > the possibility of sysrq getting stuck behind concurrency management > is an issue, queueing them on an unbound or highpri workqueue should > be good enough. Regarding SysRq-f, we could do like below. Though I think that converting the OOM killer into a dedicated kernel thread would allow more things to do (e.g. Oleg's memory zapping code, my timeout based next victim selection). diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 5381a72..46b951aa 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include @@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = { .enable_mask= SYSRQ_ENABLE_SIGNAL, }; -static void moom_callback(struct work_struct *ignored) +static DECLARE_WAIT_QUEUE_HEAD(moom_wait); + +static int moom_callback(void *unused) { const gfp_t gfp_mask = GFP_KERNEL; - struct oom_control oc = { - .zonelist = node_zonelist(first_memory_node, gfp_mask), - .nodemask = NULL, - .gfp_mask = gfp_mask, - .order = -1, - }; - - mutex_lock(_lock); - if (!out_of_memory()) - pr_info("OOM request ignored because killer is disabled\n"); - mutex_unlock(_lock); + DEFINE_WAIT(wait); + + while (1) { + struct oom_control oc = { + .zonelist = node_zonelist(first_memory_node, gfp_mask), + .nodemask = NULL, + .gfp_mask = gfp_mask, + .order = -1, + }; + + prepare_to_wait(_wait, , TASK_INTERRUPTIBLE); + schedule(); + finish_wait(_wait, ); + mutex_lock(_lock); + if (!out_of_memory()) + pr_info("OOM request ignored because killer is disabled\n"); + mutex_unlock(_lock); + } + return 0; } -static DECLARE_WORK(moom_work, moom_callback); - static void sysrq_handle_moom(int key) { - schedule_work(_work); + wake_up(_wait); } static struct sysrq_key_op sysrq_moom_op = { .handler= sysrq_handle_moom, @@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void) static int __init sysrq_init(void) { + struct task_struct *task = kthread_run(moom_callback, NULL, + "manual_oom"); + BUG_ON(IS_ERR(task)); sysrq_init_procfs(); if (sysrq_on()) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote: ... > which is perfectly suited for the stable backport, OOM sysrq resp. any > sysrq which runs from the WQ context should be as robust as possible and > shouldn't rely on all the code running from WQ context to issue a sleep > to get unstuck. So I definitely support something like this patch. Well, sysrq wouldn't run successfully either on a cpu which is busy looping with preemption off. I don't think this calls for a new flag to modify workqueue behavior especially given that missing such flag would lead to the same kind of lockup. It's a shitty solution. If the possibility of sysrq getting stuck behind concurrency management is an issue, queueing them on an unbound or highpri workqueue should be good enough. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 13:26:49, Tejun Heo wrote: > Hello, > > So, something like the following. Just compile tested but this is > essentially partial revert of 3270476a6c0c ("workqueue: reimplement > WQ_HIGHPRI using a separate worker_pool") - resurrecting the old > WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works. > If for some reason, it gets decided against simply adding one jiffy > sleep, please let me know. I'll verify the operation and post a > proper patch. That said, given that this prolly needs -stable > backport and vmstat is likely to be the only user (busy loops are > really rare in the kernel after all), I think the better approach > would be reinstating the short sleep. As already pointed out I really detest a short sleep and would prefer a way to tell WQ what we really need. vmstat is not the only user. OOM sysrq will need this special treatment as well. While the zone_reclaimable can be fixed in an easy patch (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp) which is perfectly suited for the stable backport, OOM sysrq resp. any sysrq which runs from the WQ context should be as robust as possible and shouldn't rely on all the code running from WQ context to issue a sleep to get unstuck. So I definitely support something like this patch. I am still not sure whether other WQ_MEM_RECLAIM users needs this flag as well because I am not familiar with their implementation but at vmstat and sysrq should use it and should be safe to do so without risk of breaking anything AFAICS. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 13:26:49, Tejun Heo wrote: > Hello, > > So, something like the following. Just compile tested but this is > essentially partial revert of 3270476a6c0c ("workqueue: reimplement > WQ_HIGHPRI using a separate worker_pool") - resurrecting the old > WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works. > If for some reason, it gets decided against simply adding one jiffy > sleep, please let me know. I'll verify the operation and post a > proper patch. That said, given that this prolly needs -stable > backport and vmstat is likely to be the only user (busy loops are > really rare in the kernel after all), I think the better approach > would be reinstating the short sleep. As already pointed out I really detest a short sleep and would prefer a way to tell WQ what we really need. vmstat is not the only user. OOM sysrq will need this special treatment as well. While the zone_reclaimable can be fixed in an easy patch (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp) which is perfectly suited for the stable backport, OOM sysrq resp. any sysrq which runs from the WQ context should be as robust as possible and shouldn't rely on all the code running from WQ context to issue a sleep to get unstuck. So I definitely support something like this patch. I am still not sure whether other WQ_MEM_RECLAIM users needs this flag as well because I am not familiar with their implementation but at vmstat and sysrq should use it and should be safe to do so without risk of breaking anything AFAICS. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote: ... > which is perfectly suited for the stable backport, OOM sysrq resp. any > sysrq which runs from the WQ context should be as robust as possible and > shouldn't rely on all the code running from WQ context to issue a sleep > to get unstuck. So I definitely support something like this patch. Well, sysrq wouldn't run successfully either on a cpu which is busy looping with preemption off. I don't think this calls for a new flag to modify workqueue behavior especially given that missing such flag would lead to the same kind of lockup. It's a shitty solution. If the possibility of sysrq getting stuck behind concurrency management is an issue, queueing them on an unbound or highpri workqueue should be good enough. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Tejun Heo wrote: > If > the possibility of sysrq getting stuck behind concurrency management > is an issue, queueing them on an unbound or highpri workqueue should > be good enough. Regarding SysRq-f, we could do like below. Though I think that converting the OOM killer into a dedicated kernel thread would allow more things to do (e.g. Oleg's memory zapping code, my timeout based next victim selection). diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 5381a72..46b951aa 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include @@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = { .enable_mask= SYSRQ_ENABLE_SIGNAL, }; -static void moom_callback(struct work_struct *ignored) +static DECLARE_WAIT_QUEUE_HEAD(moom_wait); + +static int moom_callback(void *unused) { const gfp_t gfp_mask = GFP_KERNEL; - struct oom_control oc = { - .zonelist = node_zonelist(first_memory_node, gfp_mask), - .nodemask = NULL, - .gfp_mask = gfp_mask, - .order = -1, - }; - - mutex_lock(_lock); - if (!out_of_memory()) - pr_info("OOM request ignored because killer is disabled\n"); - mutex_unlock(_lock); + DEFINE_WAIT(wait); + + while (1) { + struct oom_control oc = { + .zonelist = node_zonelist(first_memory_node, gfp_mask), + .nodemask = NULL, + .gfp_mask = gfp_mask, + .order = -1, + }; + + prepare_to_wait(_wait, , TASK_INTERRUPTIBLE); + schedule(); + finish_wait(_wait, ); + mutex_lock(_lock); + if (!out_of_memory()) + pr_info("OOM request ignored because killer is disabled\n"); + mutex_unlock(_lock); + } + return 0; } -static DECLARE_WORK(moom_work, moom_callback); - static void sysrq_handle_moom(int key) { - schedule_work(_work); + wake_up(_wait); } static struct sysrq_key_op sysrq_moom_op = { .handler= sysrq_handle_moom, @@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void) static int __init sysrq_init(void) { + struct task_struct *task = kthread_run(moom_callback, NULL, + "manual_oom"); + BUG_ON(IS_ERR(task)); sysrq_init_procfs(); if (sysrq_on()) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Tue 27-10-15 19:55:06, Tejun Heo wrote: > On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote: > ... > > stable kernels without causing any other regressions. 2) is the way > > to move forward for next kernels and we should really think whether > > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a > > general consensus that there are legitimate WQ_MEM_RECLAIM users which > > can do without the other flag then I am perfectly OK to use it for > > vmstat and oom sysrq dedicated workqueues. > > I don't think flagging these things is a good approach. These are too > easy to miss. If this is a problem which needs to be solved, which > I'm not convined it is at this point, the right thing to do would be > doing stall detection and kicking the next work item automatically. To be honest, I do not really care whether this gets "fixed" in the stall detection code or by making WQ_MEM_RECLAIM to flag a special behavior implicitly. All I would like to see is to have a guarantee that such workqueues are not staying behind just because all current workers are in the allocator. Adding artificial schedule_timeouts in the allocator is a fragile way to work around the issue. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
On Tue, Oct 27, 2015 at 08:07:38PM +0900, Tetsuo Handa wrote: > Can't we have a waitqueue like > http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp > ? There's no reason to complicate it. It wouldn't buy anything meaningful. Can we please stop trying to solve a non-existent problem? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
Michal Hocko wrote: > > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote: > > > > The problem here is not lack > > > > of execution resource but concurrency management misunderstanding the > > > > situation. > > > > > > And this sounds like a bug to me. > > > > I don't know. I can be argued either way, the other direction being a > > kernel thread going RUNNING non-stop is buggy. Given how this has > > been a complete non-issue for all the years, I'm not sure how useful > > plugging this is. > > Well, I guess we haven't noticed because this is a pathological case. It > also triggers OOM livelocks which were not reported in the past either. > You do not reach this state normally unless you rely _want_ to kill your > machine I don't think we can say this is a pathological case. Customers' serves might have hit this state. We have no code for warning this state. > > And vmstat is not the only instance. E.g. sysrq oom trigger is known > to stay behind in similar cases. It should be changed to a dedicated > WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well. > Well, this seems to be the cause of SysRq-f being unresponsive... http://lkml.kernel.org/r/201411231349.cag78628.vfqfotosfjm...@i-love.sakura.ne.jp Picking up from http://lkml.kernel.org/r/201506112212.jag26531.flsvfmoqjot...@i-love.sakura.ne.jp -- [ 515.536393] Showing busy workqueues and worker pools: [ 515.538185] workqueue events: flags=0x0 [ 515.539758] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256 [ 515.541872] pending: vmpressure_work_fn, console_callback, vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, sysrq_reinject_alt_sysrq, fb_deferred_io_work [ 515.546684] workqueue events_power_efficient: flags=0x80 [ 515.548589] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256 [ 515.550829] pending: neigh_periodic_work, check_lifetime [ 515.552884] workqueue events_freezable_power_: flags=0x84 [ 515.554742] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 515.556846] in-flight: 3837:disk_events_workfn [ 515.558665] workqueue writeback: flags=0x4e [ 515.560291] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256 [ 515.562271] in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn [ 515.564544] workqueue xfs-data/sda1: flags=0xc [ 515.566265] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256 [ 515.568359] in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 26:xfs_end_io, 3836:xfs_end_io [ 515.571018] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 515.573113] in-flight: 179:xfs_end_io [ 515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 3820 [ 515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219 [ 515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357 -- We want immediate execution guarantee for not only vmstat_update and moom_callback but also vmstat_shepherd and console_callback? > > > Don't we have some IO related paths which would suffer from the same > > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > > > name I would expect they _do_ participate in the reclaim and so they > > > should be able to make a progress. Now if your new IMMEDIATE flag will > > > > Seriously, nobody goes full-on RUNNING. > > Looping with cond_resched seems like general pattern in the kernel when > there is no clear source to wait for. We have io_schedule when we know > we should wait for IO (in case of congestion) but this is not necessarily > the case - as you can see here. What should we wait for? A short nap > without actually waiting on anything sounds like a dirty workaround to > me. Can't we have a waitqueue like http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote: ... > stable kernels without causing any other regressions. 2) is the way > to move forward for next kernels and we should really think whether > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a > general consensus that there are legitimate WQ_MEM_RECLAIM users which > can do without the other flag then I am perfectly OK to use it for > vmstat and oom sysrq dedicated workqueues. I don't think flagging these things is a good approach. These are too easy to miss. If this is a problem which needs to be solved, which I'm not convined it is at this point, the right thing to do would be doing stall detection and kicking the next work item automatically. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Michal. On Tue, Oct 27, 2015 at 10:16:03AM +0100, Michal Hocko wrote: > > Seriously, nobody goes full-on RUNNING. > > Looping with cond_resched seems like general pattern in the kernel when > there is no clear source to wait for. We have io_schedule when we know > we should wait for IO (in case of congestion) but this is not necessarily > the case - as you can see here. What should we wait for? A short nap > without actually waiting on anything sounds like a dirty workaround to > me. It's one thing to do cond_resched() in long loops to avoid long priority inversions and another to indefinitely loop without making any difference. > > > guarantee that then I would argue that it should be implicit for > > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > > > be a counter argument for doing that? > > > > Not serving any actual purpose and degrading execution behavior. > > I dunno, I am not familiar with WQ internals to see the risks but to me > it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety > wrt. memory pressure and as demonstrated it doesn't do that. Even if you It generally does. This is an extremely rare corner case where infinite loop w/o forward progress is introduce w/o the user being outright buggy. > consider cond_resched behavior of the page allocator as bug we should be > able to handle this gracefully. We can argue this back and forth forever but we'll either need to special case it (be it short sleep or a special flag) or implement a rather complex detection logic which will likely involve some level of complexity and is dubious in its practical usefulness. It's a trade-off and given the circumstances adding short sleep looks like a reasonable one to me. If this is more common, we definitely wanna go for automatic detection. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Sun 25-10-15 19:52:59, Tetsuo Handa wrote: [...] > Three approaches are proposed for fixing this silent livelock problem. > > (1) Use zone_page_state_snapshot() instead of zone_page_state() > when doing zone_reclaimable() checks. This approach is clear, > straightforward and easy to backport. So far I cannot reproduce > this livelock using this change. But there might be more locations > which should use zone_page_state_snapshot(). > > (2) Use a dedicated workqueue for vmstat_update item which is guaranteed > to be processed immediately. So far I cannot reproduce this livelock > using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI > (patch proposed by Christoph Lameter). But according to Tejun Heo, > if we want to guarantee that nobody can reproduce this livelock, we > need to modify workqueue API because commit 3270476a6c0c ("workqueue: > reimplement WQ_HIGHPRI using a separate worker_pool") which went to > Linux 3.6 lost the guarantee. > > (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach > is easy to backport. So far I cannot reproduce this livelock using > this approach. And I think that nobody can reproduce this livelock > because this changes the page allocator to obey the workqueue's > expectations. Even if we leave this livelock problem aside, not > entering into !TASK_RUNNING state for too long is an exclusive > occupation of workqueue which will make other items in the workqueue > needlessly deferred. We don't need to defer other items which do not > invoke a __GFP_WAIT allocation. > > This patch does approach (3), by inserting an uninterruptible sleep into > page allocator side before retrying, in order to make sure that other > workqueue items (especially vmstat_update item) are given a chance to be > processed. > > Although a different problem, by using approach (3), we can alleviate > needlessly burning CPU cycles even when we hit OOM-killer livelock problem > (hang up after the OOM-killer messages are printed because the OOM victim > cannot terminate due to dependency). I really dislike this approach. Waiting without having an event to wait for is just too ugly. I think 1) is easiest to backport to stable kernels without causing any other regressions. 2) is the way to move forward for next kernels and we should really think whether WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a general consensus that there are legitimate WQ_MEM_RECLAIM users which can do without the other flag then I am perfectly OK to use it for vmstat and oom sysrq dedicated workqueues. > Signed-off-by: Tetsuo Handa [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Sat 24-10-15 03:21:09, Tejun Heo wrote: > Hello, > > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote: > > > The problem here is not lack > > > of execution resource but concurrency management misunderstanding the > > > situation. > > > > And this sounds like a bug to me. > > I don't know. I can be argued either way, the other direction being a > kernel thread going RUNNING non-stop is buggy. Given how this has > been a complete non-issue for all the years, I'm not sure how useful > plugging this is. Well, I guess we haven't noticed because this is a pathological case. It also triggers OOM livelocks which were not reported in the past either. You do not reach this state normally unless you rely _want_ to kill your machine And vmstat is not the only instance. E.g. sysrq oom trigger is known to stay behind in similar cases. It should be changed to a dedicated WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well. > > Don't we have some IO related paths which would suffer from the same > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > > name I would expect they _do_ participate in the reclaim and so they > > should be able to make a progress. Now if your new IMMEDIATE flag will > > Seriously, nobody goes full-on RUNNING. Looping with cond_resched seems like general pattern in the kernel when there is no clear source to wait for. We have io_schedule when we know we should wait for IO (in case of congestion) but this is not necessarily the case - as you can see here. What should we wait for? A short nap without actually waiting on anything sounds like a dirty workaround to me. > > guarantee that then I would argue that it should be implicit for > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > > be a counter argument for doing that? > > Not serving any actual purpose and degrading execution behavior. I dunno, I am not familiar with WQ internals to see the risks but to me it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety wrt. memory pressure and as demonstrated it doesn't do that. Even if you consider cond_resched behavior of the page allocator as bug we should be able to handle this gracefully. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Sun 25-10-15 19:52:59, Tetsuo Handa wrote: [...] > Three approaches are proposed for fixing this silent livelock problem. > > (1) Use zone_page_state_snapshot() instead of zone_page_state() > when doing zone_reclaimable() checks. This approach is clear, > straightforward and easy to backport. So far I cannot reproduce > this livelock using this change. But there might be more locations > which should use zone_page_state_snapshot(). > > (2) Use a dedicated workqueue for vmstat_update item which is guaranteed > to be processed immediately. So far I cannot reproduce this livelock > using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI > (patch proposed by Christoph Lameter). But according to Tejun Heo, > if we want to guarantee that nobody can reproduce this livelock, we > need to modify workqueue API because commit 3270476a6c0c ("workqueue: > reimplement WQ_HIGHPRI using a separate worker_pool") which went to > Linux 3.6 lost the guarantee. > > (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach > is easy to backport. So far I cannot reproduce this livelock using > this approach. And I think that nobody can reproduce this livelock > because this changes the page allocator to obey the workqueue's > expectations. Even if we leave this livelock problem aside, not > entering into !TASK_RUNNING state for too long is an exclusive > occupation of workqueue which will make other items in the workqueue > needlessly deferred. We don't need to defer other items which do not > invoke a __GFP_WAIT allocation. > > This patch does approach (3), by inserting an uninterruptible sleep into > page allocator side before retrying, in order to make sure that other > workqueue items (especially vmstat_update item) are given a chance to be > processed. > > Although a different problem, by using approach (3), we can alleviate > needlessly burning CPU cycles even when we hit OOM-killer livelock problem > (hang up after the OOM-killer messages are printed because the OOM victim > cannot terminate due to dependency). I really dislike this approach. Waiting without having an event to wait for is just too ugly. I think 1) is easiest to backport to stable kernels without causing any other regressions. 2) is the way to move forward for next kernels and we should really think whether WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a general consensus that there are legitimate WQ_MEM_RECLAIM users which can do without the other flag then I am perfectly OK to use it for vmstat and oom sysrq dedicated workqueues. > Signed-off-by: Tetsuo Handa[...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote: ... > stable kernels without causing any other regressions. 2) is the way > to move forward for next kernels and we should really think whether > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a > general consensus that there are legitimate WQ_MEM_RECLAIM users which > can do without the other flag then I am perfectly OK to use it for > vmstat and oom sysrq dedicated workqueues. I don't think flagging these things is a good approach. These are too easy to miss. If this is a problem which needs to be solved, which I'm not convined it is at this point, the right thing to do would be doing stall detection and kicking the next work item automatically. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Sat 24-10-15 03:21:09, Tejun Heo wrote: > Hello, > > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote: > > > The problem here is not lack > > > of execution resource but concurrency management misunderstanding the > > > situation. > > > > And this sounds like a bug to me. > > I don't know. I can be argued either way, the other direction being a > kernel thread going RUNNING non-stop is buggy. Given how this has > been a complete non-issue for all the years, I'm not sure how useful > plugging this is. Well, I guess we haven't noticed because this is a pathological case. It also triggers OOM livelocks which were not reported in the past either. You do not reach this state normally unless you rely _want_ to kill your machine And vmstat is not the only instance. E.g. sysrq oom trigger is known to stay behind in similar cases. It should be changed to a dedicated WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well. > > Don't we have some IO related paths which would suffer from the same > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > > name I would expect they _do_ participate in the reclaim and so they > > should be able to make a progress. Now if your new IMMEDIATE flag will > > Seriously, nobody goes full-on RUNNING. Looping with cond_resched seems like general pattern in the kernel when there is no clear source to wait for. We have io_schedule when we know we should wait for IO (in case of congestion) but this is not necessarily the case - as you can see here. What should we wait for? A short nap without actually waiting on anything sounds like a dirty workaround to me. > > guarantee that then I would argue that it should be implicit for > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > > be a counter argument for doing that? > > Not serving any actual purpose and degrading execution behavior. I dunno, I am not familiar with WQ internals to see the risks but to me it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety wrt. memory pressure and as demonstrated it doesn't do that. Even if you consider cond_resched behavior of the page allocator as bug we should be able to handle this gracefully. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Michal. On Tue, Oct 27, 2015 at 10:16:03AM +0100, Michal Hocko wrote: > > Seriously, nobody goes full-on RUNNING. > > Looping with cond_resched seems like general pattern in the kernel when > there is no clear source to wait for. We have io_schedule when we know > we should wait for IO (in case of congestion) but this is not necessarily > the case - as you can see here. What should we wait for? A short nap > without actually waiting on anything sounds like a dirty workaround to > me. It's one thing to do cond_resched() in long loops to avoid long priority inversions and another to indefinitely loop without making any difference. > > > guarantee that then I would argue that it should be implicit for > > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > > > be a counter argument for doing that? > > > > Not serving any actual purpose and degrading execution behavior. > > I dunno, I am not familiar with WQ internals to see the risks but to me > it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety > wrt. memory pressure and as demonstrated it doesn't do that. Even if you It generally does. This is an extremely rare corner case where infinite loop w/o forward progress is introduce w/o the user being outright buggy. > consider cond_resched behavior of the page allocator as bug we should be > able to handle this gracefully. We can argue this back and forth forever but we'll either need to special case it (be it short sleep or a special flag) or implement a rather complex detection logic which will likely involve some level of complexity and is dubious in its practical usefulness. It's a trade-off and given the circumstances adding short sleep looks like a reasonable one to me. If this is more common, we definitely wanna go for automatic detection. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
Michal Hocko wrote: > > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote: > > > > The problem here is not lack > > > > of execution resource but concurrency management misunderstanding the > > > > situation. > > > > > > And this sounds like a bug to me. > > > > I don't know. I can be argued either way, the other direction being a > > kernel thread going RUNNING non-stop is buggy. Given how this has > > been a complete non-issue for all the years, I'm not sure how useful > > plugging this is. > > Well, I guess we haven't noticed because this is a pathological case. It > also triggers OOM livelocks which were not reported in the past either. > You do not reach this state normally unless you rely _want_ to kill your > machine I don't think we can say this is a pathological case. Customers' serves might have hit this state. We have no code for warning this state. > > And vmstat is not the only instance. E.g. sysrq oom trigger is known > to stay behind in similar cases. It should be changed to a dedicated > WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well. > Well, this seems to be the cause of SysRq-f being unresponsive... http://lkml.kernel.org/r/201411231349.cag78628.vfqfotosfjm...@i-love.sakura.ne.jp Picking up from http://lkml.kernel.org/r/201506112212.jag26531.flsvfmoqjot...@i-love.sakura.ne.jp -- [ 515.536393] Showing busy workqueues and worker pools: [ 515.538185] workqueue events: flags=0x0 [ 515.539758] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256 [ 515.541872] pending: vmpressure_work_fn, console_callback, vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, sysrq_reinject_alt_sysrq, fb_deferred_io_work [ 515.546684] workqueue events_power_efficient: flags=0x80 [ 515.548589] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256 [ 515.550829] pending: neigh_periodic_work, check_lifetime [ 515.552884] workqueue events_freezable_power_: flags=0x84 [ 515.554742] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 515.556846] in-flight: 3837:disk_events_workfn [ 515.558665] workqueue writeback: flags=0x4e [ 515.560291] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256 [ 515.562271] in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn [ 515.564544] workqueue xfs-data/sda1: flags=0xc [ 515.566265] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256 [ 515.568359] in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 26:xfs_end_io, 3836:xfs_end_io [ 515.571018] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 515.573113] in-flight: 179:xfs_end_io [ 515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 3820 [ 515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219 [ 515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357 -- We want immediate execution guarantee for not only vmstat_update and moom_callback but also vmstat_shepherd and console_callback? > > > Don't we have some IO related paths which would suffer from the same > > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > > > name I would expect they _do_ participate in the reclaim and so they > > > should be able to make a progress. Now if your new IMMEDIATE flag will > > > > Seriously, nobody goes full-on RUNNING. > > Looping with cond_resched seems like general pattern in the kernel when > there is no clear source to wait for. We have io_schedule when we know > we should wait for IO (in case of congestion) but this is not necessarily > the case - as you can see here. What should we wait for? A short nap > without actually waiting on anything sounds like a dirty workaround to > me. Can't we have a waitqueue like http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
On Tue, Oct 27, 2015 at 08:07:38PM +0900, Tetsuo Handa wrote: > Can't we have a waitqueue like > http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp > ? There's no reason to complicate it. It wouldn't buy anything meaningful. Can we please stop trying to solve a non-existent problem? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Tue 27-10-15 19:55:06, Tejun Heo wrote: > On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote: > ... > > stable kernels without causing any other regressions. 2) is the way > > to move forward for next kernels and we should really think whether > > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a > > general consensus that there are legitimate WQ_MEM_RECLAIM users which > > can do without the other flag then I am perfectly OK to use it for > > vmstat and oom sysrq dedicated workqueues. > > I don't think flagging these things is a good approach. These are too > easy to miss. If this is a problem which needs to be solved, which > I'm not convined it is at this point, the right thing to do would be > doing stall detection and kicking the next work item automatically. To be honest, I do not really care whether this gets "fixed" in the stall detection code or by making WQ_MEM_RECLAIM to flag a special behavior implicitly. All I would like to see is to have a guarantee that such workqueues are not staying behind just because all current workers are in the allocator. Adding artificial schedule_timeouts in the allocator is a fragile way to work around the issue. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Sun, Oct 25, 2015 at 07:52:59PM +0900, Tetsuo Handa wrote: ... > This means that any kernel code which invokes a __GFP_WAIT allocation > might fail to do (4) when invoked via workqueue, regardless of flags > passed to alloc_workqueue()? Sounds that way and yeah (3) should technically be okay and that's why HIGHPRI was implemented the way it was at the beginning; however, in practice, this is the first time it's noticeable in all the years. I think it comes down to the fact that there just aren't many places which need such looping behavior and even in those places it's often very undesirable to busy-loop while not making forward-progress (and if forward-progress is being made, it won't be indefinite). > I think that inserting a short sleep into page allocator is better > because the vmstat_update fix will not require workqueue tweaks if > we sleep inside page allocator. Also, from the point of view of > protecting page allocator from going unresponsive when hundreds of tasks > started busy-waiting at __alloc_pages_slowpath() because we can observe > that XXX value in the "MemAlloc-Info: XXX stalling task," line grows > when we are unable to make forward progress. This looks good to me too; however, it still needs a dedicated workqueue with WQ_MEM_RECLAIM set. That deadlock probably is very unlikely as the side effect of vmstat failing to execute due to worker exhaustion is more memory reclaim but it still is theoretically possible and it could just be that it happens at low enough frequency that it hasn't been reported yet. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Tejun Heo wrote: > If this is an actual problem, a better approach would be something > which detects the stall condition and kicks off the next work item but > if we do that I think I'd still trigger a warning there. I don't > know. Don't go busy waiting in kernel. Busy waiting in kernel refers several cases. (1) Wait for something with interrupts disabled. (2) Wait for something with interrupts enabled but without calling cond_resched() etc. (3) Wait for something with interrupts enabled and with calling cond_resched() etc. (4) Wait for something with interrupts enabled and with calling schedule_timeout() etc. Kernel code tries to minimize (1). Kernel code does (2) if they are not allowed to sleep. But kernel code is allowed to do (3) if they are allowed to sleep, as long as cond_resched() is sometimes called. And currently page allocator does (3). But kernel code invoked via workqueue is expected to do (4) than (3). This means that any kernel code which invokes a __GFP_WAIT allocation might fail to do (4) when invoked via workqueue, regardless of flags passed to alloc_workqueue()? Michal Hocko wrote: > On Fri 23-10-15 06:42:43, Tetsuo Handa wrote: > > Tejun Heo wrote: > > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > > > I am confused. What makes rescuer to not run? Nothing seems to be > > > > hogging CPUs, we are just out of workers which are loopin in the > > > > allocator but that is preemptible context. > > > > > > It's concurrency management. Workqueue thinks that the pool is making > > > positive forward progress and doesn't schedule anything else for > > > execution while that work item is burning cpu cycles. > > > > Then, isn't below change easier to backport which will also alleviate > > needlessly burning CPU cycles? > > This is quite obscure. If the vmstat_update fix needs workqueue tweaks > as well then I would vote for your original patch which is clear, > straightforward and easy to backport. I think that inserting a short sleep into page allocator is better because the vmstat_update fix will not require workqueue tweaks if we sleep inside page allocator. Also, from the point of view of protecting page allocator from going unresponsive when hundreds of tasks started busy-waiting at __alloc_pages_slowpath() because we can observe that XXX value in the "MemAlloc-Info: XXX stalling task," line grows when we are unable to make forward progress. >From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Sun, 25 Oct 2015 19:42:15 +0900 Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before retrying. Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort to reduce frequency of updating values in vm_stat[] is made by using per cpu variables "struct per_cpu_pageset"->vm_stat_diff[]. Values in vm_stat_diff[] are merged into vm_stat[] periodically using vmstat_update workqueue item (struct delayed_work vmstat_work). When a task attempted to allocate memory and reached direct reclaim path, shrink_zones() checks whether there are reclaimable pages by calling zone_reclaimable(). zone_reclaimable() makes decision based on values in vm_stat[] by calling zone_page_state(). This is usually fine because values in vm_stat_diff[] are expected to be merged into vm_stat[] shortly. But workqueue and page allocator have different assumptions. (A) The workqueue defers processing of other items unless currently in-flight item enters into !TASK_RUNNING state. (B) The page allocator never enters into !TASK_RUNNING state if there is nothing to reclaim. (The page allocator calls cond_resched() via wait_iff_congested(), but cond_resched() does not make the task enter into !TASK_RUNNING state.) Therefore, if a workqueue item which is processed before vmstat_update item is processed got stuck inside memory allocation request, values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result, zone_reclaimable() continues using outdated vm_stat[] values and the task which is doing direct reclaim path thinks that there are still reclaimable pages and therefore continues looping. The consequence is a silent livelock (hang up without any kernel messages) because the OOM killer will not be invoked. We can hit such livelock by e.g. disk_events_workfn workqueue item doing memory allocation from bio_copy_kern(). [ 255.054205] kworker/3:1 R running task045 2 0x0008 [ 255.056063] Workqueue: events_freezable_power_ disk_events_workfn [ 255.057715] 88007f805680 88007c55f6d0 8116463d 88007c55f758 [ 255.059705] 88007f82b870 88007c55f6e0 811646be 88007c55f710 [ 255.061694] 811bdaf0 88007f82b870 0400 [ 255.063690] Call Trace: [ 255.064664] [] ?
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Tejun Heo wrote: > If this is an actual problem, a better approach would be something > which detects the stall condition and kicks off the next work item but > if we do that I think I'd still trigger a warning there. I don't > know. Don't go busy waiting in kernel. Busy waiting in kernel refers several cases. (1) Wait for something with interrupts disabled. (2) Wait for something with interrupts enabled but without calling cond_resched() etc. (3) Wait for something with interrupts enabled and with calling cond_resched() etc. (4) Wait for something with interrupts enabled and with calling schedule_timeout() etc. Kernel code tries to minimize (1). Kernel code does (2) if they are not allowed to sleep. But kernel code is allowed to do (3) if they are allowed to sleep, as long as cond_resched() is sometimes called. And currently page allocator does (3). But kernel code invoked via workqueue is expected to do (4) than (3). This means that any kernel code which invokes a __GFP_WAIT allocation might fail to do (4) when invoked via workqueue, regardless of flags passed to alloc_workqueue()? Michal Hocko wrote: > On Fri 23-10-15 06:42:43, Tetsuo Handa wrote: > > Tejun Heo wrote: > > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > > > I am confused. What makes rescuer to not run? Nothing seems to be > > > > hogging CPUs, we are just out of workers which are loopin in the > > > > allocator but that is preemptible context. > > > > > > It's concurrency management. Workqueue thinks that the pool is making > > > positive forward progress and doesn't schedule anything else for > > > execution while that work item is burning cpu cycles. > > > > Then, isn't below change easier to backport which will also alleviate > > needlessly burning CPU cycles? > > This is quite obscure. If the vmstat_update fix needs workqueue tweaks > as well then I would vote for your original patch which is clear, > straightforward and easy to backport. I think that inserting a short sleep into page allocator is better because the vmstat_update fix will not require workqueue tweaks if we sleep inside page allocator. Also, from the point of view of protecting page allocator from going unresponsive when hundreds of tasks started busy-waiting at __alloc_pages_slowpath() because we can observe that XXX value in the "MemAlloc-Info: XXX stalling task," line grows when we are unable to make forward progress. >From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001 From: Tetsuo HandaDate: Sun, 25 Oct 2015 19:42:15 +0900 Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before retrying. Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort to reduce frequency of updating values in vm_stat[] is made by using per cpu variables "struct per_cpu_pageset"->vm_stat_diff[]. Values in vm_stat_diff[] are merged into vm_stat[] periodically using vmstat_update workqueue item (struct delayed_work vmstat_work). When a task attempted to allocate memory and reached direct reclaim path, shrink_zones() checks whether there are reclaimable pages by calling zone_reclaimable(). zone_reclaimable() makes decision based on values in vm_stat[] by calling zone_page_state(). This is usually fine because values in vm_stat_diff[] are expected to be merged into vm_stat[] shortly. But workqueue and page allocator have different assumptions. (A) The workqueue defers processing of other items unless currently in-flight item enters into !TASK_RUNNING state. (B) The page allocator never enters into !TASK_RUNNING state if there is nothing to reclaim. (The page allocator calls cond_resched() via wait_iff_congested(), but cond_resched() does not make the task enter into !TASK_RUNNING state.) Therefore, if a workqueue item which is processed before vmstat_update item is processed got stuck inside memory allocation request, values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result, zone_reclaimable() continues using outdated vm_stat[] values and the task which is doing direct reclaim path thinks that there are still reclaimable pages and therefore continues looping. The consequence is a silent livelock (hang up without any kernel messages) because the OOM killer will not be invoked. We can hit such livelock by e.g. disk_events_workfn workqueue item doing memory allocation from bio_copy_kern(). [ 255.054205] kworker/3:1 R running task045 2 0x0008 [ 255.056063] Workqueue: events_freezable_power_ disk_events_workfn [ 255.057715] 88007f805680 88007c55f6d0 8116463d 88007c55f758 [ 255.059705] 88007f82b870 88007c55f6e0 811646be 88007c55f710 [ 255.061694] 811bdaf0 88007f82b870 0400 [ 255.063690] Call Trace: [
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Sun, Oct 25, 2015 at 07:52:59PM +0900, Tetsuo Handa wrote: ... > This means that any kernel code which invokes a __GFP_WAIT allocation > might fail to do (4) when invoked via workqueue, regardless of flags > passed to alloc_workqueue()? Sounds that way and yeah (3) should technically be okay and that's why HIGHPRI was implemented the way it was at the beginning; however, in practice, this is the first time it's noticeable in all the years. I think it comes down to the fact that there just aren't many places which need such looping behavior and even in those places it's often very undesirable to busy-loop while not making forward-progress (and if forward-progress is being made, it won't be indefinite). > I think that inserting a short sleep into page allocator is better > because the vmstat_update fix will not require workqueue tweaks if > we sleep inside page allocator. Also, from the point of view of > protecting page allocator from going unresponsive when hundreds of tasks > started busy-waiting at __alloc_pages_slowpath() because we can observe > that XXX value in the "MemAlloc-Info: XXX stalling task," line grows > when we are unable to make forward progress. This looks good to me too; however, it still needs a dedicated workqueue with WQ_MEM_RECLAIM set. That deadlock probably is very unlikely as the side effect of vmstat failing to execute due to worker exhaustion is more memory reclaim but it still is theoretically possible and it could just be that it happens at low enough frequency that it hasn't been reported yet. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Tetsuo. On Fri, Oct 23, 2015 at 09:25:11PM +0900, Tetsuo Handa wrote: > WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated > in order to avoid failing to allocate it on demand due to a GFP_KERNEL > allocation? Is this correct? > > WQ_CPU_INTENSIVE only guarantees that work items don't participate in > concurrency management in order to avoid failing to wake up a "struct > task_struct" which will process the work items? Is this correct? CPU_INTENSIVE avoids the tail end of concurrency management. The previous HIGHPRI or the posted IMMEDIATE avoids the head end. > Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without > WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls > rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to > wake up due to being participated in concurrency management? If this is an actual problem, a better approach would be something which detects the stall condition and kicks off the next work item but if we do that I think I'd still trigger a warning there. I don't know. Don't go busy waiting in kernel. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote: > > The problem here is not lack > > of execution resource but concurrency management misunderstanding the > > situation. > > And this sounds like a bug to me. I don't know. I can be argued either way, the other direction being a kernel thread going RUNNING non-stop is buggy. Given how this has been a complete non-issue for all the years, I'm not sure how useful plugging this is. > Don't we have some IO related paths which would suffer from the same > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > name I would expect they _do_ participate in the reclaim and so they > should be able to make a progress. Now if your new IMMEDIATE flag will Seriously, nobody goes full-on RUNNING. > guarantee that then I would argue that it should be implicit for > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > be a counter argument for doing that? Not serving any actual purpose and degrading execution behavior. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On Fri, 23 Oct 2015, Sergey Senozhatsky wrote: > by the way, tick_nohz_stop_sched_tick() receives cpu from > __tick_nohz_idle_enter(). > do you want to pass it to quiet_vmstat()? No this is quite wrong at this point. quiet_vmstat() needs to be called from the cpu going into idle state. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On (10/23/15 09:12), Christoph Lameter wrote: [..] > > > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) > > > + cancel_delayed_work(this_cpu_ptr(_work)); > > > > shouldn't preemption be disable for smp_processor_id() here? > > Preemption is disabled when quiet_vmstat() is called. > cond_resched() [ 29.607725] BUG: sleeping function called from invalid context at mm/vmstat.c:487 [ 29.607729] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/7 [ 29.607731] no locks held by swapper/7/0. [ 29.607732] irq event stamp: 48932 [ 29.607733] hardirqs last enabled at (48931): [] _raw_spin_unlock_irq+0x2c/0x37 [ 29.607739] hardirqs last disabled at (48932): [] tick_nohz_idle_enter+0x3c/0x5f [ 29.607743] softirqs last enabled at (48924): [] __do_softirq+0x2bb/0x3a9 [ 29.607747] softirqs last disabled at (48893): [] irq_exit+0x41/0x95 [ 29.607752] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.3.0-rc6-next-20151022-dbg-3-g01184ff-dirty #261 [ 29.607754] 88041dae7da0 811dd4f3 88041dacd100 [ 29.607756] 88041dae7dc8 8105f144 8169f800 [ 29.607759] 0007 88041dae7e70 811040b1 0002 [ 29.607761] Call Trace: [ 29.607767] [] dump_stack+0x4b/0x63 [ 29.607770] [] ___might_sleep+0x1e7/0x1ee [ 29.607773] [] refresh_cpu_vm_stats+0x8b/0xb5 [ 29.607776] [] quiet_vmstat+0x3a/0x41 [ 29.607778] [] __tick_nohz_idle_enter+0x292/0x410 [ 29.607781] [] tick_nohz_idle_enter+0x57/0x5f [ 29.607784] [] cpu_startup_entry+0x36/0x330 [ 29.607788] [] start_secondary+0xf3/0xf6 by the way, tick_nohz_stop_sched_tick() receives cpu from __tick_nohz_idle_enter(). do you want to pass it to quiet_vmstat()? if (!ts->tick_stopped) { nohz_balance_enter_idle(cpu); - quiet_vmstat(); + quiet_vmstat(cpu); calc_load_enter_idle(); -ss -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On Fri, 23 Oct 2015, Sergey Senozhatsky wrote: > On (10/23/15 06:43), Christoph Lameter wrote: > > Is this ok? > > kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’: > kernel/sched/loadavg.c:195:2: error: implicit declaration of function > ‘quiet_vmstat’ [-Werror=implicit-function-declaration] > quiet_vmstat(); > ^ Oww... Not good to do that in the scheduler. Ok new patch follows that does the call from tick_nohz_stop_sched_tick. Hopefully that is the right location to call quiet_vmstat(). > > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) > > + cancel_delayed_work(this_cpu_ptr(_work)); > > shouldn't preemption be disable for smp_processor_id() here? Preemption is disabled when quiet_vmstat() is called. Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle V2 V1->V2 - Call vmstat_quiet from tick_nohz_stop_sched_tick() instead. Currently the vmstat updater is not deferrable as a result of commit ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple interruptions of the applications because the vmstat updater may run at different times than tick processing. No good. Make vmstate_update deferrable again and provide a function that shuts down the vmstat updater when we go idle by folding the differentials. Shut it down from the load average calculation logic introduced by nohz. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay) Signed-off-by: Christoph Lameter Index: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ @@ -1426,7 +1440,7 @@ static bool need_update(int cpu) */ static void vmstat_shepherd(struct work_struct *w); -static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd); +static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd); static void vmstat_shepherd(struct work_struct *w) { Index: linux/include/linux/vmstat.h === --- linux.orig/include/linux/vmstat.h +++ linux/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } Index: linux/kernel/time/tick-sched.c === --- linux.orig/kernel/time/tick-sched.c +++ linux/kernel/time/tick-sched.c @@ -667,6 +667,7 @@ static ktime_t tick_nohz_stop_sched_tick */ if (!ts->tick_stopped) { nohz_balance_enter_idle(cpu); + quiet_vmstat(); calc_load_enter_idle(); ts->last_tick = hrtimer_get_expires(>sched_timer);
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Michal Hocko wrote: > On Fri 23-10-15 19:36:30, Tejun Heo wrote: > > Hello, Michal. > > > > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote: > > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in > > > other email, sorry about that. But now I am wondering whether this > > > is an intended behavior. The documentation says: > > > > This is. > > > > > WQ_MEM_RECLAIM > > > > > > All wq which might be used in the memory reclaim paths _MUST_ > > > have this flag set. The wq is guaranteed to have at least one > > > execution context regardless of memory pressure. > > > > > > Which doesn't seem to be true currently, right? Now I can see your patch > > > > It is true. > > > > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users > > > could do without WQ_IMMEDIATE? I mean all current workers might be > > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM > > > work items might be waiting behind them so they cannot help to relieve > > > the memory pressure. This doesn't sound right to me. Or I am completely > > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to > > > be used for. > > > > It guarantees that there always is enough execution resource to > > execute a work item from that workqueue. > > OK, strictly speaking the rescuer is there but it is kind of pointless > if it doesn't fire up and do a work. > > > The problem here is not lack > > of execution resource but concurrency management misunderstanding the > > situation. > > And this sounds like a bug to me. > > > This also can be fixed by teaching concurrency management > > to be a bit smarter - e.g. if a work item is burning a lot of CPU > > cycles continuously or pool hasn't finished a work item over a certain > > amount of time, automatically ignore the in-flight work item for the > > purpose of concurrency management; however, this sort of inter-work > > item busy waits are so extremely rare and undesirable that I'm not > > sure the added complexity would be worthwhile. > > Don't we have some IO related paths which would suffer from the same > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > name I would expect they _do_ participate in the reclaim and so they > should be able to make a progress. Now if your new IMMEDIATE flag will > guarantee that then I would argue that it should be implicit for > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > be a counter argument for doing that? WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated in order to avoid failing to allocate it on demand due to a GFP_KERNEL allocation? Is this correct? WQ_CPU_INTENSIVE only guarantees that work items don't participate in concurrency management in order to avoid failing to wake up a "struct task_struct" which will process the work items? Is this correct? Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to wake up due to being participated in concurrency management? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On (10/23/15 06:43), Christoph Lameter wrote: > Is this ok? kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’: kernel/sched/loadavg.c:195:2: error: implicit declaration of function ‘quiet_vmstat’ [-Werror=implicit-function-declaration] quiet_vmstat(); ^ > Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on > idle > > Currently the vmstat updater is not deferrable as a result of commit > ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple > interruptions of the applications because the vmstat updater may run at > different times than tick processing. No good. > > Make vmstate_update deferrable again and provide a function that > shuts down the vmstat updater when we go idle by folding the differentials. > Shut it down from the load average calculation logic introduced by nohz. > > Note that the shepherd thread will continue scanning the differentials > from another processor and will reenable the vmstat workers if it > detects any changes. > > Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay) > Signed-off-by: Christoph Lameter > > Index: linux/mm/vmstat.c > === > --- linux.orig/mm/vmstat.c > +++ linux/mm/vmstat.c > @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st > } > > /* > + * Switch off vmstat processing and then fold all the remaining differentials > + * until the diffs stay at zero. The function is used by NOHZ and can only be > + * invoked when tick processing is not active. > + */ > +void quiet_vmstat(void) > +{ > + do { > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) > + cancel_delayed_work(this_cpu_ptr(_work)); shouldn't preemption be disable for smp_processor_id() here? -ss -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On Fri, 23 Oct 2015, Michal Hocko wrote: > On Thu 22-10-15 10:33:20, Christoph Lameter wrote: > > Ok that also makes me rethink commit > > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix > > this time related to idle mode not updating the counters. > > > > Could we fix that by folding the counters before going to idle mode? > > This would work as well. Is this ok? Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle Currently the vmstat updater is not deferrable as a result of commit ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple interruptions of the applications because the vmstat updater may run at different times than tick processing. No good. Make vmstate_update deferrable again and provide a function that shuts down the vmstat updater when we go idle by folding the differentials. Shut it down from the load average calculation logic introduced by nohz. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay) Signed-off-by: Christoph Lameter Index: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ @@ -1426,7 +1440,7 @@ static bool need_update(int cpu) */ static void vmstat_shepherd(struct work_struct *w); -static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd); +static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd); static void vmstat_shepherd(struct work_struct *w) { Index: linux/include/linux/vmstat.h === --- linux.orig/include/linux/vmstat.h +++ linux/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } Index: linux/kernel/sched/loadavg.c === --- linux.orig/kernel/sched/loadavg.c +++ linux/kernel/sched/loadavg.c @@ -191,6 +191,8 @@ void calc_load_enter_idle(void) atomic_long_add(delta, _load_idle[idx]); } + /* Fold the current vmstat counters and disable vmstat updater */ + quiet_vmstat(); } void calc_load_exit_idle(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 19:36:30, Tejun Heo wrote: > Hello, Michal. > > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote: > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in > > other email, sorry about that. But now I am wondering whether this > > is an intended behavior. The documentation says: > > This is. > > > WQ_MEM_RECLAIM > > > > All wq which might be used in the memory reclaim paths _MUST_ > > have this flag set. The wq is guaranteed to have at least one > > execution context regardless of memory pressure. > > > > Which doesn't seem to be true currently, right? Now I can see your patch > > It is true. > > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users > > could do without WQ_IMMEDIATE? I mean all current workers might be > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM > > work items might be waiting behind them so they cannot help to relieve > > the memory pressure. This doesn't sound right to me. Or I am completely > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to > > be used for. > > It guarantees that there always is enough execution resource to > execute a work item from that workqueue. OK, strictly speaking the rescuer is there but it is kind of pointless if it doesn't fire up and do a work. > The problem here is not lack > of execution resource but concurrency management misunderstanding the > situation. And this sounds like a bug to me. > This also can be fixed by teaching concurrency management > to be a bit smarter - e.g. if a work item is burning a lot of CPU > cycles continuously or pool hasn't finished a work item over a certain > amount of time, automatically ignore the in-flight work item for the > purpose of concurrency management; however, this sort of inter-work > item busy waits are so extremely rare and undesirable that I'm not > sure the added complexity would be worthwhile. Don't we have some IO related paths which would suffer from the same problem. I haven't checked all the WQ_MEM_RECLAIM users but from the name I would expect they _do_ participate in the reclaim and so they should be able to make a progress. Now if your new IMMEDIATE flag will guarantee that then I would argue that it should be implicit for WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would be a counter argument for doing that? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
On Fri, Oct 23, 2015 at 10:36:12AM +0200, Michal Hocko wrote: > If WQ_MEM_RECLAIM can really guarantee one worker as described in the > documentation then I agree that fixing vmstat is a better fix. But that > doesn't seem to be the case currently. It does. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Michal. On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote: > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in > other email, sorry about that. But now I am wondering whether this > is an intended behavior. The documentation says: This is. > WQ_MEM_RECLAIM > > All wq which might be used in the memory reclaim paths _MUST_ > have this flag set. The wq is guaranteed to have at least one > execution context regardless of memory pressure. > > Which doesn't seem to be true currently, right? Now I can see your patch It is true. > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users > could do without WQ_IMMEDIATE? I mean all current workers might be > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM > work items might be waiting behind them so they cannot help to relieve > the memory pressure. This doesn't sound right to me. Or I am completely > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to > be used for. It guarantees that there always is enough execution resource to execute a work item from that workqueue. The problem here is not lack of execution resource but concurrency management misunderstanding the situation. This also can be fixed by teaching concurrency management to be a bit smarter - e.g. if a work item is burning a lot of CPU cycles continuously or pool hasn't finished a work item over a certain amount of time, automatically ignore the in-flight work item for the purpose of concurrency management; however, this sort of inter-work item busy waits are so extremely rare and undesirable that I'm not sure the added complexity would be worthwhile. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu 22-10-15 10:33:20, Christoph Lameter wrote: > Ok that also makes me rethink commit > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix > this time related to idle mode not updating the counters. > > Could we fix that by folding the counters before going to idle mode? This would work as well. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
On Fri 23-10-15 06:42:43, Tetsuo Handa wrote: > Tejun Heo wrote: > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > > I am confused. What makes rescuer to not run? Nothing seems to be > > > hogging CPUs, we are just out of workers which are loopin in the > > > allocator but that is preemptible context. > > > > It's concurrency management. Workqueue thinks that the pool is making > > positive forward progress and doesn't schedule anything else for > > execution while that work item is burning cpu cycles. > > Then, isn't below change easier to backport which will also alleviate > needlessly burning CPU cycles? This is quite obscure. If the vmstat_update fix needs workqueue tweaks as well then I would vote for your original patch which is clear, straightforward and easy to backport. If WQ_MEM_RECLAIM can really guarantee one worker as described in the documentation then I agree that fixing vmstat is a better fix. But that doesn't seem to be the case currently. > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3385,6 +3385,7 @@ retry: > ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > + schedule_timeout_uninterruptible(1); > goto retry; > } > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 03:42:26, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > I am confused. What makes rescuer to not run? Nothing seems to be > > hogging CPUs, we are just out of workers which are loopin in the > > allocator but that is preemptible context. > > It's concurrency management. Workqueue thinks that the pool is making > positive forward progress and doesn't schedule anything else for > execution while that work item is burning cpu cycles. Ohh, OK I can see wq_worker_sleeping now. I've missed your point in other email, sorry about that. But now I am wondering whether this is an intended behavior. The documentation says: WQ_MEM_RECLAIM All wq which might be used in the memory reclaim paths _MUST_ have this flag set. The wq is guaranteed to have at least one execution context regardless of memory pressure. Which doesn't seem to be true currently, right? Now I can see your patch to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users could do without WQ_IMMEDIATE? I mean all current workers might be looping in the page allocator and it seems possible that WQ_MEM_RECLAIM work items might be waiting behind them so they cannot help to relieve the memory pressure. This doesn't sound right to me. Or I am completely confused and still fail to understand what is WQ_MEM_RECLAIM supposed to be used for. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote: > > The problem here is not lack > > of execution resource but concurrency management misunderstanding the > > situation. > > And this sounds like a bug to me. I don't know. I can be argued either way, the other direction being a kernel thread going RUNNING non-stop is buggy. Given how this has been a complete non-issue for all the years, I'm not sure how useful plugging this is. > Don't we have some IO related paths which would suffer from the same > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > name I would expect they _do_ participate in the reclaim and so they > should be able to make a progress. Now if your new IMMEDIATE flag will Seriously, nobody goes full-on RUNNING. > guarantee that then I would argue that it should be implicit for > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > be a counter argument for doing that? Not serving any actual purpose and degrading execution behavior. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Tetsuo. On Fri, Oct 23, 2015 at 09:25:11PM +0900, Tetsuo Handa wrote: > WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated > in order to avoid failing to allocate it on demand due to a GFP_KERNEL > allocation? Is this correct? > > WQ_CPU_INTENSIVE only guarantees that work items don't participate in > concurrency management in order to avoid failing to wake up a "struct > task_struct" which will process the work items? Is this correct? CPU_INTENSIVE avoids the tail end of concurrency management. The previous HIGHPRI or the posted IMMEDIATE avoids the head end. > Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without > WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls > rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to > wake up due to being participated in concurrency management? If this is an actual problem, a better approach would be something which detects the stall condition and kicks off the next work item but if we do that I think I'd still trigger a warning there. I don't know. Don't go busy waiting in kernel. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On Fri, 23 Oct 2015, Michal Hocko wrote: > On Thu 22-10-15 10:33:20, Christoph Lameter wrote: > > Ok that also makes me rethink commit > > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix > > this time related to idle mode not updating the counters. > > > > Could we fix that by folding the counters before going to idle mode? > > This would work as well. Is this ok? Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle Currently the vmstat updater is not deferrable as a result of commit ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple interruptions of the applications because the vmstat updater may run at different times than tick processing. No good. Make vmstate_update deferrable again and provide a function that shuts down the vmstat updater when we go idle by folding the differentials. Shut it down from the load average calculation logic introduced by nohz. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay) Signed-off-by: Christoph LameterIndex: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ @@ -1426,7 +1440,7 @@ static bool need_update(int cpu) */ static void vmstat_shepherd(struct work_struct *w); -static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd); +static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd); static void vmstat_shepherd(struct work_struct *w) { Index: linux/include/linux/vmstat.h === --- linux.orig/include/linux/vmstat.h +++ linux/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } Index: linux/kernel/sched/loadavg.c === --- linux.orig/kernel/sched/loadavg.c +++ linux/kernel/sched/loadavg.c @@ -191,6 +191,8 @@ void calc_load_enter_idle(void) atomic_long_add(delta, _load_idle[idx]); } + /* Fold the current vmstat counters and disable vmstat updater */ + quiet_vmstat(); } void calc_load_exit_idle(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On (10/23/15 09:12), Christoph Lameter wrote: [..] > > > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) > > > + cancel_delayed_work(this_cpu_ptr(_work)); > > > > shouldn't preemption be disable for smp_processor_id() here? > > Preemption is disabled when quiet_vmstat() is called. > cond_resched() [ 29.607725] BUG: sleeping function called from invalid context at mm/vmstat.c:487 [ 29.607729] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/7 [ 29.607731] no locks held by swapper/7/0. [ 29.607732] irq event stamp: 48932 [ 29.607733] hardirqs last enabled at (48931): [] _raw_spin_unlock_irq+0x2c/0x37 [ 29.607739] hardirqs last disabled at (48932): [] tick_nohz_idle_enter+0x3c/0x5f [ 29.607743] softirqs last enabled at (48924): [] __do_softirq+0x2bb/0x3a9 [ 29.607747] softirqs last disabled at (48893): [] irq_exit+0x41/0x95 [ 29.607752] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.3.0-rc6-next-20151022-dbg-3-g01184ff-dirty #261 [ 29.607754] 88041dae7da0 811dd4f3 88041dacd100 [ 29.607756] 88041dae7dc8 8105f144 8169f800 [ 29.607759] 0007 88041dae7e70 811040b1 0002 [ 29.607761] Call Trace: [ 29.607767] [] dump_stack+0x4b/0x63 [ 29.607770] [] ___might_sleep+0x1e7/0x1ee [ 29.607773] [] refresh_cpu_vm_stats+0x8b/0xb5 [ 29.607776] [] quiet_vmstat+0x3a/0x41 [ 29.607778] [] __tick_nohz_idle_enter+0x292/0x410 [ 29.607781] [] tick_nohz_idle_enter+0x57/0x5f [ 29.607784] [] cpu_startup_entry+0x36/0x330 [ 29.607788] [] start_secondary+0xf3/0xf6 by the way, tick_nohz_stop_sched_tick() receives cpu from __tick_nohz_idle_enter(). do you want to pass it to quiet_vmstat()? if (!ts->tick_stopped) { nohz_balance_enter_idle(cpu); - quiet_vmstat(); + quiet_vmstat(cpu); calc_load_enter_idle(); -ss -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On Fri, 23 Oct 2015, Sergey Senozhatsky wrote: > by the way, tick_nohz_stop_sched_tick() receives cpu from > __tick_nohz_idle_enter(). > do you want to pass it to quiet_vmstat()? No this is quite wrong at this point. quiet_vmstat() needs to be called from the cpu going into idle state. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On (10/23/15 06:43), Christoph Lameter wrote: > Is this ok? kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’: kernel/sched/loadavg.c:195:2: error: implicit declaration of function ‘quiet_vmstat’ [-Werror=implicit-function-declaration] quiet_vmstat(); ^ > Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on > idle > > Currently the vmstat updater is not deferrable as a result of commit > ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple > interruptions of the applications because the vmstat updater may run at > different times than tick processing. No good. > > Make vmstate_update deferrable again and provide a function that > shuts down the vmstat updater when we go idle by folding the differentials. > Shut it down from the load average calculation logic introduced by nohz. > > Note that the shepherd thread will continue scanning the differentials > from another processor and will reenable the vmstat workers if it > detects any changes. > > Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay) > Signed-off-by: Christoph Lameter> > Index: linux/mm/vmstat.c > === > --- linux.orig/mm/vmstat.c > +++ linux/mm/vmstat.c > @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st > } > > /* > + * Switch off vmstat processing and then fold all the remaining differentials > + * until the diffs stay at zero. The function is used by NOHZ and can only be > + * invoked when tick processing is not active. > + */ > +void quiet_vmstat(void) > +{ > + do { > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) > + cancel_delayed_work(this_cpu_ptr(_work)); shouldn't preemption be disable for smp_processor_id() here? -ss -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
On Fri, 23 Oct 2015, Sergey Senozhatsky wrote: > On (10/23/15 06:43), Christoph Lameter wrote: > > Is this ok? > > kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’: > kernel/sched/loadavg.c:195:2: error: implicit declaration of function > ‘quiet_vmstat’ [-Werror=implicit-function-declaration] > quiet_vmstat(); > ^ Oww... Not good to do that in the scheduler. Ok new patch follows that does the call from tick_nohz_stop_sched_tick. Hopefully that is the right location to call quiet_vmstat(). > > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) > > + cancel_delayed_work(this_cpu_ptr(_work)); > > shouldn't preemption be disable for smp_processor_id() here? Preemption is disabled when quiet_vmstat() is called. Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle V2 V1->V2 - Call vmstat_quiet from tick_nohz_stop_sched_tick() instead. Currently the vmstat updater is not deferrable as a result of commit ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple interruptions of the applications because the vmstat updater may run at different times than tick processing. No good. Make vmstate_update deferrable again and provide a function that shuts down the vmstat updater when we go idle by folding the differentials. Shut it down from the load average calculation logic introduced by nohz. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay) Signed-off-by: Christoph LameterIndex: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ @@ -1426,7 +1440,7 @@ static bool need_update(int cpu) */ static void vmstat_shepherd(struct work_struct *w); -static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd); +static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd); static void vmstat_shepherd(struct work_struct *w) { Index: linux/include/linux/vmstat.h === --- linux.orig/include/linux/vmstat.h +++ linux/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } Index: linux/kernel/time/tick-sched.c === --- linux.orig/kernel/time/tick-sched.c +++ linux/kernel/time/tick-sched.c @@ -667,6 +667,7 @@ static ktime_t tick_nohz_stop_sched_tick */ if (!ts->tick_stopped) { nohz_balance_enter_idle(cpu); + quiet_vmstat(); calc_load_enter_idle(); ts->last_tick = hrtimer_get_expires(>sched_timer);
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
On Fri, Oct 23, 2015 at 10:36:12AM +0200, Michal Hocko wrote: > If WQ_MEM_RECLAIM can really guarantee one worker as described in the > documentation then I agree that fixing vmstat is a better fix. But that > doesn't seem to be the case currently. It does. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 19:36:30, Tejun Heo wrote: > Hello, Michal. > > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote: > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in > > other email, sorry about that. But now I am wondering whether this > > is an intended behavior. The documentation says: > > This is. > > > WQ_MEM_RECLAIM > > > > All wq which might be used in the memory reclaim paths _MUST_ > > have this flag set. The wq is guaranteed to have at least one > > execution context regardless of memory pressure. > > > > Which doesn't seem to be true currently, right? Now I can see your patch > > It is true. > > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users > > could do without WQ_IMMEDIATE? I mean all current workers might be > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM > > work items might be waiting behind them so they cannot help to relieve > > the memory pressure. This doesn't sound right to me. Or I am completely > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to > > be used for. > > It guarantees that there always is enough execution resource to > execute a work item from that workqueue. OK, strictly speaking the rescuer is there but it is kind of pointless if it doesn't fire up and do a work. > The problem here is not lack > of execution resource but concurrency management misunderstanding the > situation. And this sounds like a bug to me. > This also can be fixed by teaching concurrency management > to be a bit smarter - e.g. if a work item is burning a lot of CPU > cycles continuously or pool hasn't finished a work item over a certain > amount of time, automatically ignore the in-flight work item for the > purpose of concurrency management; however, this sort of inter-work > item busy waits are so extremely rare and undesirable that I'm not > sure the added complexity would be worthwhile. Don't we have some IO related paths which would suffer from the same problem. I haven't checked all the WQ_MEM_RECLAIM users but from the name I would expect they _do_ participate in the reclaim and so they should be able to make a progress. Now if your new IMMEDIATE flag will guarantee that then I would argue that it should be implicit for WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would be a counter argument for doing that? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, Michal. On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote: > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in > other email, sorry about that. But now I am wondering whether this > is an intended behavior. The documentation says: This is. > WQ_MEM_RECLAIM > > All wq which might be used in the memory reclaim paths _MUST_ > have this flag set. The wq is guaranteed to have at least one > execution context regardless of memory pressure. > > Which doesn't seem to be true currently, right? Now I can see your patch It is true. > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users > could do without WQ_IMMEDIATE? I mean all current workers might be > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM > work items might be waiting behind them so they cannot help to relieve > the memory pressure. This doesn't sound right to me. Or I am completely > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to > be used for. It guarantees that there always is enough execution resource to execute a work item from that workqueue. The problem here is not lack of execution resource but concurrency management misunderstanding the situation. This also can be fixed by teaching concurrency management to be a bit smarter - e.g. if a work item is burning a lot of CPU cycles continuously or pool hasn't finished a work item over a certain amount of time, automatically ignore the in-flight work item for the purpose of concurrency management; however, this sort of inter-work item busy waits are so extremely rare and undesirable that I'm not sure the added complexity would be worthwhile. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Michal Hocko wrote: > On Fri 23-10-15 19:36:30, Tejun Heo wrote: > > Hello, Michal. > > > > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote: > > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in > > > other email, sorry about that. But now I am wondering whether this > > > is an intended behavior. The documentation says: > > > > This is. > > > > > WQ_MEM_RECLAIM > > > > > > All wq which might be used in the memory reclaim paths _MUST_ > > > have this flag set. The wq is guaranteed to have at least one > > > execution context regardless of memory pressure. > > > > > > Which doesn't seem to be true currently, right? Now I can see your patch > > > > It is true. > > > > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users > > > could do without WQ_IMMEDIATE? I mean all current workers might be > > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM > > > work items might be waiting behind them so they cannot help to relieve > > > the memory pressure. This doesn't sound right to me. Or I am completely > > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to > > > be used for. > > > > It guarantees that there always is enough execution resource to > > execute a work item from that workqueue. > > OK, strictly speaking the rescuer is there but it is kind of pointless > if it doesn't fire up and do a work. > > > The problem here is not lack > > of execution resource but concurrency management misunderstanding the > > situation. > > And this sounds like a bug to me. > > > This also can be fixed by teaching concurrency management > > to be a bit smarter - e.g. if a work item is burning a lot of CPU > > cycles continuously or pool hasn't finished a work item over a certain > > amount of time, automatically ignore the in-flight work item for the > > purpose of concurrency management; however, this sort of inter-work > > item busy waits are so extremely rare and undesirable that I'm not > > sure the added complexity would be worthwhile. > > Don't we have some IO related paths which would suffer from the same > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the > name I would expect they _do_ participate in the reclaim and so they > should be able to make a progress. Now if your new IMMEDIATE flag will > guarantee that then I would argue that it should be implicit for > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would > be a counter argument for doing that? WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated in order to avoid failing to allocate it on demand due to a GFP_KERNEL allocation? Is this correct? WQ_CPU_INTENSIVE only guarantees that work items don't participate in concurrency management in order to avoid failing to wake up a "struct task_struct" which will process the work items? Is this correct? Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to wake up due to being participated in concurrency management? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 03:42:26, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > I am confused. What makes rescuer to not run? Nothing seems to be > > hogging CPUs, we are just out of workers which are loopin in the > > allocator but that is preemptible context. > > It's concurrency management. Workqueue thinks that the pool is making > positive forward progress and doesn't schedule anything else for > execution while that work item is burning cpu cycles. Ohh, OK I can see wq_worker_sleeping now. I've missed your point in other email, sorry about that. But now I am wondering whether this is an intended behavior. The documentation says: WQ_MEM_RECLAIM All wq which might be used in the memory reclaim paths _MUST_ have this flag set. The wq is guaranteed to have at least one execution context regardless of memory pressure. Which doesn't seem to be true currently, right? Now I can see your patch to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users could do without WQ_IMMEDIATE? I mean all current workers might be looping in the page allocator and it seems possible that WQ_MEM_RECLAIM work items might be waiting behind them so they cannot help to relieve the memory pressure. This doesn't sound right to me. Or I am completely confused and still fail to understand what is WQ_MEM_RECLAIM supposed to be used for. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
On Fri 23-10-15 06:42:43, Tetsuo Handa wrote: > Tejun Heo wrote: > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > > I am confused. What makes rescuer to not run? Nothing seems to be > > > hogging CPUs, we are just out of workers which are loopin in the > > > allocator but that is preemptible context. > > > > It's concurrency management. Workqueue thinks that the pool is making > > positive forward progress and doesn't schedule anything else for > > execution while that work item is burning cpu cycles. > > Then, isn't below change easier to backport which will also alleviate > needlessly burning CPU cycles? This is quite obscure. If the vmstat_update fix needs workqueue tweaks as well then I would vote for your original patch which is clear, straightforward and easy to backport. If WQ_MEM_RECLAIM can really guarantee one worker as described in the documentation then I agree that fixing vmstat is a better fix. But that doesn't seem to be the case currently. > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3385,6 +3385,7 @@ retry: > ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > + schedule_timeout_uninterruptible(1); > goto retry; > } > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu 22-10-15 10:33:20, Christoph Lameter wrote: > Ok that also makes me rethink commit > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix > this time related to idle mode not updating the counters. > > Could we fix that by folding the counters before going to idle mode? This would work as well. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, So, something like the following. Just compile tested but this is essentially partial revert of 3270476a6c0c ("workqueue: reimplement WQ_HIGHPRI using a separate worker_pool") - resurrecting the old WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works. If for some reason, it gets decided against simply adding one jiffy sleep, please let me know. I'll verify the operation and post a proper patch. That said, given that this prolly needs -stable backport and vmstat is likely to be the only user (busy loops are really rare in the kernel after all), I think the better approach would be reinstating the short sleep. Thanks. --- include/linux/workqueue.h |7 ++--- kernel/workqueue.c| 63 +++--- 2 files changed, 63 insertions(+), 7 deletions(-) --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -278,9 +278,10 @@ enum { WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ WQ_FREEZABLE= 1 << 2, /* freeze during suspend */ WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */ - WQ_HIGHPRI = 1 << 4, /* high priority */ - WQ_CPU_INTENSIVE= 1 << 5, /* cpu intensive workqueue */ - WQ_SYSFS= 1 << 6, /* visible in sysfs, see wq_sysfs_register() */ + WQ_IMMEDIATE= 1 << 4, /* bypass concurrency management */ + WQ_HIGHPRI = 1 << 5, /* high priority */ + WQ_CPU_INTENSIVE= 1 << 6, /* cpu intensive workqueue */ + WQ_SYSFS= 1 << 7, /* visible in sysfs, see wq_sysfs_register() */ /* * Per-cpu workqueues are generally preferred because they tend to --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -68,6 +68,7 @@ enum { * attach_mutex to avoid changing binding state while * worker_attach_to_pool() is in progress. */ + POOL_IMMEDIATE_PENDING = 1 << 0, /* WQ_IMMEDIATE items on queue */ POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */ /* worker flags */ @@ -731,7 +732,8 @@ static bool work_is_canceling(struct wor static bool __need_more_worker(struct worker_pool *pool) { - return !atomic_read(>nr_running); + return !atomic_read(>nr_running) || + (pool->flags & POOL_IMMEDIATE_PENDING); } /* @@ -757,7 +759,8 @@ static bool may_start_working(struct wor static bool keep_working(struct worker_pool *pool) { return !list_empty(>worklist) && - atomic_read(>nr_running) <= 1; + (atomic_read(>nr_running) <= 1 || +(pool->flags & POOL_IMMEDIATE_PENDING)); } /* Do we need a new worker? Called from manager. */ @@ -1021,6 +1024,42 @@ static void move_linked_works(struct wor } /** + * pwq_determine_ins_pos - find insertion position + * @pwq: pwq a work is being queued for + * + * A work for @pwq is about to be queued on @pwq->pool, determine insertion + * position for the work. If @pwq is for IMMEDIATE wq, the work item is + * queued at the head of the queue but in FIFO order with respect to other + * IMMEDIATE work items; otherwise, at the end of the queue. This function + * also sets POOL_IMMEDIATE_PENDING flag to hint @pool that there are + * IMMEDIATE works pending. + * + * CONTEXT: + * spin_lock_irq(gcwq->lock). + * + * RETURNS: + * Pointer to insertion position. + */ +static struct list_head *pwq_determine_ins_pos(struct pool_workqueue *pwq) +{ + struct worker_pool *pool = pwq->pool; + struct work_struct *twork; + + if (likely(!(pwq->wq->flags & WQ_IMMEDIATE))) + return >worklist; + + list_for_each_entry(twork, >worklist, entry) { + struct pool_workqueue *tpwq = get_work_pwq(twork); + + if (!(tpwq->wq->flags & WQ_IMMEDIATE)) + break; + } + + pool->flags |= POOL_IMMEDIATE_PENDING; + return >entry; +} + +/** * get_pwq - get an extra reference on the specified pool_workqueue * @pwq: pool_workqueue to get * @@ -1081,9 +1120,10 @@ static void put_pwq_unlocked(struct pool static void pwq_activate_delayed_work(struct work_struct *work) { struct pool_workqueue *pwq = get_work_pwq(work); + struct list_head *pos = pwq_determine_ins_pos(pwq); trace_workqueue_activate_work(work); - move_linked_works(work, >pool->worklist, NULL); + move_linked_works(work, pos, NULL); __clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work)); pwq->nr_active++; } @@ -1384,7 +1424,7 @@ retry: if (likely(pwq->nr_active < pwq->max_active)) { trace_workqueue_activate_work(work); pwq->nr_active++; - worklist = >pool->worklist; + worklist = pwq_determine_ins_pos(pwq); } else { work_flags |= WORK_STRUCT_DELAYED; worklist =
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
Hello, On Fri, Oct 23, 2015 at 06:42:43AM +0900, Tetsuo Handa wrote: > Then, isn't below change easier to backport which will also alleviate > needlessly burning CPU cycles? > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3385,6 +3385,7 @@ retry: > ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > + schedule_timeout_uninterruptible(1); > goto retry; > } Yeah, that works too. It should still be put on a dedicated wq with MEM_RECLAIM tho. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
Tejun Heo wrote: > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > I am confused. What makes rescuer to not run? Nothing seems to be > > hogging CPUs, we are just out of workers which are loopin in the > > allocator but that is preemptible context. > > It's concurrency management. Workqueue thinks that the pool is making > positive forward progress and doesn't schedule anything else for > execution while that work item is burning cpu cycles. Then, isn't below change easier to backport which will also alleviate needlessly burning CPU cycles? --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3385,6 +3385,7 @@ retry: ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); + schedule_timeout_uninterruptible(1); goto retry; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > I am confused. What makes rescuer to not run? Nothing seems to be > hogging CPUs, we are just out of workers which are loopin in the > allocator but that is preemptible context. It's concurrency management. Workqueue thinks that the pool is making positive forward progress and doesn't schedule anything else for execution while that work item is burning cpu cycles. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 00:37:03, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote: > > But that shouldn't happen because the allocation path does cond_resched > > even when nothing is really reclaimable (e.g. wait_iff_congested from > > __alloc_pages_slowpath). > > cond_resched() isn't enough. The work item should go !RUNNING, not > just yielding. I am confused. What makes rescuer to not run? Nothing seems to be hogging CPUs, we are just out of workers which are loopin in the allocator but that is preemptible context. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote: > But that shouldn't happen because the allocation path does cond_resched > even when nothing is really reclaimable (e.g. wait_iff_congested from > __alloc_pages_slowpath). cond_resched() isn't enough. The work item should go !RUNNING, not just yielding. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Fri 23-10-15 00:15:28, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote: > > Do I get it right that if vmstat_update has its own workqueue with > > WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always > > be able to process vmstat_update work from the requested CPU? > > Yeah. Thanks for the confirmation. > > That should be sufficient because vmstat_update doesn't sleep on > > allocation. I agree that this would be a more appropriate fix. > > The problem seems to be reclaim path busy looping waiting for > vmstat_update and workqueue thinking that the work item must be making > forward-progress and thus not starting the next work item. But that shouldn't happen because the allocation path does cond_resched even when nothing is really reclaimable (e.g. wait_iff_congested from __alloc_pages_slowpath). -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Ok that also makes me rethink commit ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix this time related to idle mode not updating the counters. Could we fix that by folding the counters before going to idle mode? That fix seems to now create 2 separate application interuptions because the vmstat update is not deferred anymore to occur with other events. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Thu, Oct 22, 2015 at 09:41:11AM -0500, Christoph Lameter wrote: > > If this is actually a legit busy-waiting cyclic dependency, just let > > me know. > > There is no dependency of the vmstat updater on anything. > They can run anytime. If there is a dependency then its created by the > kworker subsystem itself. Sure, the other direction is from workqueue concurrency detection. I was asking whether a work item can busy-wait on vmstat_update work item cuz that's what confuses workqueue. Looking at the original dump, the pool has two idle workers indicating that the workqueue wasn't short of execution resources and it really looks like that work item was live-locking the pool. I'll go ahead and add WQ_IMMEDIATE. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote: > Do I get it right that if vmstat_update has its own workqueue with > WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always > be able to process vmstat_update work from the requested CPU? Yeah. > That should be sufficient because vmstat_update doesn't sleep on > allocation. I agree that this would be a more appropriate fix. The problem seems to be reclaim path busy looping waiting for vmstat_update and workqueue thinking that the work item must be making forward-progress and thus not starting the next work item. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu 22-10-15 23:09:44, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote: > > On Thu, 22 Oct 2015, Tetsuo Handa wrote: > > > > > The problem would be that the "struct task_struct" to execute > > > vmstat_update > > > job does not exist, and will not be able to create one on demand because > > > we > > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > > > thread for vmstat_update job would work. But ... > > > > Yuck. Can someone please get this major screwup out of the work queue > > subsystem? Tejun? > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. Do I get it right that if vmstat_update has its own workqueue with WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always be able to process vmstat_update work from the requested CPU? That should be sufficient because vmstat_update doesn't sleep on allocation. I agree that this would be a more appropriate fix. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM > is to create a cyclic dependency on another work item and keep that > work item busy wait. Workqueue thinks that work item is making > progress as it's running and doesn't schedule the next one. > > (I was misremembering here) HIGHPRI originally was implemented > head-queueing on the same pool followed by immediate execution, so > could get around cases where this could happen, but that got lost > while converting it to a separate pool. I can introduce another flag > to bypass concurrency management if necessary (it's kinda trivial) but > busy-waiting cyclic dependency is a pretty unusual thing. > > If this is actually a legit busy-waiting cyclic dependency, just let > me know. There is no dependency of the vmstat updater on anything. They can run anytime. If there is a dependency then its created by the kworker subsystem itself. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 09:25:49AM -0500, Christoph Lameter wrote: > On Thu, 22 Oct 2015, Tejun Heo wrote: > > > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote: > > > I guess we need that otherwise vm statistics are not updated while worker > > > threads are blocking on memory reclaim. > > > > And the blocking one is just constantly running? > > I was told that there is just one task struct so additional work queue > items cannot be processed while waiting? lol, no, what it tries to do is trying to keep the number of RUNNING workers at minimum so that minimum number of workers can be used and work items are executed back-to-back on the same workers. The moment a work item blocks, the next worker kicks in and starts executing the next work item in line. The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM is to create a cyclic dependency on another work item and keep that work item busy wait. Workqueue thinks that work item is making progress as it's running and doesn't schedule the next one. (I was misremembering here) HIGHPRI originally was implemented head-queueing on the same pool followed by immediate execution, so could get around cases where this could happen, but that got lost while converting it to a separate pool. I can introduce another flag to bypass concurrency management if necessary (it's kinda trivial) but busy-waiting cyclic dependency is a pretty unusual thing. If this is actually a legit busy-waiting cyclic dependency, just let me know. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote: > > I guess we need that otherwise vm statistics are not updated while worker > > threads are blocking on memory reclaim. > > And the blocking one is just constantly running? I was told that there is just one task struct so additional work queue items cannot be processed while waiting? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote: > I guess we need that otherwise vm statistics are not updated while worker > threads are blocking on memory reclaim. And the blocking one is just constantly running? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If > > concurrency management is a problem and there's something live-locking > > for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is > > a common occurrence that it makes sense to give vmstat higher > > priority, set WQ_HIGHPRI. > > Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost > while converting HIGHPRI to a separate pool but guaranteeing immediate > scheduling for CPU_INTENSIVE is trivial. If vmstat requires that, > please let me know. I guess we need that otherwise vm statistics are not updated while worker threads are blocking on memory reclaim. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > > Yuck. Can someone please get this major screwup out of the work queue > > subsystem? Tejun? > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If > concurrency management is a problem and there's something live-locking > for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is > a common occurrence that it makes sense to give vmstat higher > priority, set WQ_HIGHPRI. I did. Check the thread. The result was that other tasks were still blocking the thread. Ok I did not use HIGHPRI here is a newer version: From: Christoph Lameter Subject: vmstat: Create our own workqueue V2 V1->V2: - Add a couple of workqueue flags that may fix things. Seems that vmstat needs its own workqueue now since the general workqueue mechanism has been *enhanced* which means that the vmstat_updates cannot run reliably but are being blocked by work requests doing memory allocation. Which causes vmstat to be unable to keep the counters up to date. Bad. Fix this by creating our own workqueue. Signed-off-by: Christoph Lameter Index: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1382,6 +1382,8 @@ static const struct file_operations proc #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SMP +static struct workqueue_struct *vmstat_wq; + static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; static cpumask_var_t cpu_stat_off; @@ -1394,7 +1396,7 @@ static void vmstat_update(struct work_st * to occur in the future. Keep on running the * update worker thread. */ - schedule_delayed_work_on(smp_processor_id(), + queue_delayed_work_on(smp_processor_id(), vmstat_wq, this_cpu_ptr(_work), round_jiffies_relative(sysctl_stat_interval)); } else { @@ -1463,7 +1465,7 @@ static void vmstat_shepherd(struct work_ if (need_update(cpu) && cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - schedule_delayed_work_on(cpu, + queue_delayed_work_on(cpu, vmstat_wq, _cpu(vmstat_work, cpu), 0); put_online_cpus(); @@ -1552,6 +1554,12 @@ static int __init setup_vmstat(void) start_shepherd_timer(); cpu_notifier_register_done(); + vmstat_wq = alloc_workqueue("vmstat", + WQ_FREEZABLE| + WQ_SYSFS| + WQ_MEM_RECLAIM| + WQ_HIGHPRI| + WQ_CPU_INTENSIVE, 0); #endif #ifdef CONFIG_PROC_FS proc_create("buddyinfo", S_IRUGO, NULL, _file_operations); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 11:09:44PM +0900, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote: > > On Thu, 22 Oct 2015, Tetsuo Handa wrote: > > > > > The problem would be that the "struct task_struct" to execute > > > vmstat_update > > > job does not exist, and will not be able to create one on demand because > > > we > > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > > > thread for vmstat_update job would work. But ... > > > > Yuck. Can someone please get this major screwup out of the work queue > > subsystem? Tejun? > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If > concurrency management is a problem and there's something live-locking > for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is > a common occurrence that it makes sense to give vmstat higher > priority, set WQ_HIGHPRI. Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost while converting HIGHPRI to a separate pool but guaranteeing immediate scheduling for CPU_INTENSIVE is trivial. If vmstat requires that, please let me know. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote: > On Thu, 22 Oct 2015, Tetsuo Handa wrote: > > > The problem would be that the "struct task_struct" to execute vmstat_update > > job does not exist, and will not be able to create one on demand because we > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > > thread for vmstat_update job would work. But ... > > Yuck. Can someone please get this major screwup out of the work queue > subsystem? Tejun? Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If concurrency management is a problem and there's something live-locking for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is a common occurrence that it makes sense to give vmstat higher priority, set WQ_HIGHPRI. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tetsuo Handa wrote: > The problem would be that the "struct task_struct" to execute vmstat_update > job does not exist, and will not be able to create one on demand because we > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > thread for vmstat_update job would work. But ... Yuck. Can someone please get this major screwup out of the work queue subsystem? Tejun? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Christoph Lameter wrote: > On Wed, 21 Oct 2015, Michal Hocko wrote: > > > I am not sure how to achieve that. Requiring non-sleeping worker would > > work out but do we have enough users to add such an API? > > > > I would rather see vmstat using dedicated kernel thread(s) for this this > > purpose. We have discussed that in the past but it hasn't led anywhere. > > How about this one? I really would like to have the vm statistics work as > designed and apparently they no longer work right with the existing > workqueue mechanism. No, it won't help. Adding a dedicated workqueue for vmstat_update job merely moves that job from "events" to "vmstat" workqueue. The "vmstat" workqueue after all appears in the list of busy workqueues. The problem is that all workqueues are assigned the same CPU (cpus=2 in below example tested on a 4 CPUs VM) and therefore only one job is in-flight state. All other jobs are waiting for the in-flight job to complete in the pending list while the in-flight job is blocked at memory allocation. The problem would be that the "struct task_struct" to execute vmstat_update job does not exist, and will not be able to create one on demand because we are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel thread for vmstat_update job would work. But ... [ 133.132322] Showing busy workqueues and worker pools: [ 133.133878] workqueue events: flags=0x0 [ 133.135215] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 133.137076] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx] [ 133.139075] workqueue events_freezable_power_: flags=0x84 [ 133.140745] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 133.142638] in-flight: 20:disk_events_workfn [ 133.144199] pending: disk_events_workfn [ 133.145699] workqueue vmstat: flags=0xc [ 133.147055] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 133.148910] pending: vmstat_update [ 133.150354] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183 [ 133.174523] DMA32 zone_reclaimable: reclaim:2(30186,30162,2) free:11163(25154,-20) min:11163 pages_scanned:0(30158,0) prio:12 [ 133.177264] DMA32 zone_reclaimable: reclaim:2(30189,30165,2) free:11163(25157,-20) min:11163 pages_scanned:0(30161,0) prio:11 [ 133.180139] DMA32 zone_reclaimable: reclaim:2(30191,30167,2) free:11163(25159,-20) min:11163 pages_scanned:0(30163,0) prio:10 [ 133.182847] DMA32 zone_reclaimable: reclaim:2(30194,30170,2) free:11163(25162,-20) min:11163 pages_scanned:0(30166,0) prio:9 [ 133.207048] DMA32 zone_reclaimable: reclaim:2(30219,30195,2) free:11163(25187,-20) min:11163 pages_scanned:0(30191,0) prio:8 [ 133.209770] DMA32 zone_reclaimable: reclaim:2(30221,30197,2) free:11163(25189,-20) min:11163 pages_scanned:0(30193,0) prio:7 [ 133.212470] DMA32 zone_reclaimable: reclaim:2(30224,30200,2) free:11163(25192,-20) min:11163 pages_scanned:0(30196,0) prio:6 [ 133.215149] DMA32 zone_reclaimable: reclaim:2(30227,30203,2) free:11163(25195,-20) min:11163 pages_scanned:0(30199,0) prio:5 [ 133.239013] DMA32 zone_reclaimable: reclaim:2(30251,30227,2) free:11163(25219,-20) min:11163 pages_scanned:0(30223,0) prio:4 [ 133.241688] DMA32 zone_reclaimable: reclaim:2(30253,30229,2) free:11163(25221,-20) min:11163 pages_scanned:0(30225,0) prio:3 [ 133.244332] DMA32 zone_reclaimable: reclaim:2(30256,30232,2) free:11163(25224,-20) min:11163 pages_scanned:0(30228,0) prio:2 [ 133.246919] DMA32 zone_reclaimable: reclaim:2(30258,30234,2) free:11163(25226,-20) min:11163 pages_scanned:0(30230,0) prio:1 [ 133.270967] DMA32 zone_reclaimable: reclaim:2(30283,30259,2) free:11163(25251,-20) min:11163 pages_scanned:0(30255,0) prio:0 [ 133.273587] DMA32 zone_reclaimable: reclaim:2(30285,30261,2) free:11163(25253,-20) min:11163 pages_scanned:0(30257,0) prio:12 [ 133.276224] DMA32 zone_reclaimable: reclaim:2(30287,30263,2) free:11163(25255,-20) min:11163 pages_scanned:0(30259,0) prio:11 [ 133.278852] DMA32 zone_reclaimable: reclaim:2(30290,30266,2) free:11163(25258,-20) min:11163 pages_scanned:0(30262,0) prio:10 [ 133.302964] DMA32 zone_reclaimable: reclaim:2(30315,30291,2) free:11163(25283,-20) min:11163 pages_scanned:0(30287,0) prio:9 [ 133.305518] DMA32 zone_reclaimable: reclaim:2(30317,30293,2) free:11163(25285,-20) min:11163 pages_scanned:0(30289,0) prio:8 [ 133.308095] DMA32 zone_reclaimable: reclaim:2(30319,30295,2) free:11163(25287,-20) min:11163 pages_scanned:0(30291,0) prio:7 [ 133.310683] DMA32 zone_reclaimable: reclaim:2(30322,30298,2) free:11163(25290,-20) min:11163 pages_scanned:0(30294,0) prio:6 [ 133.334904] DMA32 zone_reclaimable: reclaim:2(30347,30323,2) free:11163(25315,-20) min:11163 pages_scanned:0(30319,0) prio:5 [ 133.337590] DMA32 zone_reclaimable: reclaim:2(30349,30325,2) free:11163(25317,-20) min:11163 pages_scanned:0(30321,0) prio:4 [ 133.340147] DMA32 zone_reclaimable: reclaim:2(30351,30327,2)
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > I am confused. What makes rescuer to not run? Nothing seems to be > hogging CPUs, we are just out of workers which are loopin in the > allocator but that is preemptible context. It's concurrency management. Workqueue thinks that the pool is making positive forward progress and doesn't schedule anything else for execution while that work item is burning cpu cycles. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Christoph Lameter wrote: > On Wed, 21 Oct 2015, Michal Hocko wrote: > > > I am not sure how to achieve that. Requiring non-sleeping worker would > > work out but do we have enough users to add such an API? > > > > I would rather see vmstat using dedicated kernel thread(s) for this this > > purpose. We have discussed that in the past but it hasn't led anywhere. > > How about this one? I really would like to have the vm statistics work as > designed and apparently they no longer work right with the existing > workqueue mechanism. No, it won't help. Adding a dedicated workqueue for vmstat_update job merely moves that job from "events" to "vmstat" workqueue. The "vmstat" workqueue after all appears in the list of busy workqueues. The problem is that all workqueues are assigned the same CPU (cpus=2 in below example tested on a 4 CPUs VM) and therefore only one job is in-flight state. All other jobs are waiting for the in-flight job to complete in the pending list while the in-flight job is blocked at memory allocation. The problem would be that the "struct task_struct" to execute vmstat_update job does not exist, and will not be able to create one on demand because we are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel thread for vmstat_update job would work. But ... [ 133.132322] Showing busy workqueues and worker pools: [ 133.133878] workqueue events: flags=0x0 [ 133.135215] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 133.137076] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx] [ 133.139075] workqueue events_freezable_power_: flags=0x84 [ 133.140745] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 133.142638] in-flight: 20:disk_events_workfn [ 133.144199] pending: disk_events_workfn [ 133.145699] workqueue vmstat: flags=0xc [ 133.147055] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 133.148910] pending: vmstat_update [ 133.150354] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183 [ 133.174523] DMA32 zone_reclaimable: reclaim:2(30186,30162,2) free:11163(25154,-20) min:11163 pages_scanned:0(30158,0) prio:12 [ 133.177264] DMA32 zone_reclaimable: reclaim:2(30189,30165,2) free:11163(25157,-20) min:11163 pages_scanned:0(30161,0) prio:11 [ 133.180139] DMA32 zone_reclaimable: reclaim:2(30191,30167,2) free:11163(25159,-20) min:11163 pages_scanned:0(30163,0) prio:10 [ 133.182847] DMA32 zone_reclaimable: reclaim:2(30194,30170,2) free:11163(25162,-20) min:11163 pages_scanned:0(30166,0) prio:9 [ 133.207048] DMA32 zone_reclaimable: reclaim:2(30219,30195,2) free:11163(25187,-20) min:11163 pages_scanned:0(30191,0) prio:8 [ 133.209770] DMA32 zone_reclaimable: reclaim:2(30221,30197,2) free:11163(25189,-20) min:11163 pages_scanned:0(30193,0) prio:7 [ 133.212470] DMA32 zone_reclaimable: reclaim:2(30224,30200,2) free:11163(25192,-20) min:11163 pages_scanned:0(30196,0) prio:6 [ 133.215149] DMA32 zone_reclaimable: reclaim:2(30227,30203,2) free:11163(25195,-20) min:11163 pages_scanned:0(30199,0) prio:5 [ 133.239013] DMA32 zone_reclaimable: reclaim:2(30251,30227,2) free:11163(25219,-20) min:11163 pages_scanned:0(30223,0) prio:4 [ 133.241688] DMA32 zone_reclaimable: reclaim:2(30253,30229,2) free:11163(25221,-20) min:11163 pages_scanned:0(30225,0) prio:3 [ 133.244332] DMA32 zone_reclaimable: reclaim:2(30256,30232,2) free:11163(25224,-20) min:11163 pages_scanned:0(30228,0) prio:2 [ 133.246919] DMA32 zone_reclaimable: reclaim:2(30258,30234,2) free:11163(25226,-20) min:11163 pages_scanned:0(30230,0) prio:1 [ 133.270967] DMA32 zone_reclaimable: reclaim:2(30283,30259,2) free:11163(25251,-20) min:11163 pages_scanned:0(30255,0) prio:0 [ 133.273587] DMA32 zone_reclaimable: reclaim:2(30285,30261,2) free:11163(25253,-20) min:11163 pages_scanned:0(30257,0) prio:12 [ 133.276224] DMA32 zone_reclaimable: reclaim:2(30287,30263,2) free:11163(25255,-20) min:11163 pages_scanned:0(30259,0) prio:11 [ 133.278852] DMA32 zone_reclaimable: reclaim:2(30290,30266,2) free:11163(25258,-20) min:11163 pages_scanned:0(30262,0) prio:10 [ 133.302964] DMA32 zone_reclaimable: reclaim:2(30315,30291,2) free:11163(25283,-20) min:11163 pages_scanned:0(30287,0) prio:9 [ 133.305518] DMA32 zone_reclaimable: reclaim:2(30317,30293,2) free:11163(25285,-20) min:11163 pages_scanned:0(30289,0) prio:8 [ 133.308095] DMA32 zone_reclaimable: reclaim:2(30319,30295,2) free:11163(25287,-20) min:11163 pages_scanned:0(30291,0) prio:7 [ 133.310683] DMA32 zone_reclaimable: reclaim:2(30322,30298,2) free:11163(25290,-20) min:11163 pages_scanned:0(30294,0) prio:6 [ 133.334904] DMA32 zone_reclaimable: reclaim:2(30347,30323,2) free:11163(25315,-20) min:11163 pages_scanned:0(30319,0) prio:5 [ 133.337590] DMA32 zone_reclaimable: reclaim:2(30349,30325,2) free:11163(25317,-20) min:11163 pages_scanned:0(30321,0) prio:4 [ 133.340147] DMA32 zone_reclaimable: reclaim:2(30351,30327,2)
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
Tejun Heo wrote: > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote: > > I am confused. What makes rescuer to not run? Nothing seems to be > > hogging CPUs, we are just out of workers which are loopin in the > > allocator but that is preemptible context. > > It's concurrency management. Workqueue thinks that the pool is making > positive forward progress and doesn't schedule anything else for > execution while that work item is burning cpu cycles. Then, isn't below change easier to backport which will also alleviate needlessly burning CPU cycles? --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3385,6 +3385,7 @@ retry: ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { /* Wait for some write requests to complete then retry */ wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); + schedule_timeout_uninterruptible(1); goto retry; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
Hello, On Fri, Oct 23, 2015 at 06:42:43AM +0900, Tetsuo Handa wrote: > Then, isn't below change easier to backport which will also alleviate > needlessly burning CPU cycles? > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3385,6 +3385,7 @@ retry: > ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { > /* Wait for some write requests to complete then retry */ > wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); > + schedule_timeout_uninterruptible(1); > goto retry; > } Yeah, that works too. It should still be put on a dedicated wq with MEM_RECLAIM tho. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote: > On Thu, 22 Oct 2015, Tetsuo Handa wrote: > > > The problem would be that the "struct task_struct" to execute vmstat_update > > job does not exist, and will not be able to create one on demand because we > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > > thread for vmstat_update job would work. But ... > > Yuck. Can someone please get this major screwup out of the work queue > subsystem? Tejun? Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If concurrency management is a problem and there's something live-locking for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is a common occurrence that it makes sense to give vmstat higher priority, set WQ_HIGHPRI. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 11:09:44PM +0900, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote: > > On Thu, 22 Oct 2015, Tetsuo Handa wrote: > > > > > The problem would be that the "struct task_struct" to execute > > > vmstat_update > > > job does not exist, and will not be able to create one on demand because > > > we > > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > > > thread for vmstat_update job would work. But ... > > > > Yuck. Can someone please get this major screwup out of the work queue > > subsystem? Tejun? > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If > concurrency management is a problem and there's something live-locking > for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is > a common occurrence that it makes sense to give vmstat higher > priority, set WQ_HIGHPRI. Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost while converting HIGHPRI to a separate pool but guaranteeing immediate scheduling for CPU_INTENSIVE is trivial. If vmstat requires that, please let me know. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If > > concurrency management is a problem and there's something live-locking > > for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is > > a common occurrence that it makes sense to give vmstat higher > > priority, set WQ_HIGHPRI. > > Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost > while converting HIGHPRI to a separate pool but guaranteeing immediate > scheduling for CPU_INTENSIVE is trivial. If vmstat requires that, > please let me know. I guess we need that otherwise vm statistics are not updated while worker threads are blocking on memory reclaim. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote: > I guess we need that otherwise vm statistics are not updated while worker > threads are blocking on memory reclaim. And the blocking one is just constantly running? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote: > > I guess we need that otherwise vm statistics are not updated while worker > > threads are blocking on memory reclaim. > > And the blocking one is just constantly running? I was told that there is just one task struct so additional work queue items cannot be processed while waiting? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > > Yuck. Can someone please get this major screwup out of the work queue > > subsystem? Tejun? > > Hmmm? Just use a dedicated workqueue with WQ_MEM_RECLAIM. If > concurrency management is a problem and there's something live-locking > for that work item (really?), WQ_CPU_INTENSIVE escapes it. If this is > a common occurrence that it makes sense to give vmstat higher > priority, set WQ_HIGHPRI. I did. Check the thread. The result was that other tasks were still blocking the thread. Ok I did not use HIGHPRI here is a newer version: From: Christoph LameterSubject: vmstat: Create our own workqueue V2 V1->V2: - Add a couple of workqueue flags that may fix things. Seems that vmstat needs its own workqueue now since the general workqueue mechanism has been *enhanced* which means that the vmstat_updates cannot run reliably but are being blocked by work requests doing memory allocation. Which causes vmstat to be unable to keep the counters up to date. Bad. Fix this by creating our own workqueue. Signed-off-by: Christoph Lameter Index: linux/mm/vmstat.c === --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1382,6 +1382,8 @@ static const struct file_operations proc #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SMP +static struct workqueue_struct *vmstat_wq; + static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; static cpumask_var_t cpu_stat_off; @@ -1394,7 +1396,7 @@ static void vmstat_update(struct work_st * to occur in the future. Keep on running the * update worker thread. */ - schedule_delayed_work_on(smp_processor_id(), + queue_delayed_work_on(smp_processor_id(), vmstat_wq, this_cpu_ptr(_work), round_jiffies_relative(sysctl_stat_interval)); } else { @@ -1463,7 +1465,7 @@ static void vmstat_shepherd(struct work_ if (need_update(cpu) && cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - schedule_delayed_work_on(cpu, + queue_delayed_work_on(cpu, vmstat_wq, _cpu(vmstat_work, cpu), 0); put_online_cpus(); @@ -1552,6 +1554,12 @@ static int __init setup_vmstat(void) start_shepherd_timer(); cpu_notifier_register_done(); + vmstat_wq = alloc_workqueue("vmstat", + WQ_FREEZABLE| + WQ_SYSFS| + WQ_MEM_RECLAIM| + WQ_HIGHPRI| + WQ_CPU_INTENSIVE, 0); #endif #ifdef CONFIG_PROC_FS proc_create("buddyinfo", S_IRUGO, NULL, _file_operations); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tejun Heo wrote: > The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM > is to create a cyclic dependency on another work item and keep that > work item busy wait. Workqueue thinks that work item is making > progress as it's running and doesn't schedule the next one. > > (I was misremembering here) HIGHPRI originally was implemented > head-queueing on the same pool followed by immediate execution, so > could get around cases where this could happen, but that got lost > while converting it to a separate pool. I can introduce another flag > to bypass concurrency management if necessary (it's kinda trivial) but > busy-waiting cyclic dependency is a pretty unusual thing. > > If this is actually a legit busy-waiting cyclic dependency, just let > me know. There is no dependency of the vmstat updater on anything. They can run anytime. If there is a dependency then its created by the kworker subsystem itself. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, Oct 22, 2015 at 09:25:49AM -0500, Christoph Lameter wrote: > On Thu, 22 Oct 2015, Tejun Heo wrote: > > > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote: > > > I guess we need that otherwise vm statistics are not updated while worker > > > threads are blocking on memory reclaim. > > > > And the blocking one is just constantly running? > > I was told that there is just one task struct so additional work queue > items cannot be processed while waiting? lol, no, what it tries to do is trying to keep the number of RUNNING workers at minimum so that minimum number of workers can be used and work items are executed back-to-back on the same workers. The moment a work item blocks, the next worker kicks in and starts executing the next work item in line. The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM is to create a cyclic dependency on another work item and keep that work item busy wait. Workqueue thinks that work item is making progress as it's running and doesn't schedule the next one. (I was misremembering here) HIGHPRI originally was implemented head-queueing on the same pool followed by immediate execution, so could get around cases where this could happen, but that got lost while converting it to a separate pool. I can introduce another flag to bypass concurrency management if necessary (it's kinda trivial) but busy-waiting cyclic dependency is a pretty unusual thing. If this is actually a legit busy-waiting cyclic dependency, just let me know. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, So, something like the following. Just compile tested but this is essentially partial revert of 3270476a6c0c ("workqueue: reimplement WQ_HIGHPRI using a separate worker_pool") - resurrecting the old WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works. If for some reason, it gets decided against simply adding one jiffy sleep, please let me know. I'll verify the operation and post a proper patch. That said, given that this prolly needs -stable backport and vmstat is likely to be the only user (busy loops are really rare in the kernel after all), I think the better approach would be reinstating the short sleep. Thanks. --- include/linux/workqueue.h |7 ++--- kernel/workqueue.c| 63 +++--- 2 files changed, 63 insertions(+), 7 deletions(-) --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -278,9 +278,10 @@ enum { WQ_UNBOUND = 1 << 1, /* not bound to any cpu */ WQ_FREEZABLE= 1 << 2, /* freeze during suspend */ WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */ - WQ_HIGHPRI = 1 << 4, /* high priority */ - WQ_CPU_INTENSIVE= 1 << 5, /* cpu intensive workqueue */ - WQ_SYSFS= 1 << 6, /* visible in sysfs, see wq_sysfs_register() */ + WQ_IMMEDIATE= 1 << 4, /* bypass concurrency management */ + WQ_HIGHPRI = 1 << 5, /* high priority */ + WQ_CPU_INTENSIVE= 1 << 6, /* cpu intensive workqueue */ + WQ_SYSFS= 1 << 7, /* visible in sysfs, see wq_sysfs_register() */ /* * Per-cpu workqueues are generally preferred because they tend to --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -68,6 +68,7 @@ enum { * attach_mutex to avoid changing binding state while * worker_attach_to_pool() is in progress. */ + POOL_IMMEDIATE_PENDING = 1 << 0, /* WQ_IMMEDIATE items on queue */ POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */ /* worker flags */ @@ -731,7 +732,8 @@ static bool work_is_canceling(struct wor static bool __need_more_worker(struct worker_pool *pool) { - return !atomic_read(>nr_running); + return !atomic_read(>nr_running) || + (pool->flags & POOL_IMMEDIATE_PENDING); } /* @@ -757,7 +759,8 @@ static bool may_start_working(struct wor static bool keep_working(struct worker_pool *pool) { return !list_empty(>worklist) && - atomic_read(>nr_running) <= 1; + (atomic_read(>nr_running) <= 1 || +(pool->flags & POOL_IMMEDIATE_PENDING)); } /* Do we need a new worker? Called from manager. */ @@ -1021,6 +1024,42 @@ static void move_linked_works(struct wor } /** + * pwq_determine_ins_pos - find insertion position + * @pwq: pwq a work is being queued for + * + * A work for @pwq is about to be queued on @pwq->pool, determine insertion + * position for the work. If @pwq is for IMMEDIATE wq, the work item is + * queued at the head of the queue but in FIFO order with respect to other + * IMMEDIATE work items; otherwise, at the end of the queue. This function + * also sets POOL_IMMEDIATE_PENDING flag to hint @pool that there are + * IMMEDIATE works pending. + * + * CONTEXT: + * spin_lock_irq(gcwq->lock). + * + * RETURNS: + * Pointer to insertion position. + */ +static struct list_head *pwq_determine_ins_pos(struct pool_workqueue *pwq) +{ + struct worker_pool *pool = pwq->pool; + struct work_struct *twork; + + if (likely(!(pwq->wq->flags & WQ_IMMEDIATE))) + return >worklist; + + list_for_each_entry(twork, >worklist, entry) { + struct pool_workqueue *tpwq = get_work_pwq(twork); + + if (!(tpwq->wq->flags & WQ_IMMEDIATE)) + break; + } + + pool->flags |= POOL_IMMEDIATE_PENDING; + return >entry; +} + +/** * get_pwq - get an extra reference on the specified pool_workqueue * @pwq: pool_workqueue to get * @@ -1081,9 +1120,10 @@ static void put_pwq_unlocked(struct pool static void pwq_activate_delayed_work(struct work_struct *work) { struct pool_workqueue *pwq = get_work_pwq(work); + struct list_head *pos = pwq_determine_ins_pos(pwq); trace_workqueue_activate_work(work); - move_linked_works(work, >pool->worklist, NULL); + move_linked_works(work, pos, NULL); __clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work)); pwq->nr_active++; } @@ -1384,7 +1424,7 @@ retry: if (likely(pwq->nr_active < pwq->max_active)) { trace_workqueue_activate_work(work); pwq->nr_active++; - worklist = >pool->worklist; + worklist = pwq_determine_ins_pos(pwq); } else { work_flags |= WORK_STRUCT_DELAYED; worklist =
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
On Thu, 22 Oct 2015, Tetsuo Handa wrote: > The problem would be that the "struct task_struct" to execute vmstat_update > job does not exist, and will not be able to create one on demand because we > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel > thread for vmstat_update job would work. But ... Yuck. Can someone please get this major screwup out of the work queue subsystem? Tejun? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
Hello, On Thu, Oct 22, 2015 at 09:41:11AM -0500, Christoph Lameter wrote: > > If this is actually a legit busy-waiting cyclic dependency, just let > > me know. > > There is no dependency of the vmstat updater on anything. > They can run anytime. If there is a dependency then its created by the > kworker subsystem itself. Sure, the other direction is from workqueue concurrency detection. I was asking whether a work item can busy-wait on vmstat_update work item cuz that's what confuses workqueue. Looking at the original dump, the pool has two idle workers indicating that the workqueue wasn't short of execution resources and it really looks like that work item was live-locking the pool. I'll go ahead and add WQ_IMMEDIATE. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/