subject:"Re\: \[PATCH\] mm,vmscan\: Use accurate values for zone

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-11 Thread Michal Hocko

With the full changelog and the vmstat update for the reference.
---
>From 9492966a552751e6d7a63e9aafb87e35992b840a Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 11 Nov 2015 16:45:53 +0100
Subject: [PATCH] mm, vmstat: Allow WQ concurrency to discover memory reclaim
 doesn't make any progress

Tetsuo Handa has reported that the system might basically livelock in OOM
condition without triggering the OOM killer. The issue is caused by
internal dependency of the direct reclaim on vmstat counter updates (via
zone_reclaimable) which are performed from the workqueue context.
If all the current workers get assigned to an allocation request,
though, they will be looping inside the allocator trying to reclaim
memory but zone_reclaimable can see stalled numbers so it will consider
a zone reclaimable even though it has been scanned way too much. WQ
concurrency logic will not consider this situation as a congested workqueue
because it relies that worker would have to sleep in such a situation.
This also means that it doesn't try to spawn new workers or invoke
the rescuer thread if the one is assigned to the queue.

In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do
a short sleep. In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---
 mm/backing-dev.c | 19 ---
 mm/vmstat.c  |  6 --
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 */
if (atomic_read(_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, >flags)) {
-   cond_resched();
+
+   /*
+* Memory allocation/reclaim might be called from a WQ
+* context and the current implementation of the WQ
+* concurrency control doesn't recognize that a particular
+* WQ is congested if the worker thread is looping without
+* ever sleeping. Therefore we have to do a short sleep
+* here rather than calling cond_resched().
+*/
+   if (current->flags & PF_WQ_WORKER)
+   schedule_timeout(1);
+   else
+   cond_resched();

/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 45dcbcb5c594..0975da8e3432 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1381,6 +1381,7 @@ static const struct file_operations 
proc_vmstat_file_operations = {
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1393,7 +1394,7 @@ static void vmstat_update(struct work_struct *w)
 * to occur in the future. Keep on running the
 * update worker thread.
 */
-   schedule_delayed_work_on(smp_processor_id(),
+   queue_delayed_work_on(smp_processor_id(), vmstat_wq,
this_cpu_ptr(_work),
round_jiffies_relative(sysctl_stat_interval));
} else {
@@ -1462,7 +1463,7 @@ static void vmstat_shepherd(struct work_struct *w)
if (need_update(cpu) &&
cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-   schedule_delayed_work_on(cpu,
+   queue_delayed_work_on(cpu, vmstat_wq,
_cpu(vmstat_work, cpu), 0);

put_online_cpus();
@@ -1551,6 +1552,7 @@ static int

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-11 Thread Michal Hocko

On Thu 05-11-15 19:16:48, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> > Sorry but we need work queue processing for vmstat counters that is
> 
> I made this analogy before but this is similar to looping with
> preemption off.  If anything on workqueue stays RUNNING w/o making
> forward progress, it's buggy.  I'd venture to say any code which busy
> loops without making forward progress in the time scale noticeable to
> human beings is borderline buggy too. 

Well, the caller asked for a memory but the request cannot succeed. Due
to the memory allocator semantic we cannot fail the request so we have
to loop. If we had an event to wait for we would do so, of course.

Now wrt. to a small sleep. We used to do that and called
congestion_wait(HZ/50) before retry. This has proved to cause stalls
during high memory pressure 0e093d99763e ("writeback: do not sleep on
the congestion queue if there are no congested BDIs or if significant
congestion is not being encountered in the current zone"). I do not
really remember what was CONFIG_HZ in those reports but it is quite
possible it was 250. So there is a risk of (partial) re-introducing of
those stalls with the patch from Tetsuo
(http://lkml.kernel.org/r/201510251952.cef04109.osotlfhfvfj...@i-love.sakura.ne.jp)

If we really have to do short sleep, though, then I would suggest
sticking that into wait_iff_congested rather than spread it into more
places and reduce it only to worker threads. This should be much more
safer. Thought?
---
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 */
if (atomic_read(_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, >flags)) {
-   cond_resched();
+
+   /*
+* Memory allocation/reclaim might be called from a WQ
+* context and the current implementation of the WQ
+* concurrency control doesn't recognize that a particular
+* WQ is congested if the worker thread is looping without
+* ever sleeping. Therefore we have to do a short sleep
+* here rather than calling cond_resched().
+*/
+   if (current->flags & PF_WQ_WORKER)
+   schedule_timeout(1);
+   else
+   cond_resched();
 
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-11 Thread Michal Hocko

With the full changelog and the vmstat update for the reference.
---
>From 9492966a552751e6d7a63e9aafb87e35992b840a Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 11 Nov 2015 16:45:53 +0100
Subject: [PATCH] mm, vmstat: Allow WQ concurrency to discover memory reclaim
 doesn't make any progress

Tetsuo Handa has reported that the system might basically livelock in OOM
condition without triggering the OOM killer. The issue is caused by
internal dependency of the direct reclaim on vmstat counter updates (via
zone_reclaimable) which are performed from the workqueue context.
If all the current workers get assigned to an allocation request,
though, they will be looping inside the allocator trying to reclaim
memory but zone_reclaimable can see stalled numbers so it will consider
a zone reclaimable even though it has been scanned way too much. WQ
concurrency logic will not consider this situation as a congested workqueue
because it relies that worker would have to sleep in such a situation.
This also means that it doesn't try to spawn new workers or invoke
the rescuer thread if the one is assigned to the queue.

In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do
a short sleep. In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---
 mm/backing-dev.c | 19 ---
 mm/vmstat.c  |  6 --
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 */
if (atomic_read(_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, >flags)) {
-   cond_resched();
+
+   /*
+* Memory allocation/reclaim might be called from a WQ
+* context and the current implementation of the WQ
+* concurrency control doesn't recognize that a particular
+* WQ is congested if the worker thread is looping without
+* ever sleeping. Therefore we have to do a short sleep
+* here rather than calling cond_resched().
+*/
+   if (current->flags & PF_WQ_WORKER)
+   schedule_timeout(1);
+   else
+   cond_resched();

/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 45dcbcb5c594..0975da8e3432 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1381,6 +1381,7 @@ static const struct file_operations 
proc_vmstat_file_operations = {
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1393,7 +1394,7 @@ static void vmstat_update(struct work_struct *w)
 * to occur in the future. Keep on running the
 * update worker thread.
 */
-   schedule_delayed_work_on(smp_processor_id(),
+   queue_delayed_work_on(smp_processor_id(), vmstat_wq,
this_cpu_ptr(_work),
round_jiffies_relative(sysctl_stat_interval));
} else {
@@ -1462,7 +1463,7 @@ static void vmstat_shepherd(struct work_struct *w)
if (need_update(cpu) &&
cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-   schedule_delayed_work_on(cpu,
+   queue_delayed_work_on(cpu, vmstat_wq,
_cpu(vmstat_work,

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-11 Thread Michal Hocko

On Thu 05-11-15 19:16:48, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> > Sorry but we need work queue processing for vmstat counters that is
> 
> I made this analogy before but this is similar to looping with
> preemption off.  If anything on workqueue stays RUNNING w/o making
> forward progress, it's buggy.  I'd venture to say any code which busy
> loops without making forward progress in the time scale noticeable to
> human beings is borderline buggy too. 

Well, the caller asked for a memory but the request cannot succeed. Due
to the memory allocator semantic we cannot fail the request so we have
to loop. If we had an event to wait for we would do so, of course.

Now wrt. to a small sleep. We used to do that and called
congestion_wait(HZ/50) before retry. This has proved to cause stalls
during high memory pressure 0e093d99763e ("writeback: do not sleep on
the congestion queue if there are no congested BDIs or if significant
congestion is not being encountered in the current zone"). I do not
really remember what was CONFIG_HZ in those reports but it is quite
possible it was 250. So there is a risk of (partial) re-introducing of
those stalls with the patch from Tetsuo
(http://lkml.kernel.org/r/201510251952.cef04109.osotlfhfvfj...@i-love.sakura.ne.jp)

If we really have to do short sleep, though, then I would suggest
sticking that into wait_iff_congested rather than spread it into more
places and reduce it only to worker threads. This should be much more
safer. Thought?
---
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 */
if (atomic_read(_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, >flags)) {
-   cond_resched();
+
+   /*
+* Memory allocation/reclaim might be called from a WQ
+* context and the current implementation of the WQ
+* concurrency control doesn't recognize that a particular
+* WQ is congested if the worker thread is looping without
+* ever sleeping. Therefore we have to do a short sleep
+* here rather than calling cond_resched().
+*/
+   if (current->flags & PF_WQ_WORKER)
+   schedule_timeout(1);
+   else
+   cond_resched();
 
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-05 Thread Tejun Heo

Hello,

On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> Sorry but we need work queue processing for vmstat counters that is

I made this analogy before but this is similar to looping with
preemption off.  If anything on workqueue stays RUNNING w/o making
forward progress, it's buggy.  I'd venture to say any code which busy
loops without making forward progress in the time scale noticeable to
human beings is borderline buggy too.  If things need to be retried in
that time scale, putting in a short sleep between trials is a sensible
thing to do.  There's no point in occupying the cpu and burning cycles
without making forward progress.

These things actually matter.  Freezer used to burn cycles this way
and was really good at burning off the last remaining battery reserve
during emergency hibernation if freezing takes some amount of time.

It is true that as it currently stands this is error-prone as
workqueue can't detect these conditions and warn about them.  The same
goes for workqueues which sit in memory reclaim path but forgets
WQ_MEM_RECLAIM.  I'm going to add lockup detection, similar to how
softlockup but that's a different issue, so please update the code.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-05 Thread Christoph Lameter

On Thu, 5 Nov 2015, Tetsuo Handa wrote:

> memory allocation. By allowing workqueue items to be processed (by using
> short sleep), some task might release memory when workqueue item is
> processed.
>
> Therefore, not only to keep vmstat counters up to date, but also for
> avoid wasting CPU cycles, I prefer a short sleep.

Sorry but we need work queue processing for vmstat counters that is
independent of other requests submitted that may block. Adding points
where we sleep / schedule everywhere to do this is not the right approach.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-05 Thread Tetsuo Handa

Michal Hocko wrote:
> As already pointed out I really detest a short sleep and would prefer
> a way to tell WQ what we really need. vmstat is not the only user. OOM
> sysrq will need this special treatment as well. While the
> zone_reclaimable can be fixed in an easy patch
> (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

I still prefer a short sleep from a different perspective.

I tested above patch with below patch applied


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0499ff..54bedd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == 
GFP_TRANSHUGE;
 }
 
+static atomic_t stall_tasks;
+
+static int kmallocwd(void *unused)
+{
+   struct task_struct *g, *p;
+   unsigned int sigkill_pending;
+   unsigned int memdie_pending;
+   unsigned int stalling_tasks;
+
+ not_stalling: /* Healty case. */
+   schedule_timeout_interruptible(HZ);
+   if (likely(!atomic_read(_tasks)))
+   goto not_stalling;
+ maybe_stalling: /* Maybe something is wrong. Let's check. */
+   /* Count stalling tasks, dying and victim tasks. */
+   sigkill_pending = 0;
+   memdie_pending = 0;
+   stalling_tasks = atomic_read(_tasks);
+   preempt_disable();
+   rcu_read_lock();
+   for_each_process_thread(g, p) {
+   if (test_tsk_thread_flag(p, TIF_MEMDIE))
+   memdie_pending++;
+   if (fatal_signal_pending(p))
+   sigkill_pending++;
+   }
+   rcu_read_unlock();
+   preempt_enable();
+   pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim 
task.\n",
+   stalling_tasks, sigkill_pending, memdie_pending);
+   show_workqueue_state();
+   schedule_timeout_interruptible(10 * HZ);
+   if (atomic_read(_tasks))
+   goto maybe_stalling;
+   goto not_stalling;
+   return 0; /* To suppress "no return statement" compiler warning. */
+}
+
+static int __init start_kmallocwd(void)
+{
+   struct task_struct *task = kthread_run(kmallocwd, NULL,
+  "kmallocwd");
+   BUG_ON(IS_ERR(task));
+   return 0;
+}
+late_initcall(start_kmallocwd);
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+   unsigned long start = jiffies;
+   bool stall_counted = false;
 
/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3095,6 +3144,11 @@ retry:
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;
 
+   if (!stall_counted && time_after(jiffies, start + 10 * HZ)) {
+   atomic_inc(_tasks);
+   stall_counted = true;
+   }
+
/*
 * Try direct compaction. The first pass is asynchronous. Subsequent
 * attempts after direct reclaim are synchronous
@@ -3188,6 +3242,8 @@ noretry:
 nopage:
warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
+   if (stall_counted)
+   atomic_dec(_tasks);
return page;
 }
 


using a crazy stressing program. (Not a TIF_MEMDIE stall.)


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void child(void)
{
char *buf = NULL;
unsigned long size = 0;
const int fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
if (argc > 1) {
int i;
char buffer[4096];
for (i = 0; i < 1000; i++) {
if (fork() == 0) {
sleep(20);
memset(buffer, 0, sizeof(buffer));
_exit(0);
}
}

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-05 Thread Christoph Lameter

On Thu, 5 Nov 2015, Tetsuo Handa wrote:

> memory allocation. By allowing workqueue items to be processed (by using
> short sleep), some task might release memory when workqueue item is
> processed.
>
> Therefore, not only to keep vmstat counters up to date, but also for
> avoid wasting CPU cycles, I prefer a short sleep.

Sorry but we need work queue processing for vmstat counters that is
independent of other requests submitted that may block. Adding points
where we sleep / schedule everywhere to do this is not the right approach.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-05 Thread Tejun Heo

Hello,

On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> Sorry but we need work queue processing for vmstat counters that is

I made this analogy before but this is similar to looping with
preemption off.  If anything on workqueue stays RUNNING w/o making
forward progress, it's buggy.  I'd venture to say any code which busy
loops without making forward progress in the time scale noticeable to
human beings is borderline buggy too.  If things need to be retried in
that time scale, putting in a short sleep between trials is a sensible
thing to do.  There's no point in occupying the cpu and burning cycles
without making forward progress.

These things actually matter.  Freezer used to burn cycles this way
and was really good at burning off the last remaining battery reserve
during emergency hibernation if freezing takes some amount of time.

It is true that as it currently stands this is error-prone as
workqueue can't detect these conditions and warn about them.  The same
goes for workqueues which sit in memory reclaim path but forgets
WQ_MEM_RECLAIM.  I'm going to add lockup detection, similar to how
softlockup but that's a different issue, so please update the code.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-05 Thread Tetsuo Handa

Michal Hocko wrote:
> As already pointed out I really detest a short sleep and would prefer
> a way to tell WQ what we really need. vmstat is not the only user. OOM
> sysrq will need this special treatment as well. While the
> zone_reclaimable can be fixed in an easy patch
> (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

I still prefer a short sleep from a different perspective.

I tested above patch with below patch applied


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0499ff..54bedd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == 
GFP_TRANSHUGE;
 }
 
+static atomic_t stall_tasks;
+
+static int kmallocwd(void *unused)
+{
+   struct task_struct *g, *p;
+   unsigned int sigkill_pending;
+   unsigned int memdie_pending;
+   unsigned int stalling_tasks;
+
+ not_stalling: /* Healty case. */
+   schedule_timeout_interruptible(HZ);
+   if (likely(!atomic_read(_tasks)))
+   goto not_stalling;
+ maybe_stalling: /* Maybe something is wrong. Let's check. */
+   /* Count stalling tasks, dying and victim tasks. */
+   sigkill_pending = 0;
+   memdie_pending = 0;
+   stalling_tasks = atomic_read(_tasks);
+   preempt_disable();
+   rcu_read_lock();
+   for_each_process_thread(g, p) {
+   if (test_tsk_thread_flag(p, TIF_MEMDIE))
+   memdie_pending++;
+   if (fatal_signal_pending(p))
+   sigkill_pending++;
+   }
+   rcu_read_unlock();
+   preempt_enable();
+   pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim 
task.\n",
+   stalling_tasks, sigkill_pending, memdie_pending);
+   show_workqueue_state();
+   schedule_timeout_interruptible(10 * HZ);
+   if (atomic_read(_tasks))
+   goto maybe_stalling;
+   goto not_stalling;
+   return 0; /* To suppress "no return statement" compiler warning. */
+}
+
+static int __init start_kmallocwd(void)
+{
+   struct task_struct *task = kthread_run(kmallocwd, NULL,
+  "kmallocwd");
+   BUG_ON(IS_ERR(task));
+   return 0;
+}
+late_initcall(start_kmallocwd);
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+   unsigned long start = jiffies;
+   bool stall_counted = false;
 
/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3095,6 +3144,11 @@ retry:
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;
 
+   if (!stall_counted && time_after(jiffies, start + 10 * HZ)) {
+   atomic_inc(_tasks);
+   stall_counted = true;
+   }
+
/*
 * Try direct compaction. The first pass is asynchronous. Subsequent
 * attempts after direct reclaim are synchronous
@@ -3188,6 +3242,8 @@ noretry:
 nopage:
warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
+   if (stall_counted)
+   atomic_dec(_tasks);
return page;
 }
 


using a crazy stressing program. (Not a TIF_MEMDIE stall.)


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void child(void)
{
char *buf = NULL;
unsigned long size = 0;
const int fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
if (argc > 1) {
int i;
char buffer[4096];
for (i = 0; i < 1000; i++) {
if (fork() == 0) {
sleep(20);
memset(buffer, 0, sizeof(buffer));
_exit(0);
}
}

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-03 Thread Tejun Heo

Hello, Tetsuo.

On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote:
> Tejun Heo wrote:
> >  If
> > the possibility of sysrq getting stuck behind concurrency management
> > is an issue, queueing them on an unbound or highpri workqueue should
> > be good enough.
> 
> Regarding SysRq-f, we could do like below. Though I think that converting
> the OOM killer into a dedicated kernel thread would allow more things to do
> (e.g. Oleg's memory zapping code, my timeout based next victim selection).

I'm not sure doing anything to sysrq-f is warranted.  If workqueue
can't make forward progress due to memory exhaustion, OOM will be
triggered anyway.  Getting stuck behind concurrency management isn't
that different a failure mode from getting stuck behind busy loop with
preemption off.  We should just plug them at the source.  If
necessary, what we can do is adding stall watchdog (can prolly
combined with the usual watchdog) so that it can better point out the
culprit.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-03 Thread Tejun Heo

Hello, Tetsuo.

On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote:
> Tejun Heo wrote:
> >  If
> > the possibility of sysrq getting stuck behind concurrency management
> > is an issue, queueing them on an unbound or highpri workqueue should
> > be good enough.
> 
> Regarding SysRq-f, we could do like below. Though I think that converting
> the OOM killer into a dedicated kernel thread would allow more things to do
> (e.g. Oleg's memory zapping code, my timeout based next victim selection).

I'm not sure doing anything to sysrq-f is warranted.  If workqueue
can't make forward progress due to memory exhaustion, OOM will be
triggered anyway.  Getting stuck behind concurrency management isn't
that different a failure mode from getting stuck behind busy loop with
preemption off.  We should just plug them at the source.  If
necessary, what we can do is adding stall watchdog (can prolly
combined with the usual watchdog) so that it can better point out the
culprit.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-02 Thread Tetsuo Handa

Tejun Heo wrote:
>  If
> the possibility of sysrq getting stuck behind concurrency management
> is an issue, queueing them on an unbound or highpri workqueue should
> be good enough.

Regarding SysRq-f, we could do like below. Though I think that converting
the OOM killer into a dedicated kernel thread would allow more things to do
(e.g. Oleg's memory zapping code, my timeout based next victim selection).

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 5381a72..46b951aa 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = {
.enable_mask= SYSRQ_ENABLE_SIGNAL,
 };
 
-static void moom_callback(struct work_struct *ignored)
+static DECLARE_WAIT_QUEUE_HEAD(moom_wait);
+
+static int moom_callback(void *unused)
 {
const gfp_t gfp_mask = GFP_KERNEL;
-   struct oom_control oc = {
-   .zonelist = node_zonelist(first_memory_node, gfp_mask),
-   .nodemask = NULL,
-   .gfp_mask = gfp_mask,
-   .order = -1,
-   };
-
-   mutex_lock(_lock);
-   if (!out_of_memory())
-   pr_info("OOM request ignored because killer is disabled\n");
-   mutex_unlock(_lock);
+   DEFINE_WAIT(wait);
+
+   while (1) {
+   struct oom_control oc = {
+   .zonelist = node_zonelist(first_memory_node, gfp_mask),
+   .nodemask = NULL,
+   .gfp_mask = gfp_mask,
+   .order = -1,
+   };
+
+   prepare_to_wait(_wait, , TASK_INTERRUPTIBLE);
+   schedule();
+   finish_wait(_wait, );
+   mutex_lock(_lock);
+   if (!out_of_memory())
+   pr_info("OOM request ignored because killer is 
disabled\n");
+   mutex_unlock(_lock);
+   }
+   return 0;
 }
 
-static DECLARE_WORK(moom_work, moom_callback);
-
 static void sysrq_handle_moom(int key)
 {
-   schedule_work(_work);
+   wake_up(_wait);
 }
 static struct sysrq_key_op sysrq_moom_op = {
.handler= sysrq_handle_moom,
@@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void)
 
 static int __init sysrq_init(void)
 {
+   struct task_struct *task = kthread_run(moom_callback, NULL,
+  "manual_oom");
+   BUG_ON(IS_ERR(task));
sysrq_init_procfs();
 
if (sysrq_on())
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-02 Thread Tejun Heo

On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote:
...
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

Well, sysrq wouldn't run successfully either on a cpu which is busy
looping with preemption off.  I don't think this calls for a new flag
to modify workqueue behavior especially given that missing such flag
would lead to the same kind of lockup.  It's a shitty solution.  If
the possibility of sysrq getting stuck behind concurrency management
is an issue, queueing them on an unbound or highpri workqueue should
be good enough.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-02 Thread Michal Hocko

On Fri 23-10-15 13:26:49, Tejun Heo wrote:
> Hello,
> 
> So, something like the following.  Just compile tested but this is
> essentially partial revert of 3270476a6c0c ("workqueue: reimplement
> WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
> WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
> If for some reason, it gets decided against simply adding one jiffy
> sleep, please let me know.  I'll verify the operation and post a
> proper patch.  That said, given that this prolly needs -stable
> backport and vmstat is likely to be the only user (busy loops are
> really rare in the kernel after all), I think the better approach
> would be reinstating the short sleep.

As already pointed out I really detest a short sleep and would prefer
a way to tell WQ what we really need. vmstat is not the only user. OOM
sysrq will need this special treatment as well. While the
zone_reclaimable can be fixed in an easy patch
(http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
which is perfectly suited for the stable backport, OOM sysrq resp. any
sysrq which runs from the WQ context should be as robust as possible and
shouldn't rely on all the code running from WQ context to issue a sleep
to get unstuck. So I definitely support something like this patch.

I am still not sure whether other WQ_MEM_RECLAIM users needs this flag
as well because I am not familiar with their implementation but at
vmstat and sysrq should use it and should be safe to do so without risk
of breaking anything AFAICS.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-02 Thread Michal Hocko

On Fri 23-10-15 13:26:49, Tejun Heo wrote:
> Hello,
> 
> So, something like the following.  Just compile tested but this is
> essentially partial revert of 3270476a6c0c ("workqueue: reimplement
> WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
> WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
> If for some reason, it gets decided against simply adding one jiffy
> sleep, please let me know.  I'll verify the operation and post a
> proper patch.  That said, given that this prolly needs -stable
> backport and vmstat is likely to be the only user (busy loops are
> really rare in the kernel after all), I think the better approach
> would be reinstating the short sleep.

As already pointed out I really detest a short sleep and would prefer
a way to tell WQ what we really need. vmstat is not the only user. OOM
sysrq will need this special treatment as well. While the
zone_reclaimable can be fixed in an easy patch
(http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
which is perfectly suited for the stable backport, OOM sysrq resp. any
sysrq which runs from the WQ context should be as robust as possible and
shouldn't rely on all the code running from WQ context to issue a sleep
to get unstuck. So I definitely support something like this patch.

I am still not sure whether other WQ_MEM_RECLAIM users needs this flag
as well because I am not familiar with their implementation but at
vmstat and sysrq should use it and should be safe to do so without risk
of breaking anything AFAICS.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-02 Thread Tejun Heo

On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote:
...
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

Well, sysrq wouldn't run successfully either on a cpu which is busy
looping with preemption off.  I don't think this calls for a new flag
to modify workqueue behavior especially given that missing such flag
would lead to the same kind of lockup.  It's a shitty solution.  If
the possibility of sysrq getting stuck behind concurrency management
is an issue, queueing them on an unbound or highpri workqueue should
be good enough.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-11-02 Thread Tetsuo Handa

Tejun Heo wrote:
>  If
> the possibility of sysrq getting stuck behind concurrency management
> is an issue, queueing them on an unbound or highpri workqueue should
> be good enough.

Regarding SysRq-f, we could do like below. Though I think that converting
the OOM killer into a dedicated kernel thread would allow more things to do
(e.g. Oleg's memory zapping code, my timeout based next victim selection).

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 5381a72..46b951aa 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = {
.enable_mask= SYSRQ_ENABLE_SIGNAL,
 };
 
-static void moom_callback(struct work_struct *ignored)
+static DECLARE_WAIT_QUEUE_HEAD(moom_wait);
+
+static int moom_callback(void *unused)
 {
const gfp_t gfp_mask = GFP_KERNEL;
-   struct oom_control oc = {
-   .zonelist = node_zonelist(first_memory_node, gfp_mask),
-   .nodemask = NULL,
-   .gfp_mask = gfp_mask,
-   .order = -1,
-   };
-
-   mutex_lock(_lock);
-   if (!out_of_memory())
-   pr_info("OOM request ignored because killer is disabled\n");
-   mutex_unlock(_lock);
+   DEFINE_WAIT(wait);
+
+   while (1) {
+   struct oom_control oc = {
+   .zonelist = node_zonelist(first_memory_node, gfp_mask),
+   .nodemask = NULL,
+   .gfp_mask = gfp_mask,
+   .order = -1,
+   };
+
+   prepare_to_wait(_wait, , TASK_INTERRUPTIBLE);
+   schedule();
+   finish_wait(_wait, );
+   mutex_lock(_lock);
+   if (!out_of_memory())
+   pr_info("OOM request ignored because killer is 
disabled\n");
+   mutex_unlock(_lock);
+   }
+   return 0;
 }
 
-static DECLARE_WORK(moom_work, moom_callback);
-
 static void sysrq_handle_moom(int key)
 {
-   schedule_work(_work);
+   wake_up(_wait);
 }
 static struct sysrq_key_op sysrq_moom_op = {
.handler= sysrq_handle_moom,
@@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void)
 
 static int __init sysrq_init(void)
 {
+   struct task_struct *task = kthread_run(moom_callback, NULL,
+  "manual_oom");
+   BUG_ON(IS_ERR(task));
sysrq_init_procfs();
 
if (sysrq_on())
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Michal Hocko

On Tue 27-10-15 19:55:06, Tejun Heo wrote:
> On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
> ...
> > stable kernels without causing any other regressions. 2) is the way
> > to move forward for next kernels and we should really think whether
> > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> > general consensus that there are legitimate WQ_MEM_RECLAIM users which
> > can do without the other flag then I am perfectly OK to use it for
> > vmstat and oom sysrq dedicated workqueues.
> 
> I don't think flagging these things is a good approach.  These are too
> easy to miss.  If this is a problem which needs to be solved, which
> I'm not convined it is at this point, the right thing to do would be
> doing stall detection and kicking the next work item automatically.

To be honest, I do not really care whether this gets "fixed" in the
stall detection code or by making WQ_MEM_RECLAIM to flag a special
behavior implicitly. All I would like to see is to have a guarantee
that such workqueues are not staying behind just because all current
workers are in the allocator. Adding artificial schedule_timeouts in the
allocator is a fragile way to work around the issue.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-27 Thread Tejun Heo

On Tue, Oct 27, 2015 at 08:07:38PM +0900, Tetsuo Handa wrote:
> Can't we have a waitqueue like
> http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp
>  ?

There's no reason to complicate it.  It wouldn't buy anything
meaningful.  Can we please stop trying to solve a non-existent
problem?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-27 Thread Tetsuo Handa

Michal Hocko wrote:
> > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > > The problem here is not lack
> > > > of execution resource but concurrency management misunderstanding the
> > > > situation. 
> > > 
> > > And this sounds like a bug to me.
> > 
> > I don't know.  I can be argued either way, the other direction being a
> > kernel thread going RUNNING non-stop is buggy.  Given how this has
> > been a complete non-issue for all the years, I'm not sure how useful
> > plugging this is.
> 
> Well, I guess we haven't noticed because this is a pathological case. It
> also triggers OOM livelocks which were not reported in the past either.
> You do not reach this state normally unless you rely _want_ to kill your
> machine

I don't think we can say this is a pathological case. Customers' serves
might have hit this state. We have no code for warning this state.

> 
> And vmstat is not the only instance. E.g. sysrq oom trigger is known
> to stay behind in similar cases. It should be changed to a dedicated
> WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.
> 

Well, this seems to be the cause of SysRq-f being unresponsive...
http://lkml.kernel.org/r/201411231349.cag78628.vfqfotosfjm...@i-love.sakura.ne.jp

Picking up from 
http://lkml.kernel.org/r/201506112212.jag26531.flsvfmoqjot...@i-love.sakura.ne.jp
--
[  515.536393] Showing busy workqueues and worker pools:
[  515.538185] workqueue events: flags=0x0
[  515.539758]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256
[  515.541872] pending: vmpressure_work_fn, console_callback, 
vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, 
sysrq_reinject_alt_sysrq, fb_deferred_io_work
[  515.546684] workqueue events_power_efficient: flags=0x80
[  515.548589]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  515.550829] pending: neigh_periodic_work, check_lifetime
[  515.552884] workqueue events_freezable_power_: flags=0x84
[  515.554742]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  515.556846] in-flight: 3837:disk_events_workfn
[  515.558665] workqueue writeback: flags=0x4e
[  515.560291]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[  515.562271] in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn
[  515.564544] workqueue xfs-data/sda1: flags=0xc
[  515.566265]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  515.568359] in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 
26:xfs_end_io, 3836:xfs_end_io
[  515.571018]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  515.573113] in-flight: 179:xfs_end_io
[  515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 
3820
[  515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219
[  515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357
--
We want immediate execution guarantee for not only vmstat_update and
moom_callback but also vmstat_shepherd and console_callback?

> > > Don't we have some IO related paths which would suffer from the same
> > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > > name I would expect they _do_ participate in the reclaim and so they
> > > should be able to make a progress. Now if your new IMMEDIATE flag will
> > 
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

Can't we have a waitqueue like
http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp
 ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Tejun Heo

On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
...
> stable kernels without causing any other regressions. 2) is the way
> to move forward for next kernels and we should really think whether
> WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> general consensus that there are legitimate WQ_MEM_RECLAIM users which
> can do without the other flag then I am perfectly OK to use it for
> vmstat and oom sysrq dedicated workqueues.

I don't think flagging these things is a good approach.  These are too
easy to miss.  If this is a problem which needs to be solved, which
I'm not convined it is at this point, the right thing to do would be
doing stall detection and kicking the next work item automatically.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Tejun Heo

Hello, Michal.

On Tue, Oct 27, 2015 at 10:16:03AM +0100, Michal Hocko wrote:
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

It's one thing to do cond_resched() in long loops to avoid long
priority inversions and another to indefinitely loop without making
any difference.

> > > guarantee that then I would argue that it should be implicit for
> > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > > be a counter argument for doing that?
> > 
> > Not serving any actual purpose and degrading execution behavior.
> 
> I dunno, I am not familiar with WQ internals to see the risks but to me
> it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
> wrt. memory pressure and as demonstrated it doesn't do that. Even if you

It generally does.  This is an extremely rare corner case where
infinite loop w/o forward progress is introduce w/o the user being
outright buggy.

> consider cond_resched behavior of the page allocator as bug we should be
> able to handle this gracefully.

We can argue this back and forth forever but we'll either need to
special case it (be it short sleep or a special flag) or implement a
rather complex detection logic which will likely involve some level of
complexity and is dubious in its practical usefulness.  It's a
trade-off and given the circumstances adding short sleep looks like a
reasonable one to me.  If this is more common, we definitely wanna go
for automatic detection.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Michal Hocko

On Sun 25-10-15 19:52:59, Tetsuo Handa wrote:
[...]
> Three approaches are proposed for fixing this silent livelock problem.
> 
>  (1) Use zone_page_state_snapshot() instead of zone_page_state()
>  when doing zone_reclaimable() checks. This approach is clear,
>  straightforward and easy to backport. So far I cannot reproduce
>  this livelock using this change. But there might be more locations
>  which should use zone_page_state_snapshot().
> 
>  (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
>  to be processed immediately. So far I cannot reproduce this livelock
>  using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
>  (patch proposed by Christoph Lameter). But according to Tejun Heo,
>  if we want to guarantee that nobody can reproduce this livelock, we
>  need to modify workqueue API because commit 3270476a6c0c ("workqueue:
>  reimplement WQ_HIGHPRI using a separate worker_pool") which went to
>  Linux 3.6 lost the guarantee.
> 
>  (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
>  is easy to backport. So far I cannot reproduce this livelock using
>  this approach. And I think that nobody can reproduce this livelock
>  because this changes the page allocator to obey the workqueue's
>  expectations. Even if we leave this livelock problem aside, not
>  entering into !TASK_RUNNING state for too long is an exclusive
>  occupation of workqueue which will make other items in the workqueue
>  needlessly deferred. We don't need to defer other items which do not
>  invoke a __GFP_WAIT allocation.
> 
> This patch does approach (3), by inserting an uninterruptible sleep into
> page allocator side before retrying, in order to make sure that other
> workqueue items (especially vmstat_update item) are given a chance to be
> processed.
> 
> Although a different problem, by using approach (3), we can alleviate
> needlessly burning CPU cycles even when we hit OOM-killer livelock problem
> (hang up after the OOM-killer messages are printed because the OOM victim
> cannot terminate due to dependency).

I really dislike this approach. Waiting without having an event to
wait for is just too ugly. I think 1) is easiest to backport to
stable kernels without causing any other regressions. 2) is the way
to move forward for next kernels and we should really think whether
WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
general consensus that there are legitimate WQ_MEM_RECLAIM users which
can do without the other flag then I am perfectly OK to use it for
vmstat and oom sysrq dedicated workqueues.

> Signed-off-by: Tetsuo Handa 
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Michal Hocko

On Sat 24-10-15 03:21:09, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > The problem here is not lack
> > > of execution resource but concurrency management misunderstanding the
> > > situation. 
> > 
> > And this sounds like a bug to me.
> 
> I don't know.  I can be argued either way, the other direction being a
> kernel thread going RUNNING non-stop is buggy.  Given how this has
> been a complete non-issue for all the years, I'm not sure how useful
> plugging this is.

Well, I guess we haven't noticed because this is a pathological case. It
also triggers OOM livelocks which were not reported in the past either.
You do not reach this state normally unless you rely _want_ to kill your
machine

And vmstat is not the only instance. E.g. sysrq oom trigger is known
to stay behind in similar cases. It should be changed to a dedicated
WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.

> > Don't we have some IO related paths which would suffer from the same
> > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > name I would expect they _do_ participate in the reclaim and so they
> > should be able to make a progress. Now if your new IMMEDIATE flag will
> 
> Seriously, nobody goes full-on RUNNING.

Looping with cond_resched seems like general pattern in the kernel when
there is no clear source to wait for. We have io_schedule when we know
we should wait for IO (in case of congestion) but this is not necessarily
the case - as you can see here. What should we wait for? A short nap
without actually waiting on anything sounds like a dirty workaround to
me.

> > guarantee that then I would argue that it should be implicit for
> > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > be a counter argument for doing that?
> 
> Not serving any actual purpose and degrading execution behavior.

I dunno, I am not familiar with WQ internals to see the risks but to me
it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
wrt. memory pressure and as demonstrated it doesn't do that. Even if you
consider cond_resched behavior of the page allocator as bug we should be
able to handle this gracefully.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Michal Hocko

On Sun 25-10-15 19:52:59, Tetsuo Handa wrote:
[...]
> Three approaches are proposed for fixing this silent livelock problem.
> 
>  (1) Use zone_page_state_snapshot() instead of zone_page_state()
>  when doing zone_reclaimable() checks. This approach is clear,
>  straightforward and easy to backport. So far I cannot reproduce
>  this livelock using this change. But there might be more locations
>  which should use zone_page_state_snapshot().
> 
>  (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
>  to be processed immediately. So far I cannot reproduce this livelock
>  using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
>  (patch proposed by Christoph Lameter). But according to Tejun Heo,
>  if we want to guarantee that nobody can reproduce this livelock, we
>  need to modify workqueue API because commit 3270476a6c0c ("workqueue:
>  reimplement WQ_HIGHPRI using a separate worker_pool") which went to
>  Linux 3.6 lost the guarantee.
> 
>  (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
>  is easy to backport. So far I cannot reproduce this livelock using
>  this approach. And I think that nobody can reproduce this livelock
>  because this changes the page allocator to obey the workqueue's
>  expectations. Even if we leave this livelock problem aside, not
>  entering into !TASK_RUNNING state for too long is an exclusive
>  occupation of workqueue which will make other items in the workqueue
>  needlessly deferred. We don't need to defer other items which do not
>  invoke a __GFP_WAIT allocation.
> 
> This patch does approach (3), by inserting an uninterruptible sleep into
> page allocator side before retrying, in order to make sure that other
> workqueue items (especially vmstat_update item) are given a chance to be
> processed.
> 
> Although a different problem, by using approach (3), we can alleviate
> needlessly burning CPU cycles even when we hit OOM-killer livelock problem
> (hang up after the OOM-killer messages are printed because the OOM victim
> cannot terminate due to dependency).

I really dislike this approach. Waiting without having an event to
wait for is just too ugly. I think 1) is easiest to backport to
stable kernels without causing any other regressions. 2) is the way
to move forward for next kernels and we should really think whether
WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
general consensus that there are legitimate WQ_MEM_RECLAIM users which
can do without the other flag then I am perfectly OK to use it for
vmstat and oom sysrq dedicated workqueues.

> Signed-off-by: Tetsuo Handa 
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Tejun Heo

On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
...
> stable kernels without causing any other regressions. 2) is the way
> to move forward for next kernels and we should really think whether
> WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> general consensus that there are legitimate WQ_MEM_RECLAIM users which
> can do without the other flag then I am perfectly OK to use it for
> vmstat and oom sysrq dedicated workqueues.

I don't think flagging these things is a good approach.  These are too
easy to miss.  If this is a problem which needs to be solved, which
I'm not convined it is at this point, the right thing to do would be
doing stall detection and kicking the next work item automatically.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Michal Hocko

On Sat 24-10-15 03:21:09, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > The problem here is not lack
> > > of execution resource but concurrency management misunderstanding the
> > > situation. 
> > 
> > And this sounds like a bug to me.
> 
> I don't know.  I can be argued either way, the other direction being a
> kernel thread going RUNNING non-stop is buggy.  Given how this has
> been a complete non-issue for all the years, I'm not sure how useful
> plugging this is.

Well, I guess we haven't noticed because this is a pathological case. It
also triggers OOM livelocks which were not reported in the past either.
You do not reach this state normally unless you rely _want_ to kill your
machine

And vmstat is not the only instance. E.g. sysrq oom trigger is known
to stay behind in similar cases. It should be changed to a dedicated
WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.

> > Don't we have some IO related paths which would suffer from the same
> > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > name I would expect they _do_ participate in the reclaim and so they
> > should be able to make a progress. Now if your new IMMEDIATE flag will
> 
> Seriously, nobody goes full-on RUNNING.

Looping with cond_resched seems like general pattern in the kernel when
there is no clear source to wait for. We have io_schedule when we know
we should wait for IO (in case of congestion) but this is not necessarily
the case - as you can see here. What should we wait for? A short nap
without actually waiting on anything sounds like a dirty workaround to
me.

> > guarantee that then I would argue that it should be implicit for
> > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > be a counter argument for doing that?
> 
> Not serving any actual purpose and degrading execution behavior.

I dunno, I am not familiar with WQ internals to see the risks but to me
it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
wrt. memory pressure and as demonstrated it doesn't do that. Even if you
consider cond_resched behavior of the page allocator as bug we should be
able to handle this gracefully.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Tejun Heo

Hello, Michal.

On Tue, Oct 27, 2015 at 10:16:03AM +0100, Michal Hocko wrote:
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

It's one thing to do cond_resched() in long loops to avoid long
priority inversions and another to indefinitely loop without making
any difference.

> > > guarantee that then I would argue that it should be implicit for
> > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > > be a counter argument for doing that?
> > 
> > Not serving any actual purpose and degrading execution behavior.
> 
> I dunno, I am not familiar with WQ internals to see the risks but to me
> it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
> wrt. memory pressure and as demonstrated it doesn't do that. Even if you

It generally does.  This is an extremely rare corner case where
infinite loop w/o forward progress is introduce w/o the user being
outright buggy.

> consider cond_resched behavior of the page allocator as bug we should be
> able to handle this gracefully.

We can argue this back and forth forever but we'll either need to
special case it (be it short sleep or a special flag) or implement a
rather complex detection logic which will likely involve some level of
complexity and is dubious in its practical usefulness.  It's a
trade-off and given the circumstances adding short sleep looks like a
reasonable one to me.  If this is more common, we definitely wanna go
for automatic detection.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-27 Thread Tetsuo Handa

Michal Hocko wrote:
> > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > > The problem here is not lack
> > > > of execution resource but concurrency management misunderstanding the
> > > > situation. 
> > > 
> > > And this sounds like a bug to me.
> > 
> > I don't know.  I can be argued either way, the other direction being a
> > kernel thread going RUNNING non-stop is buggy.  Given how this has
> > been a complete non-issue for all the years, I'm not sure how useful
> > plugging this is.
> 
> Well, I guess we haven't noticed because this is a pathological case. It
> also triggers OOM livelocks which were not reported in the past either.
> You do not reach this state normally unless you rely _want_ to kill your
> machine

I don't think we can say this is a pathological case. Customers' serves
might have hit this state. We have no code for warning this state.

> 
> And vmstat is not the only instance. E.g. sysrq oom trigger is known
> to stay behind in similar cases. It should be changed to a dedicated
> WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.
> 

Well, this seems to be the cause of SysRq-f being unresponsive...
http://lkml.kernel.org/r/201411231349.cag78628.vfqfotosfjm...@i-love.sakura.ne.jp

Picking up from 
http://lkml.kernel.org/r/201506112212.jag26531.flsvfmoqjot...@i-love.sakura.ne.jp
--
[  515.536393] Showing busy workqueues and worker pools:
[  515.538185] workqueue events: flags=0x0
[  515.539758]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256
[  515.541872] pending: vmpressure_work_fn, console_callback, 
vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, 
sysrq_reinject_alt_sysrq, fb_deferred_io_work
[  515.546684] workqueue events_power_efficient: flags=0x80
[  515.548589]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  515.550829] pending: neigh_periodic_work, check_lifetime
[  515.552884] workqueue events_freezable_power_: flags=0x84
[  515.554742]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  515.556846] in-flight: 3837:disk_events_workfn
[  515.558665] workqueue writeback: flags=0x4e
[  515.560291]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[  515.562271] in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn
[  515.564544] workqueue xfs-data/sda1: flags=0xc
[  515.566265]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  515.568359] in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 
26:xfs_end_io, 3836:xfs_end_io
[  515.571018]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  515.573113] in-flight: 179:xfs_end_io
[  515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 
3820
[  515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219
[  515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357
--
We want immediate execution guarantee for not only vmstat_update and
moom_callback but also vmstat_shepherd and console_callback?

> > > Don't we have some IO related paths which would suffer from the same
> > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > > name I would expect they _do_ participate in the reclaim and so they
> > > should be able to make a progress. Now if your new IMMEDIATE flag will
> > 
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

Can't we have a waitqueue like
http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp
 ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-27 Thread Tejun Heo

On Tue, Oct 27, 2015 at 08:07:38PM +0900, Tetsuo Handa wrote:
> Can't we have a waitqueue like
> http://lkml.kernel.org/r/201510142121.ide86954.sovoffqofmj...@i-love.sakura.ne.jp
>  ?

There's no reason to complicate it.  It wouldn't buy anything
meaningful.  Can we please stop trying to solve a non-existent
problem?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-27 Thread Michal Hocko

On Tue 27-10-15 19:55:06, Tejun Heo wrote:
> On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
> ...
> > stable kernels without causing any other regressions. 2) is the way
> > to move forward for next kernels and we should really think whether
> > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> > general consensus that there are legitimate WQ_MEM_RECLAIM users which
> > can do without the other flag then I am perfectly OK to use it for
> > vmstat and oom sysrq dedicated workqueues.
> 
> I don't think flagging these things is a good approach.  These are too
> easy to miss.  If this is a problem which needs to be solved, which
> I'm not convined it is at this point, the right thing to do would be
> doing stall detection and kicking the next work item automatically.

To be honest, I do not really care whether this gets "fixed" in the
stall detection code or by making WQ_MEM_RECLAIM to flag a special
behavior implicitly. All I would like to see is to have a guarantee
that such workqueues are not staying behind just because all current
workers are in the allocator. Adding artificial schedule_timeouts in the
allocator is a fragile way to work around the issue.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-25 Thread Tejun Heo

Hello,

On Sun, Oct 25, 2015 at 07:52:59PM +0900, Tetsuo Handa wrote:
...
> This means that any kernel code which invokes a __GFP_WAIT allocation
> might fail to do (4) when invoked via workqueue, regardless of flags
> passed to alloc_workqueue()?

Sounds that way and yeah (3) should technically be okay and that's why
HIGHPRI was implemented the way it was at the beginning; however, in
practice, this is the first time it's noticeable in all the years.  I
think it comes down to the fact that there just aren't many places
which need such looping behavior and even in those places it's often
very undesirable to busy-loop while not making forward-progress (and
if forward-progress is being made, it won't be indefinite).

> I think that inserting a short sleep into page allocator is better
> because the vmstat_update fix will not require workqueue tweaks if
> we sleep inside page allocator. Also, from the point of view of
> protecting page allocator from going unresponsive when hundreds of tasks
> started busy-waiting at __alloc_pages_slowpath() because we can observe
> that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
> when we are unable to make forward progress.

This looks good to me too; however, it still needs a dedicated
workqueue with WQ_MEM_RECLAIM set.  That deadlock probably is very
unlikely as the side effect of vmstat failing to execute due to worker
exhaustion is more memory reclaim but it still is theoretically
possible and it could just be that it happens at low enough frequency
that it hasn't been reported yet.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-25 Thread Tetsuo Handa

Tejun Heo wrote:
> If this is an actual problem, a better approach would be something
> which detects the stall condition and kicks off the next work item but
> if we do that I think I'd still trigger a warning there.  I don't
> know.  Don't go busy waiting in kernel.

Busy waiting in kernel refers several cases.

  (1) Wait for something with interrupts disabled.

  (2) Wait for something with interrupts enabled but
  without calling cond_resched() etc.

  (3) Wait for something with interrupts enabled and
  with calling cond_resched() etc.

  (4) Wait for something with interrupts enabled and
  with calling schedule_timeout() etc.

Kernel code tries to minimize (1). Kernel code does (2) if they are
not allowed to sleep. But kernel code is allowed to do (3) if they
are allowed to sleep, as long as cond_resched() is sometimes called.
And currently page allocator does (3). But kernel code invoked via
workqueue is expected to do (4) than (3).

This means that any kernel code which invokes a __GFP_WAIT allocation
might fail to do (4) when invoked via workqueue, regardless of flags
passed to alloc_workqueue()?

Michal Hocko wrote:
> On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> > Tejun Heo wrote:
> > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > > hogging CPUs, we are just out of workers which are loopin in the
> > > > allocator but that is preemptible context.
> > > 
> > > It's concurrency management.  Workqueue thinks that the pool is making
> > > positive forward progress and doesn't schedule anything else for
> > > execution while that work item is burning cpu cycles.
> > 
> > Then, isn't below change easier to backport which will also alleviate
> > needlessly burning CPU cycles?
> 
> This is quite obscure. If the vmstat_update fix needs workqueue tweaks
> as well then I would vote for your original patch which is clear,
> straightforward and easy to backport.

I think that inserting a short sleep into page allocator is better
because the vmstat_update fix will not require workqueue tweaks if
we sleep inside page allocator. Also, from the point of view of
protecting page allocator from going unresponsive when hundreds of tasks
started busy-waiting at __alloc_pages_slowpath() because we can observe
that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
when we are unable to make forward progress.

>From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Sun, 25 Oct 2015 19:42:15 +0900
Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before
 retrying.

Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort
to reduce frequency of updating values in vm_stat[] is made by using
per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
using vmstat_update workqueue item (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

But workqueue and page allocator have different assumptions.

  (A) The workqueue defers processing of other items unless currently
  in-flight item enters into !TASK_RUNNING state.

  (B) The page allocator never enters into !TASK_RUNNING state if there
  is nothing to reclaim. (The page allocator calls cond_resched()
  via wait_iff_congested(), but cond_resched() does not make the
  task enter into !TASK_RUNNING state.)

Therefore, if a workqueue item which is processed before vmstat_update
item is processed got stuck inside memory allocation request, values in
vm_stat_diff[] cannot be merged into vm_stat[].

As a result, zone_reclaimable() continues using outdated vm_stat[] values
and the task which is doing direct reclaim path thinks that there are
still reclaimable pages and therefore continues looping.

The consequence is a silent livelock (hang up without any kernel messages)
because the OOM killer will not be invoked. We can hit such livelock by
e.g. disk_events_workfn workqueue item doing memory allocation from
bio_copy_kern().

[  255.054205] kworker/3:1 R  running task045  2 0x0008
[  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[  255.057715]  88007f805680 88007c55f6d0 8116463d 
88007c55f758
[  255.059705]  88007f82b870 88007c55f6e0 811646be 
88007c55f710
[  255.061694]  811bdaf0 88007f82b870 0400 

[  255.063690] Call Trace:
[  255.064664]  [] ?

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-25 Thread Tetsuo Handa

Tejun Heo wrote:
> If this is an actual problem, a better approach would be something
> which detects the stall condition and kicks off the next work item but
> if we do that I think I'd still trigger a warning there.  I don't
> know.  Don't go busy waiting in kernel.

Busy waiting in kernel refers several cases.

  (1) Wait for something with interrupts disabled.

  (2) Wait for something with interrupts enabled but
  without calling cond_resched() etc.

  (3) Wait for something with interrupts enabled and
  with calling cond_resched() etc.

  (4) Wait for something with interrupts enabled and
  with calling schedule_timeout() etc.

Kernel code tries to minimize (1). Kernel code does (2) if they are
not allowed to sleep. But kernel code is allowed to do (3) if they
are allowed to sleep, as long as cond_resched() is sometimes called.
And currently page allocator does (3). But kernel code invoked via
workqueue is expected to do (4) than (3).

This means that any kernel code which invokes a __GFP_WAIT allocation
might fail to do (4) when invoked via workqueue, regardless of flags
passed to alloc_workqueue()?

Michal Hocko wrote:
> On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> > Tejun Heo wrote:
> > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > > hogging CPUs, we are just out of workers which are loopin in the
> > > > allocator but that is preemptible context.
> > > 
> > > It's concurrency management.  Workqueue thinks that the pool is making
> > > positive forward progress and doesn't schedule anything else for
> > > execution while that work item is burning cpu cycles.
> > 
> > Then, isn't below change easier to backport which will also alleviate
> > needlessly burning CPU cycles?
> 
> This is quite obscure. If the vmstat_update fix needs workqueue tweaks
> as well then I would vote for your original patch which is clear,
> straightforward and easy to backport.

I think that inserting a short sleep into page allocator is better
because the vmstat_update fix will not require workqueue tweaks if
we sleep inside page allocator. Also, from the point of view of
protecting page allocator from going unresponsive when hundreds of tasks
started busy-waiting at __alloc_pages_slowpath() because we can observe
that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
when we are unable to make forward progress.

>From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Sun, 25 Oct 2015 19:42:15 +0900
Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before
 retrying.

Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort
to reduce frequency of updating values in vm_stat[] is made by using
per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
using vmstat_update workqueue item (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

But workqueue and page allocator have different assumptions.

  (A) The workqueue defers processing of other items unless currently
  in-flight item enters into !TASK_RUNNING state.

  (B) The page allocator never enters into !TASK_RUNNING state if there
  is nothing to reclaim. (The page allocator calls cond_resched()
  via wait_iff_congested(), but cond_resched() does not make the
  task enter into !TASK_RUNNING state.)

Therefore, if a workqueue item which is processed before vmstat_update
item is processed got stuck inside memory allocation request, values in
vm_stat_diff[] cannot be merged into vm_stat[].

As a result, zone_reclaimable() continues using outdated vm_stat[] values
and the task which is doing direct reclaim path thinks that there are
still reclaimable pages and therefore continues looping.

The consequence is a silent livelock (hang up without any kernel messages)
because the OOM killer will not be invoked. We can hit such livelock by
e.g. disk_events_workfn workqueue item doing memory allocation from
bio_copy_kern().

[  255.054205] kworker/3:1 R  running task045  2 0x0008
[  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[  255.057715]  88007f805680 88007c55f6d0 8116463d 
88007c55f758
[  255.059705]  88007f82b870 88007c55f6e0 811646be 
88007c55f710
[  255.061694]  811bdaf0 88007f82b870 0400 

[  255.063690] Call Trace:
[

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-25 Thread Tejun Heo

Hello,

On Sun, Oct 25, 2015 at 07:52:59PM +0900, Tetsuo Handa wrote:
...
> This means that any kernel code which invokes a __GFP_WAIT allocation
> might fail to do (4) when invoked via workqueue, regardless of flags
> passed to alloc_workqueue()?

Sounds that way and yeah (3) should technically be okay and that's why
HIGHPRI was implemented the way it was at the beginning; however, in
practice, this is the first time it's noticeable in all the years.  I
think it comes down to the fact that there just aren't many places
which need such looping behavior and even in those places it's often
very undesirable to busy-loop while not making forward-progress (and
if forward-progress is being made, it won't be indefinite).

> I think that inserting a short sleep into page allocator is better
> because the vmstat_update fix will not require workqueue tweaks if
> we sleep inside page allocator. Also, from the point of view of
> protecting page allocator from going unresponsive when hundreds of tasks
> started busy-waiting at __alloc_pages_slowpath() because we can observe
> that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
> when we are unable to make forward progress.

This looks good to me too; however, it still needs a dedicated
workqueue with WQ_MEM_RECLAIM set.  That deadlock probably is very
unlikely as the side effect of vmstat failing to execute due to worker
exhaustion is more memory reclaim but it still is theoretically
possible and it could just be that it happens at low enough frequency
that it hasn't been reported yet.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tejun Heo

Hello, Tetsuo.

On Fri, Oct 23, 2015 at 09:25:11PM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
> in order to avoid failing to allocate it on demand due to a GFP_KERNEL
> allocation? Is this correct?
> 
> WQ_CPU_INTENSIVE only guarantees that work items don't participate in
> concurrency management in order to avoid failing to wake up a "struct
> task_struct" which will process the work items? Is this correct?

CPU_INTENSIVE avoids the tail end of concurrency management.  The
previous HIGHPRI or the posted IMMEDIATE avoids the head end.

> Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
> WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
> rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
> wake up due to being participated in concurrency management?

If this is an actual problem, a better approach would be something
which detects the stall condition and kicks off the next work item but
if we do that I think I'd still trigger a warning there.  I don't
know.  Don't go busy waiting in kernel.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tejun Heo

Hello,

On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.

I don't know.  I can be argued either way, the other direction being a
kernel thread going RUNNING non-stop is buggy.  Given how this has
been a complete non-issue for all the years, I'm not sure how useful
plugging this is.

> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will

Seriously, nobody goes full-on RUNNING.

> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

Not serving any actual purpose and degrading execution behavior.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Christoph Lameter

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> by the way, tick_nohz_stop_sched_tick() receives cpu from 
> __tick_nohz_idle_enter().
> do you want to pass it to quiet_vmstat()?

No this is quite wrong at this point. quiet_vmstat() needs to be called
from the cpu going into idle state.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Sergey Senozhatsky

On (10/23/15 09:12), Christoph Lameter wrote:
[..]
> > > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > > + cancel_delayed_work(this_cpu_ptr(_work));
> >
> > shouldn't preemption be disable for smp_processor_id() here?
> 
> Preemption is disabled when quiet_vmstat() is called.
> 

cond_resched()

[   29.607725] BUG: sleeping function called from invalid context at 
mm/vmstat.c:487
[   29.607729] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/7
[   29.607731] no locks held by swapper/7/0.
[   29.607732] irq event stamp: 48932
[   29.607733] hardirqs last  enabled at (48931): [] 
_raw_spin_unlock_irq+0x2c/0x37
[   29.607739] hardirqs last disabled at (48932): [] 
tick_nohz_idle_enter+0x3c/0x5f
[   29.607743] softirqs last  enabled at (48924): [] 
__do_softirq+0x2bb/0x3a9
[   29.607747] softirqs last disabled at (48893): [] 
irq_exit+0x41/0x95
[   29.607752] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 
4.3.0-rc6-next-20151022-dbg-3-g01184ff-dirty #261
[   29.607754]   88041dae7da0 811dd4f3 
88041dacd100
[   29.607756]  88041dae7dc8 8105f144 8169f800 

[   29.607759]  0007 88041dae7e70 811040b1 
0002
[   29.607761] Call Trace:
[   29.607767]  [] dump_stack+0x4b/0x63
[   29.607770]  [] ___might_sleep+0x1e7/0x1ee
[   29.607773]  [] refresh_cpu_vm_stats+0x8b/0xb5
[   29.607776]  [] quiet_vmstat+0x3a/0x41
[   29.607778]  [] __tick_nohz_idle_enter+0x292/0x410
[   29.607781]  [] tick_nohz_idle_enter+0x57/0x5f
[   29.607784]  [] cpu_startup_entry+0x36/0x330
[   29.607788]  [] start_secondary+0xf3/0xf6



by the way, tick_nohz_stop_sched_tick() receives cpu from 
__tick_nohz_idle_enter().
do you want to pass it to quiet_vmstat()?

if (!ts->tick_stopped) {
nohz_balance_enter_idle(cpu);
-   quiet_vmstat();
+   quiet_vmstat(cpu);
calc_load_enter_idle();

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Christoph Lameter

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> On (10/23/15 06:43), Christoph Lameter wrote:
> > Is this ok?
>
> kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’:
> kernel/sched/loadavg.c:195:2: error: implicit declaration of function 
> ‘quiet_vmstat’ [-Werror=implicit-function-declaration]
>   quiet_vmstat();
> ^

Oww... Not good to do that in the scheduler. Ok new patch follows that
does the call from tick_nohz_stop_sched_tick. Hopefully that is the right
location to call quiet_vmstat().

> > +   if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > +   cancel_delayed_work(this_cpu_ptr(_work));
>
> shouldn't preemption be disable for smp_processor_id() here?

Preemption is disabled when quiet_vmstat() is called.



Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle 
V2

V1->V2
 - Call vmstat_quiet from tick_nohz_stop_sched_tick() instead.

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter 

Index: linux/mm/vmstat.c
===
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+   do {
+   if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+   cancel_delayed_work(this_cpu_ptr(_work));
+
+   } while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
Index: linux/kernel/time/tick-sched.c
===
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -667,6 +667,7 @@ static ktime_t tick_nohz_stop_sched_tick
 */
if (!ts->tick_stopped) {
nohz_balance_enter_idle(cpu);
+   quiet_vmstat();
calc_load_enter_idle();

ts->last_tick = hrtimer_get_expires(>sched_timer);

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > > other email, sorry about that. But now I am wondering whether this
> > > is an intended behavior. The documentation says:
> > 
> > This is.
> > 
> > >   WQ_MEM_RECLAIM
> > > 
> > > All wq which might be used in the memory reclaim paths _MUST_
> > > have this flag set.  The wq is guaranteed to have at least one
> > > execution context regardless of memory pressure.
> > > 
> > > Which doesn't seem to be true currently, right? Now I can see your patch
> > 
> > It is true.
> > 
> > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > > could do without WQ_IMMEDIATE? I mean all current workers might be
> > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > > work items might be waiting behind them so they cannot help to relieve
> > > the memory pressure. This doesn't sound right to me. Or I am completely
> > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > > be used for.
> > 
> > It guarantees that there always is enough execution resource to
> > execute a work item from that workqueue. 
> 
> OK, strictly speaking the rescuer is there but it is kind of pointless
> if it doesn't fire up and do a work.
> 
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.
> 
> > This also can be fixed by teaching concurrency management
> > to be a bit smarter - e.g. if a work item is burning a lot of CPU
> > cycles continuously or pool hasn't finished a work item over a certain
> > amount of time, automatically ignore the in-flight work item for the
> > purpose of concurrency management; however, this sort of inter-work
> > item busy waits are so extremely rare and undesirable that I'm not
> > sure the added complexity would be worthwhile.
> 
> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will
> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
in order to avoid failing to allocate it on demand due to a GFP_KERNEL
allocation? Is this correct?

WQ_CPU_INTENSIVE only guarantees that work items don't participate in
concurrency management in order to avoid failing to wake up a "struct
task_struct" which will process the work items? Is this correct?

Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
wake up due to being participated in concurrency management?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Sergey Senozhatsky

On (10/23/15 06:43), Christoph Lameter wrote:
> Is this ok?

kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’:
kernel/sched/loadavg.c:195:2: error: implicit declaration of function 
‘quiet_vmstat’ [-Werror=implicit-function-declaration]
  quiet_vmstat();
^

> Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on 
> idle
> 
> Currently the vmstat updater is not deferrable as a result of commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
> interruptions of the applications because the vmstat updater may run at
> different times than tick processing. No good.
> 
> Make vmstate_update deferrable again and provide a function that
> shuts down the vmstat updater when we go idle by folding the differentials.
> Shut it down from the load average calculation logic introduced by nohz.
> 
> Note that the shepherd thread will continue scanning the differentials
> from another processor and will reenable the vmstat workers if it
> detects any changes.
> 
> Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/vmstat.c
> ===
> --- linux.orig/mm/vmstat.c
> +++ linux/mm/vmstat.c
> @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
>  }
> 
>  /*
> + * Switch off vmstat processing and then fold all the remaining differentials
> + * until the diffs stay at zero. The function is used by NOHZ and can only be
> + * invoked when tick processing is not active.
> + */
> +void quiet_vmstat(void)
> +{
> + do {
> + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> + cancel_delayed_work(this_cpu_ptr(_work));

shouldn't preemption be disable for smp_processor_id() here?

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Christoph Lameter

On Fri, 23 Oct 2015, Michal Hocko wrote:

> On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> > Ok that also makes me rethink commit
> > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> > this time related to idle mode not updating the counters.
> >
> > Could we fix that by folding the counters before going to idle mode?
>
> This would work as well.

Is this ok?


Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter 

Index: linux/mm/vmstat.c
===
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+   do {
+   if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+   cancel_delayed_work(this_cpu_ptr(_work));
+
+   } while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
Index: linux/kernel/sched/loadavg.c
===
--- linux.orig/kernel/sched/loadavg.c
+++ linux/kernel/sched/loadavg.c
@@ -191,6 +191,8 @@ void calc_load_enter_idle(void)

atomic_long_add(delta, _load_idle[idx]);
}
+   /* Fold the current vmstat counters and disable vmstat updater */
+   quiet_vmstat();
 }

 void calc_load_exit_idle(void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Michal Hocko

On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > other email, sorry about that. But now I am wondering whether this
> > is an intended behavior. The documentation says:
> 
> This is.
> 
> >   WQ_MEM_RECLAIM
> > 
> > All wq which might be used in the memory reclaim paths _MUST_
> > have this flag set.  The wq is guaranteed to have at least one
> > execution context regardless of memory pressure.
> > 
> > Which doesn't seem to be true currently, right? Now I can see your patch
> 
> It is true.
> 
> > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > could do without WQ_IMMEDIATE? I mean all current workers might be
> > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > work items might be waiting behind them so they cannot help to relieve
> > the memory pressure. This doesn't sound right to me. Or I am completely
> > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > be used for.
> 
> It guarantees that there always is enough execution resource to
> execute a work item from that workqueue. 

OK, strictly speaking the rescuer is there but it is kind of pointless
if it doesn't fire up and do a work.

> The problem here is not lack
> of execution resource but concurrency management misunderstanding the
> situation. 

And this sounds like a bug to me.

> This also can be fixed by teaching concurrency management
> to be a bit smarter - e.g. if a work item is burning a lot of CPU
> cycles continuously or pool hasn't finished a work item over a certain
> amount of time, automatically ignore the in-flight work item for the
> purpose of concurrency management; however, this sort of inter-work
> item busy waits are so extremely rare and undesirable that I'm not
> sure the added complexity would be worthwhile.

Don't we have some IO related paths which would suffer from the same
problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
name I would expect they _do_ participate in the reclaim and so they
should be able to make a progress. Now if your new IMMEDIATE flag will
guarantee that then I would argue that it should be implicit for
WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
be a counter argument for doing that?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-23 Thread Tejun Heo

On Fri, Oct 23, 2015 at 10:36:12AM +0200, Michal Hocko wrote:
> If WQ_MEM_RECLAIM can really guarantee one worker as described in the
> documentation then I agree that fixing vmstat is a better fix. But that
> doesn't seem to be the case currently.

It does.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tejun Heo

Hello, Michal.

On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> other email, sorry about that. But now I am wondering whether this
> is an intended behavior. The documentation says:

This is.

>   WQ_MEM_RECLAIM
> 
> All wq which might be used in the memory reclaim paths _MUST_
> have this flag set.  The wq is guaranteed to have at least one
> execution context regardless of memory pressure.
> 
> Which doesn't seem to be true currently, right? Now I can see your patch

It is true.

> to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> could do without WQ_IMMEDIATE? I mean all current workers might be
> looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> work items might be waiting behind them so they cannot help to relieve
> the memory pressure. This doesn't sound right to me. Or I am completely
> confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> be used for.

It guarantees that there always is enough execution resource to
execute a work item from that workqueue.  The problem here is not lack
of execution resource but concurrency management misunderstanding the
situation.  This also can be fixed by teaching concurrency management
to be a bit smarter - e.g. if a work item is burning a lot of CPU
cycles continuously or pool hasn't finished a work item over a certain
amount of time, automatically ignore the in-flight work item for the
purpose of concurrency management; however, this sort of inter-work
item busy waits are so extremely rare and undesirable that I'm not
sure the added complexity would be worthwhile.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Michal Hocko

On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> Ok that also makes me rethink commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> this time related to idle mode not updating the counters.
> 
> Could we fix that by folding the counters before going to idle mode?

This would work as well.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-23 Thread Michal Hocko

On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > hogging CPUs, we are just out of workers which are loopin in the
> > > allocator but that is preemptible context.
> > 
> > It's concurrency management.  Workqueue thinks that the pool is making
> > positive forward progress and doesn't schedule anything else for
> > execution while that work item is burning cpu cycles.
> 
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?

This is quite obscure. If the vmstat_update fix needs workqueue tweaks
as well then I would vote for your original patch which is clear,
straightforward and easy to backport.

If WQ_MEM_RECLAIM can really guarantee one worker as described in the
documentation then I agree that fixing vmstat is a better fix. But that
doesn't seem to be the case currently.
 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>   /* Wait for some write requests to complete then retry */
>   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> + schedule_timeout_uninterruptible(1);
>   goto retry;
>   }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Michal Hocko

On Fri 23-10-15 03:42:26, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
other email, sorry about that. But now I am wondering whether this
is an intended behavior. The documentation says:
  WQ_MEM_RECLAIM

All wq which might be used in the memory reclaim paths _MUST_
have this flag set.  The wq is guaranteed to have at least one
execution context regardless of memory pressure.

Which doesn't seem to be true currently, right? Now I can see your patch
to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
could do without WQ_IMMEDIATE? I mean all current workers might be
looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
work items might be waiting behind them so they cannot help to relieve
the memory pressure. This doesn't sound right to me. Or I am completely
confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
be used for.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tejun Heo

Hello,

On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.

I don't know.  I can be argued either way, the other direction being a
kernel thread going RUNNING non-stop is buggy.  Given how this has
been a complete non-issue for all the years, I'm not sure how useful
plugging this is.

> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will

Seriously, nobody goes full-on RUNNING.

> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

Not serving any actual purpose and degrading execution behavior.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tejun Heo

Hello, Tetsuo.

On Fri, Oct 23, 2015 at 09:25:11PM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
> in order to avoid failing to allocate it on demand due to a GFP_KERNEL
> allocation? Is this correct?
> 
> WQ_CPU_INTENSIVE only guarantees that work items don't participate in
> concurrency management in order to avoid failing to wake up a "struct
> task_struct" which will process the work items? Is this correct?

CPU_INTENSIVE avoids the tail end of concurrency management.  The
previous HIGHPRI or the posted IMMEDIATE avoids the head end.

> Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
> WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
> rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
> wake up due to being participated in concurrency management?

If this is an actual problem, a better approach would be something
which detects the stall condition and kicks off the next work item but
if we do that I think I'd still trigger a warning there.  I don't
know.  Don't go busy waiting in kernel.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Christoph Lameter

On Fri, 23 Oct 2015, Michal Hocko wrote:

> On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> > Ok that also makes me rethink commit
> > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> > this time related to idle mode not updating the counters.
> >
> > Could we fix that by folding the counters before going to idle mode?
>
> This would work as well.

Is this ok?


Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter 

Index: linux/mm/vmstat.c
===
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+   do {
+   if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+   cancel_delayed_work(this_cpu_ptr(_work));
+
+   } while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
Index: linux/kernel/sched/loadavg.c
===
--- linux.orig/kernel/sched/loadavg.c
+++ linux/kernel/sched/loadavg.c
@@ -191,6 +191,8 @@ void calc_load_enter_idle(void)

atomic_long_add(delta, _load_idle[idx]);
}
+   /* Fold the current vmstat counters and disable vmstat updater */
+   quiet_vmstat();
 }

 void calc_load_exit_idle(void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Sergey Senozhatsky

On (10/23/15 09:12), Christoph Lameter wrote:
[..]
> > > + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > > + cancel_delayed_work(this_cpu_ptr(_work));
> >
> > shouldn't preemption be disable for smp_processor_id() here?
> 
> Preemption is disabled when quiet_vmstat() is called.
> 

cond_resched()

[   29.607725] BUG: sleeping function called from invalid context at 
mm/vmstat.c:487
[   29.607729] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/7
[   29.607731] no locks held by swapper/7/0.
[   29.607732] irq event stamp: 48932
[   29.607733] hardirqs last  enabled at (48931): [] 
_raw_spin_unlock_irq+0x2c/0x37
[   29.607739] hardirqs last disabled at (48932): [] 
tick_nohz_idle_enter+0x3c/0x5f
[   29.607743] softirqs last  enabled at (48924): [] 
__do_softirq+0x2bb/0x3a9
[   29.607747] softirqs last disabled at (48893): [] 
irq_exit+0x41/0x95
[   29.607752] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 
4.3.0-rc6-next-20151022-dbg-3-g01184ff-dirty #261
[   29.607754]   88041dae7da0 811dd4f3 
88041dacd100
[   29.607756]  88041dae7dc8 8105f144 8169f800 

[   29.607759]  0007 88041dae7e70 811040b1 
0002
[   29.607761] Call Trace:
[   29.607767]  [] dump_stack+0x4b/0x63
[   29.607770]  [] ___might_sleep+0x1e7/0x1ee
[   29.607773]  [] refresh_cpu_vm_stats+0x8b/0xb5
[   29.607776]  [] quiet_vmstat+0x3a/0x41
[   29.607778]  [] __tick_nohz_idle_enter+0x292/0x410
[   29.607781]  [] tick_nohz_idle_enter+0x57/0x5f
[   29.607784]  [] cpu_startup_entry+0x36/0x330
[   29.607788]  [] start_secondary+0xf3/0xf6



by the way, tick_nohz_stop_sched_tick() receives cpu from 
__tick_nohz_idle_enter().
do you want to pass it to quiet_vmstat()?

if (!ts->tick_stopped) {
nohz_balance_enter_idle(cpu);
-   quiet_vmstat();
+   quiet_vmstat(cpu);
calc_load_enter_idle();

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Christoph Lameter

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> by the way, tick_nohz_stop_sched_tick() receives cpu from 
> __tick_nohz_idle_enter().
> do you want to pass it to quiet_vmstat()?

No this is quite wrong at this point. quiet_vmstat() needs to be called
from the cpu going into idle state.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Sergey Senozhatsky

On (10/23/15 06:43), Christoph Lameter wrote:
> Is this ok?

kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’:
kernel/sched/loadavg.c:195:2: error: implicit declaration of function 
‘quiet_vmstat’ [-Werror=implicit-function-declaration]
  quiet_vmstat();
^

> Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on 
> idle
> 
> Currently the vmstat updater is not deferrable as a result of commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
> interruptions of the applications because the vmstat updater may run at
> different times than tick processing. No good.
> 
> Make vmstate_update deferrable again and provide a function that
> shuts down the vmstat updater when we go idle by folding the differentials.
> Shut it down from the load average calculation logic introduced by nohz.
> 
> Note that the shepherd thread will continue scanning the differentials
> from another processor and will reenable the vmstat workers if it
> detects any changes.
> 
> Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/vmstat.c
> ===
> --- linux.orig/mm/vmstat.c
> +++ linux/mm/vmstat.c
> @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
>  }
> 
>  /*
> + * Switch off vmstat processing and then fold all the remaining differentials
> + * until the diffs stay at zero. The function is used by NOHZ and can only be
> + * invoked when tick processing is not active.
> + */
> +void quiet_vmstat(void)
> +{
> + do {
> + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> + cancel_delayed_work(this_cpu_ptr(_work));

shouldn't preemption be disable for smp_processor_id() here?

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)

2015-10-23 Thread Christoph Lameter

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> On (10/23/15 06:43), Christoph Lameter wrote:
> > Is this ok?
>
> kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’:
> kernel/sched/loadavg.c:195:2: error: implicit declaration of function 
> ‘quiet_vmstat’ [-Werror=implicit-function-declaration]
>   quiet_vmstat();
> ^

Oww... Not good to do that in the scheduler. Ok new patch follows that
does the call from tick_nohz_stop_sched_tick. Hopefully that is the right
location to call quiet_vmstat().

> > +   if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > +   cancel_delayed_work(this_cpu_ptr(_work));
>
> shouldn't preemption be disable for smp_processor_id() here?

Preemption is disabled when quiet_vmstat() is called.



Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle 
V2

V1->V2
 - Call vmstat_quiet from tick_nohz_stop_sched_tick() instead.

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter 

Index: linux/mm/vmstat.c
===
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+   do {
+   if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+   cancel_delayed_work(this_cpu_ptr(_work));
+
+   } while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
Index: linux/kernel/time/tick-sched.c
===
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -667,6 +667,7 @@ static ktime_t tick_nohz_stop_sched_tick
 */
if (!ts->tick_stopped) {
nohz_balance_enter_idle(cpu);
+   quiet_vmstat();
calc_load_enter_idle();

ts->last_tick = hrtimer_get_expires(>sched_timer);

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-23 Thread Tejun Heo

On Fri, Oct 23, 2015 at 10:36:12AM +0200, Michal Hocko wrote:
> If WQ_MEM_RECLAIM can really guarantee one worker as described in the
> documentation then I agree that fixing vmstat is a better fix. But that
> doesn't seem to be the case currently.

It does.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Michal Hocko

On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > other email, sorry about that. But now I am wondering whether this
> > is an intended behavior. The documentation says:
> 
> This is.
> 
> >   WQ_MEM_RECLAIM
> > 
> > All wq which might be used in the memory reclaim paths _MUST_
> > have this flag set.  The wq is guaranteed to have at least one
> > execution context regardless of memory pressure.
> > 
> > Which doesn't seem to be true currently, right? Now I can see your patch
> 
> It is true.
> 
> > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > could do without WQ_IMMEDIATE? I mean all current workers might be
> > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > work items might be waiting behind them so they cannot help to relieve
> > the memory pressure. This doesn't sound right to me. Or I am completely
> > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > be used for.
> 
> It guarantees that there always is enough execution resource to
> execute a work item from that workqueue. 

OK, strictly speaking the rescuer is there but it is kind of pointless
if it doesn't fire up and do a work.

> The problem here is not lack
> of execution resource but concurrency management misunderstanding the
> situation. 

And this sounds like a bug to me.

> This also can be fixed by teaching concurrency management
> to be a bit smarter - e.g. if a work item is burning a lot of CPU
> cycles continuously or pool hasn't finished a work item over a certain
> amount of time, automatically ignore the in-flight work item for the
> purpose of concurrency management; however, this sort of inter-work
> item busy waits are so extremely rare and undesirable that I'm not
> sure the added complexity would be worthwhile.

Don't we have some IO related paths which would suffer from the same
problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
name I would expect they _do_ participate in the reclaim and so they
should be able to make a progress. Now if your new IMMEDIATE flag will
guarantee that then I would argue that it should be implicit for
WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
be a counter argument for doing that?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tejun Heo

Hello, Michal.

On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> other email, sorry about that. But now I am wondering whether this
> is an intended behavior. The documentation says:

This is.

>   WQ_MEM_RECLAIM
> 
> All wq which might be used in the memory reclaim paths _MUST_
> have this flag set.  The wq is guaranteed to have at least one
> execution context regardless of memory pressure.
> 
> Which doesn't seem to be true currently, right? Now I can see your patch

It is true.

> to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> could do without WQ_IMMEDIATE? I mean all current workers might be
> looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> work items might be waiting behind them so they cannot help to relieve
> the memory pressure. This doesn't sound right to me. Or I am completely
> confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> be used for.

It guarantees that there always is enough execution resource to
execute a work item from that workqueue.  The problem here is not lack
of execution resource but concurrency management misunderstanding the
situation.  This also can be fixed by teaching concurrency management
to be a bit smarter - e.g. if a work item is burning a lot of CPU
cycles continuously or pool hasn't finished a work item over a certain
amount of time, automatically ignore the in-flight work item for the
purpose of concurrency management; however, this sort of inter-work
item busy waits are so extremely rare and undesirable that I'm not
sure the added complexity would be worthwhile.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Tetsuo Handa

Michal Hocko wrote:
> On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > > other email, sorry about that. But now I am wondering whether this
> > > is an intended behavior. The documentation says:
> > 
> > This is.
> > 
> > >   WQ_MEM_RECLAIM
> > > 
> > > All wq which might be used in the memory reclaim paths _MUST_
> > > have this flag set.  The wq is guaranteed to have at least one
> > > execution context regardless of memory pressure.
> > > 
> > > Which doesn't seem to be true currently, right? Now I can see your patch
> > 
> > It is true.
> > 
> > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > > could do without WQ_IMMEDIATE? I mean all current workers might be
> > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > > work items might be waiting behind them so they cannot help to relieve
> > > the memory pressure. This doesn't sound right to me. Or I am completely
> > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > > be used for.
> > 
> > It guarantees that there always is enough execution resource to
> > execute a work item from that workqueue. 
> 
> OK, strictly speaking the rescuer is there but it is kind of pointless
> if it doesn't fire up and do a work.
> 
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.
> 
> > This also can be fixed by teaching concurrency management
> > to be a bit smarter - e.g. if a work item is burning a lot of CPU
> > cycles continuously or pool hasn't finished a work item over a certain
> > amount of time, automatically ignore the in-flight work item for the
> > purpose of concurrency management; however, this sort of inter-work
> > item busy waits are so extremely rare and undesirable that I'm not
> > sure the added complexity would be worthwhile.
> 
> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will
> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
in order to avoid failing to allocate it on demand due to a GFP_KERNEL
allocation? Is this correct?

WQ_CPU_INTENSIVE only guarantees that work items don't participate in
concurrency management in order to avoid failing to wake up a "struct
task_struct" which will process the work items? Is this correct?

Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
wake up due to being participated in concurrency management?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Michal Hocko

On Fri 23-10-15 03:42:26, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
other email, sorry about that. But now I am wondering whether this
is an intended behavior. The documentation says:
  WQ_MEM_RECLAIM

All wq which might be used in the memory reclaim paths _MUST_
have this flag set.  The wq is guaranteed to have at least one
execution context regardless of memory pressure.

Which doesn't seem to be true currently, right? Now I can see your patch
to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
could do without WQ_IMMEDIATE? I mean all current workers might be
looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
work items might be waiting behind them so they cannot help to relieve
the memory pressure. This doesn't sound right to me. Or I am completely
confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
be used for.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-23 Thread Michal Hocko

On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > hogging CPUs, we are just out of workers which are loopin in the
> > > allocator but that is preemptible context.
> > 
> > It's concurrency management.  Workqueue thinks that the pool is making
> > positive forward progress and doesn't schedule anything else for
> > execution while that work item is burning cpu cycles.
> 
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?

This is quite obscure. If the vmstat_update fix needs workqueue tweaks
as well then I would vote for your original patch which is clear,
straightforward and easy to backport.

If WQ_MEM_RECLAIM can really guarantee one worker as described in the
documentation then I agree that fixing vmstat is a better fix. But that
doesn't seem to be the case currently.
 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>   /* Wait for some write requests to complete then retry */
>   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> + schedule_timeout_uninterruptible(1);
>   goto retry;
>   }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-23 Thread Michal Hocko

On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> Ok that also makes me rethink commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> this time related to idle mode not updating the counters.
> 
> Could we fix that by folding the counters before going to idle mode?

This would work as well.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

Hello,

So, something like the following.  Just compile tested but this is
essentially partial revert of 3270476a6c0c ("workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
If for some reason, it gets decided against simply adding one jiffy
sleep, please let me know.  I'll verify the operation and post a
proper patch.  That said, given that this prolly needs -stable
backport and vmstat is likely to be the only user (busy loops are
really rare in the kernel after all), I think the better approach
would be reinstating the short sleep.

Thanks.

---
 include/linux/workqueue.h |7 ++---
 kernel/workqueue.c|   63 +++---
 2 files changed, 63 insertions(+), 7 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -278,9 +278,10 @@ enum {
WQ_UNBOUND  = 1 << 1, /* not bound to any cpu */
WQ_FREEZABLE= 1 << 2, /* freeze during suspend */
WQ_MEM_RECLAIM  = 1 << 3, /* may be used for memory reclaim */
-   WQ_HIGHPRI  = 1 << 4, /* high priority */
-   WQ_CPU_INTENSIVE= 1 << 5, /* cpu intensive workqueue */
-   WQ_SYSFS= 1 << 6, /* visible in sysfs, see 
wq_sysfs_register() */
+   WQ_IMMEDIATE= 1 << 4, /* bypass concurrency management */
+   WQ_HIGHPRI  = 1 << 5, /* high priority */
+   WQ_CPU_INTENSIVE= 1 << 6, /* cpu intensive workqueue */
+   WQ_SYSFS= 1 << 7, /* visible in sysfs, see 
wq_sysfs_register() */
 
/*
 * Per-cpu workqueues are generally preferred because they tend to
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -68,6 +68,7 @@ enum {
 * attach_mutex to avoid changing binding state while
 * worker_attach_to_pool() is in progress.
 */
+   POOL_IMMEDIATE_PENDING  = 1 << 0,   /* WQ_IMMEDIATE items on queue 
*/
POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
 
/* worker flags */
@@ -731,7 +732,8 @@ static bool work_is_canceling(struct wor
 
 static bool __need_more_worker(struct worker_pool *pool)
 {
-   return !atomic_read(>nr_running);
+   return !atomic_read(>nr_running) ||
+   (pool->flags & POOL_IMMEDIATE_PENDING);
 }
 
 /*
@@ -757,7 +759,8 @@ static bool may_start_working(struct wor
 static bool keep_working(struct worker_pool *pool)
 {
return !list_empty(>worklist) &&
-   atomic_read(>nr_running) <= 1;
+   (atomic_read(>nr_running) <= 1 ||
+(pool->flags & POOL_IMMEDIATE_PENDING));
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -1021,6 +1024,42 @@ static void move_linked_works(struct wor
 }
 
 /**
+ * pwq_determine_ins_pos - find insertion position
+ * @pwq: pwq a work is being queued for
+ *
+ * A work for @pwq is about to be queued on @pwq->pool, determine insertion
+ * position for the work.  If @pwq is for IMMEDIATE wq, the work item is
+ * queued at the head of the queue but in FIFO order with respect to other
+ * IMMEDIATE work items; otherwise, at the end of the queue.  This function
+ * also sets POOL_IMMEDIATE_PENDING flag to hint @pool that there are
+ * IMMEDIATE works pending.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to insertion position.
+ */
+static struct list_head *pwq_determine_ins_pos(struct pool_workqueue *pwq)
+{
+   struct worker_pool *pool = pwq->pool;
+   struct work_struct *twork;
+
+   if (likely(!(pwq->wq->flags & WQ_IMMEDIATE)))
+   return >worklist;
+
+   list_for_each_entry(twork, >worklist, entry) {
+   struct pool_workqueue *tpwq = get_work_pwq(twork);
+
+   if (!(tpwq->wq->flags & WQ_IMMEDIATE))
+   break;
+   }
+
+   pool->flags |= POOL_IMMEDIATE_PENDING;
+   return >entry;
+}
+
+/**
  * get_pwq - get an extra reference on the specified pool_workqueue
  * @pwq: pool_workqueue to get
  *
@@ -1081,9 +1120,10 @@ static void put_pwq_unlocked(struct pool
 static void pwq_activate_delayed_work(struct work_struct *work)
 {
struct pool_workqueue *pwq = get_work_pwq(work);
+   struct list_head *pos = pwq_determine_ins_pos(pwq);
 
trace_workqueue_activate_work(work);
-   move_linked_works(work, >pool->worklist, NULL);
+   move_linked_works(work, pos, NULL);
__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
pwq->nr_active++;
 }
@@ -1384,7 +1424,7 @@ retry:
if (likely(pwq->nr_active < pwq->max_active)) {
trace_workqueue_activate_work(work);
pwq->nr_active++;
-   worklist = >pool->worklist;
+   worklist = pwq_determine_ins_pos(pwq);
} else {
work_flags |= WORK_STRUCT_DELAYED;
worklist =

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-22 Thread Tejun Heo

Hello,

On Fri, Oct 23, 2015 at 06:42:43AM +0900, Tetsuo Handa wrote:
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>   /* Wait for some write requests to complete then retry */
>   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> + schedule_timeout_uninterruptible(1);
>   goto retry;
>   }

Yeah, that works too.  It should still be put on a dedicated wq with
MEM_RECLAIM tho.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-22 Thread Tetsuo Handa

Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Then, isn't below change easier to backport which will also alleviate
needlessly burning CPU cycles?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3385,6 +3385,7 @@ retry:
((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+   schedule_timeout_uninterruptible(1);
goto retry;
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> I am confused. What makes rescuer to not run? Nothing seems to be
> hogging CPUs, we are just out of workers which are loopin in the
> allocator but that is preemptible context.

It's concurrency management.  Workqueue thinks that the pool is making
positive forward progress and doesn't schedule anything else for
execution while that work item is burning cpu cycles.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Michal Hocko

On Fri 23-10-15 00:37:03, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote:
> > But that shouldn't happen because the allocation path does cond_resched
> > even when nothing is really reclaimable (e.g. wait_iff_congested from
> > __alloc_pages_slowpath).
> 
> cond_resched() isn't enough.  The work item should go !RUNNING, not
> just yielding.

I am confused. What makes rescuer to not run? Nothing seems to be
hogging CPUs, we are just out of workers which are loopin in the
allocator but that is preemptible context.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote:
> But that shouldn't happen because the allocation path does cond_resched
> even when nothing is really reclaimable (e.g. wait_iff_congested from
> __alloc_pages_slowpath).

cond_resched() isn't enough.  The work item should go !RUNNING, not
just yielding.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Michal Hocko

On Fri 23-10-15 00:15:28, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote:
> > Do I get it right that if vmstat_update has its own workqueue with
> > WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
> > be able to process vmstat_update work from the requested CPU?
> 
> Yeah.

Thanks for the confirmation.

> > That should be sufficient because vmstat_update doesn't sleep on
> > allocation. I agree that this would be a more appropriate fix.
> 
> The problem seems to be reclaim path busy looping waiting for
> vmstat_update and workqueue thinking that the work item must be making
> forward-progress and thus not starting the next work item.

But that shouldn't happen because the allocation path does cond_resched
even when nothing is really reclaimable (e.g. wait_iff_congested from
__alloc_pages_slowpath).
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

Ok that also makes me rethink commit
ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
this time related to idle mode not updating the counters.

Could we fix that by folding the counters before going to idle mode?

That fix seems to now create 2 separate application interuptions because
the vmstat update is not deferred anymore to occur with other events.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

Hello,

On Thu, Oct 22, 2015 at 09:41:11AM -0500, Christoph Lameter wrote:
> > If this is actually a legit busy-waiting cyclic dependency, just let
> > me know.
> 
> There is no dependency of the vmstat updater on anything.
> They can run anytime. If there is a dependency then its created by the
> kworker subsystem itself.

Sure, the other direction is from workqueue concurrency detection.  I
was asking whether a work item can busy-wait on vmstat_update work
item cuz that's what confuses workqueue.  Looking at the original
dump, the pool has two idle workers indicating that the workqueue
wasn't short of execution resources and it really looks like that work
item was live-locking the pool.  I'll go ahead and add WQ_IMMEDIATE.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote:
> Do I get it right that if vmstat_update has its own workqueue with
> WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
> be able to process vmstat_update work from the requested CPU?

Yeah.

> That should be sufficient because vmstat_update doesn't sleep on
> allocation. I agree that this would be a more appropriate fix.

The problem seems to be reclaim path busy looping waiting for
vmstat_update and workqueue thinking that the work item must be making
forward-progress and thus not starting the next work item.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Michal Hocko

On Thu 22-10-15 23:09:44, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute 
> > > vmstat_update
> > > job does not exist, and will not be able to create one on demand because 
> > > we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.

Do I get it right that if vmstat_update has its own workqueue with
WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
be able to process vmstat_update work from the requested CPU?

That should be sufficient because vmstat_update doesn't sleep on
allocation. I agree that this would be a more appropriate fix.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
> is to create a cyclic dependency on another work item and keep that
> work item busy wait.  Workqueue thinks that work item is making
> progress as it's running and doesn't schedule the next one.
>
> (I was misremembering here) HIGHPRI originally was implemented
> head-queueing on the same pool followed by immediate execution, so
> could get around cases where this could happen, but that got lost
> while converting it to a separate pool.  I can introduce another flag
> to bypass concurrency management if necessary (it's kinda trivial) but
> busy-waiting cyclic dependency is a pretty unusual thing.
>
> If this is actually a legit busy-waiting cyclic dependency, just let
> me know.

There is no dependency of the vmstat updater on anything.
They can run anytime. If there is a dependency then its created by the
kworker subsystem itself.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 09:25:49AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tejun Heo wrote:
> 
> > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > > I guess we need that otherwise vm statistics are not updated while worker
> > > threads are blocking on memory reclaim.
> >
> > And the blocking one is just constantly running?
> 
> I was told that there is just one task struct so additional work queue
> items cannot be processed while waiting?

lol, no, what it tries to do is trying to keep the number of RUNNING
workers at minimum so that minimum number of workers can be used and
work items are executed back-to-back on the same workers.  The moment
a work item blocks, the next worker kicks in and starts executing the
next work item in line.

The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
is to create a cyclic dependency on another work item and keep that
work item busy wait.  Workqueue thinks that work item is making
progress as it's running and doesn't schedule the next one.

(I was misremembering here) HIGHPRI originally was implemented
head-queueing on the same pool followed by immediate execution, so
could get around cases where this could happen, but that got lost
while converting it to a separate pool.  I can introduce another flag
to bypass concurrency management if necessary (it's kinda trivial) but
busy-waiting cyclic dependency is a pretty unusual thing.

If this is actually a legit busy-waiting cyclic dependency, just let
me know.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > I guess we need that otherwise vm statistics are not updated while worker
> > threads are blocking on memory reclaim.
>
> And the blocking one is just constantly running?

I was told that there is just one task struct so additional work queue
items cannot be processed while waiting?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> I guess we need that otherwise vm statistics are not updated while worker
> threads are blocking on memory reclaim.

And the blocking one is just constantly running?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> > concurrency management is a problem and there's something live-locking
> > for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> > a common occurrence that it makes sense to give vmstat higher
> > priority, set WQ_HIGHPRI.
>
> Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
> while converting HIGHPRI to a separate pool but guaranteeing immediate
> scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
> please let me know.

I guess we need that otherwise vm statistics are not updated while worker
threads are blocking on memory reclaim.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
>
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

I did. Check the thread. The result was that other tasks were still
blocking the thread. Ok I did not use HIGHPRI here is a newer version:


From: Christoph Lameter 
Subject: vmstat: Create our own workqueue V2

V1->V2:
   - Add a couple of workqueue flags that may fix things.

Seems that vmstat needs its own workqueue now since the general
workqueue mechanism has been *enhanced* which means that the
vmstat_updates cannot run reliably but are being blocked by
work requests doing memory allocation. Which causes vmstat
to be unable to keep the counters up to date.

Bad. Fix this by creating our own workqueue.

Signed-off-by: Christoph Lameter 

Index: linux/mm/vmstat.c
===
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1382,6 +1382,8 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
+
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1394,7 +1396,7 @@ static void vmstat_update(struct work_st
 * to occur in the future. Keep on running the
 * update worker thread.
 */
-   schedule_delayed_work_on(smp_processor_id(),
+   queue_delayed_work_on(smp_processor_id(), vmstat_wq,
this_cpu_ptr(_work),
round_jiffies_relative(sysctl_stat_interval));
} else {
@@ -1463,7 +1465,7 @@ static void vmstat_shepherd(struct work_
if (need_update(cpu) &&
cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-   schedule_delayed_work_on(cpu,
+   queue_delayed_work_on(cpu, vmstat_wq,
_cpu(vmstat_work, cpu), 0);

put_online_cpus();
@@ -1552,6 +1554,12 @@ static int __init setup_vmstat(void)

start_shepherd_timer();
cpu_notifier_register_done();
+   vmstat_wq = alloc_workqueue("vmstat",
+   WQ_FREEZABLE|
+   WQ_SYSFS|
+   WQ_MEM_RECLAIM|
+   WQ_HIGHPRI|
+   WQ_CPU_INTENSIVE, 0);
 #endif
 #ifdef CONFIG_PROC_FS
proc_create("buddyinfo", S_IRUGO, NULL, _file_operations);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 11:09:44PM +0900, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute 
> > > vmstat_update
> > > job does not exist, and will not be able to create one on demand because 
> > > we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
while converting HIGHPRI to a separate pool but guaranteeing immediate
scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
please let me know.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> 
> > The problem would be that the "struct task_struct" to execute vmstat_update
> > job does not exist, and will not be able to create one on demand because we
> > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > thread for vmstat_update job would work. But ...
> 
> Yuck. Can someone please get this major screwup out of the work queue
> subsystem? Tejun?

Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
concurrency management is a problem and there's something live-locking
for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
a common occurrence that it makes sense to give vmstat higher
priority, set WQ_HIGHPRI.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tetsuo Handa wrote:

> The problem would be that the "struct task_struct" to execute vmstat_update
> job does not exist, and will not be able to create one on demand because we
> are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> thread for vmstat_update job would work. But ...

Yuck. Can someone please get this major screwup out of the work queue
subsystem? Tejun?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tetsuo Handa

Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Michal Hocko wrote:
> 
> > I am not sure how to achieve that. Requiring non-sleeping worker would
> > work out but do we have enough users to add such an API?
> >
> > I would rather see vmstat using dedicated kernel thread(s) for this this
> > purpose. We have discussed that in the past but it hasn't led anywhere.
> 
> How about this one? I really would like to have the vm statistics work as
> designed and apparently they no longer work right with the existing
> workqueue mechanism.

No, it won't help. Adding a dedicated workqueue for vmstat_update job
merely moves that job from "events" to "vmstat" workqueue. The "vmstat"
workqueue after all appears in the list of busy workqueues.

The problem is that all workqueues are assigned the same CPU (cpus=2 in
below example tested on a 4 CPUs VM) and therefore only one job is
in-flight state. All other jobs are waiting for the in-flight job to
complete in the pending list while the in-flight job is blocked at memory
allocation.

The problem would be that the "struct task_struct" to execute vmstat_update
job does not exist, and will not be able to create one on demand because we
are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
thread for vmstat_update job would work. But ...


[  133.132322] Showing busy workqueues and worker pools:
[  133.133878] workqueue events: flags=0x0
[  133.135215]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.137076] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx]
[  133.139075] workqueue events_freezable_power_: flags=0x84
[  133.140745]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.142638] in-flight: 20:disk_events_workfn
[  133.144199] pending: disk_events_workfn
[  133.145699] workqueue vmstat: flags=0xc
[  133.147055]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  133.148910] pending: vmstat_update
[  133.150354] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183
[  133.174523] DMA32 zone_reclaimable: reclaim:2(30186,30162,2) 
free:11163(25154,-20) min:11163 pages_scanned:0(30158,0) prio:12
[  133.177264] DMA32 zone_reclaimable: reclaim:2(30189,30165,2) 
free:11163(25157,-20) min:11163 pages_scanned:0(30161,0) prio:11
[  133.180139] DMA32 zone_reclaimable: reclaim:2(30191,30167,2) 
free:11163(25159,-20) min:11163 pages_scanned:0(30163,0) prio:10
[  133.182847] DMA32 zone_reclaimable: reclaim:2(30194,30170,2) 
free:11163(25162,-20) min:11163 pages_scanned:0(30166,0) prio:9
[  133.207048] DMA32 zone_reclaimable: reclaim:2(30219,30195,2) 
free:11163(25187,-20) min:11163 pages_scanned:0(30191,0) prio:8
[  133.209770] DMA32 zone_reclaimable: reclaim:2(30221,30197,2) 
free:11163(25189,-20) min:11163 pages_scanned:0(30193,0) prio:7
[  133.212470] DMA32 zone_reclaimable: reclaim:2(30224,30200,2) 
free:11163(25192,-20) min:11163 pages_scanned:0(30196,0) prio:6
[  133.215149] DMA32 zone_reclaimable: reclaim:2(30227,30203,2) 
free:11163(25195,-20) min:11163 pages_scanned:0(30199,0) prio:5
[  133.239013] DMA32 zone_reclaimable: reclaim:2(30251,30227,2) 
free:11163(25219,-20) min:11163 pages_scanned:0(30223,0) prio:4
[  133.241688] DMA32 zone_reclaimable: reclaim:2(30253,30229,2) 
free:11163(25221,-20) min:11163 pages_scanned:0(30225,0) prio:3
[  133.244332] DMA32 zone_reclaimable: reclaim:2(30256,30232,2) 
free:11163(25224,-20) min:11163 pages_scanned:0(30228,0) prio:2
[  133.246919] DMA32 zone_reclaimable: reclaim:2(30258,30234,2) 
free:11163(25226,-20) min:11163 pages_scanned:0(30230,0) prio:1
[  133.270967] DMA32 zone_reclaimable: reclaim:2(30283,30259,2) 
free:11163(25251,-20) min:11163 pages_scanned:0(30255,0) prio:0
[  133.273587] DMA32 zone_reclaimable: reclaim:2(30285,30261,2) 
free:11163(25253,-20) min:11163 pages_scanned:0(30257,0) prio:12
[  133.276224] DMA32 zone_reclaimable: reclaim:2(30287,30263,2) 
free:11163(25255,-20) min:11163 pages_scanned:0(30259,0) prio:11
[  133.278852] DMA32 zone_reclaimable: reclaim:2(30290,30266,2) 
free:11163(25258,-20) min:11163 pages_scanned:0(30262,0) prio:10
[  133.302964] DMA32 zone_reclaimable: reclaim:2(30315,30291,2) 
free:11163(25283,-20) min:11163 pages_scanned:0(30287,0) prio:9
[  133.305518] DMA32 zone_reclaimable: reclaim:2(30317,30293,2) 
free:11163(25285,-20) min:11163 pages_scanned:0(30289,0) prio:8
[  133.308095] DMA32 zone_reclaimable: reclaim:2(30319,30295,2) 
free:11163(25287,-20) min:11163 pages_scanned:0(30291,0) prio:7
[  133.310683] DMA32 zone_reclaimable: reclaim:2(30322,30298,2) 
free:11163(25290,-20) min:11163 pages_scanned:0(30294,0) prio:6
[  133.334904] DMA32 zone_reclaimable: reclaim:2(30347,30323,2) 
free:11163(25315,-20) min:11163 pages_scanned:0(30319,0) prio:5
[  133.337590] DMA32 zone_reclaimable: reclaim:2(30349,30325,2) 
free:11163(25317,-20) min:11163 pages_scanned:0(30321,0) prio:4
[  133.340147] DMA32 zone_reclaimable: reclaim:2(30351,30327,2)

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> I am confused. What makes rescuer to not run? Nothing seems to be
> hogging CPUs, we are just out of workers which are loopin in the
> allocator but that is preemptible context.

It's concurrency management.  Workqueue thinks that the pool is making
positive forward progress and doesn't schedule anything else for
execution while that work item is burning cpu cycles.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tetsuo Handa

Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Michal Hocko wrote:
> 
> > I am not sure how to achieve that. Requiring non-sleeping worker would
> > work out but do we have enough users to add such an API?
> >
> > I would rather see vmstat using dedicated kernel thread(s) for this this
> > purpose. We have discussed that in the past but it hasn't led anywhere.
> 
> How about this one? I really would like to have the vm statistics work as
> designed and apparently they no longer work right with the existing
> workqueue mechanism.

No, it won't help. Adding a dedicated workqueue for vmstat_update job
merely moves that job from "events" to "vmstat" workqueue. The "vmstat"
workqueue after all appears in the list of busy workqueues.

The problem is that all workqueues are assigned the same CPU (cpus=2 in
below example tested on a 4 CPUs VM) and therefore only one job is
in-flight state. All other jobs are waiting for the in-flight job to
complete in the pending list while the in-flight job is blocked at memory
allocation.

The problem would be that the "struct task_struct" to execute vmstat_update
job does not exist, and will not be able to create one on demand because we
are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
thread for vmstat_update job would work. But ...


[  133.132322] Showing busy workqueues and worker pools:
[  133.133878] workqueue events: flags=0x0
[  133.135215]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.137076] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx]
[  133.139075] workqueue events_freezable_power_: flags=0x84
[  133.140745]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.142638] in-flight: 20:disk_events_workfn
[  133.144199] pending: disk_events_workfn
[  133.145699] workqueue vmstat: flags=0xc
[  133.147055]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  133.148910] pending: vmstat_update
[  133.150354] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183
[  133.174523] DMA32 zone_reclaimable: reclaim:2(30186,30162,2) 
free:11163(25154,-20) min:11163 pages_scanned:0(30158,0) prio:12
[  133.177264] DMA32 zone_reclaimable: reclaim:2(30189,30165,2) 
free:11163(25157,-20) min:11163 pages_scanned:0(30161,0) prio:11
[  133.180139] DMA32 zone_reclaimable: reclaim:2(30191,30167,2) 
free:11163(25159,-20) min:11163 pages_scanned:0(30163,0) prio:10
[  133.182847] DMA32 zone_reclaimable: reclaim:2(30194,30170,2) 
free:11163(25162,-20) min:11163 pages_scanned:0(30166,0) prio:9
[  133.207048] DMA32 zone_reclaimable: reclaim:2(30219,30195,2) 
free:11163(25187,-20) min:11163 pages_scanned:0(30191,0) prio:8
[  133.209770] DMA32 zone_reclaimable: reclaim:2(30221,30197,2) 
free:11163(25189,-20) min:11163 pages_scanned:0(30193,0) prio:7
[  133.212470] DMA32 zone_reclaimable: reclaim:2(30224,30200,2) 
free:11163(25192,-20) min:11163 pages_scanned:0(30196,0) prio:6
[  133.215149] DMA32 zone_reclaimable: reclaim:2(30227,30203,2) 
free:11163(25195,-20) min:11163 pages_scanned:0(30199,0) prio:5
[  133.239013] DMA32 zone_reclaimable: reclaim:2(30251,30227,2) 
free:11163(25219,-20) min:11163 pages_scanned:0(30223,0) prio:4
[  133.241688] DMA32 zone_reclaimable: reclaim:2(30253,30229,2) 
free:11163(25221,-20) min:11163 pages_scanned:0(30225,0) prio:3
[  133.244332] DMA32 zone_reclaimable: reclaim:2(30256,30232,2) 
free:11163(25224,-20) min:11163 pages_scanned:0(30228,0) prio:2
[  133.246919] DMA32 zone_reclaimable: reclaim:2(30258,30234,2) 
free:11163(25226,-20) min:11163 pages_scanned:0(30230,0) prio:1
[  133.270967] DMA32 zone_reclaimable: reclaim:2(30283,30259,2) 
free:11163(25251,-20) min:11163 pages_scanned:0(30255,0) prio:0
[  133.273587] DMA32 zone_reclaimable: reclaim:2(30285,30261,2) 
free:11163(25253,-20) min:11163 pages_scanned:0(30257,0) prio:12
[  133.276224] DMA32 zone_reclaimable: reclaim:2(30287,30263,2) 
free:11163(25255,-20) min:11163 pages_scanned:0(30259,0) prio:11
[  133.278852] DMA32 zone_reclaimable: reclaim:2(30290,30266,2) 
free:11163(25258,-20) min:11163 pages_scanned:0(30262,0) prio:10
[  133.302964] DMA32 zone_reclaimable: reclaim:2(30315,30291,2) 
free:11163(25283,-20) min:11163 pages_scanned:0(30287,0) prio:9
[  133.305518] DMA32 zone_reclaimable: reclaim:2(30317,30293,2) 
free:11163(25285,-20) min:11163 pages_scanned:0(30289,0) prio:8
[  133.308095] DMA32 zone_reclaimable: reclaim:2(30319,30295,2) 
free:11163(25287,-20) min:11163 pages_scanned:0(30291,0) prio:7
[  133.310683] DMA32 zone_reclaimable: reclaim:2(30322,30298,2) 
free:11163(25290,-20) min:11163 pages_scanned:0(30294,0) prio:6
[  133.334904] DMA32 zone_reclaimable: reclaim:2(30347,30323,2) 
free:11163(25315,-20) min:11163 pages_scanned:0(30319,0) prio:5
[  133.337590] DMA32 zone_reclaimable: reclaim:2(30349,30325,2) 
free:11163(25317,-20) min:11163 pages_scanned:0(30321,0) prio:4
[  133.340147] DMA32 zone_reclaimable: reclaim:2(30351,30327,2)

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-22 Thread Tetsuo Handa

Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Then, isn't below change easier to backport which will also alleviate
needlessly burning CPU cycles?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3385,6 +3385,7 @@ retry:
((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+   schedule_timeout_uninterruptible(1);
goto retry;
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks

2015-10-22 Thread Tejun Heo

Hello,

On Fri, Oct 23, 2015 at 06:42:43AM +0900, Tetsuo Handa wrote:
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>   /* Wait for some write requests to complete then retry */
>   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> + schedule_timeout_uninterruptible(1);
>   goto retry;
>   }

Yeah, that works too.  It should still be put on a dedicated wq with
MEM_RECLAIM tho.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> 
> > The problem would be that the "struct task_struct" to execute vmstat_update
> > job does not exist, and will not be able to create one on demand because we
> > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > thread for vmstat_update job would work. But ...
> 
> Yuck. Can someone please get this major screwup out of the work queue
> subsystem? Tejun?

Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
concurrency management is a problem and there's something live-locking
for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
a common occurrence that it makes sense to give vmstat higher
priority, set WQ_HIGHPRI.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 11:09:44PM +0900, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute 
> > > vmstat_update
> > > job does not exist, and will not be able to create one on demand because 
> > > we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
while converting HIGHPRI to a separate pool but guaranteeing immediate
scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
please let me know.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> > concurrency management is a problem and there's something live-locking
> > for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> > a common occurrence that it makes sense to give vmstat higher
> > priority, set WQ_HIGHPRI.
>
> Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
> while converting HIGHPRI to a separate pool but guaranteeing immediate
> scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
> please let me know.

I guess we need that otherwise vm statistics are not updated while worker
threads are blocking on memory reclaim.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> I guess we need that otherwise vm statistics are not updated while worker
> threads are blocking on memory reclaim.

And the blocking one is just constantly running?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > I guess we need that otherwise vm statistics are not updated while worker
> > threads are blocking on memory reclaim.
>
> And the blocking one is just constantly running?

I was told that there is just one task struct so additional work queue
items cannot be processed while waiting?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
>
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

I did. Check the thread. The result was that other tasks were still
blocking the thread. Ok I did not use HIGHPRI here is a newer version:


From: Christoph Lameter 
Subject: vmstat: Create our own workqueue V2

V1->V2:
   - Add a couple of workqueue flags that may fix things.

Seems that vmstat needs its own workqueue now since the general
workqueue mechanism has been *enhanced* which means that the
vmstat_updates cannot run reliably but are being blocked by
work requests doing memory allocation. Which causes vmstat
to be unable to keep the counters up to date.

Bad. Fix this by creating our own workqueue.

Signed-off-by: Christoph Lameter 

Index: linux/mm/vmstat.c
===
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1382,6 +1382,8 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
+
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1394,7 +1396,7 @@ static void vmstat_update(struct work_st
 * to occur in the future. Keep on running the
 * update worker thread.
 */
-   schedule_delayed_work_on(smp_processor_id(),
+   queue_delayed_work_on(smp_processor_id(), vmstat_wq,
this_cpu_ptr(_work),
round_jiffies_relative(sysctl_stat_interval));
} else {
@@ -1463,7 +1465,7 @@ static void vmstat_shepherd(struct work_
if (need_update(cpu) &&
cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-   schedule_delayed_work_on(cpu,
+   queue_delayed_work_on(cpu, vmstat_wq,
_cpu(vmstat_work, cpu), 0);

put_online_cpus();
@@ -1552,6 +1554,12 @@ static int __init setup_vmstat(void)

start_shepherd_timer();
cpu_notifier_register_done();
+   vmstat_wq = alloc_workqueue("vmstat",
+   WQ_FREEZABLE|
+   WQ_SYSFS|
+   WQ_MEM_RECLAIM|
+   WQ_HIGHPRI|
+   WQ_CPU_INTENSIVE, 0);
 #endif
 #ifdef CONFIG_PROC_FS
proc_create("buddyinfo", S_IRUGO, NULL, _file_operations);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tejun Heo wrote:

> The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
> is to create a cyclic dependency on another work item and keep that
> work item busy wait.  Workqueue thinks that work item is making
> progress as it's running and doesn't schedule the next one.
>
> (I was misremembering here) HIGHPRI originally was implemented
> head-queueing on the same pool followed by immediate execution, so
> could get around cases where this could happen, but that got lost
> while converting it to a separate pool.  I can introduce another flag
> to bypass concurrency management if necessary (it's kinda trivial) but
> busy-waiting cyclic dependency is a pretty unusual thing.
>
> If this is actually a legit busy-waiting cyclic dependency, just let
> me know.

There is no dependency of the vmstat updater on anything.
They can run anytime. If there is a dependency then its created by the
kworker subsystem itself.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

On Thu, Oct 22, 2015 at 09:25:49AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tejun Heo wrote:
> 
> > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > > I guess we need that otherwise vm statistics are not updated while worker
> > > threads are blocking on memory reclaim.
> >
> > And the blocking one is just constantly running?
> 
> I was told that there is just one task struct so additional work queue
> items cannot be processed while waiting?

lol, no, what it tries to do is trying to keep the number of RUNNING
workers at minimum so that minimum number of workers can be used and
work items are executed back-to-back on the same workers.  The moment
a work item blocks, the next worker kicks in and starts executing the
next work item in line.

The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
is to create a cyclic dependency on another work item and keep that
work item busy wait.  Workqueue thinks that work item is making
progress as it's running and doesn't schedule the next one.

(I was misremembering here) HIGHPRI originally was implemented
head-queueing on the same pool followed by immediate execution, so
could get around cases where this could happen, but that got lost
while converting it to a separate pool.  I can introduce another flag
to bypass concurrency management if necessary (it's kinda trivial) but
busy-waiting cyclic dependency is a pretty unusual thing.

If this is actually a legit busy-waiting cyclic dependency, just let
me know.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

Hello,

So, something like the following.  Just compile tested but this is
essentially partial revert of 3270476a6c0c ("workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
If for some reason, it gets decided against simply adding one jiffy
sleep, please let me know.  I'll verify the operation and post a
proper patch.  That said, given that this prolly needs -stable
backport and vmstat is likely to be the only user (busy loops are
really rare in the kernel after all), I think the better approach
would be reinstating the short sleep.

Thanks.

---
 include/linux/workqueue.h |7 ++---
 kernel/workqueue.c|   63 +++---
 2 files changed, 63 insertions(+), 7 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -278,9 +278,10 @@ enum {
WQ_UNBOUND  = 1 << 1, /* not bound to any cpu */
WQ_FREEZABLE= 1 << 2, /* freeze during suspend */
WQ_MEM_RECLAIM  = 1 << 3, /* may be used for memory reclaim */
-   WQ_HIGHPRI  = 1 << 4, /* high priority */
-   WQ_CPU_INTENSIVE= 1 << 5, /* cpu intensive workqueue */
-   WQ_SYSFS= 1 << 6, /* visible in sysfs, see 
wq_sysfs_register() */
+   WQ_IMMEDIATE= 1 << 4, /* bypass concurrency management */
+   WQ_HIGHPRI  = 1 << 5, /* high priority */
+   WQ_CPU_INTENSIVE= 1 << 6, /* cpu intensive workqueue */
+   WQ_SYSFS= 1 << 7, /* visible in sysfs, see 
wq_sysfs_register() */
 
/*
 * Per-cpu workqueues are generally preferred because they tend to
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -68,6 +68,7 @@ enum {
 * attach_mutex to avoid changing binding state while
 * worker_attach_to_pool() is in progress.
 */
+   POOL_IMMEDIATE_PENDING  = 1 << 0,   /* WQ_IMMEDIATE items on queue 
*/
POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
 
/* worker flags */
@@ -731,7 +732,8 @@ static bool work_is_canceling(struct wor
 
 static bool __need_more_worker(struct worker_pool *pool)
 {
-   return !atomic_read(>nr_running);
+   return !atomic_read(>nr_running) ||
+   (pool->flags & POOL_IMMEDIATE_PENDING);
 }
 
 /*
@@ -757,7 +759,8 @@ static bool may_start_working(struct wor
 static bool keep_working(struct worker_pool *pool)
 {
return !list_empty(>worklist) &&
-   atomic_read(>nr_running) <= 1;
+   (atomic_read(>nr_running) <= 1 ||
+(pool->flags & POOL_IMMEDIATE_PENDING));
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -1021,6 +1024,42 @@ static void move_linked_works(struct wor
 }
 
 /**
+ * pwq_determine_ins_pos - find insertion position
+ * @pwq: pwq a work is being queued for
+ *
+ * A work for @pwq is about to be queued on @pwq->pool, determine insertion
+ * position for the work.  If @pwq is for IMMEDIATE wq, the work item is
+ * queued at the head of the queue but in FIFO order with respect to other
+ * IMMEDIATE work items; otherwise, at the end of the queue.  This function
+ * also sets POOL_IMMEDIATE_PENDING flag to hint @pool that there are
+ * IMMEDIATE works pending.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to insertion position.
+ */
+static struct list_head *pwq_determine_ins_pos(struct pool_workqueue *pwq)
+{
+   struct worker_pool *pool = pwq->pool;
+   struct work_struct *twork;
+
+   if (likely(!(pwq->wq->flags & WQ_IMMEDIATE)))
+   return >worklist;
+
+   list_for_each_entry(twork, >worklist, entry) {
+   struct pool_workqueue *tpwq = get_work_pwq(twork);
+
+   if (!(tpwq->wq->flags & WQ_IMMEDIATE))
+   break;
+   }
+
+   pool->flags |= POOL_IMMEDIATE_PENDING;
+   return >entry;
+}
+
+/**
  * get_pwq - get an extra reference on the specified pool_workqueue
  * @pwq: pool_workqueue to get
  *
@@ -1081,9 +1120,10 @@ static void put_pwq_unlocked(struct pool
 static void pwq_activate_delayed_work(struct work_struct *work)
 {
struct pool_workqueue *pwq = get_work_pwq(work);
+   struct list_head *pos = pwq_determine_ins_pos(pwq);
 
trace_workqueue_activate_work(work);
-   move_linked_works(work, >pool->worklist, NULL);
+   move_linked_works(work, pos, NULL);
__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
pwq->nr_active++;
 }
@@ -1384,7 +1424,7 @@ retry:
if (likely(pwq->nr_active < pwq->max_active)) {
trace_workqueue_activate_work(work);
pwq->nr_active++;
-   worklist = >pool->worklist;
+   worklist = pwq_determine_ins_pos(pwq);
} else {
work_flags |= WORK_STRUCT_DELAYED;
worklist =

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Christoph Lameter

On Thu, 22 Oct 2015, Tetsuo Handa wrote:

> The problem would be that the "struct task_struct" to execute vmstat_update
> job does not exist, and will not be able to create one on demand because we
> are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> thread for vmstat_update job would work. But ...

Yuck. Can someone please get this major screwup out of the work queue
subsystem? Tejun?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

2015-10-22 Thread Tejun Heo

Hello,

On Thu, Oct 22, 2015 at 09:41:11AM -0500, Christoph Lameter wrote:
> > If this is actually a legit busy-waiting cyclic dependency, just let
> > me know.
> 
> There is no dependency of the vmstat updater on anything.
> They can run anytime. If there is a dependency then its created by the
> kworker subsystem itself.

Sure, the other direction is from workqueue concurrency detection.  I
was asking whether a work item can busy-wait on vmstat_update work
item cuz that's what confuses workqueue.  Looking at the original
dump, the pool has two idle workers indicating that the workqueue
wasn't short of execution resources and it really looks like that work
item was live-locking the pool.  I'll go ahead and add WQ_IMMEDIATE.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 120 matches

Mail list logo