Re: Detecting page cache trashing state
Hello, On Mon, Nov 20, 2017 at 09:40:56PM +0200, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: > Hi Johannes, > > I tested with your patches but situation is still mostly the same. > > Spend some time for debugging and found that the problem is squashfs > specific (probably some others fs's too). > The point is that iowait for squashfs reads will be awaited inside squashfs > readpage() callback. > Here is some backtrace for page fault handling to illustrate this: > > 1) | handle_mm_fault() { > 1) | filemap_fault() { > 1) | __do_page_cache_readahead() > 1) | add_to_page_cache_lru() > 1) | squashfs_readpage() { > 1) | squashfs_readpage_block() { > 1) | squashfs_get_datablock() { > 1) | squashfs_cache_get() { > 1) | squashfs_read_data() { > 1) | ll_rw_block() { > 1) | submit_bh_wbc.isra.42() > 1) | __wait_on_buffer() { > 1) | io_schedule() { > -- > 0) kworker-79 => -0 > -- > 0) 0.382 us | blk_complete_request(); > 0) | blk_done_softirq() { > 0) | blk_update_request() { > 0) | end_buffer_read_sync() > 0) + 38.559 us | } > 0) + 48.367 us | } > -- > 0) kworker-79 => memhog-781 > -- > 0) ! 278.848 us | } > 0) ! 279.612 us | } > 0) | squashfs_decompress() { > 0) # 4919.082 us | squashfs_xz_uncompress(); > 0) # 4919.864 us | } > 0) # 5479.212 us | } /* squashfs_read_data */ > 0) # 5479.749 us | } /* squashfs_cache_get */ > 0) # 5480.177 us | } /* squashfs_get_datablock */ > 0) | squashfs_copy_cache() { > 0) 0.057 us | unlock_page(); > 0) ! 142.773 us | } > 0) # 5624.113 us | } /* squashfs_readpage_block */ > 0) # 5628.814 us | } /* squashfs_readpage */ > 0) # 5665.097 us | } /* __do_page_cache_readahead */ > 0) # 5667.437 us | } /* filemap_fault */ > 0) # 5672.880 us | } /* handle_mm_fault */ > > As you can see squashfs_read_data() schedules IO by ll_rw_block() and then > it waits for IO to finish inside wait_on_buffer(). > After that read buffer is decompressed and page is unlocked inside > squashfs_readpage() handler. > > Thus by the the time when filemap_fault() calls lock_page_or_retry() page > will be uptodate and unlocked, > wait_on_page_bit() is not called at all, and time spent for read/decompress > is not accounted. A weakness in current approach is that it relies on page lock. It means it cannot work with sychronous devices like DAX, zram and so on, I think. Johannes, Can we add memdelay_enter to every fault handler's prologue? and we can check it in epilogue whether the faulted page is workingset. If is was, we can accumuate the spent time. It would work with synchronous devices, esp, zram without hacking some FSes like squashfs. I think page fault handler/kswapd/direct reclaim would cover most of cases of *real* memory pressure but un[lock]page freinds would cover superfluously, for example, FSes can call it easily without memory pressure. > > Tried to apply quick workaround for test: > > diff --git a/mm/readahead.c b/mm/readahead.c > index c4ca702..5e2be2b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -126,9 +126,21 @@ static int read_pages(struct address_space *mapping, > struct file *filp, > > for (page_idx = 0; page_idx < nr_pages; page_idx++) { > struct page *page = lru_to_page(pages); > + bool refault = false; > + unsigned long mdflags; > + > list_del(&page->lru); > - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) > + if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) { > + if (!PageUptodate(page) && PageWorkingset(page)) { > + memdelay_enter(&mdflags); > + refault = true; > + } > + > mapping->a_ops->readpage(filp, page); > + > + if (refault) > + memdelay_leave(&mdflags); > + } > put_page(page);
Re: Detecting page cache trashing state
Hi Johannes, I tested with your patches but situation is still mostly the same. Spend some time for debugging and found that the problem is squashfs specific (probably some others fs's too). The point is that iowait for squashfs reads will be awaited inside squashfs readpage() callback. Here is some backtrace for page fault handling to illustrate this: 1) | handle_mm_fault() { 1) | filemap_fault() { 1) | __do_page_cache_readahead() 1) | add_to_page_cache_lru() 1) | squashfs_readpage() { 1) | squashfs_readpage_block() { 1) | squashfs_get_datablock() { 1) | squashfs_cache_get() { 1) | squashfs_read_data() { 1) | ll_rw_block() { 1) | submit_bh_wbc.isra.42() 1) | __wait_on_buffer() { 1) | io_schedule() { -- 0) kworker-79 => -0 -- 0) 0.382 us | blk_complete_request(); 0) | blk_done_softirq() { 0) | blk_update_request() { 0) | end_buffer_read_sync() 0) + 38.559 us | } 0) + 48.367 us | } -- 0) kworker-79 => memhog-781 -- 0) ! 278.848 us | } 0) ! 279.612 us | } 0) | squashfs_decompress() { 0) # 4919.082 us | squashfs_xz_uncompress(); 0) # 4919.864 us | } 0) # 5479.212 us | } /* squashfs_read_data */ 0) # 5479.749 us | } /* squashfs_cache_get */ 0) # 5480.177 us | } /* squashfs_get_datablock */ 0) | squashfs_copy_cache() { 0) 0.057 us | unlock_page(); 0) ! 142.773 us | } 0) # 5624.113 us | } /* squashfs_readpage_block */ 0) # 5628.814 us | } /* squashfs_readpage */ 0) # 5665.097 us | } /* __do_page_cache_readahead */ 0) # 5667.437 us | } /* filemap_fault */ 0) # 5672.880 us | } /* handle_mm_fault */ As you can see squashfs_read_data() schedules IO by ll_rw_block() and then it waits for IO to finish inside wait_on_buffer(). After that read buffer is decompressed and page is unlocked inside squashfs_readpage() handler. Thus by the the time when filemap_fault() calls lock_page_or_retry() page will be uptodate and unlocked, wait_on_page_bit() is not called at all, and time spent for read/decompress is not accounted. Tried to apply quick workaround for test: diff --git a/mm/readahead.c b/mm/readahead.c index c4ca702..5e2be2b 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -126,9 +126,21 @@ static int read_pages(struct address_space *mapping, struct file *filp, for (page_idx = 0; page_idx < nr_pages; page_idx++) { struct page *page = lru_to_page(pages); + bool refault = false; + unsigned long mdflags; + list_del(&page->lru); - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) + if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) { + if (!PageUptodate(page) && PageWorkingset(page)) { + memdelay_enter(&mdflags); + refault = true; + } + mapping->a_ops->readpage(filp, page); + + if (refault) + memdelay_leave(&mdflags); + } put_page(page); But found that situation is not much different. The reason is that at least in my synthetic tests I'm exhausting whole memory leaving almost no place for page cache: Active(anon): 15901788 kB Inactive(anon): 44844 kB Active(file): 488 kB Inactive(file): 612 kB As result refault distance is always higher that LRU_ACTIVE_FILE size and Workingset flag is not set for refaulting page even if it were active during it's lifecycle before eviction: workingset_refault 7773 workingset_activate 250 workingset_restore 233 workingset_nodereclaim 49 Tried to apply following workaround: diff --git a/mm/workingset.c b/mm/workingset.c index 264f049..8035ef6 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -305,6 +305,11 @@ void workingset_refault(struct page *page, void *shadow) inc_lruvec_state(lruvec, WORKINGSET_REFAULT); + /* Page was active prior to eviction */ + if (workingset) { + SetPageWorkingset(page); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE); + } /* * Compare the distance to the existing workingset size. We * don't act on pages that couldn't stay resident even if all @@ -314,13 +319,9 @@ void workingset_refault(struct page *page, void *shadow) goto out;
Re: Detecting page cache trashing state
On 10/26/2017 06:53 AM, vinayak menon wrote: On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: Hi Johannes, Hopefully I was able to rebase the patch on top v4.9.26 (latest supported version by us right now) and test a bit. The overall idea definitely looks promising, although I have one question on usage. Will it be able to account the time which processes spend on handling major page faults (including fs and iowait time) of refaulting page? As we have one big application which code space occupies big amount of place in page cache, when the system under heavy memory usage will reclaim some of it, the application will start constantly thrashing. Since it code is placed on squashfs it spends whole CPU time decompressing the pages and seem memdelay counters are not detecting this situation. Here are some counters to indicate this: 19:02:44CPU %user %nice %system %iowait %steal %idle 19:02:45all 0.00 0.00100.00 0.00 0.00 0.00 19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s%vmeff 19:02:45 15284.00 0.00428.00352.00 19990.00 0.00 0.00 15802.00 0.00 And as nobody actively allocating memory anymore looks like memdelay counters are not actively incremented: [:~]$ cat /proc/memdelay 268035776 6.13 5.43 3.58 1.90 1.89 1.26 Just in case, I have attached the v4.9.26 rebased patched. Looks like this 4.9 version does not contain the accounting in lock_page. In v4.9 there is no wait_on_page_bit_common(), thus accounting moved to wait_on_page_bit(_killable|_killable_timeout). Related functionality around lock_page_or_retry() seem to be mostly the same in v4.9.
Re: Detecting page cache trashing state
Hi Johannes, On 10/25/2017 08:54 PM, Johannes Weiner wrote: Hi Ruslan, sorry about the delayed response, I missed the new activity in this older thread. On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: Hi Johannes, Hopefully I was able to rebase the patch on top v4.9.26 (latest supported version by us right now) and test a bit. The overall idea definitely looks promising, although I have one question on usage. Will it be able to account the time which processes spend on handling major page faults (including fs and iowait time) of refaulting page? That's the main thing it should measure! :) The lock_page() and wait_on_page_locked() calls are where iowaits happen on a cache miss. If those are refaults, they'll be counted. As we have one big application which code space occupies big amount of place in page cache, when the system under heavy memory usage will reclaim some of it, the application will start constantly thrashing. Since it code is placed on squashfs it spends whole CPU time decompressing the pages and seem memdelay counters are not detecting this situation. Here are some counters to indicate this: 19:02:44CPU %user %nice %system %iowait %steal %idle 19:02:45all 0.00 0.00100.00 0.00 0.00 0.00 19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s%vmeff 19:02:45 15284.00 0.00428.00352.00 19990.00 0.00 0.00 15802.00 0.00 And as nobody actively allocating memory anymore looks like memdelay counters are not actively incremented: [:~]$ cat /proc/memdelay 268035776 6.13 5.43 3.58 1.90 1.89 1.26 How does it correlate with /proc/vmstat::workingset_activate during that time? It only counts thrashing time of refaults it can actively detect. The workingset counters are growing quite actively too. Here are some numbers per second: workingset_refault 8201 workingset_activate 389 workingset_restore 187 workingset_nodereclaim 313 Btw, how many CPUs does this system have? There is a bug in this version on how idle time is aggregated across multiple CPUs. The error compounds with the number of CPUs in the system. The system has 2 CPU cores. I'm attaching 3 bugfixes that go on top of what you have. There might be some conflicts, but they should be minor variable naming issues. I will test with your patches and get back to you. Thanks, Ruslan
Re: Detecting page cache trashing state
On Thu, Sep 28, 2017 at 9:19 PM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: > Hi Johannes, > > Hopefully I was able to rebase the patch on top v4.9.26 (latest supported > version by us right now) > and test a bit. > The overall idea definitely looks promising, although I have one question on > usage. > Will it be able to account the time which processes spend on handling major > page faults > (including fs and iowait time) of refaulting page? > > As we have one big application which code space occupies big amount of place > in page cache, > when the system under heavy memory usage will reclaim some of it, the > application will > start constantly thrashing. Since it code is placed on squashfs it spends > whole CPU time > decompressing the pages and seem memdelay counters are not detecting this > situation. > Here are some counters to indicate this: > > 19:02:44CPU %user %nice %system %iowait %steal %idle > 19:02:45all 0.00 0.00100.00 0.00 0.00 0.00 > > 19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s > pgscand/s pgsteal/s%vmeff > 19:02:45 15284.00 0.00428.00352.00 19990.00 0.00 0.00 > 15802.00 0.00 > > And as nobody actively allocating memory anymore looks like memdelay > counters are not > actively incremented: > > [:~]$ cat /proc/memdelay > 268035776 > 6.13 5.43 3.58 > 1.90 1.89 1.26 > > Just in case, I have attached the v4.9.26 rebased patched. > Looks like this 4.9 version does not contain the accounting in lock_page.
Re: Detecting page cache trashing state
Hi Ruslan, sorry about the delayed response, I missed the new activity in this older thread. On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: > Hi Johannes, > > Hopefully I was able to rebase the patch on top v4.9.26 (latest supported > version by us right now) > and test a bit. > The overall idea definitely looks promising, although I have one question on > usage. > Will it be able to account the time which processes spend on handling major > page faults > (including fs and iowait time) of refaulting page? That's the main thing it should measure! :) The lock_page() and wait_on_page_locked() calls are where iowaits happen on a cache miss. If those are refaults, they'll be counted. > As we have one big application which code space occupies big amount of place > in page cache, > when the system under heavy memory usage will reclaim some of it, the > application will > start constantly thrashing. Since it code is placed on squashfs it spends > whole CPU time > decompressing the pages and seem memdelay counters are not detecting this > situation. > Here are some counters to indicate this: > > 19:02:44CPU %user %nice %system %iowait %steal %idle > 19:02:45all 0.00 0.00100.00 0.00 0.00 0.00 > > 19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s > pgscand/s pgsteal/s%vmeff > 19:02:45 15284.00 0.00428.00352.00 19990.00 0.00 0.00 > 15802.00 0.00 > > And as nobody actively allocating memory anymore looks like memdelay > counters are not > actively incremented: > > [:~]$ cat /proc/memdelay > 268035776 > 6.13 5.43 3.58 > 1.90 1.89 1.26 How does it correlate with /proc/vmstat::workingset_activate during that time? It only counts thrashing time of refaults it can actively detect. Btw, how many CPUs does this system have? There is a bug in this version on how idle time is aggregated across multiple CPUs. The error compounds with the number of CPUs in the system. I'm attaching 3 bugfixes that go on top of what you have. There might be some conflicts, but they should be minor variable naming issues. >From 7318c963a582833d4556c51fc2e1658e00c14e3e Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 5 Oct 2017 12:32:47 -0400 Subject: [PATCH 1/3] mm: memdelay: fix task flags race condition WARNING: CPU: 35 PID: 2263 at ../include/linux/memcontrol.h:466 This is memcg warning that current->memcg_may_oom is set when it doesn't expect it to. The warning came in new with the memdelay patches. They add another task flag in the same int as memcg_may_oom, but modify it from try_to_wake_up, from a task that isn't current. This isn't safe. Move the flag to the other int holding task flags, whose modifications are serialized through the scheduler locks. Signed-off-by: Johannes Weiner --- include/linux/sched.h | 2 +- kernel/sched/core.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index de15e3c8c43a..d1aa8f4c19ab 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -627,6 +627,7 @@ struct task_struct { unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; unsigned sched_remote_wakeup:1; + unsigned sched_memdelay_requeue:1; /* Force alignment to the next boundary: */ unsigned :0; @@ -651,7 +652,6 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif - unsigned memdelay_migrate_enqueue:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bf105c870da6..b4fa806bf153 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -760,10 +760,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) if (!(flags & ENQUEUE_RESTORE)) sched_info_queued(rq, p); - WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue); - if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) { + WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->sched_memdelay_requeue); + if (!(flags & ENQUEUE_WAKEUP) || p->sched_memdelay_requeue) { memdelay_add_runnable(p); - p->memdelay_migrate_enqueue = 0; + p->sched_memdelay_requeue = 0; } else { memdelay_wakeup(p); } @@ -2065,8 +2065,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) rq = __task_rq_lock(p, &rf); memdelay_del_sleeping(p); + p->sched_memdelay_requeue = 1; __task_rq_unlock(rq, &rf); - p->memdelay_migrate_enqueue = 1; set_task_cpu(p, cpu); } -- 2.14.2 >From 7157c70aed93990f59942d39d1c0d8948164cfe2 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 5 Oct 2017 12:34:49 -0400 Subject: [PATCH 2/3] mm: memdelay: idle time is not productive time There is an error in the multi-core logic, where memory delay numbers drop as the number
Re: Detecting page cache trashing state
On 09/28/2017 08:49 AM, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: Hi Johannes, Hopefully I was able to rebase the patch on top v4.9.26 (latest supported version by us right now) and test a bit. The overall idea definitely looks promising, although I have one question on usage. Will it be able to account the time which processes spend on handling major page faults (including fs and iowait time) of refaulting page? Johannes, did you get a chance to review the changes from Ruslan ? Daniel
Re: Detecting page cache trashing state
Hi Johannes, Hopefully I was able to rebase the patch on top v4.9.26 (latest supported version by us right now) and test a bit. The overall idea definitely looks promising, although I have one question on usage. Will it be able to account the time which processes spend on handling major page faults (including fs and iowait time) of refaulting page? As we have one big application which code space occupies big amount of place in page cache, when the system under heavy memory usage will reclaim some of it, the application will start constantly thrashing. Since it code is placed on squashfs it spends whole CPU time decompressing the pages and seem memdelay counters are not detecting this situation. Here are some counters to indicate this: 19:02:44CPU %user %nice %system %iowait %steal %idle 19:02:45all 0.00 0.00100.00 0.00 0.00 0.00 19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s%vmeff 19:02:45 15284.00 0.00428.00352.00 19990.00 0.00 0.00 15802.00 0.00 And as nobody actively allocating memory anymore looks like memdelay counters are not actively incremented: [:~]$ cat /proc/memdelay 268035776 6.13 5.43 3.58 1.90 1.89 1.26 Just in case, I have attached the v4.9.26 rebased patched. Also attached the patch with our current solution. In current implementation it will mostly fit to squashfs only thrashing situation as in general case iowait time would be major part of page fault handling thus it need to be accounted too. Thanks, Ruslan On 09/18/2017 07:34 PM, Johannes Weiner wrote: Hi Taras, On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote: Quoting Michal Hocko (2017-09-15 07:36:19) On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: Has somebody faced similar issue? How are you solving it? Yes this is a pain point for a _long_ time. And we still do not have a good answer upstream. Johannes has been playing in this area [1]. The main problem is that our OOM detection logic is based on the ability to reclaim memory to allocate new memory. And that is pretty much true for the pagecache when you are trashing. So we do not know that basically whole time is spent refaulting the memory back and forth. We do have some refault stats for the page cache but that is not integrated to the oom detection logic because this is really a non-trivial problem to solve without triggering early oom killer invocations. [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org Thanks Michal. memdelay looks promising. We will check it. Great, I'm obviously interested in more users of it :) Please find attached the latest version of the patch series based on v4.13. It needs a bit more refactoring in the scheduler bits before resubmission, but it already contains a couple of fixes and improvements since the first version I sent out. Let me know if you need help rebasing to a different kernel version. >From 708131b315b5a5da1beed167bca80ba067aa77a1 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 27 Sep 2017 20:01:41 +0300 Subject: [PATCH 2/2] mm/sched: memdelay: memory health interface for systems and workloads Linux doesn't have a useful metric to describe the memory health of a system, a cgroup container, or individual tasks. When workloads are bigger than available memory, they spend a certain amount of their time inside page reclaim, waiting on thrashing cache, and swapping in. This has impact on latency, and depending on the CPU capacity in the system can also translate to a decrease in throughput. While Linux exports some stats and counters for these events, it does not quantify the true impact they have on throughput and latency. How much of the execution time is spent unproductively? This is important to know when sizing workloads to systems and containers. It also comes in handy when evaluating the effectiveness and efficiency of the kernel's memory management policies and heuristics. This patch implements a metric that quantifies memory pressure in a unit that matters most to applications and does not rely on hardware aspects to be meaningful: wallclock time lost while waiting on memory. Whenever a task is blocked on refaults, swapins, or direct reclaim, the time it spends is accounted on the task level and aggregated into a domain state along with other tasks on the system and cgroup level. Each task has a /proc//memdelay file that lists the microseconds the task has been delayed since it's been forked. That file can be sampled periodically for recent delays, or before and after certain operations to measure their memory-related latencies. On the system and cgroup-level, there are /proc/memdelay and memory.memdelay, respectively, and their format is as such: $ cat /proc/memdelay 2489084 41.61 47.28 29.66 0.00 0.00 0.00 The first line shows the cumulative delay times of all tasks in the domain - in this case, all
Re: Detecting page cache trashing state
Hi Taras, On Fri, Sep 15, 2017 at 10:28:30AM -0700, Taras Kondratiuk wrote: > Quoting Michal Hocko (2017-09-15 07:36:19) > > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: > > > Has somebody faced similar issue? How are you solving it? > > > > Yes this is a pain point for a _long_ time. And we still do not have a > > good answer upstream. Johannes has been playing in this area [1]. > > The main problem is that our OOM detection logic is based on the ability > > to reclaim memory to allocate new memory. And that is pretty much true > > for the pagecache when you are trashing. So we do not know that > > basically whole time is spent refaulting the memory back and forth. > > We do have some refault stats for the page cache but that is not > > integrated to the oom detection logic because this is really a > > non-trivial problem to solve without triggering early oom killer > > invocations. > > > > [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org > > Thanks Michal. memdelay looks promising. We will check it. Great, I'm obviously interested in more users of it :) Please find attached the latest version of the patch series based on v4.13. It needs a bit more refactoring in the scheduler bits before resubmission, but it already contains a couple of fixes and improvements since the first version I sent out. Let me know if you need help rebasing to a different kernel version. >From d5ffeb4d9d65fcff1b7e50dbde8264b4c32824a5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 14 Jun 2017 11:12:05 -0400 Subject: [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros There are several identical definitions of those macros in places that mess with fixed-point load averages. Provide an official version. Signed-off-by: Johannes Weiner --- arch/powerpc/platforms/cell/spufs/sched.c | 3 --- arch/s390/appldata/appldata_os.c | 4 drivers/cpuidle/governors/menu.c | 4 fs/proc/loadavg.c | 3 --- include/linux/sched/loadavg.h | 3 +++ kernel/debug/kdb/kdb_main.c | 7 +-- 6 files changed, 4 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c index 1fbb5da17dd2..de544070def3 100644 --- a/arch/powerpc/platforms/cell/spufs/sched.c +++ b/arch/powerpc/platforms/cell/spufs/sched.c @@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx, } } -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static int show_spu_loadavg(struct seq_file *s, void *private) { int a, b, c; diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index 45b3178200ab..a8aac17e1e82 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -24,10 +24,6 @@ #include "appldata.h" - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - /* * OS data * diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 61b64c2b2cb8..e215a2c10a61 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -132,10 +132,6 @@ struct menu_device { int interval_ptr; }; - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static inline int get_loadavg(unsigned long load) { return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10; diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index 983fce5c2418..111a25e4b088 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -9,9 +9,6 @@ #include #include -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static int loadavg_proc_show(struct seq_file *m, void *v) { unsigned long avnrun[3]; diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index 4264bc6b2c27..745483bb5cca 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -26,6 +26,9 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); load += n*(FIXED_1-exp); \ load >>= FSHIFT; +#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) + extern void calc_global_load(unsigned long ticks); #endif /* _LINUX_SCHED_LOADAVG_H */ diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index c8146d53ca67..225ccd7a 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2571,16 +2571,11 @@ static int kdb_summary(int argc, const char **argv) } kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60); - /* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */ - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) kdb_printf("load avg %ld.%02ld %ld.%02ld %ld.%02ld\n", LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]), LOAD_I
Re: Detecting page cache trashing state
On Fri 15-09-17 14:20:28, vcap...@pengaru.com wrote: > On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote: > > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: > > > Hi > > > > > > In our devices under low memory conditions we often get into a trashing > > > state when system spends most of the time re-reading pages of .text > > > sections from a file system (squashfs in our case). Working set doesn't > > > fit into available page cache, so it is expected. The issue is that > > > OOM killer doesn't get triggered because there is still memory for > > > reclaiming. System may stuck in this state for a quite some time and > > > usually dies because of watchdogs. > > > > > > We are trying to detect such trashing state early to take some > > > preventive actions. It should be a pretty common issue, but for now we > > > haven't find any existing VM/IO statistics that can reliably detect such > > > state. > > > > > > Most of metrics provide absolute values: number/rate of page faults, > > > rate of IO operations, number of stolen pages, etc. For a specific > > > device configuration we can determine threshold values for those > > > parameters that will detect trashing state, but it is not feasible for > > > hundreds of device configurations. > > > > > > We are looking for some relative metric like "percent of CPU time spent > > > handling major page faults". With such relative metric we could use a > > > common threshold across all devices. For now we have added such metric > > > to /proc/stat in our kernel, but we would like to find some mechanism > > > available in upstream kernel. > > > > > > Has somebody faced similar issue? How are you solving it? > > > > Yes this is a pain point for a _long_ time. And we still do not have a > > good answer upstream. Johannes has been playing in this area [1]. > > The main problem is that our OOM detection logic is based on the ability > > to reclaim memory to allocate new memory. And that is pretty much true > > for the pagecache when you are trashing. So we do not know that > > basically whole time is spent refaulting the memory back and forth. > > We do have some refault stats for the page cache but that is not > > integrated to the oom detection logic because this is really a > > non-trivial problem to solve without triggering early oom killer > > invocations. > > > > [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org > > For desktop users running without swap, couldn't we just provide a kernel > setting which marks all executable pages as unevictable when first faulted > in? This could result in the immediate DoS vector and you could see trashing elsewhere. In fact we already do protect executable pages and reclaim them later (see page_check_references). I am afraid that the only way to resolve the trashing behavior is to release a larger amount of memory because shifting the reclaim priority will just push the suboptimal behavior somewhere else. In order to do that we really have to detect that the working set doesn't fit into memory and refaults are predominating system activity. -- Michal Hocko SUSE Labs
Re: Detecting page cache trashing state
Quoting vcap...@pengaru.com (2017-09-15 14:20:28) > On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote: > > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: > > > Hi > > > > > > In our devices under low memory conditions we often get into a trashing > > > state when system spends most of the time re-reading pages of .text > > > sections from a file system (squashfs in our case). Working set doesn't > > > fit into available page cache, so it is expected. The issue is that > > > OOM killer doesn't get triggered because there is still memory for > > > reclaiming. System may stuck in this state for a quite some time and > > > usually dies because of watchdogs. > > > > > > We are trying to detect such trashing state early to take some > > > preventive actions. It should be a pretty common issue, but for now we > > > haven't find any existing VM/IO statistics that can reliably detect such > > > state. > > > > > > Most of metrics provide absolute values: number/rate of page faults, > > > rate of IO operations, number of stolen pages, etc. For a specific > > > device configuration we can determine threshold values for those > > > parameters that will detect trashing state, but it is not feasible for > > > hundreds of device configurations. > > > > > > We are looking for some relative metric like "percent of CPU time spent > > > handling major page faults". With such relative metric we could use a > > > common threshold across all devices. For now we have added such metric > > > to /proc/stat in our kernel, but we would like to find some mechanism > > > available in upstream kernel. > > > > > > Has somebody faced similar issue? How are you solving it? > > > > Yes this is a pain point for a _long_ time. And we still do not have a > > good answer upstream. Johannes has been playing in this area [1]. > > The main problem is that our OOM detection logic is based on the ability > > to reclaim memory to allocate new memory. And that is pretty much true > > for the pagecache when you are trashing. So we do not know that > > basically whole time is spent refaulting the memory back and forth. > > We do have some refault stats for the page cache but that is not > > integrated to the oom detection logic because this is really a > > non-trivial problem to solve without triggering early oom killer > > invocations. > > > > [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org > > For desktop users running without swap, couldn't we just provide a kernel > setting which marks all executable pages as unevictable when first faulted > in? Then at least thrashing within the space occupied by executables and > shared libraries before eventual OOM would be avoided, and only the > remaining file-backed non-executable pages would be thrashable. > > On my swapless laptops I'd much rather have OOM killer kick in immediately > rather than wait for a few minutes of thrashing to pass while the bogged > down system crawls through depleting what's left of technically reclaimable > memory. It's much improved on modern SSDs, but still annoying. Usually a significant part of executable is used rarely or only once during initialization. Pinning all executable pages forever will waste a lot of memory.
Re: Detecting page cache trashing state
On Fri, Sep 15, 2017 at 04:36:19PM +0200, Michal Hocko wrote: > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: > > Hi > > > > In our devices under low memory conditions we often get into a trashing > > state when system spends most of the time re-reading pages of .text > > sections from a file system (squashfs in our case). Working set doesn't > > fit into available page cache, so it is expected. The issue is that > > OOM killer doesn't get triggered because there is still memory for > > reclaiming. System may stuck in this state for a quite some time and > > usually dies because of watchdogs. > > > > We are trying to detect such trashing state early to take some > > preventive actions. It should be a pretty common issue, but for now we > > haven't find any existing VM/IO statistics that can reliably detect such > > state. > > > > Most of metrics provide absolute values: number/rate of page faults, > > rate of IO operations, number of stolen pages, etc. For a specific > > device configuration we can determine threshold values for those > > parameters that will detect trashing state, but it is not feasible for > > hundreds of device configurations. > > > > We are looking for some relative metric like "percent of CPU time spent > > handling major page faults". With such relative metric we could use a > > common threshold across all devices. For now we have added such metric > > to /proc/stat in our kernel, but we would like to find some mechanism > > available in upstream kernel. > > > > Has somebody faced similar issue? How are you solving it? > > Yes this is a pain point for a _long_ time. And we still do not have a > good answer upstream. Johannes has been playing in this area [1]. > The main problem is that our OOM detection logic is based on the ability > to reclaim memory to allocate new memory. And that is pretty much true > for the pagecache when you are trashing. So we do not know that > basically whole time is spent refaulting the memory back and forth. > We do have some refault stats for the page cache but that is not > integrated to the oom detection logic because this is really a > non-trivial problem to solve without triggering early oom killer > invocations. > > [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org For desktop users running without swap, couldn't we just provide a kernel setting which marks all executable pages as unevictable when first faulted in? Then at least thrashing within the space occupied by executables and shared libraries before eventual OOM would be avoided, and only the remaining file-backed non-executable pages would be thrashable. On my swapless laptops I'd much rather have OOM killer kick in immediately rather than wait for a few minutes of thrashing to pass while the bogged down system crawls through depleting what's left of technically reclaimable memory. It's much improved on modern SSDs, but still annoying. Regards, Vito Caputo
Re: Detecting page cache trashing state
On 09/15/2017 09:38 AM, Taras Kondratiuk wrote: Quoting Daniel Walker (2017-09-15 07:22:27) On 09/14/2017 05:16 PM, Taras Kondratiuk wrote: Hi In our devices under low memory conditions we often get into a trashing state when system spends most of the time re-reading pages of .text sections from a file system (squashfs in our case). Working set doesn't fit into available page cache, so it is expected. The issue is that OOM killer doesn't get triggered because there is still memory for reclaiming. System may stuck in this state for a quite some time and usually dies because of watchdogs. We are trying to detect such trashing state early to take some preventive actions. It should be a pretty common issue, but for now we haven't find any existing VM/IO statistics that can reliably detect such state. Most of metrics provide absolute values: number/rate of page faults, rate of IO operations, number of stolen pages, etc. For a specific device configuration we can determine threshold values for those parameters that will detect trashing state, but it is not feasible for hundreds of device configurations. We are looking for some relative metric like "percent of CPU time spent handling major page faults". With such relative metric we could use a common threshold across all devices. For now we have added such metric to /proc/stat in our kernel, but we would like to find some mechanism available in upstream kernel. Has somebody faced similar issue? How are you solving it? Did you make any attempt to tune swappiness ? Documentation/sysctl/vm.txt swappiness This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap. The default value is 60. === Since your using squashfs I would guess that's going to act like swap. The default tune of 60 is most likely for x86 servers which may not be a good value for some other device. Swap is disabled in our systems, so anonymous pages can't be evicted. As per my understanding swappiness tune is irrelevant. Even with enabled swap swappiness tune can't help much in this case. If working set doesn't fit into available page cache we will hit the same trashing state. I think it's our lack of understanding of how the VM works. If the system has no swap, then the system shouldn't start evicting pages unless you have %100 memory utilization, then the only place for those pages to go is back into the backing store, squashfs in this case. What your suggesting is that there is still free memory, which means something must be evicting page more aggressively then waiting till %100 utilization. Maybe someone more knownlegable about the VM subsystem can clear this up. Daniel
Re: Detecting page cache trashing state
Quoting Michal Hocko (2017-09-15 07:36:19) > On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: > > Hi > > > > In our devices under low memory conditions we often get into a trashing > > state when system spends most of the time re-reading pages of .text > > sections from a file system (squashfs in our case). Working set doesn't > > fit into available page cache, so it is expected. The issue is that > > OOM killer doesn't get triggered because there is still memory for > > reclaiming. System may stuck in this state for a quite some time and > > usually dies because of watchdogs. > > > > We are trying to detect such trashing state early to take some > > preventive actions. It should be a pretty common issue, but for now we > > haven't find any existing VM/IO statistics that can reliably detect such > > state. > > > > Most of metrics provide absolute values: number/rate of page faults, > > rate of IO operations, number of stolen pages, etc. For a specific > > device configuration we can determine threshold values for those > > parameters that will detect trashing state, but it is not feasible for > > hundreds of device configurations. > > > > We are looking for some relative metric like "percent of CPU time spent > > handling major page faults". With such relative metric we could use a > > common threshold across all devices. For now we have added such metric > > to /proc/stat in our kernel, but we would like to find some mechanism > > available in upstream kernel. > > > > Has somebody faced similar issue? How are you solving it? > > Yes this is a pain point for a _long_ time. And we still do not have a > good answer upstream. Johannes has been playing in this area [1]. > The main problem is that our OOM detection logic is based on the ability > to reclaim memory to allocate new memory. And that is pretty much true > for the pagecache when you are trashing. So we do not know that > basically whole time is spent refaulting the memory back and forth. > We do have some refault stats for the page cache but that is not > integrated to the oom detection logic because this is really a > non-trivial problem to solve without triggering early oom killer > invocations. > > [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org Thanks Michal. memdelay looks promising. We will check it.
Re: Detecting page cache trashing state
Quoting Daniel Walker (2017-09-15 07:22:27) > On 09/14/2017 05:16 PM, Taras Kondratiuk wrote: > > Hi > > > > In our devices under low memory conditions we often get into a trashing > > state when system spends most of the time re-reading pages of .text > > sections from a file system (squashfs in our case). Working set doesn't > > fit into available page cache, so it is expected. The issue is that > > OOM killer doesn't get triggered because there is still memory for > > reclaiming. System may stuck in this state for a quite some time and > > usually dies because of watchdogs. > > > > We are trying to detect such trashing state early to take some > > preventive actions. It should be a pretty common issue, but for now we > > haven't find any existing VM/IO statistics that can reliably detect such > > state. > > > > Most of metrics provide absolute values: number/rate of page faults, > > rate of IO operations, number of stolen pages, etc. For a specific > > device configuration we can determine threshold values for those > > parameters that will detect trashing state, but it is not feasible for > > hundreds of device configurations. > > > > We are looking for some relative metric like "percent of CPU time spent > > handling major page faults". With such relative metric we could use a > > common threshold across all devices. For now we have added such metric > > to /proc/stat in our kernel, but we would like to find some mechanism > > available in upstream kernel. > > > > Has somebody faced similar issue? How are you solving it? > > > Did you make any attempt to tune swappiness ? > > Documentation/sysctl/vm.txt > > swappiness > > This control is used to define how aggressive the kernel will swap > memory pages. Higher values will increase agressiveness, lower values > decrease the amount of swap. > > The default value is 60. > === > > Since your using squashfs I would guess that's going to act like swap. > The default tune of 60 is most likely for x86 servers which may not be a > good value for some other device. Swap is disabled in our systems, so anonymous pages can't be evicted. As per my understanding swappiness tune is irrelevant. Even with enabled swap swappiness tune can't help much in this case. If working set doesn't fit into available page cache we will hit the same trashing state.
Re: Detecting page cache trashing state
On Thu 14-09-17 17:16:27, Taras Kondratiuk wrote: > Hi > > In our devices under low memory conditions we often get into a trashing > state when system spends most of the time re-reading pages of .text > sections from a file system (squashfs in our case). Working set doesn't > fit into available page cache, so it is expected. The issue is that > OOM killer doesn't get triggered because there is still memory for > reclaiming. System may stuck in this state for a quite some time and > usually dies because of watchdogs. > > We are trying to detect such trashing state early to take some > preventive actions. It should be a pretty common issue, but for now we > haven't find any existing VM/IO statistics that can reliably detect such > state. > > Most of metrics provide absolute values: number/rate of page faults, > rate of IO operations, number of stolen pages, etc. For a specific > device configuration we can determine threshold values for those > parameters that will detect trashing state, but it is not feasible for > hundreds of device configurations. > > We are looking for some relative metric like "percent of CPU time spent > handling major page faults". With such relative metric we could use a > common threshold across all devices. For now we have added such metric > to /proc/stat in our kernel, but we would like to find some mechanism > available in upstream kernel. > > Has somebody faced similar issue? How are you solving it? Yes this is a pain point for a _long_ time. And we still do not have a good answer upstream. Johannes has been playing in this area [1]. The main problem is that our OOM detection logic is based on the ability to reclaim memory to allocate new memory. And that is pretty much true for the pagecache when you are trashing. So we do not know that basically whole time is spent refaulting the memory back and forth. We do have some refault stats for the page cache but that is not integrated to the oom detection logic because this is really a non-trivial problem to solve without triggering early oom killer invocations. [1] http://lkml.kernel.org/r/20170727153010.23347-1-han...@cmpxchg.org -- Michal Hocko SUSE Labs
Re: Detecting page cache trashing state
On 09/14/2017 05:16 PM, Taras Kondratiuk wrote: Hi In our devices under low memory conditions we often get into a trashing state when system spends most of the time re-reading pages of .text sections from a file system (squashfs in our case). Working set doesn't fit into available page cache, so it is expected. The issue is that OOM killer doesn't get triggered because there is still memory for reclaiming. System may stuck in this state for a quite some time and usually dies because of watchdogs. We are trying to detect such trashing state early to take some preventive actions. It should be a pretty common issue, but for now we haven't find any existing VM/IO statistics that can reliably detect such state. Most of metrics provide absolute values: number/rate of page faults, rate of IO operations, number of stolen pages, etc. For a specific device configuration we can determine threshold values for those parameters that will detect trashing state, but it is not feasible for hundreds of device configurations. We are looking for some relative metric like "percent of CPU time spent handling major page faults". With such relative metric we could use a common threshold across all devices. For now we have added such metric to /proc/stat in our kernel, but we would like to find some mechanism available in upstream kernel. Has somebody faced similar issue? How are you solving it? Did you make any attempt to tune swappiness ? Documentation/sysctl/vm.txt swappiness This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap. The default value is 60. === Since your using squashfs I would guess that's going to act like swap. The default tune of 60 is most likely for x86 servers which may not be a good value for some other device. Daniel
Re: Detecting page cache trashing state
Dne 15.9.2017 v 02:16 Taras Kondratiuk napsal(a): Hi In our devices under low memory conditions we often get into a trashing state when system spends most of the time re-reading pages of .text sections from a file system (squashfs in our case). Working set doesn't fit into available page cache, so it is expected. The issue is that OOM killer doesn't get triggered because there is still memory for reclaiming. System may stuck in this state for a quite some time and usually dies because of watchdogs. We are trying to detect such trashing state early to take some preventive actions. It should be a pretty common issue, but for now we haven't find any existing VM/IO statistics that can reliably detect such state. Most of metrics provide absolute values: number/rate of page faults, rate of IO operations, number of stolen pages, etc. For a specific device configuration we can determine threshold values for those parameters that will detect trashing state, but it is not feasible for hundreds of device configurations. We are looking for some relative metric like "percent of CPU time spent handling major page faults". With such relative metric we could use a common threshold across all devices. For now we have added such metric to /proc/stat in our kernel, but we would like to find some mechanism available in upstream kernel. Has somebody faced similar issue? How are you solving it? Hi Well I witness this when running Firefox & Thunderbird on my desktop for a while on just 4G RAM machine till these 2app eat all free RAM... It gets to the position (when I open new tab) that mouse hardly moves - kswapd eats CPU (I've no swap in fact - so likely just page-caching). The only 'quick' solution for me as desktop user is to manually invoke OOM with SYSRQ+F key - and I'm also wondering why the system is not reacting better. In most cases it kills one of those 2 - but sometime it kills whole Xsession... Regards Zdenek