RE: mm: pages are not freed from lru_add_pvecs after process termination
On Tue 07-06-16 13:20:00, Michal Hocko wrote: > I guess you want something like posix_memalign or start faulting in from > an aligned address to guarantee you will fault 2MB pages. Good catch. > Besides that I am really suspicious that this will be measurable at all. > I would just go and spin a patch assuming you are still able to trigger > OOM with the vanilla kernel. Yes, I am still able to trigger OOM, the tests I did are more like sanity checks rather than benchmarks. lru_cache_add takes very little time so it was rather to look for some unexpected side effects. Thank, Lukas
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Tue 07-06-16 09:02:02, Odzioba, Lukasz wrote: [...] > //compile with: gcc bench.c -o bench_2M -fopenmp > //compile with: gcc -D SMALL_PAGES bench.c -o bench_4K -fopenmp > #include > #include > #include > > #define MAP_HUGE_SHIFT 26 > #define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT) > > #ifndef SMALL_PAGES > #define PAGE_SIZE (1024*1024*2) > #define MAP_PARAM (MAP_HUGE_2MB) Isn't MAP_HUGE_2MB ignored for !hugetlb pages? > #else > #define PAGE_SIZE (1024*4) > #define MAP_PARAM (0) > #endif > > void main() { > size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory > 288 CPUs > #pragma omp parallel > { > unsigned int k; > for (k = 0; k < 10; k++) { > void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0); I guess you want something like posix_memalign or start faulting in from an aligned address to guarantee you will fault 2MB pages. Also note that the default behavior for THP during the fault has changed recently (see 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") so you might need MADV_HUGEPAGE. Besides that I am really suspicious that this will be measurable at all. I would just go and spin a patch assuming you are still able to trigger OOM with the vanilla kernel. The bug fix is more important... -- Michal Hocko SUSE Labs
RE: mm: pages are not freed from lru_add_pvecs after process termination
On Wed 05-11-16 09:53:00, Michal Hocko wrote: > Yes I think this makes sense. The only case where it would be suboptimal > is when the pagevec was already full and then we just created a single > page pvec to drain it. This can be handled better though by: > > diff --git a/mm/swap.c b/mm/swap.c > index 95916142fc46..3fe4f180e8bf 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page) > struct pagevec *pvec = &get_cpu_var(lru_add_pvec); > > get_page(page); >- if (!pagevec_space(pvec)) >+ if (!pagevec_add(pvec, page) || PageCompound(page)) > __pagevec_lru_add(pvec); >- pagevec_add(pvec, page); > put_cpu_var(lru_add_pvec); >} It's been a while, but I am back with some results. For 2M i 4K pages I wrote simple app which mmaps and unmaps a lot of memory (60GB/288CPU) in parallel and does it ten times to get rid of some os/threading overhead. Then I created an app which mixes pages in sort of pseudo random random way. I executed those 10 times under "time" (once with THP=on and once with THP=off) command and calculated sum, min, max, avg of sys, real, user time which was necessary due to significant bias in results. In overall it seems that this change has no negative impact on performance: 4K THP=on,off -> no significant change 2M THP=on,off -> it might be a tiny bit slower, but still close to measurement error MIX THP=on,off -> no significant change If you have any concerns about test correctness please let me know. Below I added test applications and test results. Thanks, Lukas -- //compile with: gcc bench.c -o bench_2M -fopenmp //compile with: gcc -D SMALL_PAGES bench.c -o bench_4K -fopenmp #include #include #include #define MAP_HUGE_SHIFT 26 #define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT) #ifndef SMALL_PAGES #define PAGE_SIZE (1024*1024*2) #define MAP_PARAM (MAP_HUGE_2MB) #else #define PAGE_SIZE (1024*4) #define MAP_PARAM (0) #endif void main() { size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 CPUs #pragma omp parallel { unsigned int k; for (k = 0; k < 10; k++) { void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0); if (p != MAP_FAILED) { char *cp = (char*)p; size_t i; for (i = 0; i < size / PAGE_SIZE; i++) { *cp = 0; cp += PAGE_SIZE; } munmap(p, size); } } } } //compile with: gcc bench_mixed.c -o bench_mixed -fopenmp #include #include #include #define SMALL_PAGE (1024*4) #define HUGE_PAGE (1024*4) #define MAP_HUGE_SHIFT 26 #define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT) void main() { size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 CPUs #pragma omp parallel { unsigned int k, MAP_PARAM = 0; unsigned int PAGE_SIZE = SMALL_PAGE; for (k = 0; k < 10; k++) { if ((k + omp_get_thread_num()) % 2) { MAP_PARAM = MAP_HUGE_2MB; PAGE_SIZE = HUGE_PAGE; } void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0); if (p != MAP_FAILED) { char *cp = (char*)p; size_t i; for (i = 0; i < size / PAGE_SIZE; i++) { *cp = 0; cp += PAGE_SIZE; } munmap(p, size); } } } } *** # 4K THP=ON ###real unpatched patched### sum = 428.737s sum = 421.339s min = 41.187s min = 41.492s max = 44.948s max = 42.822s avg = 42.874s avg = 42.134s ###user unpatched patched### sum = 145.241s sum = 147.283s min = 13.760s min = 14.418s max = 15.532s max = 15.201s avg = 14.524s avg = 14.728s ###sys unpatched patched### sum = 4882.708s sum = 5020.581s min = 441.922s min = 490.516s max = 535.294s max = 532.137s avg = 488.271s avg = 502.058s # 4K THP=OFF### ###real unpatched patched### sum = 2149.288s sum = 2144.336s min = 214.589s min = 212.642s max = 215.937s max = 215.579s avg = 214.929s avg = 214.434s ###user unpatched patched### sum = 858.659s sum = 858.166s min = 81.655s min = 82.084s max = 87.790s max = 88.649s avg = 85.866s avg = 85.817s ###sys unpatched patched### sum = 32357.867s sum = 31126.183s min = 2952.685s min = 2783.157s max = 3442.004s max =
RE: mm: pages are not freed from lru_add_pvecs after process termination
On Wed 05-11-16 09:53:00, Michal Hocko wrote: > Yes I think this makes sense. The only case where it would be suboptimal > is when the pagevec was already full and then we just created a single > page pvec to drain it. This can be handled better though by: > > diff --git a/mm/swap.c b/mm/swap.c > index 95916142fc46..3fe4f180e8bf 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page) > struct pagevec *pvec = &get_cpu_var(lru_add_pvec); > > get_page(page); >- if (!pagevec_space(pvec)) >+ if (!pagevec_add(pvec, page) || PageCompound(page)) > __pagevec_lru_add(pvec); >- pagevec_add(pvec, page); > put_cpu_var(lru_add_pvec); >} Oh yeah, that's exactly what I meant, couldn't find such elegant way of handling this special case and didn't want to obscure the idea. I'll do the tests proposed by Date and be back here with results next week. Thank you guys for the involvement, Lukas
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 05/11/2016 09:53 AM, Michal Hocko wrote: On Fri 06-05-16 09:04:34, Dave Hansen wrote: On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote: On Thu 05-05-16 09:21:00, Michal Hocko wrote: Or maybe the async nature of flushing turns out to be just impractical and unreliable and we will end up skipping THP (or all compound pages) for pcp LRU add cache. Let's see... What if we simply skip lru_add pvecs for compound pages? That way we still have compound pages on LRU's, but the problem goes away. It is not quite what this naïve patch does, but it works nice for me. diff --git a/mm/swap.c b/mm/swap.c index 03aacbc..c75d5e1 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page) get_page(page); if (!pagevec_space(pvec)) __pagevec_lru_add(pvec); pagevec_add(pvec, page); + if (PageCompound(page)) + __pagevec_lru_add(pvec); put_cpu_var(lru_add_pvec); } That's not _quite_ what I had in mind since that drains the entire pvec every time a large page is encountered. But I'm conflicted about what the right behavior _is_. We'd taking the LRU lock for 'page' anyway, so we might as well drain the pvec. Note that pages in the pagevec can come from different zones, so this is not universally true. Yes I think this makes sense. The only case where it would be suboptimal is when the pagevec was already full and then we just created a single page pvec to drain it. This can be handled better though by: diff --git a/mm/swap.c b/mm/swap.c index 95916142fc46..3fe4f180e8bf 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page) struct pagevec *pvec = &get_cpu_var(lru_add_pvec); get_page(page); - if (!pagevec_space(pvec)) + if (!pagevec_add(pvec, page) || PageCompound(page)) __pagevec_lru_add(pvec); - pagevec_add(pvec, page); put_cpu_var(lru_add_pvec); } Yeah that could work. There might be more complex solutions at the level of lru_cache_add_active_or_unevictable() where we call it either from base page code (mm/memory.c) or functions in mm/huge_memory.c. We could redirect it at that point, but likely not worth the trouble unless this simple solution doesn't show some performance regression... Or, does the additional work to put the page on to a pvec and then immediately drain it overwhelm that advantage? pagevec_add is quite trivial so I would be really surprised if it mattered.
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Fri 06-05-16 09:04:34, Dave Hansen wrote: > On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote: > > On Thu 05-05-16 09:21:00, Michal Hocko wrote: > >> Or maybe the async nature of flushing turns > >> out to be just impractical and unreliable and we will end up skipping > >> THP (or all compound pages) for pcp LRU add cache. Let's see... > > > > What if we simply skip lru_add pvecs for compound pages? > > That way we still have compound pages on LRU's, but the problem goes > > away. It is not quite what this naïve patch does, but it works nice for me. > > > > diff --git a/mm/swap.c b/mm/swap.c > > index 03aacbc..c75d5e1 100644 > > --- a/mm/swap.c > > +++ b/mm/swap.c > > @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page) > > get_page(page); > > if (!pagevec_space(pvec)) > > __pagevec_lru_add(pvec); > > pagevec_add(pvec, page); > > + if (PageCompound(page)) > > + __pagevec_lru_add(pvec); > > put_cpu_var(lru_add_pvec); > > } > > That's not _quite_ what I had in mind since that drains the entire pvec > every time a large page is encountered. But I'm conflicted about what > the right behavior _is_. > > We'd taking the LRU lock for 'page' anyway, so we might as well drain > the pvec. Yes I think this makes sense. The only case where it would be suboptimal is when the pagevec was already full and then we just created a single page pvec to drain it. This can be handled better though by: diff --git a/mm/swap.c b/mm/swap.c index 95916142fc46..3fe4f180e8bf 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page) struct pagevec *pvec = &get_cpu_var(lru_add_pvec); get_page(page); - if (!pagevec_space(pvec)) + if (!pagevec_add(pvec, page) || PageCompound(page)) __pagevec_lru_add(pvec); - pagevec_add(pvec, page); put_cpu_var(lru_add_pvec); } > Or, does the additional work to put the page on to a pvec and then > immediately drain it overwhelm that advantage? pagevec_add is quite trivial so I would be really surprised if it mattered. -- Michal Hocko SUSE Labs
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Thu 05-05-16 17:25:07, Odzioba, Lukasz wrote: > On Thu 05-05-16 09:21:00, Michal Hocko wrote: > > OK, it wasn't that tricky afterall. Maybe I have missed something but > > the following should work. Or maybe the async nature of flushing turns > > out to be just impractical and unreliable and we will end up skipping > > THP (or all compound pages) for pcp LRU add cache. Let's see... > > Initially this issue was found on RH's 3.10.x kernel, but now I am using > 4.6-rc6. > > In overall it does help and under heavy load it is slightly better than the > second patch. Unfortunately I am still able to hit 10-20% oom kills with it - > (went down from 30-50%) partially due to earlier vmstat_update call > - it went up to 25-25% with this patch below: This simply shows that this is not a viable option. So I guess we really want to rather skip THP (compound pages) from LRU add pcp cache. Thanks for your effort and testing! -- Michal Hocko SUSE Labs
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote: > On Thu 05-05-16 09:21:00, Michal Hocko wrote: >> Or maybe the async nature of flushing turns >> out to be just impractical and unreliable and we will end up skipping >> THP (or all compound pages) for pcp LRU add cache. Let's see... > > What if we simply skip lru_add pvecs for compound pages? > That way we still have compound pages on LRU's, but the problem goes > away. It is not quite what this naïve patch does, but it works nice for me. > > diff --git a/mm/swap.c b/mm/swap.c > index 03aacbc..c75d5e1 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page) > get_page(page); > if (!pagevec_space(pvec)) > __pagevec_lru_add(pvec); > pagevec_add(pvec, page); > + if (PageCompound(page)) > + __pagevec_lru_add(pvec); > put_cpu_var(lru_add_pvec); > } That's not _quite_ what I had in mind since that drains the entire pvec every time a large page is encountered. But I'm conflicted about what the right behavior _is_. We'd taking the LRU lock for 'page' anyway, so we might as well drain the pvec. Or, does the additional work to put the page on to a pvec and then immediately drain it overwhelm that advantage? Or does it just not matter? Kirill, do you have a suggestion for how we should be checking for THP pages in code like this? PageCompound() will surely _work_ for anon-THP and your file-THP, but is it the best way to check? > Do we have any tests that I could use to measure performance impact > of such changes before I start to tweak it up? Or maybe it doesn't make > sense at all ? You probably want to very carefully calculate the time to fault a page, then separately to free a page. If we can't manage to detect a delta on a little microbenchmark like that then we'll probably never see one in practice. You'll want to measure the fault time for a 4k pages, 2M pages, and then possibly a mix. You'll want to do this in a highly parallel test to make sure any additional LRU lock overhead shows up.
RE: mm: pages are not freed from lru_add_pvecs after process termination
On Thu 05-05-16 09:21:00, Michal Hocko wrote: > Or maybe the async nature of flushing turns > out to be just impractical and unreliable and we will end up skipping > THP (or all compound pages) for pcp LRU add cache. Let's see... What if we simply skip lru_add pvecs for compound pages? That way we still have compound pages on LRU's, but the problem goes away. It is not quite what this naïve patch does, but it works nice for me. diff --git a/mm/swap.c b/mm/swap.c index 03aacbc..c75d5e1 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page) get_page(page); if (!pagevec_space(pvec)) __pagevec_lru_add(pvec); pagevec_add(pvec, page); + if (PageCompound(page)) + __pagevec_lru_add(pvec); put_cpu_var(lru_add_pvec); } Do we have any tests that I could use to measure performance impact of such changes before I start to tweak it up? Or maybe it doesn't make sense at all ? Thanks, Lukas
RE: mm: pages are not freed from lru_add_pvecs after process termination
On Thu 05-05-16 09:21:00, Michal Hocko wrote: > OK, it wasn't that tricky afterall. Maybe I have missed something but > the following should work. Or maybe the async nature of flushing turns > out to be just impractical and unreliable and we will end up skipping > THP (or all compound pages) for pcp LRU add cache. Let's see... Initially this issue was found on RH's 3.10.x kernel, but now I am using 4.6-rc6. In overall it does help and under heavy load it is slightly better than the second patch. Unfortunately I am still able to hit 10-20% oom kills with it - (went down from 30-50%) partially due to earlier vmstat_update call - it went up to 25-25% with this patch below: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b4359f8..7a5ab0d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3264,17 +3264,17 @@ retry: if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD)) migration_mode = MIGRATE_SYNC_LIGHT; - if(!vmstat_updated) { - vmstat_updated = true; - kick_vmstat_update(); - } - /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress); if (page) goto got_pg; + if(!vmstat_updated) { + vmstat_updated = true; + kick_vmstat_update(); + } I don't quite see an uninvasive way to make sure that we drain all pvecs before failing allocation and doing it asynchronously will race allocations anyway - I guess. Thanks, Lukas
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Wed 04-05-16 22:36:43, Michal Hocko wrote: > On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote: [...] > > I have an app which allocates almost all of the memory from numa node and > > with just second patch and 100 consecutive executions 30-50% got killed. > > This is still not acceptable. So I guess we need a way to kick > vmstat_shepherd from the reclaim path. I will think about that. Sounds a > bit tricky at first sight. OK, it wasn't that tricky afterall. Maybe I have missed something but the following should work. Or maybe the async nature of flushing turns out to be just impractical and unreliable and we will end up skipping THP (or all compound pages) for pcp LRU add cache. Let's see... --- diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 0aa613df463e..7f2c1aef6a09 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -274,4 +274,5 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages, extern const char * const vmstat_text[]; +extern void kick_vmstat_update(void); #endif /* _LINUX_VMSTAT_H */ diff --git a/mm/internal.h b/mm/internal.h index b6ead95a0184..876125bd11f4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -488,4 +488,5 @@ extern const struct trace_print_flags pageflag_names[]; extern const struct trace_print_flags vmaflag_names[]; extern const struct trace_print_flags gfpflag_names[]; +extern bool pcp_lru_add_need_drain(int cpu); #endif /* __MM_INTERNAL_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 056baf55a88d..5ca829e707f4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3556,6 +3556,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum compact_result compact_result; int compaction_retries = 0; int no_progress_loops = 0; + bool vmstat_updated = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3658,6 +3659,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (order && compaction_made_progress(compact_result)) compaction_retries++; + if (!vmstat_updated) { + vmstat_updated = true; + kick_vmstat_update(); + } + /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress); diff --git a/mm/swap.c b/mm/swap.c index 95916142fc46..3937e6caef96 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -667,6 +667,15 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); +bool pcp_lru_add_need_drain(int cpu) +{ + return pagevec_count(&per_cpu(lru_add_pvec, cpu)) || + pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || + need_activate_page_drain(cpu); +} + void lru_add_drain_all(void) { static DEFINE_MUTEX(lock); @@ -680,11 +689,7 @@ void lru_add_drain_all(void) for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); - if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || - pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || - need_activate_page_drain(cpu)) { + if (pcp_lru_add_need_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); schedule_work_on(cpu, work); cpumask_set_cpu(cpu, &has_work); diff --git a/mm/vmstat.c b/mm/vmstat.c index 7397d9548f21..cf4b095ace1c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -479,6 +479,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets) int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; int changes = 0; + /* +* Do not try to drain LRU pcp caches because that might be +* expensive - we take locks there etc. +*/ + if (do_pagesets && pcp_lru_add_need_drain(smp_processor_id())) + lru_add_drain(); + for_each_populated_zone(zone) { struct per_cpu_pageset __percpu *p = zone->pageset; @@ -1477,7 +1484,8 @@ static bool need_update(int cpu) return true; } - return false; + + return pcp_lru_add_need_drain(cpu); } void quiet_vmstat(void) @@ -1542,6 +1550,16 @@ static void vmstat_shepherd(struct work_struct *w) round_jiffies_relative(sysctl_stat_interval)); } +void kick_vmstat_update(void) +{ +#ifdef CONFIG_SMP + might_sleep(); + + if (cancel_delayed_work(&shepherd)) + vmstat_shepherd(
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote: > On Thu 02-05-16 03:00:00, Michal Hocko wrote: > > So I have given this a try (not tested yet) and it doesn't look terribly > > complicated. It is hijacking vmstat for a purpose it wasn't intended for > > originally but creating a dedicated kenrnel threads/WQ sounds like an > > overkill to me. Does this helps or do we have to be more aggressive and > > wake up shepherd from the allocator slow path. Could you give it a try > > please? > > It seems to work fine, but it takes quite random time to drain lists, > sometimes > a couple of seconds sometimes over two minutes. It is acceptable I believe. I guess you mean that some CPUs are not drained for few minutes, right? This might be a quite long and I tried to not flush LRU drain to the idle entry because I felt it would be too expensive. Maybe it would be better to kick the vmstat_shepherd from the allocator slow path. It would still take unpredictable amount of time but it would at list be called when we are getting short on memory. > I have an app which allocates almost all of the memory from numa node and > with just second patch and 100 consecutive executions 30-50% got killed. This is still not acceptable. So I guess we need a way to kick vmstat_shepherd from the reclaim path. I will think about that. Sounds a bit tricky at first sight. > After applying also your first patch I haven't seen any oom kill > activity - great. As I've said the first patch is quite dangerous as it depends on the WQ to make a forward progress which might depend on the memory allocation to create a new worker. > I was wondering how many lru_add_drain()'s are called and after boot when > machine was idle it was a bit over 5k calls during first 400s, and with some > activity it went up to 15k calls during 700s (including 5k from previous > experiment) which sounds fair to me given big cpu count. > > Do you see any advantages of dropping THP from pagevecs over this > solution? Well the general purpose of pcp pagevecs is to reduce the lru_lock contention. I have never measured the effect of THP pages. It is true THP amortizes the contention by the page number handled at once so it might be the easiest way (and certainly more acceptable for an old kernel which you seem to be running as mentioned by Dave) but it sounds too special cased and I would rather see less special casing for THP. So if the async pcp sync is not too tricky or hard to maintain and worsk I would rather go that way. Thanks for testing those patches! -- Michal Hocko SUSE Labs
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 05/04/2016 12:41 PM, Odzioba, Lukasz wrote: > Do you see any advantages of dropping THP from pagevecs over this solution? It's a more foolproof solution. Even with this patch, there might still be some corner cases where the draining doesn't occur. That "two minutes" might be come 20 or 200 under some circumstances.
RE: mm: pages are not freed from lru_add_pvecs after process termination
On Thu 02-05-16 03:00:00, Michal Hocko wrote: > So I have given this a try (not tested yet) and it doesn't look terribly > complicated. It is hijacking vmstat for a purpose it wasn't intended for > originally but creating a dedicated kenrnel threads/WQ sounds like an > overkill to me. Does this helps or do we have to be more aggressive and > wake up shepherd from the allocator slow path. Could you give it a try > please? It seems to work fine, but it takes quite random time to drain lists, sometimes a couple of seconds sometimes over two minutes. It is acceptable I believe. I have an app which allocates almost all of the memory from numa node and with just second patch and 100 consecutive executions 30-50% got killed. After applying also your first patch I haven't seen any oom kill activity - great. I was wondering how many lru_add_drain()'s are called and after boot when machine was idle it was a bit over 5k calls during first 400s, and with some activity it went up to 15k calls during 700s (including 5k from previous experiment) which sounds fair to me given big cpu count. Do you see any advantages of dropping THP from pagevecs over this solution? Thanks, Lukas
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Tue, May 03, 2016 at 09:37:57AM +0200, Michal Hocko wrote: > On Mon 02-05-16 19:02:50, Kirill A. Shutemov wrote: > > On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote: > > > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote: > > > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote: > > > >> On 04/27/2016 07:11 PM, Dave Hansen wrote: > > > >>> 6. Perhaps don't use the LRU pagevecs for large pages. It limits the > > > >>>severity of the problem. > > > >> > > > >> I think that makes sense. Being large already amortizes the cost per > > > >> base > > > >> page much more than pagevecs do (512 vs ~22 pages?). > > > > > > > > We try to do this already, don't we? Any spefic case where we have THPs > > > > on > > > > pagevecs? > > > > > > Lukas was hitting this on a RHEL 7 era kernel. In his kernel at least, > > > I'm pretty sure THP's were ending up on pagevecs. Are you saying you > > > don't think we're doing that any more? > > > > As Vlastimil pointed, we do. It need to be fixed, I think. > > It seems that offloading the draining to the vmstat context doesn't look > terribly bad. Don't we rather want to go that way? Maybe. My knowledge about lru cache is limited. -- Kirill A. Shutemov
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Mon 02-05-16 19:02:50, Kirill A. Shutemov wrote: > On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote: > > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote: > > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote: > > >> On 04/27/2016 07:11 PM, Dave Hansen wrote: > > >>> 6. Perhaps don't use the LRU pagevecs for large pages. It limits the > > >>>severity of the problem. > > >> > > >> I think that makes sense. Being large already amortizes the cost per base > > >> page much more than pagevecs do (512 vs ~22 pages?). > > > > > > We try to do this already, don't we? Any spefic case where we have THPs on > > > pagevecs? > > > > Lukas was hitting this on a RHEL 7 era kernel. In his kernel at least, > > I'm pretty sure THP's were ending up on pagevecs. Are you saying you > > don't think we're doing that any more? > > As Vlastimil pointed, we do. It need to be fixed, I think. It seems that offloading the draining to the vmstat context doesn't look terribly bad. Don't we rather want to go that way? -- Michal Hocko SUSE Labs
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote: > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote: > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote: > >> On 04/27/2016 07:11 PM, Dave Hansen wrote: > >>> 6. Perhaps don't use the LRU pagevecs for large pages. It limits the > >>>severity of the problem. > >> > >> I think that makes sense. Being large already amortizes the cost per base > >> page much more than pagevecs do (512 vs ~22 pages?). > > > > We try to do this already, don't we? Any spefic case where we have THPs on > > pagevecs? > > Lukas was hitting this on a RHEL 7 era kernel. In his kernel at least, > I'm pretty sure THP's were ending up on pagevecs. Are you saying you > don't think we're doing that any more? As Vlastimil pointed, we do. It need to be fixed, I think. Any volunteer? :-P -- Kirill A. Shutemov
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote: > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote: >> On 04/27/2016 07:11 PM, Dave Hansen wrote: >>> 6. Perhaps don't use the LRU pagevecs for large pages. It limits the >>>severity of the problem. >> >> I think that makes sense. Being large already amortizes the cost per base >> page much more than pagevecs do (512 vs ~22 pages?). > > We try to do this already, don't we? Any spefic case where we have THPs on > pagevecs? Lukas was hitting this on a RHEL 7 era kernel. In his kernel at least, I'm pretty sure THP's were ending up on pagevecs. Are you saying you don't think we're doing that any more?
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 05/02/2016 05:01 PM, Kirill A. Shutemov wrote: On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote: On 04/27/2016 07:11 PM, Dave Hansen wrote: 6. Perhaps don't use the LRU pagevecs for large pages. It limits the severity of the problem. I think that makes sense. Being large already amortizes the cost per base page much more than pagevecs do (512 vs ~22 pages?). We try to do this already, don't we? Any spefic case where we have THPs on pagevecs? For example like this? __do_huge_pmd_anonymous_page lru_cache_add_active_or_unevictable lru_cache_add
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote: > On 04/27/2016 07:11 PM, Dave Hansen wrote: > >6. Perhaps don't use the LRU pagevecs for large pages. It limits the > >severity of the problem. > > I think that makes sense. Being large already amortizes the cost per base > page much more than pagevecs do (512 vs ~22 pages?). We try to do this already, don't we? Any spefic case where we have THPs on pagevecs? -- Kirill A. Shutemov
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 04/27/2016 07:11 PM, Dave Hansen wrote: 6. Perhaps don't use the LRU pagevecs for large pages. It limits the severity of the problem. I think that makes sense. Being large already amortizes the cost per base page much more than pagevecs do (512 vs ~22 pages?).
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Thu 28-04-16 16:37:10, Michal Hocko wrote: [...] > 7. Hook into vmstat and flush from there? This would drain them > periodically but it would also introduce an undeterministic interference > as well. So I have given this a try (not tested yet) and it doesn't look terribly complicated. It is hijacking vmstat for a purpose it wasn't intended for originally but creating a dedicated kenrnel threads/WQ sounds like an overkill to me. Does this helps or do we have to be more aggressive and wake up shepherd from the allocator slow path. Could you give it a try please? --- diff --git a/mm/internal.h b/mm/internal.h index b6ead95a0184..876125bd11f4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -488,4 +488,5 @@ extern const struct trace_print_flags pageflag_names[]; extern const struct trace_print_flags vmaflag_names[]; extern const struct trace_print_flags gfpflag_names[]; +extern bool pcp_lru_add_need_drain(int cpu); #endif /* __MM_INTERNAL_H */ diff --git a/mm/swap.c b/mm/swap.c index 95916142fc46..3937e6caef96 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -667,6 +667,15 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); +bool pcp_lru_add_need_drain(int cpu) +{ + return pagevec_count(&per_cpu(lru_add_pvec, cpu)) || + pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || + need_activate_page_drain(cpu); +} + void lru_add_drain_all(void) { static DEFINE_MUTEX(lock); @@ -680,11 +689,7 @@ void lru_add_drain_all(void) for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); - if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || - pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || - need_activate_page_drain(cpu)) { + if (pcp_lru_add_need_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); schedule_work_on(cpu, work); cpumask_set_cpu(cpu, &has_work); diff --git a/mm/vmstat.c b/mm/vmstat.c index 7397d9548f21..766f751e3467 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -479,6 +479,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets) int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; int changes = 0; + /* +* Do not try to drain LRU pcp caches because that might be +* expensive - we take locks there etc. +*/ + if (do_pagesets && pcp_lru_add_need_drain(smp_processor_id())) + lru_add_drain(); + for_each_populated_zone(zone) { struct per_cpu_pageset __percpu *p = zone->pageset; @@ -1477,7 +1484,8 @@ static bool need_update(int cpu) return true; } - return false; + + return pcp_lru_add_need_drain(cpu); } void quiet_vmstat(void) -- Michal Hocko SUSE Labs
Re: mm: pages are not freed from lru_add_pvecs after process termination
On Wed 27-04-16 10:11:04, Dave Hansen wrote: > On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote: [...] > > 1. We need some statistics on the number and total *SIZES* of all pages > >in the lru pagevecs. It's too opaque now. > > 2. We need to make darn sure we drain the lru pagevecs before failing > >any kind of allocation. lru_add_drain_all is unfortunatelly too costly (especially on large machines). You are right that failing an allocation with a lot of cached pages is less than suboptimal though. So maybe we can do it from the slow path after the first round of direct reclaim failed to allocate anything. Something like the following: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dd65d9fb76a..0743c58c2e9d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3559,6 +3559,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, enum compact_result compact_result; int compaction_retries = 0; int no_progress_loops = 0; + bool drained_lru = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3667,6 +3668,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; + if (!drained_lru) { + drained_lru = true; + lru_add_drain_all(); + } + /* Do not loop if specifically requested */ if (gfp_mask & __GFP_NORETRY) goto noretry; The downside would be that we really depend on the WQ to make any progress here. If we are really out of memory then we are screwed so we would need a flush_work_timeout() or something else that would guarantee maximum timeout. That something else might be to stop using WQ and move the flushing into the IRQ context. Not for free too but at least not dependant on having some memory to make a progress. > > 3. We need some way to drain the lru pagevecs directly. Maybe the buddy > >pcp lists too. > > 4. We need to make sure that a zone_reclaim_mode=0 system still drains > >too. > > 5. The VM stats and their updates are now related to how often > >drain_zone_pages() gets run. That might be interacting here too. > > 6. Perhaps don't use the LRU pagevecs for large pages. It limits the >severity of the problem. 7. Hook into vmstat and flush from there? This would drain them periodically but it would also introduce an undeterministic interference as well. -- Michal Hocko SUSE Labs
Re: mm: pages are not freed from lru_add_pvecs after process termination
On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote: > Pieces of the puzzle: > A) after process termination memory is not getting freed nor accounted as free I don't think this part is necessarily a bug. As long as we have stats *somewhere*, and we really do "reclaim" them, I don't think we need to call these pages "free". > I am not sure whether it is expected behavior or a side effect of something > else not > going as it should. Temporarily I added lru_add_drain_all() to > try_to_free_pages() > which sort of hammers B case, but A is still present. It's not expected behavior. It's an unanticipated side effect of large numbers of cpu threads, large pages on the LRU, and (relatively) small zones. > I am not familiar with this code, but I feel like draining lru_add work > should be split > into smaller pieces and done by kswapd to fix A and drain only as much pages > as > needed in try_to_free_pages to fix B. > > Any comments/ideas/patches for a proper fix are welcome. Here are my suggestions. I've passed these along multiple times, but I guess I'll repeat them again for good measure. > 1. We need some statistics on the number and total *SIZES* of all pages >in the lru pagevecs. It's too opaque now. > 2. We need to make darn sure we drain the lru pagevecs before failing >any kind of allocation. > 3. We need some way to drain the lru pagevecs directly. Maybe the buddy >pcp lists too. > 4. We need to make sure that a zone_reclaim_mode=0 system still drains >too. > 5. The VM stats and their updates are now related to how often >drain_zone_pages() gets run. That might be interacting here too. 6. Perhaps don't use the LRU pagevecs for large pages. It limits the severity of the problem.