subject:"Re\: mm\: pages are not freed from lru_add

RE: mm: pages are not freed from lru_add_pvecs after process termination

2016-06-08 Thread Odzioba, Lukasz

On Tue 07-06-16 13:20:00, Michal Hocko wrote:
> I guess you want something like posix_memalign or start faulting in from
> an aligned address to guarantee you will fault 2MB pages. 

Good catch.

> Besides that I am really suspicious that this will be measurable at all.
> I would just go and spin a patch assuming you are still able to trigger
> OOM with the vanilla kernel. 

Yes, I am still able to trigger OOM, the tests I did are  more like sanity
checks rather than benchmarks. lru_cache_add takes very little time
so it was rather to look for some unexpected side effects.

Thank,
Lukas

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-06-07 Thread Michal Hocko

On Tue 07-06-16 09:02:02, Odzioba, Lukasz wrote:
[...]
> //compile with: gcc bench.c -o bench_2M -fopenmp
> //compile with: gcc -D SMALL_PAGES bench.c -o bench_4K -fopenmp
> #include 
> #include 
> #include 
> 
> #define MAP_HUGE_SHIFT  26
> #define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT)
> 
> #ifndef SMALL_PAGES
> #define PAGE_SIZE (1024*1024*2)
> #define MAP_PARAM (MAP_HUGE_2MB)

Isn't MAP_HUGE_2MB ignored for !hugetlb pages?

> #else
> #define PAGE_SIZE (1024*4)
> #define MAP_PARAM (0)
> #endif
> 
> void main() {
> size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 
> 288 CPUs
> #pragma omp parallel
> {
> unsigned int k;
> for (k = 0; k < 10; k++) {
> void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, 
> MAP_PRIVATE | MAP_ANON | MAP_PARAM, -1, 0);

I guess you want something like posix_memalign or start faulting in from
an aligned address to guarantee you will fault 2MB pages. Also note that
the default behavior for THP during the fault has changed recently (see
444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
stall-free defrag option") so you might need MADV_HUGEPAGE.

Besides that I am really suspicious that this will be measurable at all.
I would just go and spin a patch assuming you are still able to trigger
OOM with the vanilla kernel. The bug fix is more important...
-- 
Michal Hocko
SUSE Labs

RE: mm: pages are not freed from lru_add_pvecs after process termination

2016-06-07 Thread Odzioba, Lukasz

On Wed 05-11-16 09:53:00, Michal Hocko wrote:
> Yes I think this makes sense. The only case where it would be suboptimal
> is when the pagevec was already full and then we just created a single
> page pvec to drain it. This can be handled better though by:
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 95916142fc46..3fe4f180e8bf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
>   struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
> 
>   get_page(page);
>-  if (!pagevec_space(pvec))
>+  if (!pagevec_add(pvec, page) || PageCompound(page))
>   __pagevec_lru_add(pvec);
>-  pagevec_add(pvec, page);
>   put_cpu_var(lru_add_pvec);
>}

It's been a while, but I am back with some results.
For 2M i 4K pages I wrote simple app which mmaps and unmaps a lot of memory 
(60GB/288CPU) in parallel and does it ten times to get rid of some os/threading 
overhead.
Then I created an app which mixes pages in sort of pseudo random random way.
I executed those 10 times under "time" (once with THP=on and once with THP=off) 
command and calculated sum, min, max, avg of sys, real, user time which was 
necessary due to significant bias in results.

In overall it seems that this change has no negative impact on performance:
4K  THP=on,off -> no significant change
2M  THP=on,off -> it might be a tiny bit slower, but still close to measurement 
error
MIX THP=on,off -> no significant change

If you have any concerns about test correctness please let me know.
Below I added test applications and test results.

Thanks,
Lukas

--

//compile with: gcc bench.c -o bench_2M -fopenmp
//compile with: gcc -D SMALL_PAGES bench.c -o bench_4K -fopenmp
#include 
#include 
#include 

#define MAP_HUGE_SHIFT  26
#define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT)

#ifndef SMALL_PAGES
#define PAGE_SIZE (1024*1024*2)
#define MAP_PARAM (MAP_HUGE_2MB)
#else
#define PAGE_SIZE (1024*4)
#define MAP_PARAM (0)
#endif

void main() {
size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 
CPUs
#pragma omp parallel
{
unsigned int k;
for (k = 0; k < 10; k++) {
void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE 
| MAP_ANON | MAP_PARAM, -1, 0);
if (p != MAP_FAILED) {
char *cp = (char*)p;
size_t i;
for (i = 0; i < size / PAGE_SIZE; i++) {
*cp = 0;
cp += PAGE_SIZE;
}
munmap(p, size);
}
}
}
}

//compile with: gcc bench_mixed.c -o bench_mixed -fopenmp
#include 
#include 
#include 
#define SMALL_PAGE (1024*4)
#define HUGE_PAGE (1024*4)
#define MAP_HUGE_SHIFT  26
#define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT)
void main() {
size_t size = ((60 * 1000 * 1000) / 288) * 1000; // 60GBs of memory 288 
CPUs
#pragma omp parallel
{
unsigned int k, MAP_PARAM = 0;
unsigned int PAGE_SIZE = SMALL_PAGE;
for (k = 0; k < 10; k++) {
if ((k + omp_get_thread_num()) % 2) {
MAP_PARAM = MAP_HUGE_2MB;
PAGE_SIZE = HUGE_PAGE;
}
void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE 
| MAP_ANON | MAP_PARAM, -1, 0);
if (p != MAP_FAILED) {
char *cp = (char*)p;
size_t i;
for (i = 0; i < size / PAGE_SIZE; i++) {
*cp = 0;
cp += PAGE_SIZE;
}
munmap(p, size);
}
}
}
}



***

# 4K THP=ON
###real  unpatched   patched###
sum = 428.737s sum = 421.339s
min = 41.187s min = 41.492s
max = 44.948s max = 42.822s
avg = 42.874s avg = 42.134s

###user  unpatched   patched###
sum = 145.241s sum = 147.283s
min = 13.760s min = 14.418s
max = 15.532s max = 15.201s
avg = 14.524s avg = 14.728s

###sys  unpatched   patched###
sum = 4882.708s sum = 5020.581s
min = 441.922s min = 490.516s
max = 535.294s max = 532.137s
avg = 488.271s avg = 502.058s

# 4K THP=OFF###
###real  unpatched   patched###
sum = 2149.288s sum = 2144.336s
min = 214.589s min = 212.642s
max = 215.937s max = 215.579s
avg = 214.929s avg = 214.434s

###user  unpatched   patched###
sum = 858.659s sum = 858.166s
min = 81.655s min = 82.084s
max = 87.790s max = 88.649s
avg = 85.866s avg = 85.817s

###sys  unpatched   patched###
sum = 32357.867s sum = 31126.183s
min = 2952.685s min = 2783.157s
max = 3442.004s max =

RE: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-13 Thread Odzioba, Lukasz

On Wed 05-11-16 09:53:00, Michal Hocko wrote:
> Yes I think this makes sense. The only case where it would be suboptimal
> is when the pagevec was already full and then we just created a single
> page pvec to drain it. This can be handled better though by:
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 95916142fc46..3fe4f180e8bf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
>   struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>  
>   get_page(page);
>-  if (!pagevec_space(pvec))
>+  if (!pagevec_add(pvec, page) || PageCompound(page))
>   __pagevec_lru_add(pvec);
>-  pagevec_add(pvec, page);
>   put_cpu_var(lru_add_pvec);
 >}
 
Oh yeah, that's exactly what I meant, couldn't find such elegant way of
handling this special case and didn't want to obscure the idea.

I'll do the tests proposed by Date and be back here with results next week.

Thank you guys for the involvement,
Lukas

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-13 Thread Vlastimil Babka


On 05/11/2016 09:53 AM, Michal Hocko wrote:

On Fri 06-05-16 09:04:34, Dave Hansen wrote:

On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote:

On Thu 05-05-16 09:21:00, Michal Hocko wrote:

Or maybe the async nature of flushing turns
out to be just impractical and unreliable and we will end up skipping
THP (or all compound pages) for pcp LRU add cache. Let's see...


What if we simply skip lru_add pvecs for compound pages?
That way we still have compound pages on LRU's, but the problem goes
away.  It is not quite what this naïve patch does, but it works nice for me.

diff --git a/mm/swap.c b/mm/swap.c
index 03aacbc..c75d5e1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
 get_page(page);
 if (!pagevec_space(pvec))
 __pagevec_lru_add(pvec);
 pagevec_add(pvec, page);
+   if (PageCompound(page))
+   __pagevec_lru_add(pvec);
 put_cpu_var(lru_add_pvec);
  }


That's not _quite_ what I had in mind since that drains the entire pvec
every time a large page is encountered.  But I'm conflicted about what
the right behavior _is_.

We'd taking the LRU lock for 'page' anyway, so we might as well drain
the pvec.


Note that pages in the pagevec can come from different zones, so this is 
not universally true.




Yes I think this makes sense. The only case where it would be suboptimal
is when the pagevec was already full and then we just created a single
page pvec to drain it. This can be handled better though by:

diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3fe4f180e8bf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
struct pagevec *pvec = &get_cpu_var(lru_add_pvec);

get_page(page);
-   if (!pagevec_space(pvec))
+   if (!pagevec_add(pvec, page) || PageCompound(page))
__pagevec_lru_add(pvec);
-   pagevec_add(pvec, page);
put_cpu_var(lru_add_pvec);
  }


Yeah that could work. There might be more complex solutions at the level
of lru_cache_add_active_or_unevictable() where we call it either from
base page code (mm/memory.c) or functions in mm/huge_memory.c. We could
redirect it at that point, but likely not worth the trouble unless this
simple solution doesn't show some performance regression...


Or, does the additional work to put the page on to a pvec and then
immediately drain it overwhelm that advantage?


pagevec_add is quite trivial so I would be really surprised if it
mattered.

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-11 Thread Michal Hocko

On Fri 06-05-16 09:04:34, Dave Hansen wrote:
> On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote:
> > On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> >> Or maybe the async nature of flushing turns
> >> out to be just impractical and unreliable and we will end up skipping
> >> THP (or all compound pages) for pcp LRU add cache. Let's see...
> > 
> > What if we simply skip lru_add pvecs for compound pages?
> > That way we still have compound pages on LRU's, but the problem goes
> > away.  It is not quite what this naïve patch does, but it works nice for me.
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 03aacbc..c75d5e1 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
> > get_page(page);
> > if (!pagevec_space(pvec))
> > __pagevec_lru_add(pvec);
> > pagevec_add(pvec, page);
> > +   if (PageCompound(page))
> > +   __pagevec_lru_add(pvec);
> > put_cpu_var(lru_add_pvec);
> >  }
> 
> That's not _quite_ what I had in mind since that drains the entire pvec
> every time a large page is encountered.  But I'm conflicted about what
> the right behavior _is_.
> 
> We'd taking the LRU lock for 'page' anyway, so we might as well drain
> the pvec.

Yes I think this makes sense. The only case where it would be suboptimal
is when the pagevec was already full and then we just created a single
page pvec to drain it. This can be handled better though by:

diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3fe4f180e8bf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -391,9 +391,8 @@ static void __lru_cache_add(struct page *page)
struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 
get_page(page);
-   if (!pagevec_space(pvec))
+   if (!pagevec_add(pvec, page) || PageCompound(page))
__pagevec_lru_add(pvec);
-   pagevec_add(pvec, page);
put_cpu_var(lru_add_pvec);
 }
 

> Or, does the additional work to put the page on to a pvec and then
> immediately drain it overwhelm that advantage?

pagevec_add is quite trivial so I would be really surprised if it
mattered.
-- 
Michal Hocko
SUSE Labs

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-11 Thread Michal Hocko

On Thu 05-05-16 17:25:07, Odzioba, Lukasz wrote:
> On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> > OK, it wasn't that tricky afterall. Maybe I have missed something but
> > the following should work. Or maybe the async nature of flushing turns
> > out to be just impractical and unreliable and we will end up skipping
> > THP (or all compound pages) for pcp LRU add cache. Let's see...
> 
> Initially this issue was found on RH's 3.10.x kernel, but now I am using 
> 4.6-rc6.
> 
> In overall it does help and under heavy load it is slightly better than the
> second patch. Unfortunately I am still able to hit 10-20% oom kills with it -
> (went down from 30-50%) partially due to earlier vmstat_update call
>  - it went up to 25-25% with this patch below:

This simply shows that this is not a viable option. So I guess we really
want to rather skip THP (compound pages) from LRU add pcp cache. Thanks
for your effort and testing!
-- 
Michal Hocko
SUSE Labs

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-06 Thread Dave Hansen

On 05/06/2016 08:10 AM, Odzioba, Lukasz wrote:
> On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
>> Or maybe the async nature of flushing turns
>> out to be just impractical and unreliable and we will end up skipping
>> THP (or all compound pages) for pcp LRU add cache. Let's see...
> 
> What if we simply skip lru_add pvecs for compound pages?
> That way we still have compound pages on LRU's, but the problem goes
> away.  It is not quite what this naïve patch does, but it works nice for me.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 03aacbc..c75d5e1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
> get_page(page);
> if (!pagevec_space(pvec))
> __pagevec_lru_add(pvec);
> pagevec_add(pvec, page);
> +   if (PageCompound(page))
> +   __pagevec_lru_add(pvec);
> put_cpu_var(lru_add_pvec);
>  }

That's not _quite_ what I had in mind since that drains the entire pvec
every time a large page is encountered.  But I'm conflicted about what
the right behavior _is_.

We'd taking the LRU lock for 'page' anyway, so we might as well drain
the pvec.

Or, does the additional work to put the page on to a pvec and then
immediately drain it overwhelm that advantage?

Or does it just not matter?

Kirill, do you have a suggestion for how we should be checking for THP
pages in code like this?  PageCompound() will surely _work_ for anon-THP
and your file-THP, but is it the best way to check?

> Do we have any tests that I could use to measure performance impact
> of such changes before I start to tweak it up? Or maybe it doesn't make
> sense at all ?

You probably want to very carefully calculate the time to fault a page,
then separately to free a page.  If we can't manage to detect a delta on
a little microbenchmark like that then we'll probably never see one in
practice.

You'll want to measure the fault time for a 4k pages, 2M pages, and then
possibly a mix.

You'll want to do this in a highly parallel test to make sure any
additional LRU lock overhead shows up.

RE: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-06 Thread Odzioba, Lukasz

On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> Or maybe the async nature of flushing turns
> out to be just impractical and unreliable and we will end up skipping
> THP (or all compound pages) for pcp LRU add cache. Let's see...

What if we simply skip lru_add pvecs for compound pages?
That way we still have compound pages on LRU's, but the problem goes
away.  It is not quite what this naïve patch does, but it works nice for me.

diff --git a/mm/swap.c b/mm/swap.c
index 03aacbc..c75d5e1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -392,7 +392,9 @@ static void __lru_cache_add(struct page *page)
get_page(page);
if (!pagevec_space(pvec))
__pagevec_lru_add(pvec);
pagevec_add(pvec, page);
+   if (PageCompound(page))
+   __pagevec_lru_add(pvec);
put_cpu_var(lru_add_pvec);
 }

Do we have any tests that I could use to measure performance impact
of such changes before I start to tweak it up? Or maybe it doesn't make
sense at all ?

Thanks,
Lukas

RE: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-05 Thread Odzioba, Lukasz

On Thu 05-05-16 09:21:00, Michal Hocko wrote: 
> OK, it wasn't that tricky afterall. Maybe I have missed something but
> the following should work. Or maybe the async nature of flushing turns
> out to be just impractical and unreliable and we will end up skipping
> THP (or all compound pages) for pcp LRU add cache. Let's see...

Initially this issue was found on RH's 3.10.x kernel, but now I am using 
4.6-rc6.

In overall it does help and under heavy load it is slightly better than the
second patch. Unfortunately I am still able to hit 10-20% oom kills with it -
(went down from 30-50%) partially due to earlier vmstat_update call
 - it went up to 25-25% with this patch below:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b4359f8..7a5ab0d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3264,17 +3264,17 @@ retry:
if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
migration_mode = MIGRATE_SYNC_LIGHT;

-   if(!vmstat_updated) {
-   vmstat_updated = true;
-   kick_vmstat_update();
-   }
-
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
if (page)
goto got_pg;

+   if(!vmstat_updated) {
+   vmstat_updated = true;
+   kick_vmstat_update();
+   }

I don't quite see an uninvasive way to make sure that we drain all pvecs
before failing allocation and doing it asynchronously will race allocations
anyway - I guess.

Thanks,
Lukas

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-05 Thread Michal Hocko

On Wed 04-05-16 22:36:43, Michal Hocko wrote:
> On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote:
[...]
> > I have an app which allocates almost all of the memory from numa node and
> > with just second patch and 100 consecutive executions 30-50% got killed.
> 
> This is still not acceptable. So I guess we need a way to kick
> vmstat_shepherd from the reclaim path. I will think about that. Sounds a
> bit tricky at first sight.

OK, it wasn't that tricky afterall. Maybe I have missed something but
the following should work. Or maybe the async nature of flushing turns
out to be just impractical and unreliable and we will end up skipping
THP (or all compound pages) for pcp LRU add cache. Let's see...
---
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 0aa613df463e..7f2c1aef6a09 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -274,4 +274,5 @@ static inline void __mod_zone_freepage_state(struct zone 
*zone, int nr_pages,
 
 extern const char * const vmstat_text[];
 
+extern void kick_vmstat_update(void);
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..876125bd11f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -488,4 +488,5 @@ extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
 
+extern bool pcp_lru_add_need_drain(int cpu);
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 056baf55a88d..5ca829e707f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3556,6 +3556,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum compact_result compact_result;
int compaction_retries = 0;
int no_progress_loops = 0;
+   bool vmstat_updated = false;
 
/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3658,6 +3659,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (order && compaction_made_progress(compact_result))
compaction_retries++;
 
+   if (!vmstat_updated) {
+   vmstat_updated = true;
+   kick_vmstat_update();
+   }
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3937e6caef96 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -667,6 +667,15 @@ static void lru_add_drain_per_cpu(struct work_struct 
*dummy)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
+bool pcp_lru_add_need_drain(int cpu)
+{
+   return pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+   pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+   pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+   pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+   need_activate_page_drain(cpu);
+}
+
 void lru_add_drain_all(void)
 {
static DEFINE_MUTEX(lock);
@@ -680,11 +689,7 @@ void lru_add_drain_all(void)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-   if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-   pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-   pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-   pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-   need_activate_page_drain(cpu)) {
+   if (pcp_lru_add_need_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
cpumask_set_cpu(cpu, &has_work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7397d9548f21..cf4b095ace1c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -479,6 +479,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int changes = 0;
 
+   /*
+* Do not try to drain LRU pcp caches because that might be
+* expensive - we take locks there etc.
+*/
+   if (do_pagesets && pcp_lru_add_need_drain(smp_processor_id()))
+   lru_add_drain();
+
for_each_populated_zone(zone) {
struct per_cpu_pageset __percpu *p = zone->pageset;
 
@@ -1477,7 +1484,8 @@ static bool need_update(int cpu)
return true;
 
}
-   return false;
+
+   return pcp_lru_add_need_drain(cpu);
 }
 
 void quiet_vmstat(void)
@@ -1542,6 +1550,16 @@ static void vmstat_shepherd(struct work_struct *w)
round_jiffies_relative(sysctl_stat_interval));
 }
 
+void kick_vmstat_update(void)
+{
+#ifdef CONFIG_SMP
+   might_sleep();
+
+   if (cancel_delayed_work(&shepherd))
+   vmstat_shepherd(

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-04 Thread Michal Hocko

On Wed 04-05-16 19:41:59, Odzioba, Lukasz wrote:
> On Thu 02-05-16 03:00:00, Michal Hocko wrote:
> > So I have given this a try (not tested yet) and it doesn't look terribly
> > complicated. It is hijacking vmstat for a purpose it wasn't intended for
> > originally but creating a dedicated kenrnel threads/WQ sounds like an
> > overkill to me. Does this helps or do we have to be more aggressive and
> > wake up shepherd from the allocator slow path. Could you give it a try
> > please?
> 
> It seems to work fine, but it takes quite random time to drain lists, 
> sometimes
> a couple of seconds sometimes over two minutes. It is acceptable I believe.

I guess you mean that some CPUs are not drained for few minutes, right?
This might be a quite long and I tried to not flush LRU drain to the
idle entry because I felt it would be too expensive. Maybe it would be
better to kick the vmstat_shepherd from the allocator slow path. It
would still take unpredictable amount of time but it would at list be
called when we are getting short on memory.

> I have an app which allocates almost all of the memory from numa node and
> with just second patch and 100 consecutive executions 30-50% got killed.

This is still not acceptable. So I guess we need a way to kick
vmstat_shepherd from the reclaim path. I will think about that. Sounds a
bit tricky at first sight.

> After applying also your first patch I haven't seen any oom kill
> activity - great.

As I've said the first patch is quite dangerous as it depends on the WQ
to make a forward progress which might depend on the memory allocation
to create a new worker.

> I was wondering how many lru_add_drain()'s are called and after boot when
> machine was idle it was a bit over 5k calls during first 400s, and with some 
> activity it went up to 15k calls during 700s (including 5k from previous 
> experiment) which sounds fair to me given big cpu count.
> 
> Do you see any advantages of dropping THP from pagevecs over this
> solution?

Well the general purpose of pcp pagevecs is to reduce the lru_lock
contention. I have never measured the effect of THP pages. It is true
THP amortizes the contention by the page number handled at once so it
might be the easiest way (and certainly more acceptable for an old
kernel which you seem to be running as mentioned by Dave) but it sounds
too special cased and I would rather see less special casing for THP. So
if the async pcp sync is not too tricky or hard to maintain and worsk I
would rather go that way.

Thanks for testing those patches!
-- 
Michal Hocko
SUSE Labs

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-04 Thread Dave Hansen

On 05/04/2016 12:41 PM, Odzioba, Lukasz wrote:
> Do you see any advantages of dropping THP from pagevecs over this solution?

It's a more foolproof solution.  Even with this patch, there might still
be some corner cases where the draining doesn't occur.  That "two
minutes" might be come 20 or 200 under some circumstances.

RE: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-04 Thread Odzioba, Lukasz

On Thu 02-05-16 03:00:00, Michal Hocko wrote:
> So I have given this a try (not tested yet) and it doesn't look terribly
> complicated. It is hijacking vmstat for a purpose it wasn't intended for
> originally but creating a dedicated kenrnel threads/WQ sounds like an
> overkill to me. Does this helps or do we have to be more aggressive and
> wake up shepherd from the allocator slow path. Could you give it a try
> please?

It seems to work fine, but it takes quite random time to drain lists, sometimes
a couple of seconds sometimes over two minutes. It is acceptable I believe.

I have an app which allocates almost all of the memory from numa node and
with just second patch and 100 consecutive executions 30-50% got killed.
After applying also your first patch I haven't seen any oom kill activity - 
great.

I was wondering how many lru_add_drain()'s are called and after boot when
machine was idle it was a bit over 5k calls during first 400s, and with some 
activity it went up to 15k calls during 700s (including 5k from previous 
experiment) which sounds fair to me given big cpu count.

Do you see any advantages of dropping THP from pagevecs over this solution?

Thanks,
Lukas

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-03 Thread Kirill A. Shutemov

On Tue, May 03, 2016 at 09:37:57AM +0200, Michal Hocko wrote:
> On Mon 02-05-16 19:02:50, Kirill A. Shutemov wrote:
> > On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote:
> > > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> > > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> > > >> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> > > >>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> > > >>>severity of the problem.
> > > >>
> > > >> I think that makes sense. Being large already amortizes the cost per 
> > > >> base
> > > >> page much more than pagevecs do (512 vs ~22 pages?).
> > > > 
> > > > We try to do this already, don't we? Any spefic case where we have THPs 
> > > > on
> > > > pagevecs?
> > > 
> > > Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
> > > I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
> > > don't think we're doing that any more?
> > 
> > As Vlastimil pointed, we do. It need to be fixed, I think.
> 
> It seems that offloading the draining to the vmstat context doesn't look
> terribly bad. Don't we rather want to go that way?

Maybe. My knowledge about lru cache is limited.

-- 
 Kirill A. Shutemov

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-03 Thread Michal Hocko

On Mon 02-05-16 19:02:50, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote:
> > On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> > > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> > >> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> > >>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> > >>>severity of the problem.
> > >>
> > >> I think that makes sense. Being large already amortizes the cost per base
> > >> page much more than pagevecs do (512 vs ~22 pages?).
> > > 
> > > We try to do this already, don't we? Any spefic case where we have THPs on
> > > pagevecs?
> > 
> > Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
> > I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
> > don't think we're doing that any more?
> 
> As Vlastimil pointed, we do. It need to be fixed, I think.

It seems that offloading the draining to the vmstat context doesn't look
terribly bad. Don't we rather want to go that way?
-- 
Michal Hocko
SUSE Labs

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-02 Thread Kirill A. Shutemov

On Mon, May 02, 2016 at 08:49:03AM -0700, Dave Hansen wrote:
> On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> > On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> >> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> >>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> >>>severity of the problem.
> >>
> >> I think that makes sense. Being large already amortizes the cost per base
> >> page much more than pagevecs do (512 vs ~22 pages?).
> > 
> > We try to do this already, don't we? Any spefic case where we have THPs on
> > pagevecs?
> 
> Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
> I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
> don't think we're doing that any more?

As Vlastimil pointed, we do. It need to be fixed, I think.

Any volunteer? :-P

-- 
 Kirill A. Shutemov

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-02 Thread Dave Hansen

On 05/02/2016 08:01 AM, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
>> On 04/27/2016 07:11 PM, Dave Hansen wrote:
>>> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>>>severity of the problem.
>>
>> I think that makes sense. Being large already amortizes the cost per base
>> page much more than pagevecs do (512 vs ~22 pages?).
> 
> We try to do this already, don't we? Any spefic case where we have THPs on
> pagevecs?

Lukas was hitting this on a RHEL 7 era kernel.  In his kernel at least,
I'm pretty sure THP's were ending up on pagevecs.  Are you saying you
don't think we're doing that any more?

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-02 Thread Vlastimil Babka


On 05/02/2016 05:01 PM, Kirill A. Shutemov wrote:

On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:

On 04/27/2016 07:11 PM, Dave Hansen wrote:

6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
severity of the problem.


I think that makes sense. Being large already amortizes the cost per base
page much more than pagevecs do (512 vs ~22 pages?).


We try to do this already, don't we? Any spefic case where we have THPs on
pagevecs?


For example like this?
__do_huge_pmd_anonymous_page
  lru_cache_add_active_or_unevictable
lru_cache_add

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-02 Thread Kirill A. Shutemov

On Mon, May 02, 2016 at 04:39:35PM +0200, Vlastimil Babka wrote:
> On 04/27/2016 07:11 PM, Dave Hansen wrote:
> >6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
> >severity of the problem.
> 
> I think that makes sense. Being large already amortizes the cost per base
> page much more than pagevecs do (512 vs ~22 pages?).

We try to do this already, don't we? Any spefic case where we have THPs on
pagevecs?

-- 
 Kirill A. Shutemov

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-02 Thread Vlastimil Babka


On 04/27/2016 07:11 PM, Dave Hansen wrote:

6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
severity of the problem.


I think that makes sense. Being large already amortizes the cost per 
base page much more than pagevecs do (512 vs ~22 pages?).

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-05-02 Thread Michal Hocko

On Thu 28-04-16 16:37:10, Michal Hocko wrote:
[...]
> 7. Hook into vmstat and flush from there? This would drain them
> periodically but it would also introduce an undeterministic interference
> as well.

So I have given this a try (not tested yet) and it doesn't look terribly
complicated. It is hijacking vmstat for a purpose it wasn't intended for
originally but creating a dedicated kenrnel threads/WQ sounds like an
overkill to me. Does this helps or do we have to be more aggressive and
wake up shepherd from the allocator slow path. Could you give it a try
please?
---
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..876125bd11f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -488,4 +488,5 @@ extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
 
+extern bool pcp_lru_add_need_drain(int cpu);
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..3937e6caef96 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -667,6 +667,15 @@ static void lru_add_drain_per_cpu(struct work_struct 
*dummy)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
+bool pcp_lru_add_need_drain(int cpu)
+{
+   return pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+   pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+   pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+   pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+   need_activate_page_drain(cpu);
+}
+
 void lru_add_drain_all(void)
 {
static DEFINE_MUTEX(lock);
@@ -680,11 +689,7 @@ void lru_add_drain_all(void)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-   if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-   pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-   pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-   pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-   need_activate_page_drain(cpu)) {
+   if (pcp_lru_add_need_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
cpumask_set_cpu(cpu, &has_work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7397d9548f21..766f751e3467 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -479,6 +479,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int changes = 0;
 
+   /*
+* Do not try to drain LRU pcp caches because that might be
+* expensive - we take locks there etc.
+*/
+   if (do_pagesets && pcp_lru_add_need_drain(smp_processor_id()))
+   lru_add_drain();
+
for_each_populated_zone(zone) {
struct per_cpu_pageset __percpu *p = zone->pageset;
 
@@ -1477,7 +1484,8 @@ static bool need_update(int cpu)
return true;
 
}
-   return false;
+
+   return pcp_lru_add_need_drain(cpu);
 }
 
 void quiet_vmstat(void)
-- 
Michal Hocko
SUSE Labs

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-04-28 Thread Michal Hocko

On Wed 27-04-16 10:11:04, Dave Hansen wrote:
> On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
[...]
> > 1. We need some statistics on the number and total *SIZES* of all pages
> >in the lru pagevecs.  It's too opaque now.
> > 2. We need to make darn sure we drain the lru pagevecs before failing
> >any kind of allocation.

lru_add_drain_all is unfortunatelly too costly (especially on large
machines). You are right that failing an allocation with a lot of cached
pages is less than suboptimal though. So maybe we can do it from the
slow path after the first round of direct reclaim failed to allocate
anything. Something like the following:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd65d9fb76a..0743c58c2e9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3559,6 +3559,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum compact_result compact_result;
int compaction_retries = 0;
int no_progress_loops = 0;
+   bool drained_lru = false;

/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3667,6 +3668,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (page)
goto got_pg;

+   if (!drained_lru) {
+   drained_lru = true;
+   lru_add_drain_all();
+   }
+
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto noretry;

The downside would be that we really depend on the WQ to make any
progress here. If we are really out of memory then we are screwed so
we would need a flush_work_timeout() or something else that would
guarantee maximum timeout. That something else might be to stop using WQ
and move the flushing into the IRQ context. Not for free too but at
least not dependant on having some memory to make a progress.

> > 3. We need some way to drain the lru pagevecs directly.  Maybe the buddy
> >pcp lists too.
> > 4. We need to make sure that a zone_reclaim_mode=0 system still drains
> >too.
> > 5. The VM stats and their updates are now related to how often
> >drain_zone_pages() gets run.  That might be interacting here too.
> 
> 6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
>severity of the problem.

7. Hook into vmstat and flush from there? This would drain them
periodically but it would also introduce an undeterministic interference
as well.

-- 
Michal Hocko
SUSE Labs

Re: mm: pages are not freed from lru_add_pvecs after process termination

2016-04-27 Thread Dave Hansen

On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
> Pieces of the puzzle:
> A) after process termination memory is not getting freed nor accounted as free

I don't think this part is necessarily a bug.  As long as we have stats
*somewhere*, and we really do "reclaim" them, I don't think we need to
call these pages "free".

> I am not sure whether it is expected behavior or a side effect of something 
> else not
> going as it should. Temporarily I added lru_add_drain_all() to 
> try_to_free_pages()
> which sort of hammers B case, but A is still present.

It's not expected behavior.  It's an unanticipated side effect of large
numbers of cpu threads, large pages on the LRU, and (relatively) small
zones.

> I am not familiar with this code, but I feel like draining lru_add work 
> should be split
> into smaller pieces and done by kswapd to fix A and drain only as much pages 
> as
> needed in try_to_free_pages to fix B.
> 
> Any comments/ideas/patches for a proper fix are welcome.

Here are my suggestions.  I've passed these along multiple times, but I
guess I'll repeat them again for good measure.

> 1. We need some statistics on the number and total *SIZES* of all pages
>in the lru pagevecs.  It's too opaque now.
> 2. We need to make darn sure we drain the lru pagevecs before failing
>any kind of allocation.
> 3. We need some way to drain the lru pagevecs directly.  Maybe the buddy
>pcp lists too.
> 4. We need to make sure that a zone_reclaim_mode=0 system still drains
>too.
> 5. The VM stats and their updates are now related to how often
>drain_zone_pages() gets run.  That might be interacting here too.

6. Perhaps don't use the LRU pagevecs for large pages.  It limits the
   severity of the problem.

RE: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

RE: mm: pages are not freed from lru_add_pvecs after process termination

RE: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

RE: mm: pages are not freed from lru_add_pvecs after process termination

RE: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

RE: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

Re: mm: pages are not freed from lru_add_pvecs after process termination

24 matches

Site Navigation

Mail list logo

Footer information