Re: [PATCH v7] mm: support madvise(MADV_FREE)

2014-05-22 Thread Michael Kerrisk
Hi Minchan,

On Mon, May 19, 2014 at 5:16 AM, Minchan Kim  wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).

Since this patch changes the ABI, could you please CC future
iterations to linux-...@vger.kernel.org as per
Documentation/SubmitChecklist.

Thanks,

Michael


> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
>
> How to work is following as.
>
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
>
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):4
> On-line CPU(s) list:   0-3
> Thread(s) per core:2
> Core(s) per socket:2
> Socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 42
> Stepping:  7
> CPU MHz:   2801.000
> BogoMIPS:  5581.64
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  4096K
> NUMA node0 CPU(s): 0-3
>
> ebizzy benchmark(./ebizzy -S 10 -n 512)
>
>  vanilla-jemalloc   MADV_free-jemalloc
>
> 1 thread
> records:  10  records:  10
> avg:  7682.10 avg:  15306.10
> std:  62.35(0.81%)std:  347.99(2.27%)
> max:  7770.00 max:  15622.00
> min:  7598.00 min:  14772.00
>
> 2 thread
> records:  10  records:  10
> avg:  12747.50avg:  24171.00
> std:  792.06(6.21%)   std:  895.18(3.70%)
> max:  13337.00max:  26023.00
> min:  10535.00min:  23152.00
>
> 4 thread
> records:  10  records:  10
> avg:  16474.60avg:  33717.90
> std:  1496.45(9.08%)  std:  2008.97(5.96%)
> max:  17877.00max:  35958.00
> min:  12224.00min:  29565.00
>
> 8 thread
> records:  10  records:  10
> avg:  16778.50avg:  33308.10
> std:  825.53(4.92%)   std:  1668.30(5.01%)
> max:  17543.00max:  36010.00
> min:  14576.00min:  29577.00
>
> 16 thread
> records:  10  records:  10
> avg:  20614.40avg:  35516.30
> std:  602.95(2.92%)   std:  1283.65(3.61%)
> max:  21753.00max:  37178.00
> min:  19605.00min:  33217.00
>
> 32 thread
> records:  10  records:  10
> avg:  22771.70avg:  36018.50
> std:  598.94(2.63%)   std:  1046.76(2.91%)
> max:  24035.00max:  37266.00
> min:  22108.00min:  34149.00
>
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
>
> * From v6
>  * Remove page from swapcache in syscal time
>  * Move utility functions from memory.c to madvise.c - Johannes
>  * Rename untilify functtions - Johannes
>  * Remove unnecessary checks from vmscan.c - Johannes
>  * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
>  * Drop Reviewe-by because there was some changes since then.
>
> * From v5
>  * Fix PPC problem which don't flush TLB - Rik
>  * Remove unnecessary lazyfree_range stub function - Rik
>  * Rebased on v3.15-rc5
>
> * From v4
>  * Add Reviewed-by: Zhang Yanfei
>  * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
>
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
>
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
>
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
>
> Cc: Hugh Dickins 
> Cc: Johannes Weiner 
> Cc: Rik van Riel 
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> Cc: Jason Evans 
> Cc: Zhang Yanfei 
> Signed-off-by: Minchan Kim 
> ---
>  include/linux/rmap.h   |   8 +-
>  include/linux/vm_event_item.h  |   1 +
>  include/uapi/asm-generic/mman-common.h |   

[PATCH v7] mm: support madvise(MADV_FREE)

2014-05-18 Thread Minchan Kim
Linux doesn't have an ability to free pages lazy while other OS
already have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation
+ zeroing).

How to work is following as.

When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding.

Firstly, heavy users would be general allocators(ex, jemalloc,
tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
have supported the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):4
On-line CPU(s) list:   0-3
Thread(s) per core:2
Core(s) per socket:2
Socket(s): 1
NUMA node(s):  1
Vendor ID: GenuineIntel
CPU family:6
Model: 42
Stepping:  7
CPU MHz:   2801.000
BogoMIPS:  5581.64
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  4096K
NUMA node0 CPU(s): 0-3

ebizzy benchmark(./ebizzy -S 10 -n 512)

 vanilla-jemalloc   MADV_free-jemalloc

1 thread
records:  10  records:  10
avg:  7682.10 avg:  15306.10
std:  62.35(0.81%)std:  347.99(2.27%)
max:  7770.00 max:  15622.00
min:  7598.00 min:  14772.00

2 thread
records:  10  records:  10
avg:  12747.50avg:  24171.00
std:  792.06(6.21%)   std:  895.18(3.70%)
max:  13337.00max:  26023.00
min:  10535.00min:  23152.00

4 thread
records:  10  records:  10
avg:  16474.60avg:  33717.90
std:  1496.45(9.08%)  std:  2008.97(5.96%)
max:  17877.00max:  35958.00
min:  12224.00min:  29565.00

8 thread
records:  10  records:  10
avg:  16778.50avg:  33308.10
std:  825.53(4.92%)   std:  1668.30(5.01%)
max:  17543.00max:  36010.00
min:  14576.00min:  29577.00

16 thread
records:  10  records:  10
avg:  20614.40avg:  35516.30
std:  602.95(2.92%)   std:  1283.65(3.61%)
max:  21753.00max:  37178.00
min:  19605.00min:  33217.00

32 thread
records:  10  records:  10
avg:  22771.70avg:  36018.50
std:  598.94(2.63%)   std:  1046.76(2.91%)
max:  24035.00max:  37266.00
min:  22108.00min:  34149.00

In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.

* From v6
 * Remove page from swapcache in syscal time
 * Move utility functions from memory.c to madvise.c - Johannes
 * Rename untilify functtions - Johannes
 * Remove unnecessary checks from vmscan.c - Johannes
 * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
 * Drop Reviewe-by because there was some changes since then.

* From v5
 * Fix PPC problem which don't flush TLB - Rik
 * Remove unnecessary lazyfree_range stub function - Rik
 * Rebased on v3.15-rc5

* From v4
 * Add Reviewed-by: Zhang Yanfei
 * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14

* From v3
 * Add "how to work part" in description - Zhang
 * Add page_discardable utility function - Zhang
 * Clean up

* From v2
 * Remove forceful dirty marking of swap-readed page - Johannes
 * Remove deactivation logic of lazyfreed page
 * Rebased on 3.14
 * Remove RFC tag

* From v1
 * Use custom page table walker for madvise_free - Johannes
 * Remove PG_lazypage flag - Johannes
 * Do madvise_dontneed instead of madvise_freein swapless system

Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Cc: KOSAKI Motohiro 
Cc: Mel Gorman 
Cc: Jason Evans 
Cc: Zhang Yanfei 
Signed-off-by: Minchan Kim 
---
 include/linux/rmap.h   |   8 +-
 include/linux/vm_event_item.h  |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c   | 174 +
 mm/rmap.c  |  34 ++-
 mm/vmscan.c|  37 +--
 mm/vmstat.c|   1 +
 7 files changed, 245 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9be55c7617da..1fb2beb351f8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -182,7 +182,8 @@ static inline v